Satvik Dixit (@satvikdixit9) 's Twitter Profile
Satvik Dixit

@satvikdixit9

MS student @CarnegieMellon | Prev @MIT @IITDelhi | Audio understanding and generation

ID: 1377350872540180484

linkhttps://satvik-dixit.github.io/ calendar_today31-03-2021 20:03:45

19 Tweet

109 Takipçi

792 Takip Edilen

arXiv Sound (@arxivsound) 's Twitter Profile Photo

``Vision Language Models Are Few-Shot Audio Spectrogram Classifiers,'' Satvik Dixit, Laurie M. Heller, Chris Donahue, ift.tt/JUELXMA

Neil Zeghidour (@neilzegh) 's Twitter Profile Photo

Today we release Hibiki, real-time speech translation that runs on your phone. Adaptive flow without fancy policy, simple temperature sampling of a multistream audio-text LM. Very proud of Tom Labiausse 's work as an intern.

arXiv Sound (@arxivsound) 's Twitter Profile Photo

``Mellow: a small audio language model for reasoning,'' Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj, ift.tt/BFgXl2L

Soham Deshmukh (@sohamdesh_) 's Twitter Profile Photo

we show for the first time ever that sub-billion audio models can reason. we introduce mellow, a small audio-language model (167M) that gets SoTA on different audio reasoning tasks. by using our method and data, you can train an alm within 24 hrs on academic resources (1/n 🧵)

we show for the first time ever that sub-billion audio models can reason. we introduce mellow, a small audio-language model (167M) that gets SoTA on different audio reasoning tasks. by using our method and data, you can train an alm within 24 hrs on academic resources (1/n 🧵)
Neil Zeghidour (@neilzegh) 's Twitter Profile Photo

Trimodal training (text-audio-img) is challenging because you a have a lot of unimodal data, some bimodal and few to none with all 3 modalities & combining them is not obvious. We propose a simple extension to Moshi that allows it to understand images.

Neil Zeghidour (@neilzegh) 's Twitter Profile Photo

Thanks Google AI 🙏, I'm proud to see concepts introduced in this paper (RVQ-VAE, quantizer dropout) being still as relevant four years later, and in particular how the RVQ turned out to be a perfect fit for audio language models.

Chris Donahue (@chrisdonahuey) 's Twitter Profile Photo

Excited to announce 🎵Magenta RealTime, the first open weights music generation model capable of real-time audio generation with real-time control. 👋 **Try Magenta RT on Colab TPUs**: colab.research.google.com/github/magenta… 👀 Blog post: g.co/magenta/rt 🧵 below

Albert Gu (@_albertgu) 's Twitter Profile Photo

I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.

I converted one of my favorite talks I've given over the past year into a blog post.

"On the Tradeoffs of SSMs and Transformers"
(or: tokens are bullshit)

In a few days, we'll release what I believe is the next major advance for architectures.
Chris Donahue (@chrisdonahuey) 's Twitter Profile Photo

Excited to share our beta release of Music Arena, a live evaluation platform for state-of-the-art AI music generation models! 🎧 Listen to the latest models and 🗳️ vote for your favorite ⚔️ music-arena.org ⭐️ github.com/gclef-cmu/musi… 📜 arxiv.org/abs/2507.20900

Excited to share our beta release of Music Arena, a live evaluation platform for state-of-the-art AI music generation models!

🎧 Listen to the latest models and 🗳️ vote for your favorite

⚔️ music-arena.org 
⭐️ github.com/gclef-cmu/musi…
📜 arxiv.org/abs/2507.20900