Rafael Valle (@rafaelvalleart) 's Twitter Profile
Rafael Valle

@rafaelvalleart

Research Manager and Scientist at NVIDIA.
UC Berkeley alumn.
Love, music, set and setting!

ID: 615836062

linkhttp://rafaelvalle.github.io calendar_today23-06-2012 05:26:22

159 Tweet

1,1K Followers

180 Following

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Audio Dialoges is finally out! It describes how we leveraged pre-trained LMs and joint audio and language embeddings to produce a dataset that gives Audio LLMs the ability to have multi-turn dialogues with users. arxiv.org/abs/2404.07616

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Synthetic labels are amazing! Do you need an audio labelling machine? Audio Flamingo checkpoints are available on github.com/NVIDIA/audio-f… ...and pre-training with synthetic labels from Audio Flamingo gives large improvements in text-to-audio models arxiv.org/abs/2406.15487

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

What an honor to be in the cockpit while researchers from CMU, Fudan University, UC Berkeley and NVIDIA developed the approach what won DCASES's 2024 Audio-to-Text Captioning challenge! dcase.community/challenge2024/…

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Do you work on audio synthesis and need state of the art vocoders? BigVGAN v2 is out! BigVGAN v2 is the state-of-the-art in quality, faster and has commercial friendly checkpoints in 44, 24 and 22khz! By the way, it tops again the vocoding leaderboard! paperswithcode.com/sota/speech-sy…

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

We are presenting Audio Flamingo at ICML Conference at 11:30 am Tuesday, Hall C 4-9 #2803. Come chat with us about latest developments in Audio understanding and synthesis! In preparation for ICML, we made this demo to highlight Audio Flamingo's capabilities. youtube.com/watch?v=ucttuS…

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

💚 Big shoutout to the #FUGATTO team for making this release happen — and to cats like Coltrane and Xenakis, who envisioned a world where "saxophones bark and howl." Together, artists and researchers, let’s build a GPT-like future for audio generation! fugatto.github.io

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Our team at NVIDIA is continuously looking for highly motivated interns to work on intelligence in audio understanding and synthesis. Please reach out if you would like to collaborate with us!

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

New releases before the new year! 1) Audio generation for Music and FX (SOTA) ETTA: Elucidating the Design Space of Text-to-Audio Models arxiv.org/abs/2412.19351 2) Fine-grained, cross-modal temporal understanding. OMCAT: Omni Context Aware Transformer arxiv.org/abs/2410.12109

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Text LLMs thrive on massive web-scale data—but for speech, synthetic dialogues are crucial! Besides outperforming SOTA TTS models, NVIDIA's new Koel-TTS excels at dialogue generation, leveraging improvements from preference optimization and CFG. koeltts.github.io

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Audio Flamingo 2 beats GPT-4o, Gemini 2.0 & Phi-4M on 20+ benchmarks —but its real super power? Emergent abilities like knowing that a drum track made of mechanical sounds is unusual: research.nvidia.com/labs/adlr/AF2/ Checkpoints for Synthetic Data Generation? Yes! github.com/NVIDIA/audio-f…

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Looking forward to discussing how Audio General Intelligence (AGI) can be an instrument for imagination! #GTC25 🗣️ The Expanding Sound: Unlock Creativity With AI in Audio Innovation 📅 March 20, 2025 ⏰ 4:00 PM - 5:00 PM PDT 📍 San Jose, CA nvidia.com/gtc/session-ca…

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

🚀 Excited to represent "Team Fugatto" at #ICLR2025 this Saturday! 📍 Find us in Hall 3 & Hall 2B, booth #152—come say hi and chat about our latest work!

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Thanks to prophets Ilya, Hinton and Bengio, I now strongly feel AGI and its risks. Embarking on a pilgrimage to become a prophet—starting with my departure from NVIDIA. Honored to have represented its Audio General Intelligence team and excited for their future research!

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

The repository for ETTA – Elucidating the Design Space of Text-to-Audio Models – is finally out! github.com/NVIDIA/elucida…

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

🤯 Audio Flamingo 3 is out already... and that's before Audio Flamingo 2 makes its debut at ICML on Wednesday, July 16 at 4:30 p.m.! These benchmark results are insane! arxiv.org/abs/2507.08128

🤯 Audio Flamingo 3 is out already... and that's before Audio Flamingo 2 makes its debut at ICML on Wednesday, July 16 at 4:30 p.m.!

These benchmark results are insane!
arxiv.org/abs/2507.08128
Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

ICML Wed 16 Jul 11am ETTA: Elucidating the Design Space of Text-to-Audio Models Favorite prompt: "A hip-hop track using sounds from a construction site—hammering nails as the beat, drilling sounds as scratches, and metal clanks as rhythm accents." research.nvidia.com/labs/adlr/ETTA/

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Our research community demonstrated – across text (OAI GPT), audio (NVIDIA Fugatto, ETTA, AF) and video (GDM Veo) – that scaling compute, model size and diversity in data can lead to zero and few-shot learning. The time has come for scaling laws that predict emergent properties.

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

When Jinchuan became a guest researcher with ADLR-AGI, I knew we'd push the Audio General Intelligence frontier beyond Fugatto and UniAudio. UALM is a milestone that unifies audio understanding, generation, and multimodal reasoning in a single model 💚🙏🚀 arxiv.org/abs/2510.12000

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Superintelligence is nearer and multimodal! 🚀🙏💚 Great honor to be involved in OmniVinci! - SOTA AVLM - Audio understanding significantly enhances video comprehension - Audio signals improve omni-modal reinforcement learning - Understanding demands omni-modal context

Rafael Valle (@rafaelvalleart) 's Twitter Profile Photo

Against all odds, I’ll be at NeurIPS 2025 in San Diego this Thursday. If you trip on multimodal general intelligence, let’s chat.