Rafael Valle (@rafaelvalleart) Twitter Tweets • TwiCopy

Rafael Valle

2 years ago

Audio Dialoges is finally out! It describes how we leveraged pre-trained LMs and joint audio and language embeddings to produce a dataset that gives Audio LLMs the ability to have multi-turn dialogues with users. arxiv.org/abs/2404.07616

thumb_up_off_alt51

chat_bubble_outline0

repeat6

shareShare

Rafael Valle

@rafaelvalleart

a year ago

Synthetic labels are amazing! Do you need an audio labelling machine? Audio Flamingo checkpoints are available on github.com/NVIDIA/audio-f… ...and pre-training with synthetic labels from Audio Flamingo gives large improvements in text-to-audio models arxiv.org/abs/2406.15487

thumb_up_off_alt62

chat_bubble_outline3

repeat16

shareShare

Rafael Valle

@rafaelvalleart

a year ago

What an honor to be in the cockpit while researchers from CMU, Fudan University, UC Berkeley and NVIDIA developed the approach what won DCASES's 2024 Audio-to-Text Captioning challenge! dcase.community/challenge2024/…

thumb_up_off_alt41

chat_bubble_outline0

repeat14

shareShare

Rafael Valle

@rafaelvalleart

a year ago

Do you work on audio synthesis and need state of the art vocoders? BigVGAN v2 is out! BigVGAN v2 is the state-of-the-art in quality, faster and has commercial friendly checkpoints in 44, 24 and 22khz! By the way, it tops again the vocoding leaderboard! paperswithcode.com/sota/speech-sy…

thumb_up_off_alt93

chat_bubble_outline2

repeat26

shareShare

Rafael Valle

@rafaelvalleart

a year ago

We are presenting Audio Flamingo at ICML Conference at 11:30 am Tuesday, Hall C 4-9 #2803. Come chat with us about latest developments in Audio understanding and synthesis! In preparation for ICML, we made this demo to highlight Audio Flamingo's capabilities. youtube.com/watch?v=ucttuS…

thumb_up_off_alt34

chat_bubble_outline5

repeat5

shareShare

Rafael Valle

@rafaelvalleart

a year ago

💚 Big shoutout to the #FUGATTO team for making this release happen — and to cats like Coltrane and Xenakis, who envisioned a world where "saxophones bark and howl." Together, artists and researchers, let’s build a GPT-like future for audio generation! fugatto.github.io

thumb_up_off_alt47

chat_bubble_outline5

repeat8

shareShare

Rafael Valle

@rafaelvalleart

a year ago

Our team at NVIDIA is continuously looking for highly motivated interns to work on intelligence in audio understanding and synthesis. Please reach out if you would like to collaborate with us!

thumb_up_off_alt213

chat_bubble_outline9

repeat17

shareShare

Rafael Valle

@rafaelvalleart

a year ago

New releases before the new year! 1) Audio generation for Music and FX (SOTA) ETTA: Elucidating the Design Space of Text-to-Audio Models arxiv.org/abs/2412.19351 2) Fine-grained, cross-modal temporal understanding. OMCAT: Omni Context Aware Transformer arxiv.org/abs/2410.12109

thumb_up_off_alt70

chat_bubble_outline1

repeat13

shareShare

Rafael Valle

@rafaelvalleart

10 months ago

Text LLMs thrive on massive web-scale data—but for speech, synthetic dialogues are crucial! Besides outperforming SOTA TTS models, NVIDIA's new Koel-TTS excels at dialogue generation, leveraging improvements from preference optimization and CFG. koeltts.github.io

thumb_up_off_alt47

chat_bubble_outline2

repeat6

shareShare

Rafael Valle

@rafaelvalleart

9 months ago

Audio Flamingo 2 beats GPT-4o, Gemini 2.0 & Phi-4M on 20+ benchmarks —but its real super power? Emergent abilities like knowing that a drum track made of mechanical sounds is unusual: research.nvidia.com/labs/adlr/AF2/ Checkpoints for Synthetic Data Generation? Yes! github.com/NVIDIA/audio-f…

thumb_up_off_alt164

chat_bubble_outline4

repeat30

shareShare

Rafael Valle

@rafaelvalleart

9 months ago

Looking forward to discussing how Audio General Intelligence (AGI) can be an instrument for imagination! #GTC25 🗣️ The Expanding Sound: Unlock Creativity With AI in Audio Innovation 📅 March 20, 2025 ⏰ 4:00 PM - 5:00 PM PDT 📍 San Jose, CA nvidia.com/gtc/session-ca…

thumb_up_off_alt20

chat_bubble_outline0

repeat4

shareShare

Rafael Valle

@rafaelvalleart

8 months ago

🚀 Excited to represent "Team Fugatto" at #ICLR2025 this Saturday! 📍 Find us in Hall 3 & Hall 2B, booth #152—come say hi and chat about our latest work!

thumb_up_off_alt15

chat_bubble_outline3

repeat2

shareShare

Rafael Valle

@rafaelvalleart

5 months ago

Thanks to prophets Ilya, Hinton and Bengio, I now strongly feel AGI and its risks. Embarking on a pilgrimage to become a prophet—starting with my departure from NVIDIA. Honored to have represented its Audio General Intelligence team and excited for their future research!

thumb_up_off_alt89

chat_bubble_outline3

repeat3

shareShare

Rafael Valle

@rafaelvalleart

5 months ago

The repository for ETTA – Elucidating the Design Space of Text-to-Audio Models – is finally out! github.com/NVIDIA/elucida…

thumb_up_off_alt35

chat_bubble_outline1

repeat1

shareShare

Rafael Valle

@rafaelvalleart

5 months ago

🤯 Audio Flamingo 3 is out already... and that's before Audio Flamingo 2 makes its debut at ICML on Wednesday, July 16 at 4:30 p.m.! These benchmark results are insane! arxiv.org/abs/2507.08128

thumb_up_off_alt53

chat_bubble_outline2

repeat16

shareShare

Rafael Valle

@rafaelvalleart

5 months ago

ICML Wed 16 Jul 11am ETTA: Elucidating the Design Space of Text-to-Audio Models Favorite prompt: "A hip-hop track using sounds from a construction site—hammering nails as the beat, drilling sounds as scratches, and metal clanks as rhythm accents." research.nvidia.com/labs/adlr/ETTA/

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare

Rafael Valle

@rafaelvalleart

2 months ago

Our research community demonstrated – across text (OAI GPT), audio (NVIDIA Fugatto, ETTA, AF) and video (GDM Veo) – that scaling compute, model size and diversity in data can lead to zero and few-shot learning. The time has come for scaling laws that predict emergent properties.

thumb_up_off_alt8

chat_bubble_outline1

repeat0

shareShare

Rafael Valle

@rafaelvalleart

2 months ago

When Jinchuan became a guest researcher with ADLR-AGI, I knew we'd push the Audio General Intelligence frontier beyond Fugatto and UniAudio. UALM is a milestone that unifies audio understanding, generation, and multimodal reasoning in a single model 💚🙏🚀 arxiv.org/abs/2510.12000

thumb_up_off_alt10

chat_bubble_outline0

repeat2

shareShare

Rafael Valle

@rafaelvalleart

2 months ago

Superintelligence is nearer and multimodal! 🚀🙏💚 Great honor to be involved in OmniVinci! - SOTA AVLM - Audio understanding significantly enhances video comprehension - Audio signals improve omni-modal reinforcement learning - Understanding demands omni-modal context

thumb_up_off_alt18

chat_bubble_outline1

repeat3

shareShare

Rafael Valle

@rafaelvalleart

9 days ago

Against all odds, I’ll be at NeurIPS 2025 in San Diego this Thursday. If you trip on multimodal general intelligence, let’s chat.

thumb_up_off_alt4

chat_bubble_outline1

repeat0

shareShare