Hugo (@mldhug) 's Twitter Profile
Hugo

@mldhug

PhD student in multimodal learning for audio understanding at @telecomparis

ID: 1162859120988438528

calendar_today17-08-2019 22:50:03

19 Tweet

45 Followers

403 Following

Sagar Vaze (@sagar_vaze) 's Twitter Profile Photo

We'll present GeneCIS at #CVPR2023 (Highlight) TL;DR: While most image representations are *fixed*, we present a general way to train and evaluate models which can adapt to different *conditions* on the fly. Code: github.com/facebookresearโ€ฆ Project page: sgvaze.github.io/genecis/ ๐Ÿงต

AK (@_akhaliq) 's Twitter Profile Photo

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration paper page: huggingface.co/papers/2306.09โ€ฆ Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

paper page: huggingface.co/papers/2306.09โ€ฆ

Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data
Wei-Ning Hsu (@mhnt1580) 's Twitter Profile Photo

Super excited to finally launch Voicebox๐Ÿคฉ, the most versatile speech generative model ever๐Ÿ’ฌ๐Ÿ‘„ Demo page: voicebox.metademolab.com

Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Who killed non-contrastive image-text pretraining? Alec Radford and Jong Wook Kim ๐Ÿ’Ÿ with the below Fig2 in CLIP. Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours. Generative captioning is not only competitive, it seems better!

Who killed non-contrastive image-text pretraining? <a href="/AlecRad/">Alec Radford</a> and <a href="/_jongwook_kim/">Jong Wook Kim ๐Ÿ’Ÿ</a> with the below Fig2 in CLIP.

Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours.

Generative captioning is not only competitive, it seems better!
Jack (in SF) Langerman (@jacklangerman) 's Twitter Profile Photo

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language Looks promising; I'll have to try and see if it stands upto some poking ;-) Love that they get around the need for multimodal training. ar5iv.org/abs/2306.16410 github.com/ContextualAI/lโ€ฆ

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

Looks promising; I'll have to try and see if it stands upto some poking ;-)

Love that they get around the need for multimodal training.

ar5iv.org/abs/2306.16410
github.com/ContextualAI/lโ€ฆ
AK (@_akhaliq) 's Twitter Profile Photo

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals paper page: huggingface.co/papers/2306.16โ€ฆ paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

paper page: huggingface.co/papers/2306.16โ€ฆ

paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate
Puyuan Peng (@puyuanpeng) 's Twitter Profile Photo

Why is Whisper so robust to background noise? Not because Whisper suppresses them, but because Whisper ๐ฎ๐ง๐๐ž๐ซ๐ฌ๐ญ๐š๐ง๐๐ฌ them! Check out the amazing work by Yuan Gong Yuan Gong. They reveal this emergent capability of Whisper, and get SOTA *simultaneous* ASR + audio tagging

Why is Whisper so robust to background noise? Not because Whisper suppresses them, but because Whisper ๐ฎ๐ง๐๐ž๐ซ๐ฌ๐ญ๐š๐ง๐๐ฌ them!

Check out the amazing work by Yuan Gong <a href="/YGongND/">Yuan Gong</a>. They reveal this emergent capability of Whisper, and get SOTA *simultaneous* ASR + audio tagging
Maksym Andriushchenko @ ICLR (@maksym_andr) 's Twitter Profile Photo

It's really surprising how far one can go with *linear* predictors in the autoregressive setting. Interesting theory and experiments on TinyStories: a linear model (with 162M params :-) ) can generate totally coherent text with few grammatical mistakes. arxiv.org/abs/2309.06979

It's really surprising how far one can go with *linear* predictors in the autoregressive setting. 

Interesting theory and experiments on TinyStories: a linear model (with 162M params :-) ) can generate totally coherent text with few grammatical mistakes.

arxiv.org/abs/2309.06979
Salah Zaiem (@salah_zaiem) 's Twitter Profile Photo

Given a number of ASR models of different sizes, how can I allocate an audio sample to the smallest one that will be good enough ? Hugo worked on this question during his internship, and ended up with interesting conclusions you will find in our paper !

arXiv Sound (@arxivsound) 's Twitter Profile Photo

``An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment,'' Hugo Malard, Michel Olvera, St\'ephane Lathuiliere, Slim Essid, ift.tt/lf5BrIC

Michel Olvera (@michelolzam) 's Twitter Profile Photo

Great talk today by Haohe Liu at the ADASP group on Latent Diffusion Models (LDMs) as versatile audio decoder! Walked us through diffusion basics, AudioLDM for text-to-audio, audio quality enhancement, and neural codecs!

Great talk today by <a href="/LiuHaohe/">Haohe Liu</a> at the <a href="/tp_adasp/">ADASP</a> group on Latent Diffusion Models (LDMs) as versatile audio decoder! Walked us through diffusion basics, AudioLDM for text-to-audio, audio quality enhancement, and neural codecs!
Hugo (@mldhug) 's Twitter Profile Photo

If you want to learn more about audio-visual alignment and how to use it to give audio abilities to your VLM, stop by our NeurIPS Conference poster #3602 (East exhibit hall A-C) tomorrow at 11am!

Salah Zaiem (@salah_zaiem) 's Twitter Profile Photo

We are looking for audio and speech generation people, in Zurich, Paris or London to join our team at Google Deepmind. We build cutting-edge speech, music and audio (also audio-visual) generation capabilities. Reach out to Jason or me if interested. Retweets very appreciated !