Hugo (@mldhug) Twitter Tweets • TwiCopy

Sagar Vaze

2 years ago

We'll present GeneCIS at #CVPR2023 (Highlight) TL;DR: While most image representations are *fixed*, we present a general way to train and evaluate models which can adapt to different *conditions* on the fly. Code: github.com/facebookresear… Project page: sgvaze.github.io/genecis/ 🧵

thumb_up_off_alt69

chat_bubble_outline1

repeat15

shareShare

AK

@_akhaliq

2 years ago

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration paper page: huggingface.co/papers/2306.09… Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data

thumb_up_off_alt138

chat_bubble_outline3

repeat36

shareShare

Orr Zohar @ ICLR’25

@orr_zohar

2 years ago

Thanks for tweeting, @AK! We’re super excited about the future of text-only vision model selection! 🙏 mars huang Jackson (Kuan-Chieh) Wang @cvpr @syeung10

thumb_up_off_alt16

chat_bubble_outline1

repeat6

shareShare

Wei-Ning Hsu

@mhnt1580

2 years ago

Super excited to finally launch Voicebox🤩, the most versatile speech generative model ever💬👄 Demo page: voicebox.metademolab.com

thumb_up_off_alt186

chat_bubble_outline7

repeat26

shareShare

Lucas Beyer (bl16)

@giffmana

2 years ago

Who killed non-contrastive image-text pretraining? Alec Radford and Jong Wook Kim 💟 with the below Fig2 in CLIP. Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours. Generative captioning is not only competitive, it seems better!

Who killed non-contrastive image-text pretraining? <a href="/AlecRad/">Alec Radford</a> and <a href="/_jongwook_kim/">Jong Wook Kim 💟</a> with the below Fig2 in CLIP.

Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours.

Generative captioning is not only competitive, it seems better!

thumb_up_off_alt573

chat_bubble_outline18

repeat90

shareShare

Jack (in SF) Langerman

@jacklangerman

2 years ago

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language Looks promising; I'll have to try and see if it stands upto some poking ;-) Love that they get around the need for multimodal training. ar5iv.org/abs/2306.16410 github.com/ContextualAI/l…

thumb_up_off_alt8

chat_bubble_outline0

repeat1

shareShare

AK

@_akhaliq

2 years ago

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals paper page: huggingface.co/papers/2306.16… paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate

thumb_up_off_alt592

chat_bubble_outline12

repeat142

shareShare

Puyuan Peng

@puyuanpeng

2 years ago

Why is Whisper so robust to background noise? Not because Whisper suppresses them, but because Whisper 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐬 them! Check out the amazing work by Yuan Gong Yuan Gong. They reveal this emergent capability of Whisper, and get SOTA *simultaneous* ASR + audio tagging

thumb_up_off_alt125

chat_bubble_outline4

repeat19

shareShare

Maksym Andriushchenko @ ICLR

@maksym_andr

2 years ago

It's really surprising how far one can go with *linear* predictors in the autoregressive setting. Interesting theory and experiments on TinyStories: a linear model (with 162M params :-) ) can generate totally coherent text with few grammatical mistakes. arxiv.org/abs/2309.06979

thumb_up_off_alt286

chat_bubble_outline3

repeat43

shareShare

Salah Zaiem

@salah_zaiem

2 years ago

Given a number of ASR models of different sizes, how can I allocate an audio sample to the smallest one that will be good enough ? Hugo worked on this question during his internship, and ended up with interesting conclusions you will find in our paper !

thumb_up_off_alt9

chat_bubble_outline0

repeat3

shareShare

arXiv Sound

@arxivsound

a year ago

``An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment,'' Hugo Malard, Michel Olvera, St\'ephane Lathuiliere, Slim Essid, ift.tt/lf5BrIC

thumb_up_off_alt8

chat_bubble_outline0

repeat2

shareShare

Michel Olvera

@michelolzam

a year ago

Great talk today by Haohe Liu at the ADASP group on Latent Diffusion Models (LDMs) as versatile audio decoder! Walked us through diffusion basics, AudioLDM for text-to-audio, audio quality enhancement, and neural codecs!

Great talk today by <a href="/LiuHaohe/">Haohe Liu</a> at the <a href="/tp_adasp/">ADASP</a> group on Latent Diffusion Models (LDMs) as versatile audio decoder! Walked us through diffusion basics, AudioLDM for text-to-audio, audio quality enhancement, and neural codecs!

thumb_up_off_alt10

chat_bubble_outline1

repeat1

shareShare

Hugo

@mldhug

a year ago

If you want to learn more about audio-visual alignment and how to use it to give audio abilities to your VLM, stop by our NeurIPS Conference poster #3602 (East exhibit hall A-C) tomorrow at 11am!

thumb_up_off_alt7

chat_bubble_outline0

repeat2

shareShare

Alfredo Canziani

@alfcnz

a year ago

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

thumb_up_off_alt18

chat_bubble_outline0

repeat5

shareShare

Salah Zaiem

@salah_zaiem

8 months ago

We are looking for audio and speech generation people, in Zurich, Paris or London to join our team at Google Deepmind. We build cutting-edge speech, music and audio (also audio-visual) generation capabilities. Reach out to Jason or me if interested. Retweets very appreciated !

thumb_up_off_alt34

chat_bubble_outline0

repeat9

shareShare