Daniel Blasko (@blskdan) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Today, we’re announcing a launch that culminates what we’ve believed since starting Captions: AI will fundamentally change video editing. We’re calling it AI Edit. Imagine if AI could take your unfinished video and give you back a fully edited version faster than it took

thumb_up_off_alt1,1K

chat_bubble_outline85

repeat89

shareShare

Jiachen Li

@jiachenli11

10 months ago

Met many folks at ICML this year. We all agree that the key to the success of RLHF in training LLMs is the HF instead of the RL. Overall, the RL in RLHF only acts as a gradient estimator to address the non-differentiability of the "sampling operation" from a categorical

thumb_up_off_alt214

chat_bubble_outline6

repeat24

shareShare

Magic

@magicailabs

10 months ago

LTM-2-Mini is our first model with a 100 million token context window. That’s 10 million lines of code, or 750 novels. Full blog: magic.dev/blog/100m-toke… Evals, efficiency, and more ↓

thumb_up_off_alt2,2K

chat_bubble_outline151

repeat439

shareShare

Dario Amodei

@darioamodei

8 months ago

Machines of Loving Grace: my essay on how AI could transform the world for the better darioamodei.com/machines-of-lo…

thumb_up_off_alt5,5K

chat_bubble_outline0

repeat1,1K

shareShare

Andrew Carr (e/🤸)

@andrew_n_carr

8 months ago

Source: research.nvidia.com/labs/dir/cosmo… Legitly amazing image and video tokenizers. Probably one of the best Nvidia releases recently. Lots of juicy details here. Especially the two stage training on reconstruction then optical flow.

thumb_up_off_alt35

chat_bubble_outline0

repeat3

shareShare

Alaa El-Nouby

@alaa_nouby

7 months ago

𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔 Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding. github.com/apple/ml-aim (🧵)

thumb_up_off_alt152

chat_bubble_outline4

repeat27

shareShare

Pavankumar Vasu

@pavankumarvasu

7 months ago

📢 Presenting our app for real-time zero-shot image classification using MobileCLIP! Fully open-source—code & models available for everyone to explore. Check it out here: github.com/apple/ml-mobil… with - David Koski, Travis Trotto, Megan Maher Welsh & Hugues Thomas

thumb_up_off_alt26

chat_bubble_outline0

repeat11

shareShare

Avi

@avischiffmann

7 months ago

Truffle’s aesthetics are peak. Design that transcends utility and becomes ubiquitous furniture. Your goal should be to make movies that don’t feature your work look anachronistic. simp 4 satoshi is on that path. An inspiration ⭐️⭐️⭐️⭐️⭐️

thumb_up_off_alt130

chat_bubble_outline7

repeat1

shareShare

Tali Dekel

@talidekel

7 months ago

Understanding the inner workings of foundation models is key for unlocking their full potential. While the research community has explored this for LLMs, CLIP, and text-to-image models, it's time to turn our focus to VLMs. Let's dive in! 🌟 vision-of-vlm.github.io

thumb_up_off_alt153

chat_bubble_outline0

repeat23

shareShare

Justin Johnson

@jcjohnss

7 months ago

Today we're sharing our first research update World Labs -- a generative model of 3D worlds! I'm super proud of what the team has achieved so far, and can't wait to see what comes next. Lifting GenAI to 3D will change the way we make media, from movies to games and more!

thumb_up_off_alt377

chat_bubble_outline18

repeat27

shareShare

Andreas Steiner

@andreaspsteiner

7 months ago

🚀🚀PaliGemma 2 is our updated and improved PaliGemma release using the Gemma 2 models and providing new pre-trained checkpoints for the full cross product of {224px,448px,896px} resolutions and {3B,10B,28B} model sizes. 1/7

thumb_up_off_alt262

chat_bubble_outline5

repeat54

shareShare

Tim Brooks

@_tim_brooks

6 months ago

Gemini 2.0 Flash has native image outputs! Congrats to the awesome team that built it. I find the example at 1:15 super cool: to change the car's color and add beach gear, the model generates two images step-by-step using visual chain of thought. youtube.com/watch?v=7RqFLp…

thumb_up_off_alt444

chat_bubble_outline24

repeat83

shareShare

Peter Tong

@tongpetersb

6 months ago

This project really changed how I think about multimodal models and LLMs. I used to believe that multimodal (visual) prediction required significant changes to the model and heavy pretraining, like Chameleon. But surprisingly, the opposite is true! In large autoregressive models,

thumb_up_off_alt476

chat_bubble_outline9

repeat97

shareShare

Daniel Blasko

@blskdan

6 months ago

Neat approach to more flexible and steerable token-based image-generation! Seems to lead to noteworthy instruction- and task-level zero-shot capabilities huggingface.co/papers/2412.18…

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Simo Ryu

@cloneofsimo

4 months ago

So if you are typical ML researcher, you had this question for eternity: "I want small, powerful model: Should we train large model and distill? Or should we train small model from scatch" This new Apple papers conclusion: Its complicated but maybe yes, depending on your

thumb_up_off_alt1,1K

chat_bubble_outline11

repeat106

shareShare

merve

@mervenoyann

4 months ago

we just dropped SmolVLM2: world's smollest video models in 256M, 500M and 2.2B ⏯️🤗 we also release the following 🔥 > an iPhone app (runs on 500M model in MLX) > integration with VLC for segmentation of descriptions (2.2B) > a highlights extractor (2.2B)

thumb_up_off_alt751

chat_bubble_outline21

repeat124

shareShare

Andrew Curran

@andrewcurran_

4 months ago

They did say they had built their own system, it's two models working together. S2 is a 7B VLM, S1 is an 80M(!) transformer.

thumb_up_off_alt126

chat_bubble_outline6

repeat14

shareShare

Xiaohua Zhai

@xiaohuazhai

4 months ago

Introducing SigLIP2: now trained with additional captioning and self-supervised losses! Stronger everywhere: - multilingual - cls. / ret. - localization - ocr - captioning / vqa Try it out, backward compatible! Models: github.com/google-researc… Paper: arxiv.org/abs/2502.14786

thumb_up_off_alt370

chat_bubble_outline9

repeat57

shareShare

Alessio

@alessiograncini

4 months ago

Spectacles Office Hours Snap AR

<a href="/Spectacles/">Spectacles</a> Office Hours
<a href="/SnapAR/">Snap AR</a>

thumb_up_off_alt5

chat_bubble_outline0

repeat2

shareShare

Pavlo Molchanov

@pavlomolchanov

3 months ago

Not all visual tokens are important. We present new work on efficient token selection driven by the text prompt in VLMs. We train a vision encoder in a CLIP-like setting with local/global contrastive loss. Once trained, the model can output a heatmap of interest given a text

thumb_up_off_alt345

chat_bubble_outline1

repeat51

shareShare

Daniel Blasko

Gate.io

Gaurav Misra

Jiachen Li

Magic

Dario Amodei

Andrew Carr (e/🤸)

Alaa El-Nouby

Pavankumar Vasu

Avi

Tali Dekel

Justin Johnson

Andreas Steiner

Tim Brooks

Peter Tong

Daniel Blasko

Simo Ryu

merve

Andrew Curran

Xiaohua Zhai

Alessio

Pavlo Molchanov