Daniel Blasko (@blskdan) 's Twitter Profile
Daniel Blasko

@blskdan

ML @snap for @spectacles, prev. @canva, building user-facing ML experiences that feel like magic ✨. Interested in multimodality, VLMs, systems & MLOps.

ID: 1150171710127312896

linkhttps://www.dblasko.fr calendar_today13-07-2019 22:34:39

235 Tweet

244 Followers

2,2K Following

Gaurav Misra (@gmharhar) 's Twitter Profile Photo

Today, we’re announcing a launch that culminates what we’ve believed since starting Captions: AI will fundamentally change video editing. We’re calling it AI Edit. Imagine if AI could take your unfinished video and give you back a fully edited version faster than it took

Jiachen Li (@jiachenli11) 's Twitter Profile Photo

Met many folks at ICML this year. We all agree that the key to the success of RLHF in training LLMs is the HF instead of the RL. Overall, the RL in RLHF only acts as a gradient estimator to address the non-differentiability of the "sampling operation" from a categorical

Magic (@magicailabs) 's Twitter Profile Photo

LTM-2-Mini is our first model with a 100 million token context window. That’s 10 million lines of code, or 750 novels. Full blog: magic.dev/blog/100m-toke… Evals, efficiency, and more ↓

Andrew Carr (e/🤸) (@andrew_n_carr) 's Twitter Profile Photo

Source: research.nvidia.com/labs/dir/cosmo… Legitly amazing image and video tokenizers. Probably one of the best Nvidia releases recently. Lots of juicy details here. Especially the two stage training on reconstruction then optical flow.

Alaa El-Nouby (@alaa_nouby) 's Twitter Profile Photo

𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔 Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding. github.com/apple/ml-aim (🧵)

Pavankumar Vasu (@pavankumarvasu) 's Twitter Profile Photo

📢 Presenting our app for real-time zero-shot image classification using MobileCLIP! Fully open-source—code & models available for everyone to explore. Check it out here: github.com/apple/ml-mobil… with - David Koski, Travis Trotto, Megan Maher Welsh & Hugues Thomas

📢 Presenting our app for real-time zero-shot image classification using MobileCLIP!

Fully open-source—code & models available for everyone to explore. Check it out here: github.com/apple/ml-mobil… 

with - David Koski, Travis Trotto, Megan Maher Welsh & Hugues Thomas
Avi (@avischiffmann) 's Twitter Profile Photo

Truffle’s aesthetics are peak. Design that transcends utility and becomes ubiquitous furniture. Your goal should be to make movies that don’t feature your work look anachronistic. simp 4 satoshi is on that path. An inspiration ⭐️⭐️⭐️⭐️⭐️

Tali Dekel (@talidekel) 's Twitter Profile Photo

Understanding the inner workings of foundation models is key for unlocking their full potential. While the research community has explored this for LLMs, CLIP, and text-to-image models, it's time to turn our focus to VLMs. Let's dive in! 🌟 vision-of-vlm.github.io

Justin Johnson (@jcjohnss) 's Twitter Profile Photo

Today we're sharing our first research update World Labs -- a generative model of 3D worlds! I'm super proud of what the team has achieved so far, and can't wait to see what comes next. Lifting GenAI to 3D will change the way we make media, from movies to games and more!

Andreas Steiner (@andreaspsteiner) 's Twitter Profile Photo

🚀🚀PaliGemma 2 is our updated and improved PaliGemma release using the Gemma 2 models and providing new pre-trained checkpoints for the full cross product of {224px,448px,896px} resolutions and {3B,10B,28B} model sizes. 1/7

🚀🚀PaliGemma 2 is our updated and improved PaliGemma release using the Gemma 2 models and providing new pre-trained checkpoints for the full cross product of {224px,448px,896px} resolutions and {3B,10B,28B} model sizes.

1/7
Tim Brooks (@_tim_brooks) 's Twitter Profile Photo

Gemini 2.0 Flash has native image outputs! Congrats to the awesome team that built it. I find the example at 1:15 super cool: to change the car's color and add beach gear, the model generates two images step-by-step using visual chain of thought. youtube.com/watch?v=7RqFLp…

Peter Tong (@tongpetersb) 's Twitter Profile Photo

This project really changed how I think about multimodal models and LLMs. I used to believe that multimodal (visual) prediction required significant changes to the model and heavy pretraining, like Chameleon. But surprisingly, the opposite is true! In large autoregressive models,

Daniel Blasko (@blskdan) 's Twitter Profile Photo

Neat approach to more flexible and steerable token-based image-generation! Seems to lead to noteworthy instruction- and task-level zero-shot capabilities huggingface.co/papers/2412.18…

Simo Ryu (@cloneofsimo) 's Twitter Profile Photo

So if you are typical ML researcher, you had this question for eternity: "I want small, powerful model: Should we train large model and distill? Or should we train small model from scatch" This new Apple papers conclusion: Its complicated but maybe yes, depending on your

So if you are typical ML researcher, you had this question for eternity:

"I want small, powerful model: Should we train large model and distill? Or should we train small model from scatch"

This new Apple papers conclusion:
Its complicated but maybe yes, depending on your
merve (@mervenoyann) 's Twitter Profile Photo

we just dropped SmolVLM2: world's smollest video models in 256M, 500M and 2.2B ⏯️🤗 we also release the following 🔥 > an iPhone app (runs on 500M model in MLX) > integration with VLC for segmentation of descriptions (2.2B) > a highlights extractor (2.2B)

Xiaohua Zhai (@xiaohuazhai) 's Twitter Profile Photo

Introducing SigLIP2: now trained with additional captioning and self-supervised losses! Stronger everywhere: - multilingual - cls. / ret. - localization - ocr - captioning / vqa Try it out, backward compatible! Models: github.com/google-researc… Paper: arxiv.org/abs/2502.14786

Introducing SigLIP2: now trained with additional captioning and self-supervised losses!

Stronger everywhere: 
- multilingual
- cls. / ret.
- localization
- ocr
- captioning / vqa

Try it out, backward compatible!

Models: github.com/google-researc…

Paper: arxiv.org/abs/2502.14786
Pavlo Molchanov (@pavlomolchanov) 's Twitter Profile Photo

Not all visual tokens are important. We present new work on efficient token selection driven by the text prompt in VLMs. We train a vision encoder in a CLIP-like setting with local/global contrastive loss. Once trained, the model can output a heatmap of interest given a text

Not all visual tokens are important. We present new work on efficient token selection driven by the text prompt in VLMs. We train a vision encoder in a CLIP-like setting with local/global contrastive loss. Once trained, the model can output a heatmap of interest given a text