Orr Zohar @ ICLR’25 (@orr_zohar) 's Twitter Profile
Orr Zohar @ ICLR’25

@orr_zohar

PhD Student @Stanford • Researching large multimodal models • @KnightHennessy scholar • Advised by @yeung_levy

ID: 1659236939088936961

linkhttps://orrzohar.github.io/ calendar_today18-05-2023 16:38:24

79 Tweet

279 Takipçi

169 Takip Edilen

Pedro Cuenca (@pcuenq) 's Twitter Profile Photo

smol code update for HuggingSnap, huge impact: we added VoiceOver support so more people can use it more easily. A visual local assistant always in your pocket has many use cases, and it can also be a great help for people with low vision. Reminder: local, private, open source.

Alejandro Lozano (@ale9806_) 's Twitter Profile Photo

Earlier this year, we released the BIOMEDICA dataset, featuring 24 million unique image caption pairs and 30 million image references derived from open-source biomedical literature. It's been great to see the community engaging with it—we're currently seeing around 6K downloads

Earlier this year, we released the BIOMEDICA dataset, featuring 24 million unique image caption pairs and 30 million image references derived from open-source biomedical literature. It's been great to see the community engaging with it—we're currently seeing around 6K downloads
Peter Tong (@tongpetersb) 's Twitter Profile Photo

Vision models have been smaller than language models; what if we scale them up? Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.

Vision models have been smaller than language models; what if we scale them up?

Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.
Orr Zohar @ ICLR’25 (@orr_zohar) 's Twitter Profile Photo

Excited to see SmolVLM powering BMC-SmolVLM in the latest BIOMEDICA update! At just 2.2B params, it matches 7-13B biomedical VLMs. Check out the full release: Hugging Face #smolvlm

Andi Marafioti (@andimarafioti) 's Twitter Profile Photo

We are so back with Hugging Face’s Smol models 🚀 Usage doubled 🔥 and we’re now at 110k+ MAU 📈 SmolLM, SmolVLM, SmolDocling — all coming together 💫 Huge thanks to everyone building with us 💛 Let’s keep it growing 💪✨

We are so back with Hugging Face’s Smol models 🚀
Usage doubled 🔥 and we’re now at 110k+ MAU 📈
SmolLM, SmolVLM, SmolDocling — all coming together 💫
Huge thanks to everyone building with us 💛 Let’s keep it growing 💪✨
Andi Marafioti (@andimarafioti) 's Twitter Profile Photo

Today, we share the tech report for SmolVLM: Redefining small and efficient multimodal models. 🔥 Explaining how to design a tiny 256M VLM that uses less than 1GB of RAM and outperforms our 80B models from 18 months ago! Here are the coolest insights from our experiments: ✨

Today, we share the tech report for SmolVLM: Redefining small and efficient multimodal models.
🔥 Explaining how to design a tiny 256M VLM that uses less than 1GB of RAM and outperforms our 80B models from 18 months ago!

Here are the coolest insights from our experiments:
✨
merve (@mervenoyann) 's Twitter Profile Photo

SmolVLM paper is out 🔥 It's one of my favorite papers since it contains a ton of findings on training a good smol model 🤯 Andi Marafioti summarized it here ⤵️

Orr Zohar @ ICLR’25 (@orr_zohar) 's Twitter Profile Photo

🤗The SmolVLM report is out, with all the experiments, findings, and insights that led to high performance at tiny sizes🤏. 📱These models can run on most mobile/edge devices. 📖Give it a look!

🤗The SmolVLM report is out, with all the experiments, findings, and insights that led to high performance at tiny sizes🤏. 
📱These models can run on most mobile/edge devices. 
📖Give it a look!
Andi Marafioti (@andimarafioti) 's Twitter Profile Photo

We are taking the most popular open-source reproducible evaluation (OpenCompass). I actually reached out to moondream and asked them to update their eval since it's 6 months old and their internal evaluations claim to way higher🤷‍♂️

We are taking the most popular open-source reproducible evaluation (OpenCompass). I actually reached out to moondream and asked them to update their eval since it's 6 months old and their internal evaluations claim to way higher🤷‍♂️
Andi Marafioti (@andimarafioti) 's Twitter Profile Photo

Eric Lee vik The values we report can be corroborated with the open-source evaluation from Open Compass. The model in the table you're highlighting is SmolVLM2 (huggingface.co/spaces/opencom…). We don't know how moondream got those evaluations for SmolVLM, I guess they run their own evals.

Luis (@lusxvr) 's Twitter Profile Photo

Today, we are open-sourcing nanoVLM, a pure pytorch library to train a Vision-Language Model from scratch in 750 lines of code. Training on one H100 for 6h, we get 35.3% on MMStar, matching SmolVLM-256M which was trained with 100x more GPU hours. 👀 Even in a FREE Google Colab,

Today, we are open-sourcing nanoVLM, a pure pytorch library to train a Vision-Language Model from scratch in 750 lines of code.
Training on one H100 for 6h, we get 35.3% on MMStar, matching SmolVLM-256M which was trained with 100x more GPU hours.  👀
Even in a FREE Google Colab,
Thomas Wolf (@thom_wolf) 's Twitter Profile Photo

New open-source drop from the HF team - nanoVLM A super tight codebase to learn/train VLM with good performances - inspired by Andrej Karpathy 's NanoGPT 750 lines of pytorch code. Training a 222M parameters nanoVLM for 6 hours on a single H100 reaches 35.3% on MMStar, matching the

New open-source drop from the HF team - nanoVLM

A super tight codebase to learn/train VLM with good performances - inspired by <a href="/karpathy/">Andrej Karpathy</a> 's NanoGPT

750 lines of pytorch code. Training a 222M parameters nanoVLM for 6 hours on a single H100 reaches 35.3% on MMStar, matching the
Miquel Farré (@micuelll) 's Twitter Profile Photo

WE ARE COOKING!! I’m looking for a creative engineer to join the ride 🤩 If that’s you, send me a message 🚀 You should be someone who learns tools fast, builds scrappy hacks when needed, and focuses on what works. You might be working in the space of media, image/video