NoAI (@onlyyouuu8) Twitter Tweets • TwiCopy

David Fan

2 months ago

[1/9] What happens when you treat vision as a first-class citizen during multimodal pretraining? To find out, we studied the design space of training Transfusion-style models that input and output all modalities, from scratch. Here is what we learned about visual representations,

thumb_up_off_alt283

chat_bubble_outline11

repeat55

shareShare

Klára Janoušková

@klaaracz

2 months ago

📝 Multimodal Large Language Models as Image Classifiers MLLMs are increasingly used for visual tasks, but evaluating their image classification ability has produced conflicting conclusions. w. N. Kisel, Illia Volkov 🇺🇦 , J. Matas #CVPR26 findings

thumb_up_off_alt66

chat_bubble_outline2

repeat12

shareShare

Kaiser Sun

@kaiserwholearns

2 months ago

Multimodal LLMs can read text in images, but why do they often perform worse than when the same text is given as tokens? Our work studies the modality gap of models perceiving text as pixels and shows how to close it. 📄 arxiv.org/abs/2603.09095 🧵👇 #NLProc #LLM #ComputerVision

thumb_up_off_alt82

chat_bubble_outline3

repeat27

shareShare

Anshul Shah

@anshul__shah

2 months ago

Excited to share our latest research on limitations of RL-finetuned VLMs! We investigate the robustness of model responses and consistency of CoT to textual perturbations. Work led by Rosie Zhao during her internship with the Multimodal Machine Intelligence team at Apple.

thumb_up_off_alt71

chat_bubble_outline0

repeat14

shareShare

AVB

@neural_avb

2 months ago

My essay on Vision Language Models just went live! Massive thanks to the Towards Data Science team for featuring it on their Deep Dives section. Read it here: towardsdatascience.com/how-vision-lan… Conceptual deep dive, git repo, video tutorial - it's all in there!

My essay on Vision Language Models just went live!

Massive thanks to the <a href="/TDataScience/">Towards Data Science</a> team for featuring it on their Deep Dives section.

Read it here: towardsdatascience.com/how-vision-lan…

Conceptual deep dive, git repo, video tutorial - it's all in there!

thumb_up_off_alt217

chat_bubble_outline2

repeat31

shareShare

Faheem Ullah

@faheem_uh

2 months ago

PhD Students - How to generate a graphical abstract in seconds? 1. Go to researchviz.io 2. Copy and paste your abstract 3. ResearchViz will generate graphical abstract. You can download it as PPT and make any changes.

thumb_up_off_alt776

chat_bubble_outline6

repeat195

shareShare

Niels Rogge

@nielsrogge

2 months ago

Quick video showing how to run GLM-OCR 100% locally! I cover the following things: > llama cpp > GGUF > LM Studio > OpenAI-compatible Python API > transcribe tables, images, and more

thumb_up_off_alt379

chat_bubble_outline13

repeat57

shareShare

AK

@_akhaliq

2 months ago

Multimodal OCR Parse Anything from Documents On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured

thumb_up_off_alt317

chat_bubble_outline11

repeat57

shareShare

Adina Yakup

@adinayakup

2 months ago

Another OCR model just dropped on Hugging Face (so many OCRs lately!) dots.mocr from rednote Hi Lab looks really impressive on the benchmarks. -Model: huggingface.co/collections/re… -Paper: huggingface.co/papers/2603.13… ✨ 3B ✨ Multilingual support ✨ Converts charts, diagrams, and

Another OCR model just dropped on <a href="/huggingface/">Hugging Face</a> (so many OCRs lately!)

dots.mocr from <a href="/xiaohongshu/">rednote</a> Hi Lab looks really impressive on the benchmarks.

-Model: huggingface.co/collections/re…
-Paper: huggingface.co/papers/2603.13…

✨ 3B
✨ Multilingual support
✨ Converts charts, diagrams, and

thumb_up_off_alt203

chat_bubble_outline4

repeat33

shareShare

Daniel van Strien

@vanstriendaniel

2 months ago

Bunch of new open OCR models recently — all available as uv scripts on Hugging Face. 19 models from 0.9B–8B. Some standouts: - Qianfan-OCR - 192 languages - dots.mocr — charts/figures → editable SVG - GLM-OCR — 94.6% accuracy, only 0.9B params

Bunch of new open OCR models recently — all available as uv scripts on <a href="/huggingface/">Hugging Face</a>.

19 models from 0.9B–8B. Some standouts:

- Qianfan-OCR - 192 languages
- dots.mocr — charts/figures → editable SVG
- GLM-OCR — 94.6% accuracy, only 0.9B params

thumb_up_off_alt174

chat_bubble_outline4

repeat21

shareShare

staghado

@staghado

2 months ago

Nice finetune of Qwen3.5 4B! It's only missing a comparison to 4x smaller LightOnOCR-2-1B 😉

thumb_up_off_alt124

chat_bubble_outline6

repeat8

shareShare

Akshay 🚀

@akshay_pachaar

a month ago

Everyone is sleeping on this new OCR model! - 85.9% (sota) on olmocr bench - 90+ language support w/benchmarks - 4B model (down from 9B) - Full layout information - Extracts + captions images and diagrams - Strong handwriting, math, form, table support 100% open-source.

thumb_up_off_alt2,2K

chat_bubble_outline47

repeat414

shareShare

Rimsha Bhardwaj

@heyrimsha

a month ago

🚨BREAKING: A dev just open-sourced the #1 ranked OCR model on Earth. It's called GLM-OCR and it just hit 94.62 on OmniDocBench V1.5, beating every OCR model in existence. Only 0.9B parameters. One pip install. Handles documents no other model could touch. 100% Open Source.

thumb_up_off_alt2,2K

chat_bubble_outline48

repeat386

shareShare

Javier Ferrando

@javifer_96

a month ago

Can language models explain features learned by vision encoders? #CVPR2026 - Feed a blank image - Steer a specific feature in the vision encoder - Ask the language model to explain the image The model explains the feature itself.

thumb_up_off_alt337

chat_bubble_outline4

repeat49

shareShare

DailyPapers

@huggingpapers

a month ago

NVIDIA just released Nemotron OCR v2 on Hugging Face A production-ready multilingual OCR system with a hybrid detector-recognizer architecture for text, layout and reading order. huggingface.co/nvidia/nemotro…

thumb_up_off_alt200

chat_bubble_outline0

repeat28

shareShare

Guanyu Zhou

@tmartyr4951

a month ago

It's time to systematically teach VLMs to see with synthetic images! We built VisionFoundry, a simple but intuitive framework that generates synthetic image datasets from only a task name. 10k synthetic data → over +10% improvement on visual perception benchmarks 👀

thumb_up_off_alt235

chat_bubble_outline6

repeat38

shareShare

Jerry Liu

@jerryjliu0

24 days ago

Parsing complex tables in PDFs is extremely challenging. Existing metrics for measuring table accuracy, like TEDS (tree edit distance similarity), overweight exact table structure and underweight semantic correctness. 🚫 Overweight: If the rows within a table are out of order -

thumb_up_off_alt124

chat_bubble_outline2

repeat29

shareShare

Sophia Sirko-Galouchenko

@sophia_sirko

22 days ago

1/n New paper - V-GIFT 🎁 Self-supervised tasks like rotation prediction or colorization were big in 2018. Do they still matter? Yes. We turn them into visual instruction tuning data for MLLMs. Result: models rely more on the image and perform better on vision tasks 👀

thumb_up_off_alt86

chat_bubble_outline3

repeat23

shareShare

机器之心 JIQIZHIXIN

@synced_global

5 days ago

Can vision-language models truly see the fine-grained details in images? Google DeepMind presents TIPSv2. They boost dense patch-text alignment using three novel tricks: a distillation method where the student outperforms the teacher, an upgraded masked image objective

thumb_up_off_alt193

chat_bubble_outline0

repeat26

shareShare