NoAI (@onlyyouuu8) 's Twitter Profile
NoAI

@onlyyouuu8

ทุบเฟมทวิตทุกตัว ไม่เลือกหน้า

ID: 1337021952884543498

calendar_today10-12-2020 13:10:57

3,3K Tweet

41 Followers

161 Following

David Fan (@davidjfan) 's Twitter Profile Photo

[1/9] What happens when you treat vision as a first-class citizen during multimodal pretraining? To find out, we studied the design space of training Transfusion-style models that input and output all modalities, from scratch. Here is what we learned about visual representations,

Klára Janoušková (@klaaracz) 's Twitter Profile Photo

📝 Multimodal Large Language Models as Image Classifiers MLLMs are increasingly used for visual tasks, but evaluating their image classification ability has produced conflicting conclusions. w. N. Kisel, Illia Volkov 🇺🇦 , J. Matas #CVPR26 findings

📝 Multimodal Large Language Models as Image Classifiers  

MLLMs are increasingly used for visual tasks, but evaluating their image classification ability has produced conflicting conclusions.

w. N. Kisel, <a href="/circuslegend_/">Illia Volkov 🇺🇦</a> , J. Matas

#CVPR26 findings
Kaiser Sun (@kaiserwholearns) 's Twitter Profile Photo

Multimodal LLMs can read text in images, but why do they often perform worse than when the same text is given as tokens? Our work studies the modality gap of models perceiving text as pixels and shows how to close it. 📄 arxiv.org/abs/2603.09095 🧵👇 #NLProc #LLM #ComputerVision

Multimodal LLMs can read text in images, but why do they often perform worse than when the same text is given as tokens? Our work studies the modality gap of models perceiving text as pixels and shows how to close it.
📄 arxiv.org/abs/2603.09095
🧵👇  #NLProc #LLM #ComputerVision
Anshul Shah (@anshul__shah) 's Twitter Profile Photo

Excited to share our latest research on limitations of RL-finetuned VLMs! We investigate the robustness of model responses and consistency of CoT to textual perturbations. Work led by Rosie Zhao during her internship with the Multimodal Machine Intelligence team at Apple.

AVB (@neural_avb) 's Twitter Profile Photo

My essay on Vision Language Models just went live! Massive thanks to the Towards Data Science team for featuring it on their Deep Dives section. Read it here: towardsdatascience.com/how-vision-lan… Conceptual deep dive, git repo, video tutorial - it's all in there!

My essay on Vision Language Models just went live! 

Massive thanks to the <a href="/TDataScience/">Towards Data Science</a> team for featuring it on their Deep Dives section.

Read it here: towardsdatascience.com/how-vision-lan…

Conceptual deep dive, git repo, video tutorial - it's all in there!
Faheem Ullah (@faheem_uh) 's Twitter Profile Photo

PhD Students - How to generate a graphical abstract in seconds? 1. Go to researchviz.io 2. Copy and paste your abstract 3. ResearchViz will generate graphical abstract. You can download it as PPT and make any changes.

Niels Rogge (@nielsrogge) 's Twitter Profile Photo

Quick video showing how to run GLM-OCR 100% locally! I cover the following things: > llama cpp > GGUF > LM Studio > OpenAI-compatible Python API > transcribe tables, images, and more

AK (@_akhaliq) 's Twitter Profile Photo

Multimodal OCR Parse Anything from Documents On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured

Multimodal OCR

Parse Anything from Documents

On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured
Adina Yakup (@adinayakup) 's Twitter Profile Photo

Another OCR model just dropped on Hugging Face (so many OCRs lately!) dots.mocr from rednote Hi Lab looks really impressive on the benchmarks. -Model: huggingface.co/collections/re… -Paper: huggingface.co/papers/2603.13… ✨ 3B ✨ Multilingual support ✨ Converts charts, diagrams, and

Another OCR model just dropped on <a href="/huggingface/">Hugging Face</a>  (so many OCRs lately!)

dots.mocr from <a href="/xiaohongshu/">rednote</a> Hi Lab looks really impressive on the benchmarks.

-Model: huggingface.co/collections/re…
-Paper: huggingface.co/papers/2603.13…

✨ 3B 
✨ Multilingual support
✨ Converts charts, diagrams, and
Daniel van Strien (@vanstriendaniel) 's Twitter Profile Photo

Bunch of new open OCR models recently — all available as uv scripts on Hugging Face. 19 models from 0.9B–8B. Some standouts: - Qianfan-OCR - 192 languages - dots.mocr — charts/figures → editable SVG - GLM-OCR — 94.6% accuracy, only 0.9B params

Bunch of new open OCR models recently — all available as uv scripts on <a href="/huggingface/">Hugging Face</a>.

19 models from 0.9B–8B. Some standouts:

- Qianfan-OCR - 192 languages
- dots.mocr — charts/figures → editable SVG
- GLM-OCR — 94.6% accuracy, only 0.9B params
Akshay 🚀 (@akshay_pachaar) 's Twitter Profile Photo

Everyone is sleeping on this new OCR model! - 85.9% (sota) on olmocr bench - 90+ language support w/benchmarks - 4B model (down from 9B) - Full layout information - Extracts + captions images and diagrams - Strong handwriting, math, form, table support 100% open-source.

Rimsha Bhardwaj (@heyrimsha) 's Twitter Profile Photo

🚨BREAKING: A dev just open-sourced the #1 ranked OCR model on Earth. It's called GLM-OCR and it just hit 94.62 on OmniDocBench V1.5, beating every OCR model in existence. Only 0.9B parameters. One pip install. Handles documents no other model could touch. 100% Open Source.

🚨BREAKING: A dev just open-sourced the #1 ranked OCR model on Earth.

It's called GLM-OCR and it just hit 94.62 on OmniDocBench V1.5, beating every OCR model in existence.

Only 0.9B parameters. One pip install. Handles documents no other model could touch.

100% Open Source.
Javier Ferrando (@javifer_96) 's Twitter Profile Photo

Can language models explain features learned by vision encoders? #CVPR2026 - Feed a blank image - Steer a specific feature in the vision encoder - Ask the language model to explain the image The model explains the feature itself.

DailyPapers (@huggingpapers) 's Twitter Profile Photo

NVIDIA just released Nemotron OCR v2 on Hugging Face A production-ready multilingual OCR system with a hybrid detector-recognizer architecture for text, layout and reading order. huggingface.co/nvidia/nemotro…

Guanyu Zhou (@tmartyr4951) 's Twitter Profile Photo

It's time to systematically teach VLMs to see with synthetic images! We built VisionFoundry, a simple but intuitive framework that generates synthetic image datasets from only a task name. 10k synthetic data → over +10% improvement on visual perception benchmarks 👀

It's time to systematically teach VLMs to see with synthetic images!

We built VisionFoundry, a simple but intuitive framework that generates synthetic image datasets from only a task name.

10k synthetic data → over +10% improvement on visual perception benchmarks 👀
Jerry Liu (@jerryjliu0) 's Twitter Profile Photo

Parsing complex tables in PDFs is extremely challenging. Existing metrics for measuring table accuracy, like TEDS (tree edit distance similarity), overweight exact table structure and underweight semantic correctness. 🚫 Overweight: If the rows within a table are out of order -

Parsing complex tables in PDFs is extremely challenging.

Existing metrics for measuring table accuracy, like TEDS (tree edit distance similarity), overweight exact table structure and underweight semantic correctness.

🚫 Overweight: If the rows within a table are out of order -
Sophia Sirko-Galouchenko (@sophia_sirko) 's Twitter Profile Photo

1/n New paper - V-GIFT 🎁 Self-supervised tasks like rotation prediction or colorization were big in 2018. Do they still matter? Yes. We turn them into visual instruction tuning data for MLLMs. Result: models rely more on the image and perform better on vision tasks 👀

1/n New paper - V-GIFT 🎁

Self-supervised tasks like rotation prediction or colorization were big in 2018.
Do they still matter?

Yes.
We turn them into visual instruction tuning data for MLLMs.

Result: models rely more on the image and perform better on vision tasks 👀
机器之心 JIQIZHIXIN (@synced_global) 's Twitter Profile Photo

Can vision-language models truly see the fine-grained details in images? Google DeepMind presents TIPSv2. They boost dense patch-text alignment using three novel tricks: a distillation method where the student outperforms the teacher, an upgraded masked image objective

Can vision-language models truly see the fine-grained details in images?  

Google DeepMind presents TIPSv2.  

They boost dense patch-text alignment using three novel tricks: a distillation method where the student outperforms the teacher, an upgraded masked image objective