Wenqi Zhang (@spicysweet1859) Twitter Tweets • TwiCopy

Rohan Paul

a year ago

"Scale-up" is NOT dead. High-quality data is the true key to effective scaling, particularly textbook-level, high-quality knowledge corpora. The project in this paper collected massive online instructional videos and extracted keyframes and their corresponding audio

thumb_up_off_alt7

chat_bubble_outline2

repeat2

shareShare

merve

@mervenoyann

a year ago

Alibaba released Multimodal Textbook: a new multimodal pre-training set from online instructional videos (22k hours) 🧑🏻‍🏫📕 6,5M images interleaved witk 800k text on math, physics, chemistry 👏

thumb_up_off_alt663

chat_bubble_outline13

repeat111

shareShare

AK

@_akhaliq

a year ago

VideoLLaMA 3 Frontier Multimodal Foundation Models for Image and Video Understanding

thumb_up_off_alt227

chat_bubble_outline5

repeat51

shareShare

Xin (Ted) Li

@lixin4ever

a year ago

🚀🚀🚀Announcing VideoLLaMA3, our latest MLLMs for image and video understanding: - Highly capable 7B models: DocVQA: 94.9, MathVision: 26.2, VideoMME: 66.2/70.3, MLVU: 73.0 - Competitive 2B models for edge devices: MMMU: 45.3, VideoMME: 59.6/63.4 - Frontier-class video model

thumb_up_off_alt157

chat_bubble_outline8

repeat43

shareShare

The AI Timeline

@theaitimeline

a year ago

🚨This week's top AI/ML research papers: - DeepSeek-R1 - Kimi k1.5 - UI-TARS - Can We Generate Images with CoT? - Physics of Skill Learning - Test-time regression - SRMT - Scaling Laws for Optimal Sparsity for MoE LMs - Distillation Quantification for LLMs - Autonomy-of-Experts

thumb_up_off_alt908

chat_bubble_outline3

repeat125

shareShare

Mingyang Chen

@chen_mingyang

a year ago

🌟Introducing 𝗥𝗲𝗦𝗲𝗮𝗿𝗰𝗵: Learning to Reason with Search for LLMs via Reinforcement Learning. An open-source project that combines 𝗥𝗟 and 𝗥𝗔𝗚 for LLMs! 💡Like Deepseek-R1-Zero and Deep Research, we start with pretrained models and use RL to empower them with the

thumb_up_off_alt349

chat_bubble_outline4

repeat54

shareShare

AIGCLINK

@aigclink

a year ago

多所高校和阿里联合出的一个具身智能模型：Embodied-Reasoner，它通过视觉搜索、推理以及执行行动组合起来完成交互式任务它能感知并理解环境，还能通过思考和规划来完成复杂的任务，其复合任务能力强，超出GPT-4o 39.9% 成功率比 OpenAI o1高9.6%，搜索效率上比OpenAI o1高12%

thumb_up_off_alt79

chat_bubble_outline4

repeat27

shareShare

Rohan Paul

@rohanpaul_ai

a year ago

Reflections fix errors. Multi-step reflection keeps exploration consistent, ensuring very minimal wasted moves. Embodied tasks need vision-driven reasoning, but models often fail. This paper unifies observation, reflection, and action, offering consistent planning and

thumb_up_off_alt4

chat_bubble_outline3

repeat1

shareShare

Xin (Ted) Li

@lixin4ever

a year ago

🚨 NEW PAPER ALERT! Even TOP VLMs FAIL at ELEMENTARY SCHOOL MATH! 🧠❌ We present VCBench (huggingface.co/papers/2504.18…), revealing the ALARMING TRUTH: all of the latest vision-language models score BELOW 50% on BASIC math problems that 10-year-olds solve easily! 😱🤯 WHY? These

thumb_up_off_alt13

chat_bubble_outline1

repeat6

shareShare

Xin Eric Wang @ ICLR 2025

@xwang_lk

a year ago

𝘏𝘶𝘮𝘢𝘯𝘴 𝘵𝘩𝘪𝘯𝘬 𝘧𝘭𝘶𝘪𝘥𝘭𝘺—𝘯𝘢𝘷𝘪𝘨𝘢𝘵𝘪𝘯𝘨 𝘢𝘣𝘴𝘵𝘳𝘢𝘤𝘵 𝘤𝘰𝘯𝘤𝘦𝘱𝘵𝘴 𝘦𝘧𝘧𝘰𝘳𝘵𝘭𝘦𝘴𝘴𝘭𝘺, 𝘧𝘳𝘦𝘦 𝘧𝘳𝘰𝘮 𝘳𝘪𝘨𝘪𝘥 𝘭𝘪𝘯𝘨𝘶𝘪𝘴𝘵𝘪𝘤 𝘣𝘰𝘶𝘯𝘥𝘢𝘳𝘪𝘦𝘴. But current reasoning models remain constrained by discrete tokens, limiting their full

thumb_up_off_alt931

chat_bubble_outline27

repeat136

shareShare

Jingyuan Qi

@jingyuan_qi

a year ago

🚀 Introducing AR-RAG: Autoregressive Retrieval Augmentation for Image Generation: arxiv.org/pdf/2506.06962 🔍 Dynamic patch-level retrieval during generation 🧠 Context-aware visual references that evolve with your image 📈 Significant gains on Midjourney, GenEval and DPG-Bench

thumb_up_off_alt17

chat_bubble_outline5

repeat6

shareShare

Yongliang Shen

@itricktreat

7 months ago

Introducing EasySteer: High-performance LLM steering framework built on vLLM. Achieves 5.5-11.4× speedup over existing tools while maintaining 71-84% throughput. Paper: arxiv.org/abs/2509.25175 Code: github.com/ZJU-REAL/EasyS… HF Paper: huggingface.co/papers/2509.25…

thumb_up_off_alt12

chat_bubble_outline4

repeat5

shareShare