Qian Liu (@sivil_taram) 's Twitter Profile
Qian Liu

@sivil_taram

Researcher @ TikTok ๐Ÿ‡ธ๐Ÿ‡ฌ

๐Ÿ“„ Sailor / StarCoder / OpenCoder
๐Ÿ’ผ Past: Research Scientist @SeaAIL; PhD @MSFTResearch
๐Ÿง  Contribution: @XlangNLP @BigCodeProject

ID: 1465140087193161734

linkhttp://siviltaram.github.io/ calendar_today29-11-2021 02:06:42

1,1K Tweet

3,3K Takipรงi

674 Takip Edilen

Ge Zhang (@gezhang86038849) 's Twitter Profile Photo

Is text-only information enough for LLM/VLM Web Agents? ๐Ÿค” Clearly not. ๐Ÿ™…โ€โ™‚๏ธ The modern web is a rich tapestry of text, images ๐Ÿ–ผ๏ธ, and videos ๐ŸŽฅ. To truly assist us, agents need to understand it all. That's why we built MM-BrowseComp. ๐ŸŒ We're introducing MM-BrowseComp ๐Ÿš€, a new

Is text-only information enough for LLM/VLM Web Agents? ๐Ÿค” Clearly not. ๐Ÿ™…โ€โ™‚๏ธ The modern web is a rich tapestry of text, images ๐Ÿ–ผ๏ธ, and videos ๐ŸŽฅ. To truly assist us, agents need to understand it all. That's why we built MM-BrowseComp. ๐ŸŒ

We're introducing MM-BrowseComp ๐Ÿš€, a new
Dynamics Lab (@dynamicslab_ai) 's Twitter Profile Photo

Introducing Mirage 2 โ€” a real-time, general-domain generative world engine you can play online Upload any imageโ€”photos, concept art, classic paintings, kids' drawingsโ€”and step into it as a live, interactive world. Prompt your worlds with text to create any surreal scenes and

Jia Guo (@jia__guo) 's Twitter Profile Photo

๐Ÿš€ Is more data always better for your RL training? Not sure how to pick the โ€œrightโ€ data? Check out our latest research!

Ge Zhang (@gezhang86038849) 's Twitter Profile Photo

Although I won't be able to be onsite personally. Glad to announce that M-A-P is co-organizing a meetup with Monolith, alongside co-hosts from the verl, SGLang, Zilliz, and Creao AI dev teams, to explore the latest advances in RL, RL infrastructure, reasoning, and agentic AI in

Although I won't be able to be onsite personally.  Glad to announce that M-A-P is co-organizing a meetup with Monolith, alongside co-hosts from the verl, SGLang, Zilliz, and Creao AI dev teams, to explore the latest advances in RL, RL infrastructure, reasoning, and agentic AI in
Michael Qizhe Shieh (@mpulsewidth) 's Twitter Profile Photo

Introducing MCPMark, a collaboration with Eval Sys and LobeHub! We created a challenging benchmark to stress-test MCP use in comprehensive contexts. - 127 high-quality data samples created by experts. - GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the

Introducing MCPMark, a collaboration with <a href="/EvalSysOrg/">Eval Sys</a> and <a href="/lobehub/">LobeHub</a>! 

We created a challenging benchmark to stress-test MCP use in comprehensive contexts.
- 127 high-quality data samples created by experts.
- GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the
Deep Learning For Code @ ICLR'25 (@dl4code) 's Twitter Profile Photo

๐Ÿšจ FINAL CALL: Only 2 days left to submit to the ๐”ป๐•–๐•–๐•ก ๐•ƒ๐•–๐•’๐•ฃ๐•Ÿ๐•š๐•Ÿ๐•˜ ๐•—๐• ๐•ฃ โ„‚๐• ๐••๐•– ๐•š๐•Ÿ ๐•ฅ๐•™๐•– ๐”ธ๐•˜๐•–๐•Ÿ๐•ฅ๐•š๐•” ๐”ผ๐•ฃ๐•’ (DL4C) workshop at NeurIPS2025 ! ๐Ÿ—“Deadline: Aug 27th, 11:59PM UTC-12 Amazing speaker lineup including experts from CMU, UC Berkeley, Replit, poolside,

๐Ÿšจ FINAL CALL: Only 2 days left to submit to the ๐”ป๐•–๐•–๐•ก ๐•ƒ๐•–๐•’๐•ฃ๐•Ÿ๐•š๐•Ÿ๐•˜ ๐•—๐• ๐•ฃ โ„‚๐• ๐••๐•– ๐•š๐•Ÿ ๐•ฅ๐•™๐•– ๐”ธ๐•˜๐•–๐•Ÿ๐•ฅ๐•š๐•” ๐”ผ๐•ฃ๐•’ (DL4C) workshop at NeurIPS2025 !

๐Ÿ—“Deadline: Aug 27th, 11:59PM UTC-12

Amazing speaker lineup including experts from CMU, UC Berkeley, Replit, poolside,
Yizhi Li (@yizhilll) 's Twitter Profile Photo

[1/n] Introducing TreePO๐ŸŒฒ, a new RL framework for LLMs! It slashes sampling costs while boosting reasoning capabilities. Daily Paper: huggingface.co/papers/2508.17โ€ฆ

AK (@_akhaliq) 's Twitter Profile Photo

TreePO Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

TreePO

Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
Qian Liu (@sivil_taram) 's Twitter Profile Photo

Thanks AK for sharing our work! ๐Ÿ”ฅ ๐Ÿงต Back to Jan when we just started this project... we were living a nightmare ๐Ÿ˜ฉ Months of watching our multi-turn RL models collapse. Every. Single. Time. ๐Ÿ’ฅ We thought we were doing something wrong... until we discovered other research

Junxian He (@junxian_he) 's Twitter Profile Photo

Mirage or method? We re-assess a series of RL observations such as spurious reward, one-shot RL, test-time RL, and negative-sample training. ๐ŸงThese approaches were all proved on Qwen+Math combination originally, but do they work in other settings? If not, under which

Mirage or method? We re-assess a series of RL observations such as spurious reward, one-shot RL, test-time RL, and negative-sample training. 

๐ŸงThese approaches were all proved on Qwen+Math combination originally, but do they work in other settings? If not, under which
Yingru Li (@richardyrli) 's Twitter Profile Photo

The SimpleTIR paper is officially out! We go beyond our July blog post to provide a deeper mathematical explanation and rigorous proof for why multi-turn RL agents are so unstable. The root cause? A predictable domino effect: OOD Tool Feedback โ†’ Low-Prob Tokens โ†’ Exploding

Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

Computer Use: Modern Moravec's Paradox A new blog post arguing why computer-use agents may be the biggest opportunity and challenge for AGI. tinyurl.com/computer-use-aโ€ฆ Table of Contents > Moravecโ€™s Paradox > Moravec's Paradox in 2025 > Computer use may be the biggest opportunity

Computer Use: Modern Moravec's Paradox

A new blog post arguing why computer-use agents may be the biggest opportunity and challenge for AGI.

tinyurl.com/computer-use-aโ€ฆ

Table of Contents
&gt; Moravecโ€™s Paradox
&gt; Moravec's Paradox in 2025
&gt; Computer use may be the biggest opportunity
Ge Zhang (@gezhang86038849) 's Twitter Profile Photo

๐ŸŒ€ โ€œDo As I Say, Not As You Were Trained!โ€ โ€” problem solved? โŒ Not by todayโ€™s LLMs. We present Inverse IFEval: a new benchmark testing whether LLMs can follow counterintuitive instructions that deliberately break away from standard training patterns. ๐Ÿ“Š Dataset:

๐ŸŒ€ โ€œDo As I Say, Not As You Were Trained!โ€ โ€” problem solved? โŒ Not by todayโ€™s LLMs.
We present Inverse IFEval: a new benchmark testing whether LLMs can follow counterintuitive instructions that deliberately break away from standard training patterns.
๐Ÿ“Š Dataset:
Qian Liu (@sivil_taram) 's Twitter Profile Photo

๐Ÿค” Is your LLM actually listening to you? Or just parroting its training data? New research shows that LLMs might be WAY more "stubborn" than you think! Checkout the thread for more details โฌ‡๏ธ

๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ (@gm8xx8) 's Twitter Profile Photo

Mini-o3: Reproducing OpenAI o3-style multi-turn visual reasoning. Unlike prior VLMs stuck at 1โ€“2 turns, Mini-o3 executes deep tool-based reasoning spanning tens of steps. What it proves is that the right data, init, and an RL tweak unlock long-horizon visual search, without

Mini-o3: Reproducing OpenAI o3-style multi-turn visual reasoning. Unlike prior VLMs stuck at 1โ€“2 turns, Mini-o3 executes deep tool-based reasoning spanning tens of steps. What it proves is that the right data, init, and an RL tweak unlock long-horizon visual search, without
Qian Liu (@sivil_taram) 's Twitter Profile Photo

Trained on a 6-turn cap but naturally scales to 32 turns at inference, with accuracy improving the deeper it thinks. Great work from Xin Lai and the team!