Tianjian Li (@tli104) 's Twitter Profile
Tianjian Li

@tli104

phd student @jhuclsp, I work on data engineering for language models. Previously @nyuniversity.

ID: 1591342007347208192

linkhttp://tianjianl.github.io calendar_today12-11-2022 08:08:03

136 Tweet

235 Takipçi

508 Takip Edilen

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Cool new paper from @aiatmeta and brilliant idea. Proposes DARLING, a training method that makes LLM answers both higher quality and more varied. Unlike traditional reinforcement learning, which only rewards a model for producing the single highest quality response, DARLING

Cool new paper from @aiatmeta and brilliant idea.

Proposes DARLING, a training method that makes LLM answers both higher quality and more varied. 

Unlike traditional reinforcement learning, which only rewards a model for producing the single highest quality response, DARLING
Thinking Machines (@thinkymachines) 's Twitter Profile Photo

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to
Weiting (Steven) Tan (@weiting_nlp) 's Twitter Profile Photo

I was curious about the voice agent, particularly its ability for tool-use and reasoning, while listening to a user. So we train voice agents by building a sandbox (based on tau-bench) that uses GPT-4.1 for user simulation and SEED-TTS for speech synthesis. #Agents #ToolUse

I was curious about the voice agent, particularly its ability for tool-use and reasoning, while listening to a user.

So we train voice agents by building a sandbox (based on tau-bench) that uses GPT-4.1 for user simulation and SEED-TTS for speech synthesis.
#Agents #ToolUse
Tianjian Li (@tli104) 's Twitter Profile Photo

Thanks, Wenhao, for sharing our work—and congrats on yours as well. For clarity: our classifier is a fine-tuned embedding model tailored to specific definitions of diversity. A natural next step is to use an LM judge to assess diversity directly in online RL.

Gabriel Synnaeve (@syhw) 's Twitter Profile Photo

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…

Thinking Machines (@thinkymachines) 's Twitter Profile Photo

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
Arda Uzunoğlu (@aardauzunoglu) 's Twitter Profile Photo

🛑 What's the Flaw of Averages? 📄: arxiv.org/abs/2509.25671 We’re in an evaluation crisis. Benchmarks are saturating, creating a false sense that tasks are solved. As training/eval chase these sets, plateaued averages hide shortcutting and distributional skew. 🧵1/7

🛑 What's the Flaw of Averages?
📄: arxiv.org/abs/2509.25671
We’re in an evaluation crisis. Benchmarks are saturating, creating a false sense that tasks are solved. As training/eval chase these sets, plateaued averages hide shortcutting and distributional skew.
🧵1/7
Jason Weston (@jaseweston) 's Twitter Profile Photo

🌀New Self-Driven RL Method: RESTRAIN 🌀 📝: arxiv.org/abs/2510.02172 - RESTRAIN turns spurious votes → self-Improving signals. No labels needed - Does this through self-penalizing unreliable reasoning paths: ✔️ Uses all rollouts, not just the majority, ✔️ Offsets

🌀New Self-Driven RL Method: RESTRAIN 🌀
📝: arxiv.org/abs/2510.02172
-  RESTRAIN turns spurious votes → self-Improving signals. No labels needed
-  Does this through self-penalizing unreliable reasoning paths:
✔️ Uses all rollouts, not just the majority,
✔️ Offsets
Richard Pang (@yzpang_) 's Twitter Profile Photo

🚨Prompt Curriculum Learning (PCL) - Efficient LLM RL training algo! - We investigate factors that affect convergence: bsz, # prompt, # gen, prompt selection - We propose PCL: lightweight algo that *dynamically selects intermediate-difficulty prompts* using a learned value model

🚨Prompt Curriculum Learning (PCL) 
- Efficient LLM RL training algo!
- We investigate factors that affect convergence: bsz, # prompt, # gen, prompt selection
- We propose PCL: lightweight algo that *dynamically selects intermediate-difficulty prompts* using a learned value model
Sharon Y. Li (@sharonyixuanli) 's Twitter Profile Photo

Collecting large human preference data is expensive—the biggest bottleneck in reward modeling. In our #NeurIPS2025 paper, we introduce latent-space synthesis for preference data, which is 18× faster and uses a network that’s 16,000× smaller (0.5M vs 8B parameters) than

Collecting large human preference data is expensive—the biggest bottleneck in reward modeling.

In our #NeurIPS2025 paper, we introduce latent-space synthesis for preference data, which is 18× faster and uses a network that’s 16,000× smaller (0.5M vs 8B parameters) than
Zhenwen Liang (@liangzhenwen) 's Twitter Profile Photo

Wenhao Yu and I are recruiting 2026 Spring/Summer Research Interns at Tencent AI Lab 🚀 Topics include self-evolving, Agent Systems, Complex Reasoning, etc. We are also hiring full-time researchers with PhD degrees, fully publication-driven. Please DM or email.

Taylor Sorensen (@ma_tay_) 's Twitter Profile Photo

🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!) We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈 1/🧵

🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!)

We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈

1/🧵
Jason Weston (@jaseweston) 's Twitter Profile Photo

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪 📝: arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪
 📝: arxiv.org/abs/2510.07242

- HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method
- Tackles the brittleness of binary signals and the noise of pure reward
Jason Weston (@jaseweston) 's Twitter Profile Photo

💃New Multi-Agent RL Method: WaltzRL💃 📝: arxiv.org/abs/2510.08240 - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed -> Improves safety & reduces overrefusals without degrading capabilities! 🧵1/5

💃New Multi-Agent RL Method: WaltzRL💃
📝: arxiv.org/abs/2510.08240
- Makes LLM safety a positive-sum game between a conversation & feedback agent
- At inference feedback is adaptive, used when needed
-> Improves safety & reduces overrefusals without degrading capabilities!
🧵1/5
vLLM (@vllm_project) 's Twitter Profile Photo

it’s tokenization again! 🤯 did you know tokenize(detokenize(token_ids)) ≠ token_ids? RL researchers from Agent Lightning coined the term Retokenization Drift — a subtle mismatch between what your model generated and what your trainer thinks it generated. why? because most

Tianjian Li (@tli104) 's Twitter Profile Photo

Thanks for sharing our work Nate! In our work we found that explicitly optimizing for diversity allows the model to beat baseline GRPO in both pass@1 & pass@k. The code is now open-sourced at github.com/facebookresear….