Tianjian Li (@tli104) Twitter Tweets • TwiCopy

Rohan Paul

3 months ago

Cool new paper from @aiatmeta and brilliant idea. Proposes DARLING, a training method that makes LLM answers both higher quality and more varied. Unlike traditional reinforcement learning, which only rewards a model for producing the single highest quality response, DARLING

thumb_up_off_alt234

chat_bubble_outline14

repeat40

shareShare

Marc Marone

@ruyimarone

3 months ago

3T tokens, ~1800 languages, 2 models - we’re releasing mmBERT, a modern multilingual encoder model!

thumb_up_off_alt388

chat_bubble_outline9

repeat62

shareShare

Thinking Machines

@thinkymachines

3 months ago

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to

thumb_up_off_alt6,6K

chat_bubble_outline205

repeat1,1K

shareShare

Weiting (Steven) Tan

@weiting_nlp

2 months ago

I was curious about the voice agent, particularly its ability for tool-use and reasoning, while listening to a user. So we train voice agents by building a sandbox (based on tau-bench) that uses GPT-4.1 for user simulation and SEED-TTS for speech synthesis. #Agents #ToolUse

thumb_up_off_alt9

chat_bubble_outline1

repeat3

shareShare

Tianjian Li

@tli104

2 months ago

Thanks, Wenhao, for sharing our work—and congrats on yours as well. For clarity: our classifier is a fine-tuned embedding model tailored to specific definitions of diversity. A natural next step is to use an LM judge to assess diversity directly in online RL.

thumb_up_off_alt12

chat_bubble_outline1

repeat4

shareShare

Gabriel Synnaeve

@syhw

2 months ago

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…

thumb_up_off_alt1,1K

chat_bubble_outline56

repeat262

shareShare

Thinking Machines

@thinkymachines

2 months ago

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.

thumb_up_off_alt3,3K

chat_bubble_outline77

repeat533

shareShare

Arda Uzunoğlu

@aardauzunoglu

2 months ago

🛑 What's the Flaw of Averages? 📄: arxiv.org/abs/2509.25671 We’re in an evaluation crisis. Benchmarks are saturating, creating a false sense that tasks are solved. As training/eval chase these sets, plateaued averages hide shortcutting and distributional skew. 🧵1/7

thumb_up_off_alt41

chat_bubble_outline1

repeat18

shareShare

Jason Weston

@jaseweston

2 months ago

🌀New Self-Driven RL Method: RESTRAIN 🌀 📝: arxiv.org/abs/2510.02172 - RESTRAIN turns spurious votes → self-Improving signals. No labels needed - Does this through self-penalizing unreliable reasoning paths: ✔️ Uses all rollouts, not just the majority, ✔️ Offsets

thumb_up_off_alt195

chat_bubble_outline4

repeat38

shareShare

Richard Pang

@yzpang_

2 months ago

🚨Prompt Curriculum Learning (PCL) - Efficient LLM RL training algo! - We investigate factors that affect convergence: bsz, # prompt, # gen, prompt selection - We propose PCL: lightweight algo that *dynamically selects intermediate-difficulty prompts* using a learned value model

thumb_up_off_alt171

chat_bubble_outline2

repeat37

shareShare

Sharon Y. Li

@sharonyixuanli

2 months ago

Collecting large human preference data is expensive—the biggest bottleneck in reward modeling. In our #NeurIPS2025 paper, we introduce latent-space synthesis for preference data, which is 18× faster and uses a network that’s 16,000× smaller (0.5M vs 8B parameters) than

thumb_up_off_alt319

chat_bubble_outline5

repeat56

shareShare

Zhenwen Liang

@liangzhenwen

2 months ago

Wenhao Yu and I are recruiting 2026 Spring/Summer Research Interns at Tencent AI Lab 🚀 Topics include self-evolving, Agent Systems, Complex Reasoning, etc. We are also hiring full-time researchers with PhD degrees, fully publication-driven. Please DM or email.

thumb_up_off_alt53

chat_bubble_outline4

repeat8

shareShare

Taylor Sorensen

@ma_tay_

2 months ago

🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!) We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈 1/🧵

thumb_up_off_alt192

chat_bubble_outline5

repeat47

shareShare

Jason Weston

@jaseweston

2 months ago

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪 📝: arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward

thumb_up_off_alt322

chat_bubble_outline4

repeat53

shareShare

Jason Weston

@jaseweston

2 months ago

💃New Multi-Agent RL Method: WaltzRL💃 📝: arxiv.org/abs/2510.08240 - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed -> Improves safety & reduces overrefusals without degrading capabilities! 🧵1/5

thumb_up_off_alt150

chat_bubble_outline5

repeat33

shareShare

vLLM

@vllm_project

a month ago

it’s tokenization again! 🤯 did you know tokenize(detokenize(token_ids)) ≠ token_ids? RL researchers from Agent Lightning coined the term Retokenization Drift — a subtle mismatch between what your model generated and what your trainer thinks it generated. why? because most

thumb_up_off_alt694

chat_bubble_outline17

repeat78

shareShare

Jesse Hoogland

@jesse_hoogland

a month ago

How does training data shape model behavior? Well, it’s complicated… 1/10

thumb_up_off_alt989

chat_bubble_outline15

repeat147

shareShare

ueaj

@_ueaj

a month ago

publish.obsidian.md/ueaj/Machine+L…

thumb_up_off_alt69

chat_bubble_outline5

repeat8

shareShare

Tianjian Li

@tli104

a month ago

Thanks for sharing our work Nate! In our work we found that explicitly optimizing for diversity allows the model to beat baseline GRPO in both pass@1 & pass@k. The code is now open-sourced at github.com/facebookresear….

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare