Seo Sanghyeon (@sanxiyn) Twitter Tweets • TwiCopy

Lindia Tjuatja

3 months ago

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs: 🧵1/9

thumb_up_off_alt85

chat_bubble_outline1

repeat18

shareShare

Smitha Milli

@smithamilli

2 months ago

Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵

thumb_up_off_alt316

chat_bubble_outline11

repeat69

shareShare

Qinyuan Ye (👀Jobs)

@qinyuan_ye

2 months ago

1+1=3 2+2=5 3+3=? Many language models (e.g., Llama 3 8B, Mistral v0.1 7B) will answer 7. But why? We dig into the model internals, uncover a function induction mechanism, and find that it’s broadly reused when models encounter surprises during in-context learning. 🧵

thumb_up_off_alt103

chat_bubble_outline4

repeat15

shareShare

Andrew White 🐦‍⬛

@andrewwhite01

2 months ago

HLE has recently become the benchmark to beat for frontier agents. We FutureHouse took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7

thumb_up_off_alt281

chat_bubble_outline7

repeat45

shareShare

Mehul Damani @ ICLR

@mehuldamani2

2 months ago

🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --

thumb_up_off_alt892

chat_bubble_outline11

repeat286

shareShare

Rosinality

@rosinality

a month ago

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models Impressive results. This resolves various problems around MoE all at once. First it reconfirms that MoE has a higher ratio of optimal data size relative to computational cost compared to

thumb_up_off_alt186

chat_bubble_outline1

repeat24

shareShare

Kilian Lieret @ICLR

@klieret

a month ago

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

thumb_up_off_alt787

chat_bubble_outline12

repeat72

shareShare

Anisha Gunjal

@anisha_gunjal

a month ago

🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer? Our work at Scale AI introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵

🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer?

Our work at <a href="/scale_AI/">Scale AI</a> introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵

thumb_up_off_alt199

chat_bubble_outline5

repeat34

shareShare

Chujie Zheng

@chujiezheng

a month ago

Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀 📄 huggingface.co/papers/2507.18…

thumb_up_off_alt1,1K

chat_bubble_outline18

repeat143

shareShare

机器之心 JIQIZHIXIN

@synced_global

a month ago

The wait is over! Meet Step 3 — the groundbreaking multimodal LLM from StepFun! 🚀 MoE architecture (321B total params, 38B active) 💡 Rivals OpenAI o3, Gemini 2.5 Pro, and Claude Opus 4 in performance 🖥️ Optimized for China’s domestic AI chips StepFun just announced: Step 3

thumb_up_off_alt596

chat_bubble_outline32

repeat84

shareShare

机器之心 JIQIZHIXIN

@synced_global

a month ago

China's AI breakthroughs today! A wave of cutting-edge models just appeared: - Qwen3-MT，the most powerful translation model from Alibaba Qwen Team. Trained on trillions of multilingual tokens, it supports 92+ languages—covering 95%+ of the world’s population. - Qwen3 that

thumb_up_off_alt148

chat_bubble_outline3

repeat23

shareShare

Rohan Paul

@rohanpaul_ai

a month ago

Training an LLM to talk itself (self‑reflection) through mistakes can lift accuracy by up to 34.7%, turning a 7B‑parameter model into a giant‑killer. The study tackles the headache of models that know the tool list or the math rules yet still fumble, by teaching them to

thumb_up_off_alt825

chat_bubble_outline24

repeat147

shareShare

机器之心 JIQIZHIXIN

@synced_global

a month ago

The Step-3 tech report is now on arXiv! Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding arxiv.org/abs/2507.19427

thumb_up_off_alt104

chat_bubble_outline6

repeat13

shareShare

Yu Su @#ICLR2025

@ysu_nlp

a month ago

Safety is one of the biggest blockers for computer use agents: how can I trust an agent won’t accidentally do something consequential without my permission? We collect and release the first large-scale dataset for detecting consequential actions on the web, and train the best

thumb_up_off_alt98

chat_bubble_outline0

repeat19

shareShare

METR

@metr_evals

a month ago

We found that Grok 4’s 50%-time-horizon on our agentic multi-step software engineering tasks is about 1hr 50min (with a 95% CI of 48min to 3hr 52min) compared to o3 (previous SOTA) at about 1hr 30min. However, Grok 4’s time horizon is below SOTA at higher success rate thresholds.

thumb_up_off_alt323

chat_bubble_outline9

repeat22

shareShare

机器之心 JIQIZHIXIN

@synced_global

a month ago

🛠️ We all know LLMs can write code—but they make mistakes too, sometimes dangerous ones. To make sense of the fast-growing research on LLMs for software vulnerability detection, this systematic review analyzes 227 papers (2020–2025). It includes: 📚 Taxonomy of tasks, input

thumb_up_off_alt22

chat_bubble_outline2

repeat8

shareShare

StepFun

@stepfun_ai

a month ago

🚀 Announcing Step 3: Our latest open-source multimodal reasoning model is here! Get ready for a stronger, faster, & more cost-effective VLM！ 🔵 321B parameters (38B active), optimized for top-tier performance & cost-effective decoding. 🔵 Revolutionary Multi-Matrix

thumb_up_off_alt442

chat_bubble_outline20

repeat89

shareShare

Anthropic

@anthropicai

a month ago

New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.

thumb_up_off_alt5,5K

chat_bubble_outline222

repeat918

shareShare

Epoch AI

@epochairesearch

a month ago

A fourth problem on FrontierMath Tier 4 has been solved by AI! Written by Dan Romik, it had won our prize for the best submission in the number theory category.

thumb_up_off_alt729

chat_bubble_outline21

repeat114

shareShare

机器之心 JIQIZHIXIN

@synced_global

a month ago

A long waited survey on LLM-based code generation agents. It dives deep into autonomous agents (plan, code, debug) and full software development lifecycle (SDLC) coverage, and focus on real-world engineering (reliability, toolchains, workflows). 📚 Covers single/multi-agent

thumb_up_off_alt25

chat_bubble_outline1

repeat12

shareShare