Seo Sanghyeon (@sanxiyn) 's Twitter Profile
Seo Sanghyeon

@sanxiyn

Card-carrying member of Cetocratic Party of Manifold

ID: 15275230

calendar_today30-06-2008 02:09:01

3,3K Tweet

3,3K Followers

1,1K Following

Lindia Tjuatja (@lltjuatja) 's Twitter Profile Photo

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs: 🧵1/9

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs: 

🧵1/9
Smitha Milli (@smithamilli) 's Twitter Profile Photo

Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵

Qinyuan Ye (👀Jobs) (@qinyuan_ye) 's Twitter Profile Photo

1+1=3 2+2=5 3+3=? Many language models (e.g., Llama 3 8B, Mistral v0.1 7B) will answer 7. But why? We dig into the model internals, uncover a function induction mechanism, and find that it’s broadly reused when models encounter surprises during in-context learning. 🧵

1+1=3
2+2=5
3+3=?

Many language models (e.g., Llama 3 8B, Mistral v0.1 7B) will answer 7. But why?

We dig into the model internals, uncover a function induction mechanism, and find that it’s broadly reused when models encounter surprises during in-context learning. 🧵
Andrew White 🐦‍⬛ (@andrewwhite01) 's Twitter Profile Photo

HLE has recently become the benchmark to beat for frontier agents. We FutureHouse took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7

Mehul Damani @ ICLR (@mehuldamani2) 's Twitter Profile Photo

🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --

🚨New Paper!🚨
We trained reasoning LLMs to reason about what they don't know.

o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more.

Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --
Rosinality (@rosinality) 's Twitter Profile Photo

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models Impressive results. This resolves various problems around MoE all at once. First it reconfirms that MoE has a higher ratio of optimal data size relative to computational cost compared to

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Impressive results. This resolves various problems around MoE all at once. First it reconfirms that MoE has a higher ratio of optimal data size relative to computational cost compared to
Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified!
Made for benchmarking, fine-tuning, RL, or just for use from your terminal.
It’s open source, simple to hack, and compatible with any LM! Link in 🧵
Anisha Gunjal (@anisha_gunjal) 's Twitter Profile Photo

🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer? Our work at Scale AI introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵

🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer?

Our work at <a href="/scale_AI/">Scale AI</a> introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵
Chujie Zheng (@chujiezheng) 's Twitter Profile Photo

Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀 📄 huggingface.co/papers/2507.18…

Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀

📄 huggingface.co/papers/2507.18…
机器之心 JIQIZHIXIN (@synced_global) 's Twitter Profile Photo

The wait is over! Meet Step 3 — the groundbreaking multimodal LLM from StepFun! 🚀 MoE architecture (321B total params, 38B active) 💡 Rivals OpenAI o3, Gemini 2.5 Pro, and Claude Opus 4 in performance 🖥️ Optimized for China’s domestic AI chips StepFun just announced: Step 3

The wait is over!

Meet Step 3 — the groundbreaking multimodal LLM from StepFun!

🚀 MoE architecture (321B total params, 38B active)
💡 Rivals OpenAI o3, Gemini 2.5 Pro, and Claude Opus 4 in performance
🖥️ Optimized for China’s domestic AI chips

StepFun just announced: Step 3
机器之心 JIQIZHIXIN (@synced_global) 's Twitter Profile Photo

China's AI breakthroughs today! A wave of cutting-edge models just appeared: - Qwen3-MT,the most powerful translation model from Alibaba Qwen Team. Trained on trillions of multilingual tokens, it supports 92+ languages—covering 95%+ of the world’s population. - Qwen3 that

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Training an LLM to talk itself (self‑reflection) through mistakes can lift accuracy by up to 34.7%, turning a 7B‑parameter model into a giant‑killer. The study tackles the headache of models that know the tool list or the math rules yet still fumble, by teaching them to

Training an LLM to talk itself (self‑reflection) through mistakes can lift accuracy by up to 34.7%, turning a 7B‑parameter model into a giant‑killer. 

The study tackles the headache of models that know the tool list or the math rules yet still fumble, by teaching them to
机器之心 JIQIZHIXIN (@synced_global) 's Twitter Profile Photo

The Step-3 tech report is now on arXiv! Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding arxiv.org/abs/2507.19427

The Step-3 tech report is now on arXiv!

Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
arxiv.org/abs/2507.19427
Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

Safety is one of the biggest blockers for computer use agents: how can I trust an agent won’t accidentally do something consequential without my permission? We collect and release the first large-scale dataset for detecting consequential actions on the web, and train the best

Safety is one of the biggest blockers for computer use agents: how can I trust an agent won’t accidentally do something consequential without my permission? 

We collect and release the first large-scale dataset for detecting consequential actions on the web, and train the best
METR (@metr_evals) 's Twitter Profile Photo

We found that Grok 4’s 50%-time-horizon on our agentic multi-step software engineering tasks is about 1hr 50min (with a 95% CI of 48min to 3hr 52min) compared to o3 (previous SOTA) at about 1hr 30min. However, Grok 4’s time horizon is below SOTA at higher success rate thresholds.

We found that Grok 4’s 50%-time-horizon on our agentic multi-step software engineering tasks is about 1hr 50min (with a 95% CI of 48min to 3hr 52min) compared to o3 (previous SOTA) at about 1hr 30min. However, Grok 4’s time horizon is below SOTA at higher success rate thresholds.
机器之心 JIQIZHIXIN (@synced_global) 's Twitter Profile Photo

🛠️ We all know LLMs can write code—but they make mistakes too, sometimes dangerous ones. To make sense of the fast-growing research on LLMs for software vulnerability detection, this systematic review analyzes 227 papers (2020–2025). It includes: 📚 Taxonomy of tasks, input

🛠️ We all know LLMs can write code—but they make mistakes too, sometimes dangerous ones.

To make sense of the fast-growing research on LLMs for software vulnerability detection, this systematic review analyzes 227 papers (2020–2025). It includes:

📚 Taxonomy of tasks, input
StepFun (@stepfun_ai) 's Twitter Profile Photo

🚀 Announcing Step 3: Our latest open-source multimodal reasoning model is here! Get ready for a stronger, faster, & more cost-effective VLM! 🔵 321B parameters (38B active), optimized for top-tier performance & cost-effective decoding. 🔵 Revolutionary Multi-Matrix

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.

New Anthropic research: Persona vectors.

Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
Epoch AI (@epochairesearch) 's Twitter Profile Photo

A fourth problem on FrontierMath Tier 4 has been solved by AI! Written by Dan Romik, it had won our prize for the best submission in the number theory category.

A fourth problem on FrontierMath Tier 4 has been solved by AI! Written by Dan Romik, it had won our prize for the best submission in the number theory category.
机器之心 JIQIZHIXIN (@synced_global) 's Twitter Profile Photo

A long waited survey on LLM-based code generation agents. It dives deep into autonomous agents (plan, code, debug) and full software development lifecycle (SDLC) coverage, and focus on real-world engineering (reliability, toolchains, workflows). 📚 Covers single/multi-agent

A long waited survey on LLM-based code generation agents.

It dives deep into autonomous agents (plan, code, debug) and full software development lifecycle (SDLC) coverage, and focus on real-world engineering (reliability, toolchains, workflows).

📚 Covers single/multi-agent