Ludwig Schmidt (@lschmidt3) 's Twitter Profile
Ludwig Schmidt

@lschmidt3

Assistant professor at @Stanford and member of the technical staff at @AnthropicAI.

ID: 64333359

linkhttp://people.csail.mit.edu/ludwigs/ calendar_today10-08-2009 04:14:00

215 Tweet

4,4K Followers

425 Following

Stanford AI Lab (@stanfordailab) 's Twitter Profile Photo

SAIL is still accepting applications for the SAIL Postdoctoral Fellowships! This is an opportunity to work with our wonderful professors and community. Applications received by the end of April 30 will receive full consideration: ai.stanford.edu/postdoctoralfe…

Thao Nguyen (@thao_nguyen26) 's Twitter Profile Photo

📢 Announcing our data-centric workshop at ICML 2025 on unifying data curation frameworks across domains! 📅 Deadline: May 24, AoE 🔗 Website: dataworldicml2025.github.io We have an amazing lineup of speakers + panelists from various institutions and application areas.

📢 Announcing our data-centric workshop at ICML 2025 on unifying data curation frameworks across domains!

📅 Deadline: May 24, AoE
🔗 Website: dataworldicml2025.github.io

We have an amazing lineup of speakers + panelists from various institutions and application areas.
John Yang (@jyangballin) 's Twitter Profile Photo

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified.

We built it by synthesizing a ton of agentic training data from 100+ Python repos.

Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
Percy Liang (@percyliang) 's Twitter Profile Photo

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:
Mike A. Merrill (@mike_a_merrill) 's Twitter Profile Photo

Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr

Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? 

We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr
Ludwig Schmidt (@lschmidt3) 's Twitter Profile Photo

Very excited about our new agent benchmark! I think it's a nice way of evaluating how well agents can do complex task in terminal (command line) environments.

Ryan Marten (@ryanmart3n) 's Twitter Profile Photo

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals.

We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
Thao Nguyen (@thao_nguyen26) 's Twitter Profile Photo

Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689

Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔
We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats!

arxiv.org/abs/2506.04689
Ludwig Schmidt (@lschmidt3) 's Twitter Profile Photo

I'm a big fan of the approach to research funding Andy Konwinski and the Laude team are taking! Working with them on terminal-bench has been fantastic (thanks Alex Shaw!) and I'm excited that they're going to support more open, impact-oriented research.

Alex Shaw (@alexgshaw) 's Twitter Profile Photo

Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now

Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days.

We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks.

Now
John Yang (@jyangballin) 's Twitter Profile Photo

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals

Alex Shaw (@alexgshaw) 's Twitter Profile Photo

Today, we’re announcing the next chapter of Terminal-Bench with two releases: 1. Harbor, a new package for running sandboxed agent rollouts at scale 2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification

Today, we’re announcing the next chapter of Terminal-Bench with two releases:

1. Harbor, a new package for running sandboxed agent rollouts at scale
2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
Anas Awadalla (@anas_awadalla) 's Twitter Profile Photo

We're releasing🍨Gelato-30B-A3B, a state-of-the-art computer grounding model that delivers immediate performance gains for computer-use agents! Trained on our open-source🖱️Click-100k dataset, Gelato achieves 63.8% on ScreenSpot-Pro and 69.1% on OS-World-G. It outperforms

We're releasing🍨Gelato-30B-A3B, a state-of-the-art computer grounding model that delivers immediate performance gains for computer-use agents! 

Trained on our open-source🖱️Click-100k dataset, Gelato achieves 63.8% on ScreenSpot-Pro and 69.1% on OS-World-G. It outperforms