Benedikt Stroebl (@benediktstroebl) 's Twitter Profile
Benedikt Stroebl

@benediktstroebl

PhD @Princeton

ID: 1323762441713569792

linkhttps://benediktstroebl.github.io/ calendar_today03-11-2020 23:02:26

261 Tweet

487 Followers

1,1K Following

Y Combinator (@ycombinator) 's Twitter Profile Photo

Rid (Rid (S25)) is making selling easier than buying. Just text them a photo of what you want to sell and they'll do the rest. Finding buyers, negotiating, answering questions, and picking up the item. Congrats on the launch, @vminvsky & Benedikt Stroebl!

Sayash Kapoor (@sayashk) 's Twitter Profile Photo

How does GPT-5 compare against Claude Opus 4.1 on agentic tasks? Since their release, we have been evaluating these models on challenging science, web, service, and code tasks. Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵

How does GPT-5 compare against Claude Opus 4.1 on agentic tasks? 

Since their release, we have been evaluating these models on challenging science, web, service, and code tasks. 

Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵
Benedikt Stroebl (@benediktstroebl) 's Twitter Profile Photo

GPT-5 results on HAL (Holistic Agent Leaderboard) are in! On all 8 benchmarks we consider, GPT so far does not claim the lead on any of the leaderboards. More details in the linked thread.

Sayash Kapoor (@sayashk) 's Twitter Profile Photo

Can AI agents reliably navigate the web? Does the choice of agent scaffold affect web browsing ability? To answer these questions, we added Online Mind2Web, a web browsing benchmark, to the Holistic Agent Leaderboard (HAL). We evaluated 9 models (including GPT-5 and Sonnet 4)

Can AI agents reliably navigate the web? Does the choice of agent scaffold affect web browsing ability? To answer these questions, we added Online Mind2Web, a web browsing benchmark, to the Holistic Agent Leaderboard (HAL). 

We evaluated 9 models (including GPT-5 and Sonnet 4)
Harvin Park (@harvpark) 's Twitter Profile Photo

The best ideas don’t come from mercenaries but from a band of very very special people, and Sonder is making its first hires to join the founding team. We’re iterating on some of the most interesting creative and technical problems in AI, and we’re reaching outward now to meet

The best ideas don’t come from mercenaries but from a band of very very special people, and Sonder is making its first hires to join the founding team.

We’re iterating on some of the most interesting creative and technical problems in AI, and we’re reaching outward now to meet
Ziru Chen (@ronziruchen) 's Twitter Profile Photo

🔬 One year on: how close are today’s AI agents to truly accelerating data-driven discovery? We just incorporated ScienceAgentBench into Princeton Center for Information Technology Policy’s Holistic Agent Leaderboard (HAL) and benchmarked the latest frontier LLMs — and we are making progress! 👇 A quick tour of

🔬 One year on: how close are today’s AI agents to truly accelerating data-driven discovery?

We just incorporated ScienceAgentBench into <a href="/PrincetonCITP/">Princeton Center for Information Technology Policy</a>’s Holistic Agent Leaderboard (HAL) and benchmarked the latest frontier LLMs — and we are making progress!

👇 A quick tour of
Ziru Chen (@ronziruchen) 's Twitter Profile Photo

🔎We also note that higher thinking does not always lead to better performance on ScienceAgentBench, which coincides with the observations on several other benchmarks evaluated in HAL. 📄 Please check out our paper (arxiv.org/abs/2410.05080) and HAL (hal.cs.princeton.edu/scienceagentbe…) for

Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

Fair and comprehensive agent evaluation is hard. I'm just glad that folks like Sayash Kapoor Benedikt Stroebl Arvind Narayanan put in the hard work to iron out and share these thorny issues so you don't have to

Logan Kilpatrick (@officiallogank) 's Twitter Profile Photo

Text messaging with AI is the next form factor to hit 1 billion users, I have 100% conviction on this. The fact that every AI company isn’t doing this with urgency is absurd.

Richard Socher (@richardsocher) 's Twitter Profile Photo

We tested you.com's Search API against alternatives across the following dimensions: 🎯 Accuracy - How well does retrieved content support correct answers? 🆕 Freshness - Ability to surface recent events ⚡ Latency - Speed of response 💰 Cost - Price per thousand queries

will brown (@willccbb) 's Twitter Profile Photo

if you’re a professor teaching about LLM RL this semester + considering doing any sort of hands-on lessons about RL environments/agentic RL, hit me up, would love to chat :) this stuff is now at the accessibility level where students can easily play with it

Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Did you know that when they say stuff like "The A18 uses TSMC's 3nm process" or "announced the 2nm node" The 3nm, 2nm actually doesn't mean anything?! It's just like a version number. They make it up. Literally nothing measures 2nm or 3nm. I certainly didn't know.

Did you know that when they say stuff like "The A18 uses TSMC's 3nm process" or "announced the 2nm node"

The 3nm, 2nm actually doesn't mean anything?! It's just like a version number. They make it up. Literally nothing measures 2nm or 3nm.

I certainly didn't know.
Sayash Kapoor (@sayashk) 's Twitter Profile Photo

On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals. For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and

On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals. 

For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and