Benedikt Stroebl (@benediktstroebl) Twitter Tweets • TwiCopy

Y Combinator

4 months ago

Rid (Rid (S25)) is making selling easier than buying. Just text them a photo of what you want to sell and they'll do the rest. Finding buyers, negotiating, answering questions, and picking up the item. Congrats on the launch, @vminvsky & Benedikt Stroebl!

thumb_up_off_alt220

chat_bubble_outline20

repeat11

shareShare

Sayash Kapoor

@sayashk

4 months ago

How does GPT-5 compare against Claude Opus 4.1 on agentic tasks? Since their release, we have been evaluating these models on challenging science, web, service, and code tasks. Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵

thumb_up_off_alt322

chat_bubble_outline23

repeat51

shareShare

Benedikt Stroebl

@benediktstroebl

4 months ago

GPT-5 results on HAL (Holistic Agent Leaderboard) are in! On all 8 benchmarks we consider, GPT so far does not claim the lead on any of the leaderboards. More details in the linked thread.

thumb_up_off_alt6

chat_bubble_outline2

repeat1

shareShare

Sayash Kapoor

@sayashk

4 months ago

Can AI agents reliably navigate the web? Does the choice of agent scaffold affect web browsing ability? To answer these questions, we added Online Mind2Web, a web browsing benchmark, to the Holistic Agent Leaderboard (HAL). We evaluated 9 models (including GPT-5 and Sonnet 4)

thumb_up_off_alt134

chat_bubble_outline12

repeat32

shareShare

Harvin Park

@harvpark

4 months ago

The best ideas don’t come from mercenaries but from a band of very very special people, and Sonder is making its first hires to join the founding team. We’re iterating on some of the most interesting creative and technical problems in AI, and we’re reaching outward now to meet

thumb_up_off_alt101

chat_bubble_outline8

repeat9

shareShare

Interaction

@interaction

3 months ago

Say hi to Poke.com! 👋🏼🌴

thumb_up_off_alt3,3K

chat_bubble_outline443

repeat208

shareShare

Ziru Chen

@ronziruchen

3 months ago

🔬 One year on: how close are today’s AI agents to truly accelerating data-driven discovery? We just incorporated ScienceAgentBench into Princeton Center for Information Technology Policy’s Holistic Agent Leaderboard (HAL) and benchmarked the latest frontier LLMs — and we are making progress! 👇 A quick tour of

🔬 One year on: how close are today’s AI agents to truly accelerating data-driven discovery?

We just incorporated ScienceAgentBench into <a href="/PrincetonCITP/">Princeton Center for Information Technology Policy</a>’s Holistic Agent Leaderboard (HAL) and benchmarked the latest frontier LLMs — and we are making progress!

👇 A quick tour of

thumb_up_off_alt25

chat_bubble_outline3

repeat7

shareShare

Ziru Chen

@ronziruchen

3 months ago

🔎We also note that higher thinking does not always lead to better performance on ScienceAgentBench, which coincides with the observations on several other benchmarks evaluated in HAL. 📄 Please check out our paper (arxiv.org/abs/2410.05080) and HAL (hal.cs.princeton.edu/scienceagentbe…) for

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

Sayash Kapoor

@sayashk

3 months ago

We spent the last year evaluating agents for HAL. My biggest learning: We live in the Windows 95 era of agent evaluation.

thumb_up_off_alt359

chat_bubble_outline6

repeat48

shareShare

Yu Su @#ICLR2025

@ysu_nlp

3 months ago

Fair and comprehensive agent evaluation is hard. I'm just glad that folks like Sayash Kapoor Benedikt Stroebl Arvind Narayanan put in the hard work to iron out and share these thorny issues so you don't have to

thumb_up_off_alt30

chat_bubble_outline0

repeat6

shareShare

Sara Hooker

@sarahookr

3 months ago

It is much more fun to do hard things, anybody can do the easy parts.

thumb_up_off_alt178

chat_bubble_outline13

repeat23

shareShare

Logan Kilpatrick

@officiallogank

3 months ago

Text messaging with AI is the next form factor to hit 1 billion users, I have 100% conviction on this. The fact that every AI company isn’t doing this with urgency is absurd.

thumb_up_off_alt2,2K

chat_bubble_outline487

repeat120

shareShare

Benedikt Stroebl

@benediktstroebl

3 months ago

random chat with neighbor in sf: turns out he built the HTML feature on arXiv

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Richard Socher

@richardsocher

3 months ago

We tested you.com's Search API against alternatives across the following dimensions: 🎯 Accuracy - How well does retrieved content support correct answers? 🆕 Freshness - Ability to surface recent events ⚡ Latency - Speed of response 💰 Cost - Price per thousand queries

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Benedikt Stroebl

@benediktstroebl

3 months ago

This is really cool:

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

will brown

@willccbb

3 months ago

if you’re a professor teaching about LLM RL this semester + considering doing any sort of hands-on lessons about RL environments/agentic RL, hit me up, would love to chat :) this stuff is now at the accessibility level where students can easily play with it

thumb_up_off_alt654

chat_bubble_outline26

repeat35

shareShare

Lucas Beyer (bl16)

@giffmana

3 months ago

Did you know that when they say stuff like "The A18 uses TSMC's 3nm process" or "announced the 2nm node" The 3nm, 2nm actually doesn't mean anything?! It's just like a version number. They make it up. Literally nothing measures 2nm or 3nm. I certainly didn't know.

thumb_up_off_alt9,9K

chat_bubble_outline342

repeat566

shareShare

Sayash Kapoor

@sayashk

3 months ago

On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals. For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and

thumb_up_off_alt38

chat_bubble_outline2

repeat13

shareShare