Jerry Wu (@jerr_wu) 's Twitter Profile
Jerry Wu

@jerr_wu

Founder @ Halluminate.ai. ex-PM @capitalonelabs. CS/ML @cornell.

Currently spend time helping companies with evals. Startup musings on side.

ID: 779114251533156353

linkhttps://jerrywublog.notion.site/thoughts calendar_today23-09-2016 00:24:47

57 Tweet

103 Followers

641 Following

Jerry Wu (@jerr_wu) 's Twitter Profile Photo

Incredible work but the suite is only ~1300 tasks. In the economy people do probably hundreds of thousands of atomic tasks across multiple variations. Gives you an idea on how much work there is still to be done. Really awesome first step

Jerry Wu (@jerr_wu) 's Twitter Profile Photo

AI labs are increasingly organizing their research efforts around use cases/product (ex. financial services, deep research) rather than capabilities (ex. computer use, coding). This is the right direction long term from a business POV.

Jerry Wu (@jerr_wu) 's Twitter Profile Photo

Sonnet 4.5 is extremely good. I've noticed it has particularly good traces and "print messages" that are informative but not overwhelming or slop.

Jerry Wu (@jerr_wu) 's Twitter Profile Photo

Seems like a natural next step in organizing tools. As the catalog of tools/MCPs grow, giving models unneeded tools can degrade performance / increase cost. A structured way to only deliver necessary tools at run-time is really intuitive

Dhruv Batra (@dhruvbatradb) 's Twitter Profile Photo

Introducing Yutori Navigator 31 years ago, the modern web era began with Netscape Navigator. Today, we’re introducing Yutori Navigator — a web agent that autonomously navigates websites on its own cloud browser to complete tasks for you. Navigator achieves pareto-domination

Introducing Yutori Navigator

31 years ago, the modern web era began with Netscape Navigator.

Today, we’re introducing Yutori Navigator — a web agent that autonomously navigates websites on its own cloud browser to complete tasks for you. 

Navigator achieves pareto-domination
Dhruv Batra (@dhruvbatradb) 's Twitter Profile Photo

Update: we benchmarked Opus 4.5 on Navi-Bench. • Overall: 74.2% (Sonnet 4.5) → 78.0% (Opus 4.5) • Breakdown per site • Biggest gain: Apartments 76.7% → 90.0% • small bumps on Craigslist and Resy • slight regression on Google Flights

Update: we benchmarked Opus 4.5 on Navi-Bench. 

• Overall: 74.2% (Sonnet 4.5) → 78.0% (Opus 4.5)

• Breakdown per site
    • Biggest gain: Apartments 76.7% → 90.0% 
    • small bumps on Craigslist and Resy
    • slight regression on Google Flights