Alejandro Cuadron (@alex_cuadron) Twitter Tweets • TwiCopy

Alejandro Cuadron

@alex_cuadron

+ Follow

Infra & AI enthusiast, dreaming about test-time compute ✨
Research Scholar at @Berkeley_EECS, @ucbrise, @berkeley_ai | MS in CS @ETH_en | Prev, @IBMResearch

ID: 1872435878876049408

calendar_today27-12-2024 00:14:51

136 Tweet

1,1K Followers

321 Following

Alejandro Cuadron

@alex_cuadron

5 months ago

Wait what!? We robustified tau2-bench and found that the newly released model from OpenAI (GPT-5.1) performs way worse than GPT-5 and GPT-5-mini. All while being 5x more expensive than GPT-5-mini! But, why? We have a theory...

Wait what!? We robustified tau2-bench and found that the newly released model from <a href="/OpenAI/">OpenAI</a> (GPT-5.1) performs way worse than GPT-5 and GPT-5-mini.

All while being 5x more expensive than GPT-5-mini!

But, why? We have a theory...

thumb_up_off_alt249

chat_bubble_outline15

repeat25

shareShare

Artificial Analysis

@artificialanlys

5 months ago

Amazon is back with Nova 2.0, a substantial upgrade over prior Amazon Nova models and demonstrating particular strength in agentic capabilities Amazon has released Nova 2.0 Pro (Preview), its new flagship model; Nova 2.0 Lite, focused on speed and lower cost; and Nova 2.0 Omni,

thumb_up_off_alt365

chat_bubble_outline14

repeat54

shareShare

Lisan al Gaib

@scaling01

5 months ago

Nova 2.0 Pro looks freaking solid lol

thumb_up_off_alt293

chat_bubble_outline4

repeat15

shareShare

Alejandro Cuadron

@alex_cuadron

5 months ago

Very unexpected results! Grok 4.1 Fast Reasoning beats every frontier model in Tau2-Verified! Congrats team! I was certainly not expecting a Fast model to beat Anthropic 's Opus 4.5 in agentic tasks xAI Elon Musk Yuhuai (Tony) Wu Check it out: github.com/amazon-agi/tau…

thumb_up_off_alt170

chat_bubble_outline11

repeat22

shareShare

Alejandro Cuadron

@alex_cuadron

5 months ago

Super excited to see it!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Alejandro Cuadron

@alex_cuadron

5 months ago

OMG Elon Musk reposted our benchmark. Check out the leaderboard in: github.com/amazon-agi/tau…

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Sayash Kapoor

@sayashk

5 months ago

CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses

thumb_up_off_alt795

chat_bubble_outline27

repeat108

shareShare

Joey Gonzalez

@profjoeyg

5 months ago

There has been significant recent work in sparse attention promising to unlock long context and improved efficiency -- but does it work? We recently launched a benchmarking effort to evaluate sparse attention methods and help practitioners understand the capabilities and

thumb_up_off_alt22

chat_bubble_outline0

repeat7

shareShare

Sayan Deb Sarkar

@debsarkar_sayan

5 months ago

Come get to know GuideFlow3D tomorrow (Friday, December 5) at #NeurIPS2025 and try our demo ✨ 🗓️ 4:30pm-7:30pm PST 📍 Exhibit Hall C,D,E #4310

thumb_up_off_alt13

chat_bubble_outline0

repeat7

shareShare

Alejandro Cuadron

@alex_cuadron

4 months ago

Wow

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

ICML Conference

@icmlconf

4 months ago

📢 Announcing the dates for #ICML2026 in Seoul! 🇰🇷 July 6th - 11th, 2026 (Monday through Saturday) 📆 Mark your calendar! ✈️ Book your flights! 😅 ...Start stressing about the deadline!

thumb_up_off_alt570

chat_bubble_outline2

repeat54

shareShare

Lisa Dunlap

@lisabdunlap

4 months ago

🧵Tired of scrolling through your horribly long model traces in VSCode to figure out why your model failed? We made StringSight to fix this: an automated pipeline for analyzing your model outputs at scale. ➡️Demo: stringsight.com ➡️Blog: blog.stringsight.com

thumb_up_off_alt89

chat_bubble_outline3

repeat35

shareShare

Yiqi Zhu

@stephenzhu0218

4 months ago

Introducing SWE-Playground: A fully automated pipeline that generates synthetic environments to train versatile coding agents. 🤖✨ Training software engineering agents often relies on existing resources like GitHub issues and focuses on solving SWE-bench style issue resolution

thumb_up_off_alt46

chat_bubble_outline2

repeat10

shareShare

Deedy

@deedydas

4 months ago

Little known fact: one of Geoffrey Hinton’s PhD students from ~40yrs ago went on to be the billionaire CEO of the most successful quant hedge fund of all time, RenTech. His thesis was in modeling speech recognition systems with hidden markov models.

thumb_up_off_alt1,1K

chat_bubble_outline23

repeat72

shareShare

Alejandro Cuadron

@alex_cuadron

3 months ago

Congrats!

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

Shu Lynn Liu

@lynnliu41887950

a month ago

Researchers spend hours and hours hand-crafting the strategies behind LLM-driven optimization systems like AlphaEvolve: deciding which ideas to reuse, when to explore vs exploit, and what mutations to try. 🤖But what if AI could evolve its own evolution process? We introduce

thumb_up_off_alt470

chat_bubble_outline17

repeat82

shareShare

ICML Conference

@icmlconf

a month ago

To ensure compliance w peer-review policies, ICML has removed 795 reviews (1% of total) by reviewers who used LLMs when they explicitly agreed to not. Consequently, 497 papers (2% of all submissions) of these (reciprocal) reviewers have been desk rejected Details in blog post 👇

thumb_up_off_alt598

chat_bubble_outline21

repeat81

shareShare

Azalia Mirhoseini

@azaliamirh

12 days ago

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the

thumb_up_off_alt956

chat_bubble_outline31

repeat110

shareShare