Alejandro Cuadron (@alex_cuadron) 's Twitter Profile
Alejandro Cuadron

@alex_cuadron

Infra & AI enthusiast, dreaming about test-time compute ✨
Research Scholar at @Berkeley_EECS, @ucbrise, @berkeley_ai | MS in CS @ETH_en | Prev, @IBMResearch

ID: 1872435878876049408

calendar_today27-12-2024 00:14:51

136 Tweet

1,1K Followers

321 Following

Alejandro Cuadron (@alex_cuadron) 's Twitter Profile Photo

Wait what!? We robustified tau2-bench and found that the newly released model from OpenAI (GPT-5.1) performs way worse than GPT-5 and GPT-5-mini. All while being 5x more expensive than GPT-5-mini! But, why? We have a theory...

Wait what!? We robustified tau2-bench and found that the newly released model from <a href="/OpenAI/">OpenAI</a> (GPT-5.1) performs way worse than GPT-5 and GPT-5-mini.

All while being 5x more expensive than GPT-5-mini!

But, why? We have a theory...
Artificial Analysis (@artificialanlys) 's Twitter Profile Photo

Amazon is back with Nova 2.0, a substantial upgrade over prior Amazon Nova models and demonstrating particular strength in agentic capabilities Amazon has released Nova 2.0 Pro (Preview), its new flagship model; Nova 2.0 Lite, focused on speed and lower cost; and Nova 2.0 Omni,

Amazon is back with Nova 2.0, a substantial upgrade over prior Amazon Nova models and demonstrating particular strength in agentic capabilities

Amazon has released Nova 2.0 Pro (Preview), its new flagship model; Nova 2.0 Lite, focused on speed and lower cost; and Nova 2.0 Omni,
Alejandro Cuadron (@alex_cuadron) 's Twitter Profile Photo

Very unexpected results! Grok 4.1 Fast Reasoning beats every frontier model in Tau2-Verified! Congrats team! I was certainly not expecting a Fast model to beat Anthropic 's Opus 4.5 in agentic tasks xAI Elon Musk Yuhuai (Tony) Wu Check it out: github.com/amazon-agi/tau…

Very unexpected results! Grok 4.1 Fast Reasoning beats every frontier model in Tau2-Verified!

Congrats team! I was certainly not expecting a Fast model to beat <a href="/AnthropicAI/">Anthropic</a> 's Opus 4.5 in agentic tasks <a href="/xai/">xAI</a> <a href="/elonmusk/">Elon Musk</a> <a href="/Yuhu_ai_/">Yuhuai (Tony) Wu</a> 

Check it out: github.com/amazon-agi/tau…
Sayash Kapoor (@sayashk) 's Twitter Profile Photo

CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses

CORE-Bench is solved (using Opus 4.5 with Claude Code)

TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses
Joey Gonzalez (@profjoeyg) 's Twitter Profile Photo

There has been significant recent work in sparse attention promising to unlock long context and improved efficiency -- but does it work? We recently launched a benchmarking effort to evaluate sparse attention methods and help practitioners understand the capabilities and

Sayan Deb Sarkar (@debsarkar_sayan) 's Twitter Profile Photo

Come get to know GuideFlow3D tomorrow (Friday, December 5) at #NeurIPS2025 and try our demo ✨ 🗓️ 4:30pm-7:30pm PST 📍 Exhibit Hall C,D,E #4310

Come get to know GuideFlow3D tomorrow (Friday, December 5) at #NeurIPS2025 and try our demo ✨

🗓️ 4:30pm-7:30pm PST
📍 Exhibit Hall C,D,E #4310
ICML Conference (@icmlconf) 's Twitter Profile Photo

📢 Announcing the dates for #ICML2026 in Seoul! 🇰🇷 July 6th - 11th, 2026 (Monday through Saturday) 📆 Mark your calendar! ✈️ Book your flights! 😅 ...Start stressing about the deadline!

📢 Announcing the dates for #ICML2026 in Seoul! 🇰🇷

July 6th - 11th, 2026 (Monday through Saturday)

📆 Mark your calendar!
✈️ Book your flights!
😅 ...Start stressing about the deadline!
Lisa Dunlap (@lisabdunlap) 's Twitter Profile Photo

🧵Tired of scrolling through your horribly long model traces in VSCode to figure out why your model failed? We made StringSight to fix this: an automated pipeline for analyzing your model outputs at scale. ➡️Demo: stringsight.com ➡️Blog: blog.stringsight.com

Yiqi Zhu (@stephenzhu0218) 's Twitter Profile Photo

Introducing SWE-Playground: A fully automated pipeline that generates synthetic environments to train versatile coding agents. 🤖✨ Training software engineering agents often relies on existing resources like GitHub issues and focuses on solving SWE-bench style issue resolution

Introducing SWE-Playground: A fully automated pipeline that generates synthetic environments to train versatile coding agents. 🤖✨

Training software engineering agents often relies on existing resources like GitHub issues and focuses on solving SWE-bench style issue resolution
Deedy (@deedydas) 's Twitter Profile Photo

Little known fact: one of Geoffrey Hinton’s PhD students from ~40yrs ago went on to be the billionaire CEO of the most successful quant hedge fund of all time, RenTech. His thesis was in modeling speech recognition systems with hidden markov models.

Little known fact: one of Geoffrey Hinton’s PhD students from ~40yrs ago went on to be the billionaire CEO of the most successful quant hedge fund of all time, RenTech.

His thesis was in modeling speech recognition systems with hidden markov models.
Shu Lynn Liu (@lynnliu41887950) 's Twitter Profile Photo

Researchers spend hours and hours hand-crafting the strategies behind LLM-driven optimization systems like AlphaEvolve: deciding which ideas to reuse, when to explore vs exploit, and what mutations to try. 🤖But what if AI could evolve its own evolution process? We introduce

Researchers spend hours and hours hand-crafting the strategies behind LLM-driven optimization systems like AlphaEvolve: deciding which ideas to reuse, when to explore vs exploit, and what mutations to try.

🤖But what if AI could evolve its own evolution process?

We introduce
ICML Conference (@icmlconf) 's Twitter Profile Photo

To ensure compliance w peer-review policies, ICML has removed 795 reviews (1% of total) by reviewers who used LLMs when they explicitly agreed to not. Consequently, 497 papers (2% of all submissions) of these (reciprocal) reviewers have been desk rejected Details in blog post 👇

To ensure compliance w peer-review policies, ICML has removed 795 reviews (1% of total) by reviewers who used LLMs when they explicitly agreed to not. Consequently, 497 papers (2% of all submissions) of these (reciprocal) reviewers have been desk rejected

Details in blog post 👇
Azalia Mirhoseini (@azaliamirh) 's Twitter Profile Photo

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the

Turns out we can get SOTA on agentic benchmarks with a simple test-time method!

Excited to introduce LLM-as-a-Verifier.

Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the