Andon Labs (@andonlabs) 's Twitter Profile
Andon Labs

@andonlabs

Preparing the world for AGI with safety evals

ID: 1864729184100323328

linkhttp://andonlabs.com calendar_today05-12-2024 17:51:28

45 Tweet

161 Takipçi

2 Takip Edilen

Thomas Johnson (@thomasj02) 's Twitter Profile Photo

What if you wanted to run a lot of vending machines? I just calculated Sharpe ratios on Andon Labs’ Vending-Bench-2. Claude Opus 4.5 edges out Gemini 3 Pro — even with slightly lower final PnL. Chart below.

What if you wanted to run a lot of vending machines?

I just calculated Sharpe ratios on Andon Labs’ Vending-Bench-2.

Claude Opus 4.5 edges out Gemini 3 Pro — even with slightly lower final PnL.

Chart below.
Andon Labs (@andonlabs) 's Twitter Profile Photo

Correction: We had a slight bug in how we sent back agent messages to the Grok API in Vending-Bench 2. As a result, Grok's results were distorted. The issue is now fixed, and we reran Grok 4.1 Fast Reasoning. It climbs one spot, now above Gemini 2.5 Pro. We'll add more Grok

Correction: We had a slight bug in how we sent back agent messages to the Grok API in Vending-Bench 2. As a result, Grok's results were distorted. The issue is now fixed, and we reran Grok 4.1 Fast Reasoning. It climbs one spot, now above Gemini 2.5 Pro. We'll add more Grok
Lukas Petersson (@lukaspet) 's Twitter Profile Photo

@polymarket, Chatbot Arena i a horrible way to measure which LLM is best. Human preferences on single-turn questions correlated with intelligence when models were weak, but not any more. For 2026, use Vending-Bench 2 to resolve this market (which AI can make the most money).

SID AI (@try_sid) 's Twitter Profile Photo

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

we just released our first model: SID-1

it's designed to be extremely good at only one task: retrieval.

it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
Esben Kran (@esbenkc) 's Twitter Profile Photo

In connection with our recent Seldon Batch 02 launch, I had the pleasure of co-authoring our Request for Startups. This list of categories is the culmination of an exhilarating journey consulting leaders in AI safety and I believe it to be a faithful documentation of the

In connection with our recent <a href="/seldonai/">Seldon</a> Batch 02 launch, I had the pleasure of co-authoring our Request for Startups.

This list of categories is the culmination of an exhilarating journey consulting leaders in AI safety and I believe it to be a faithful documentation of the
Andon Labs (@andonlabs) 's Twitter Profile Photo

GPT-5.2 ranks 3rd in Vending-Bench 2. This is a big upgrade over GPT-5.1, but what impressed us most was the performance in the second half of the simulation. Continual learning?

GPT-5.2 ranks 3rd in Vending-Bench 2.

This is a big upgrade over GPT-5.1, but what impressed us most was the performance in the second half of the simulation. Continual learning?
Andon Labs (@andonlabs) 's Twitter Profile Photo

Andon Labs is thrilled to welcome Kristoffer Nordström (sleipner) to our team! In just a few months, Kristoffer has already increased the reliability of many of our systems tremendously. We quite often hear him say, "I didn't think <some system> worked so well, so I rewrote

Andon Labs is thrilled to welcome Kristoffer Nordström (<a href="/Sleipner42/">sleipner</a>) to our team!

In just a few months, Kristoffer has already increased the reliability of many of our systems tremendously. We quite often hear him say, "I didn't think &lt;some system&gt; worked so well, so I rewrote