Andon Labs (@andonlabs) Twitter Tweets • TwiCopy

Thomas Johnson

24 days ago

What if you wanted to run a lot of vending machines? I just calculated Sharpe ratios on Andon Labs’ Vending-Bench-2. Claude Opus 4.5 edges out Gemini 3 Pro — even with slightly lower final PnL. Chart below.

thumb_up_off_alt20

chat_bubble_outline1

repeat1

shareShare

Lukas Petersson

@lukaspet

23 days ago

I learned to code from 👩‍💻 Paige Bailey's Udacity courses many years ago. Today, she calls Vending-Bench "One of my favorite benchmarks."

I learned to code from <a href="/DynamicWebPaige/">👩‍💻 Paige Bailey</a>'s Udacity courses many years ago. Today, she calls Vending-Bench "One of my favorite benchmarks."

thumb_up_off_alt26

chat_bubble_outline2

repeat4

shareShare

Wesxdz

@subcivic

22 days ago

Andon Labs

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Andon Labs

@andonlabs

19 days ago

Correction: We had a slight bug in how we sent back agent messages to the Grok API in Vending-Bench 2. As a result, Grok's results were distorted. The issue is now fixed, and we reran Grok 4.1 Fast Reasoning. It climbs one spot, now above Gemini 2.5 Pro. We'll add more Grok

thumb_up_off_alt202

chat_bubble_outline5

repeat9

shareShare

Lukas Petersson

@lukaspet

19 days ago

@polymarket, Chatbot Arena i a horrible way to measure which LLM is best. Human preferences on single-turn questions correlated with intelligence when models were weak, but not any more. For 2026, use Vending-Bench 2 to resolve this market (which AI can make the most money).

thumb_up_off_alt17

chat_bubble_outline2

repeat2

shareShare

ThePrimeagen

@theprimeagen

18 days ago

i think this is my favorite story to date youtube.com/watch?v=bDOM3c…

thumb_up_off_alt105

chat_bubble_outline6

repeat4

shareShare

Lukas Petersson

@lukaspet

18 days ago

What is bigger, Elon loving vending machines or Prime loving Butter-Bench?

thumb_up_off_alt7

chat_bubble_outline2

repeat1

shareShare

Andon Labs

@andonlabs

14 days ago

Happy to announce our Series V. We've raised 10000 vending machines on a 100000 vending machine valuation.

thumb_up_off_alt67

chat_bubble_outline4

repeat0

shareShare

SID AI

@try_sid

13 days ago

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

thumb_up_off_alt328

chat_bubble_outline18

repeat32

shareShare

Lukas Petersson

@lukaspet

12 days ago

I can confirm that this is fake. We've not run it on Vending-Bench. Sorry to spoil the party.

thumb_up_off_alt769

chat_bubble_outline25

repeat24

shareShare

Esben Kran

@esbenkc

9 days ago

In connection with our recent Seldon Batch 02 launch, I had the pleasure of co-authoring our Request for Startups. This list of categories is the culmination of an exhilarating journey consulting leaders in AI safety and I believe it to be a faithful documentation of the

In connection with our recent <a href="/seldonai/">Seldon</a> Batch 02 launch, I had the pleasure of co-authoring our Request for Startups.

This list of categories is the culmination of an exhilarating journey consulting leaders in AI safety and I believe it to be a faithful documentation of the

thumb_up_off_alt29

chat_bubble_outline2

repeat10

shareShare

Andon Labs

@andonlabs

7 days ago

GPT-5.2 ranks 3rd in Vending-Bench 2. This is a big upgrade over GPT-5.1, but what impressed us most was the performance in the second half of the simulation. Continual learning?

thumb_up_off_alt197

chat_bubble_outline10

repeat11

shareShare

Lukas Petersson

@lukaspet

6 days ago

GPT-5.2 can't handle images with transparent backgrounds. Not AGI.

thumb_up_off_alt23

chat_bubble_outline4

repeat1

shareShare

Andon Labs

@andonlabs

3 days ago

Andon Labs is thrilled to welcome Kristoffer Nordström (sleipner) to our team! In just a few months, Kristoffer has already increased the reliability of many of our systems tremendously. We quite often hear him say, "I didn't think <some system> worked so well, so I rewrote

Andon Labs is thrilled to welcome Kristoffer Nordström (<a href="/Sleipner42/">sleipner</a>) to our team!

In just a few months, Kristoffer has already increased the reliability of many of our systems tremendously. We quite often hear him say, "I didn't think <some system> worked so well, so I rewrote

thumb_up_off_alt29

chat_bubble_outline1

repeat2

shareShare