Mike A. Merrill (@mike_a_merrill) Twitter Tweets • TwiCopy

Mike A. Merrill

@mike_a_merrill

+ Follow

Postdoc @StanfordAILab
Go Bills

ID: 1233837766271569920

linkhttp://mikemerrill.io calendar_today29-02-2020 19:34:11

202 Tweet

274 Followers

211 Following

Armin Ronacher ⇌

@mitsuhiko

6 months ago

Not so hot take: if you use Claude Code, most of y’all’s MCP servers could be a shell script. Easier to maintain and faster and Claude uses just as well if not better.

thumb_up_off_alt839

chat_bubble_outline43

repeat49

shareShare

Mike A. Merrill

@mike_a_merrill

6 months ago

this is why we made terminal bench - just give the ai a bash shell, it'll be fine

thumb_up_off_alt17

chat_bubble_outline0

repeat4

shareShare

New benchmark! LLMs can retrieve bits of information from ridiculously long contexts (needle-in-a-haystack) but they can't tell what's missing from relatively short documents (AbsenceBench). We can't trust LLMs to annotate or judge documents if they can't see negative space!

thumb_up_off_alt98

chat_bubble_outline3

repeat15

shareShare

Andy Konwinski

@andykonwinski

6 months ago

Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity. Built for and by researchers, including Jeff Dean & Joelle Pineau on the board, Laude Institute catalyzes research with real-world impact.

thumb_up_off_alt1,1K

chat_bubble_outline48

repeat105

shareShare

Jean Mercat

@mercatjean

6 months ago

We evaluated more than 1000 reasoning LLMs on 12 reasoning-focused benchmarks and made fascinating observations about cross-benchmark comparisons. You can explore all that data yourself on our HuggingFace spaces page. (1/4)

thumb_up_off_alt93

chat_bubble_outline2

repeat18

shareShare

Mike A. Merrill

@mike_a_merrill

6 months ago

Congrats to Warp for setting the new SOTA on terminal-bench! Now we've got to make it harder ;)

thumb_up_off_alt12

chat_bubble_outline0

repeat0

shareShare

Mike A. Merrill

@mike_a_merrill

5 months ago

I'll be at ICML next week! DM me if you'd like to chat about agents / benchmarking / open source / really anything

thumb_up_off_alt10

chat_bubble_outline2

repeat0

shareShare

Valerie Chen

@valeriechen_

5 months ago

🤖🧠 Join us for the 2025 Workshop on Human-AI Complementarity for Decision Making at CMU! 📅 Sept 25-26, 2025 💰 Travel to Pittsburgh & lodging covered 📝 Abstract deadline: July 15 We welcome abstract submissions, which will be presented as talks or posters. Details below!

thumb_up_off_alt13

chat_bubble_outline1

repeat5

shareShare

Mike A. Merrill

@mike_a_merrill

5 months ago

It's great to see Terminal-Bench on the Kimi K2 model card. We love open source models, and just made it even easier to test them by adding better support for local models to our harness through LiteLLM

thumb_up_off_alt15

chat_bubble_outline0

repeat3

shareShare

Yuntian Deng

@yuntiandeng

5 months ago

Can we build an operating system entirely powered by neural networks? Introducing NeuralOS: towards a generative OS that directly predicts screen images from user inputs. Try it live: neural-os.com Paper: huggingface.co/papers/2507.08… Inspired by Andrej Karpathy's vision. 1/5

thumb_up_off_alt159

chat_bubble_outline6

repeat34

shareShare

Mike A. Merrill

@mike_a_merrill

5 months ago

Congrats to the OpenHands team :)

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Mike A. Merrill

@mike_a_merrill

5 months ago

"npm for benchmarks". It's too hard to eval agents. We're making it easier. Now if you can run Terminal-Bench you can also run SWE-Bench and others. Great work from the team :)

thumb_up_off_alt12

chat_bubble_outline0

repeat0

shareShare

Mike A. Merrill

Armin Ronacher ⇌

Mike A. Merrill

Ari Holtzman

Andy Konwinski

Jean Mercat

Mike A. Merrill

Mike A. Merrill

Valerie Chen

Mike A. Merrill

Yuntian Deng

Mike A. Merrill

Mike A. Merrill