Mike A. Merrill (@mike_a_merrill) 's Twitter Profile
Mike A. Merrill

@mike_a_merrill

Postdoc @StanfordAILab
Go Bills

ID: 1233837766271569920

linkhttp://mikemerrill.io calendar_today29-02-2020 19:34:11

202 Tweet

274 Followers

211 Following

Armin Ronacher ⇌ (@mitsuhiko) 's Twitter Profile Photo

Not so hot take: if you use Claude Code, most of y’all’s MCP servers could be a shell script. Easier to maintain and faster and Claude uses just as well if not better.

Not so hot take: if you use Claude Code, most of y’all’s MCP servers could be a shell script. Easier to maintain and faster and Claude uses just as well if not better.
Ari Holtzman (@universeinanegg) 's Twitter Profile Photo

New benchmark! LLMs can retrieve bits of information from ridiculously long contexts (needle-in-a-haystack) but they can't tell what's missing from relatively short documents (AbsenceBench). We can't trust LLMs to annotate or judge documents if they can't see negative space!

Andy Konwinski (@andykonwinski) 's Twitter Profile Photo

Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity. Built for and by researchers, including Jeff Dean & Joelle Pineau on the board, Laude Institute catalyzes research with real-world impact.

Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity.
Built for and by researchers, including <a href="/JeffDean/">Jeff Dean</a> &amp; <a href="/jpineau1/">Joelle Pineau</a> on the board, <a href="/LaudeInstitute/">Laude Institute</a> catalyzes research with real-world impact.
Jean Mercat (@mercatjean) 's Twitter Profile Photo

We evaluated more than 1000 reasoning LLMs on 12 reasoning-focused benchmarks and made fascinating observations about cross-benchmark comparisons. You can explore all that data yourself on our HuggingFace spaces page. (1/4)

We evaluated more than 1000 reasoning LLMs on 12 reasoning-focused benchmarks and made fascinating observations about cross-benchmark comparisons. You can explore all that data yourself on our HuggingFace spaces page. (1/4)
Valerie Chen (@valeriechen_) 's Twitter Profile Photo

🤖🧠 Join us for the 2025 Workshop on Human-AI Complementarity for Decision Making at CMU! 📅 Sept 25-26, 2025 💰 Travel to Pittsburgh & lodging covered 📝 Abstract deadline: July 15 We welcome abstract submissions, which will be presented as talks or posters. Details below!

🤖🧠 Join us for the 2025 Workshop on Human-AI Complementarity for Decision Making at CMU!

📅 Sept 25-26, 2025 
💰 Travel to Pittsburgh &amp; lodging covered  
📝 Abstract deadline: July 15

We welcome abstract submissions, which will be presented as talks or posters. Details below!
Mike A. Merrill (@mike_a_merrill) 's Twitter Profile Photo

It's great to see Terminal-Bench on the Kimi K2 model card. We love open source models, and just made it even easier to test them by adding better support for local models to our harness through LiteLLM

It's great to see Terminal-Bench on the Kimi K2 model card. We love open source models, and just made it even easier to test them by adding better support for local models to our harness through LiteLLM
Yuntian Deng (@yuntiandeng) 's Twitter Profile Photo

Can we build an operating system entirely powered by neural networks? Introducing NeuralOS: towards a generative OS that directly predicts screen images from user inputs. Try it live: neural-os.com Paper: huggingface.co/papers/2507.08… Inspired by Andrej Karpathy's vision. 1/5

Mike A. Merrill (@mike_a_merrill) 's Twitter Profile Photo

"npm for benchmarks". It's too hard to eval agents. We're making it easier. Now if you can run Terminal-Bench you can also run SWE-Bench and others. Great work from the team :)