Naman Jain @ ICLR (@stringchaos) 's Twitter Profile
Naman Jain @ ICLR

@stringchaos

PhD @UCBerkeley | Projects - LiveCodeBench, R2E, Syzygy, LMArena (Coding/RepoChat) | Past: l @MetaAI @AWS @MSFTResearch @iitbombay

ID: 972435243876544512

linkhttp://naman-ntc.github.io calendar_today10-03-2018 11:33:23

461 Tweet

2,2K Takipçi

1,1K Takip Edilen

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxestex) 's Twitter Profile Photo

I don't really get what «7 hours of autonomous coding» means for AI and I dislike this metric. METR talks about some-human-equivalent work, but these guys just literally describe LLM session time. Guess what, I can do 30 hours with a Llama-70B and a CPU server.

Aditya Kanade (@adityakanade0) 's Twitter Profile Photo

Introducing Code Researcher - a deep research agent for large systems code and commit history. aka.ms/coderesearcher Achieves a 58% crash resolution rate on a benchmark of crashes in the Linux kernel, a complex codebase with 28M LOC & 75K files.

Introducing Code Researcher - a deep research agent for large systems code and commit history.

aka.ms/coderesearcher

Achieves a 58% crash resolution rate on a benchmark of crashes in the Linux kernel, a complex codebase with 28M LOC & 75K files.
Naman Jain @ ICLR (@stringchaos) 's Twitter Profile Photo

Ensuring construct validity is becoming increasingly more complex as we move towards more real-world evaluation setups. We should routinely inspect benchmark solutions to ensure intended goal is being met!!

Bespoke Labs (@bespokelabsai) 's Twitter Profile Photo

Day 3 of drilling down into popular benchmarks for models/agents. Benchmark #3: LiveCodeBench Developed by researchers at UC Berkeley, MIT, and Cornell, this benchmark evaluates LLM code-generation skills and continually expands with new problems drawn from programming contests

Kexun Zhang@ICLR 2025 (@kexun_zhang) 's Twitter Profile Photo

RLVR is not just about RL, it's more about VR! Particularly for LLM coding, good verifiers (tests) are hard to get! In our latest work, we ask 3 questions: How good are current tests? How do we get better tests? How much does test quality matter? leililab.github.io/HardTests/

Damek (@damekdavis) 's Twitter Profile Photo

Questions to ask 1. can we see a "commit history?" (only 2 commits in repo) 2. what level of supervision was provided? 3. the paper is 4 dense pages. was it outlined first in a lean friendly way and then formalization took place? 4. which models were used in the formalization?

hardmaru (@hardmaru) 's Twitter Profile Photo

DeepSWE is a new state-of-the-art open-source software engineering model trained entirely using reinforcement learning, based on Qwen3-32B. together.ai/blog/deepswe Fantastic work from Together AI Agentica Project

DeepSWE is a new state-of-the-art open-source software engineering model trained entirely using reinforcement learning, based on Qwen3-32B.

together.ai/blog/deepswe

Fantastic work from <a href="/togethercompute/">Together AI</a> <a href="/Agentica_/">Agentica Project</a>‼
Agentica Project (@agentica_) 's Twitter Profile Photo

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results.  

Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%.  

So how did we achieve this? 

DeepSWE generates N candidate solutions. Then, another LLM
Weijia Shi (@weijiashi2) 's Twitter Profile Photo

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data