Naman Jain @ ICLR (@stringchaos) Twitter Tweets • TwiCopy

Naman Jain @ ICLR

@stringchaos

+ Follow

PhD @UCBerkeley | Projects - LiveCodeBench, R2E, Syzygy, LMArena (Coding/RepoChat) | Past: l @MetaAI @AWS @MSFTResearch @iitbombay

ID: 972435243876544512

linkhttp://naman-ntc.github.io calendar_today10-03-2018 11:33:23

461 Tweet

2,2K Takipçi

1,1K Takip Edilen

good girl

@goodgirlxsz

5 hours ago

🔥Telegram İfşa

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxestex

6 months ago

I don't really get what «7 hours of autonomous coding» means for AI and I dislike this metric. METR talks about some-human-equivalent work, but these guys just literally describe LLM session time. Guess what, I can do 30 hours with a Llama-70B and a CPU server.

thumb_up_off_alt145

chat_bubble_outline20

repeat3

shareShare

Alex Gu @ iclr

@minimario1729

6 months ago

new deepseek release almost on-par with o3 (high) on livecodebench 😲🚀

thumb_up_off_alt679

chat_bubble_outline21

repeat73

shareShare

Aditya Kanade

@adityakanade0

5 months ago

Introducing Code Researcher - a deep research agent for large systems code and commit history. aka.ms/coderesearcher Achieves a 58% crash resolution rate on a benchmark of crashes in the Linux kernel, a complex codebase with 28M LOC & 75K files.

thumb_up_off_alt465

chat_bubble_outline4

repeat90

shareShare

Naman Jain @ ICLR

@stringchaos

5 months ago

Ensuring construct validity is becoming increasingly more complex as we move towards more real-world evaluation setups. We should routinely inspect benchmark solutions to ensure intended goal is being met!!

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Naman Jain @ ICLR

@stringchaos

5 months ago

We ran this eval yesterday before price drop 😆🫠 OpenAI

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Bespoke Labs

@bespokelabsai

5 months ago

Day 3 of drilling down into popular benchmarks for models/agents. Benchmark #3: LiveCodeBench Developed by researchers at UC Berkeley, MIT, and Cornell, this benchmark evaluates LLM code-generation skills and continually expands with new problems drawn from programming contests

thumb_up_off_alt8

chat_bubble_outline1

repeat3

shareShare

Kexun Zhang@ICLR 2025

@kexun_zhang

5 months ago

RLVR is not just about RL, it's more about VR! Particularly for LLM coding, good verifiers (tests) are hard to get! In our latest work, we ask 3 questions: How good are current tests? How do we get better tests? How much does test quality matter? leililab.github.io/HardTests/

thumb_up_off_alt88

chat_bubble_outline4

repeat16

shareShare

Damek

@damekdavis

5 months ago

Questions to ask 1. can we see a "commit history?" (only 2 commits in repo) 2. what level of supervision was provided? 3. the paper is 4 dense pages. was it outlined first in a lean friendly way and then formalization took place? 4. which models were used in the formalization?

thumb_up_off_alt146

chat_bubble_outline5

repeat5

shareShare

Alex Albert

@alexalbert__

5 months ago

Claude is hyped to hear that its small business is getting the public recognition it deserves

thumb_up_off_alt778

chat_bubble_outline39

repeat18

shareShare

hardmaru

@hardmaru

5 months ago

DeepSWE is a new state-of-the-art open-source software engineering model trained entirely using reinforcement learning, based on Qwen3-32B. together.ai/blog/deepswe Fantastic work from Together AI Agentica Project‼

thumb_up_off_alt258

chat_bubble_outline12

repeat43

shareShare

Agentica Project

@agentica_

5 months ago

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM

thumb_up_off_alt51

chat_bubble_outline1

repeat15

shareShare

Weijia Shi

@weijiashi2

4 months ago

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data

thumb_up_off_alt197

chat_bubble_outline7

repeat59

shareShare