Karthik Narasimhan (@karthik_r_n) 's Twitter Profile
Karthik Narasimhan

@karthik_r_n

Associate Professor @PrincetonCS, Head of Research @SierraPlatform. Previously @OpenAI, PhD @MIT_CSAIL, BTech @iitmadras

ID: 3272351166

linkhttp://www.karthiknarasimhan.com/ calendar_today09-07-2015 01:28:42

259 Tweet

3,3K Followers

456 Following

carlos (@_carlosejimenez) 's Twitter Profile Photo

SWE-bench Lite is a smaller & slightly easier *subset* of SWE-bench, with 23 dev / 300 test examples (full SWE-bench is 225 dev / 2,294 test). We hopes this makes SWE-bench evals easier. Special thanks to Jiayi Geng for making this happen. Download here: swebench.com/lite

SWE-bench Lite is a smaller & slightly easier *subset* of SWE-bench, with 23 dev / 300 test examples (full SWE-bench is 225 dev / 2,294 test).
We hopes this makes SWE-bench evals easier.

Special thanks to <a href="/JiayiiGeng/">Jiayi Geng</a> for making this happen.
Download here: swebench.com/lite
John Yang (@jyangballin) 's Twitter Profile Photo

SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code github.com/princeton-nlp/…

SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source!

We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code
github.com/princeton-nlp/…
carlos (@_carlosejimenez) 's Twitter Profile Photo

SWE-Agent is an open-source software engineering agent with a 12.3% resolve rate on SWE-Bench! Check out SWE-agent in action at swe-agent.com Repo: github.com/princeton-nlp/…

Ofir Press (@ofirpress) 's Twitter Profile Photo

SWE-agent is blazing fast, and when it works it feels like magic! In this short demo I show how it solved a real bug in the neural network training code in scikit-learn. I also explain the process behind our agent-computer interface design choices.

Ben Shi (@benshi34) 's Twitter Profile Photo

Our visualizer for our preprint, “Can Language Models Solve Olympiad Programming” is live. See the per-problem performance of models on USACO + more! Link here: princeton-nlp.github.io/USACOBench/ Ty again to my collaborators: Michael Tang Shunyu Yao Karthik Narasimhan

Our visualizer for our preprint, “Can Language Models Solve Olympiad Programming” is live. See the per-problem performance of models on USACO + more!

Link here: princeton-nlp.github.io/USACOBench/

Ty again to my collaborators: <a href="/_michaeltang_/">Michael Tang</a> <a href="/ShunyuYao12/">Shunyu Yao</a> <a href="/karthik_r_n/">Karthik Narasimhan</a>
Bret Taylor (@btaylor) 's Twitter Profile Photo

Sierra's research team just published 𝜏-bench, a novel new benchmark to evaluate AI agents' performance and reliability in real-world settings. The results show that that agents built with simple LLM constructs (like function calling or ReAct) perform poorly on even relatively

Shunyu Yao (@shunyuyao12) 's Twitter Profile Photo

Excited to share what I did Sierra with Noah Shinn pedram and Karthik Narasimhan ! 𝜏-bench evaluates critical agent capabilities omitted by current benchmarks: robustness, complex rule following, and human interaction skills. Try it out!