Kilian Lieret @ICLR (@klieret) Twitter Tweets • TwiCopy

Kilian Lieret @ICLR

@klieret

+ Follow

Research Software Engineer at Princeton University. AI agents & benchmarks for software engineering.

ID: 1388792248100442112

linkhttps://github.com/klieret calendar_today02-05-2021 09:47:47

50 Tweet

460 Followers

35 Following

Kilian Lieret @ICLR

@klieret

9 months ago

SWE-agent 1.0 is so much more flexible than before. It has never been easier to set it up with various tool bundles or multiple LMs. And you can combine them all in a multi-attempt scheme!

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Watch Princeton's SWE-agent Kilian Lieret reveal research on autonomous coding agents at Daytona AI Builders GitHub HQ! From benchmarks to production-ready frameworks, see why AI agents are 3x better than last year 🚀 Link in reply ⬇️

Watch Princeton's SWE-agent <a href="/KLieret/">Kilian Lieret</a> reveal research on autonomous coding agents at Daytona AI Builders <a href="/github/">GitHub</a> HQ! From benchmarks to production-ready frameworks, see why AI agents are 3x better than last year 🚀 Link in reply ⬇️

thumb_up_off_alt14

chat_bubble_outline1

repeat8

shareShare

Ofir Press

@ofirpress

8 months ago

We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.

thumb_up_off_alt30

chat_bubble_outline1

repeat5

shareShare

Ofir Press

@ofirpress

8 months ago

The creators of LiveCodeBench just released a new, private, SWE-bench like benchmark in Java, C++, Python, JavaScript, TypeScript. SWE-agent is at the top! All system use Claude 3.7.

thumb_up_off_alt67

chat_bubble_outline3

repeat4

shareShare

Kilian Lieret @ICLR

@klieret

8 months ago

Join carlos and me today at GenAI Collective NYC as we break down SWE-Bench, SWE-agent, and the future of AI-driven software engineering. What works? What’s next? What does this mean for developers? Let's discuss! 📍 Today 1–4pm, Brooklyn Navy Yard

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Kilian Lieret @ICLR

@klieret

8 months ago

Had a great time talking about building agents, SWE-agent, SWE-bench, and more

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Kabir

@plodq

7 months ago

Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧵

thumb_up_off_alt66

chat_bubble_outline2

repeat16

shareShare

John Yang

@jyangballin

7 months ago

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

thumb_up_off_alt638

chat_bubble_outline25

repeat132

shareShare

Alex Zhang

@a1zhang

6 months ago

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

thumb_up_off_alt518

chat_bubble_outline23

repeat71

shareShare

Kilian Lieret @ICLR

Kilian Lieret @ICLR

Daytona.io

Ofir Press

Ofir Press

Kilian Lieret @ICLR

Kilian Lieret @ICLR

Kabir

John Yang

Alex Zhang