Kilian Lieret @ICLR (@klieret) 's Twitter Profile
Kilian Lieret @ICLR

@klieret

Research Software Engineer at Princeton University. AI agents & benchmarks for software engineering.

ID: 1388792248100442112

linkhttps://github.com/klieret calendar_today02-05-2021 09:47:47

50 Tweet

460 Followers

35 Following

Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

SWE-agent 1.0 is so much more flexible than before. It has never been easier to set it up with various tool bundles or multiple LMs. And you can combine them all in a multi-attempt scheme!

Daytona.io (@daytonaio) 's Twitter Profile Photo

Watch Princeton's SWE-agent Kilian Lieret reveal research on autonomous coding agents at Daytona AI Builders GitHub HQ! From benchmarks to production-ready frameworks, see why AI agents are 3x better than last year 🚀 Link in reply ⬇️

Watch Princeton's SWE-agent <a href="/KLieret/">Kilian Lieret</a> reveal research on autonomous coding agents at Daytona AI Builders <a href="/github/">GitHub</a> HQ! From benchmarks to production-ready frameworks, see why AI agents are 3x better than last year 🚀 Link in reply ⬇️
Ofir Press (@ofirpress) 's Twitter Profile Photo

We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.

We just updated the SWE-bench Multimodal leaderboard.
Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.
Ofir Press (@ofirpress) 's Twitter Profile Photo

The creators of LiveCodeBench just released a new, private, SWE-bench like benchmark in Java, C++, Python, JavaScript, TypeScript. SWE-agent is at the top! All system use Claude 3.7.

The creators of LiveCodeBench just released a new, private, SWE-bench like benchmark in Java, C++, Python, JavaScript, TypeScript. 

SWE-agent is at the top! All system use Claude 3.7.
Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

Join carlos and me today at GenAI Collective NYC as we break down SWE-Bench, SWE-agent, and the future of AI-driven software engineering. What works? What’s next? What does this mean for developers? Let's discuss! 📍 Today 1–4pm, Brooklyn Navy Yard

John Yang (@jyangballin) 's Twitter Profile Photo

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified.

We built it by synthesizing a ton of agentic training data from 100+ Python repos.

Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
Alex Zhang (@a1zhang) 's Twitter Profile Photo

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇