carlos (@_carlosejimenez) 's Twitter Profile
carlos

@_carlosejimenez

phd student @princeton_nlp

ID: 1124834159841562624

linkhttps://www.carlosejimenez.com/ calendar_today05-05-2019 00:32:16

240 Tweet

1,1K Followers

451 Following

The GenAI Collective (@genaicollective) 's Twitter Profile Photo

AI coding tools are moving from autocomplete to autonomy 🤖 — with big implications for developers, users, and businesses 💼 Join GenAI Collective NYC this Thursday, April 3 at Brooklyn Navy Yard Bldg 303 for a panel + fireside chat featuring: 🧠 Carlos Jimenez & Kilian Lieret

AI coding tools are moving from autocomplete to autonomy 🤖 — with big implications for developers, users, and businesses 💼

Join GenAI Collective NYC this Thursday, April 3 at Brooklyn Navy Yard Bldg 303 for a panel + fireside chat featuring:
🧠 Carlos Jimenez & Kilian Lieret
Ofir Press (@ofirpress) 's Twitter Profile Photo

We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.

We just updated the SWE-bench Multimodal leaderboard.
Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.
Ofir Press (@ofirpress) 's Twitter Profile Photo

The creators of LiveCodeBench just released a new, private, SWE-bench like benchmark in Java, C++, Python, JavaScript, TypeScript. SWE-agent is at the top! All system use Claude 3.7.

The creators of LiveCodeBench just released a new, private, SWE-bench like benchmark in Java, C++, Python, JavaScript, TypeScript. 

SWE-agent is at the top! All system use Claude 3.7.
carlos (@_carlosejimenez) 's Twitter Profile Photo

Llama 8B 🦙 tricked my friends to think they were talking to me! Check out our new paper IMPersona 🕵. It explores whether LMs can impersonate you while texting your friends. We train models on text conversations and test our friends in a personalized turing test!

Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

Evaluating SWE-agent on SWE-bench lite was once an overnight job. With SWE-ReX parallelizing our execution it now takes half an hour! SWE-ReX spins up docker containers with a FastAPI server that uses pexpect to interface with shell sessions. MIT licensed, lightweight & hackable

Alex Zhang (@a1zhang) 's Twitter Profile Photo

Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> 🧵

Ben Shi (@benshi34) 's Twitter Profile Photo

I gave both LLaMA-8B and GPT-4o access to my messages, and tasked them with pretending to be me, seeing if close friends and family could tell the difference. LLaMa was able to deceive them 44% of the time. GPT-4o only 6%. Why is this?!

I gave both LLaMA-8B and GPT-4o access to my messages, and tasked them with pretending to be me, seeing if close friends and family could tell the difference. 

LLaMa was able to deceive them 44% of the time. GPT-4o only 6%. Why is this?!
Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

Maps. Diagrams. UI glitches. SWE-bench Multimodal benchmarks AI agents on real-world frontend issues and they struggle. Poster at #ICLR25 today. Multiple submissions to the leaderboard already.

Maps. Diagrams. UI glitches. SWE-bench Multimodal benchmarks AI agents on real-world frontend issues and they struggle. Poster at #ICLR25 today. Multiple submissions to the leaderboard already.
Xindi Wu (@cindy_x_wu) 's Twitter Profile Photo

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦

arxiv.org/abs/2504.21850

1/10
PyTorch (@pytorch) 's Twitter Profile Photo

Can language model systems autonomously complete entire tasks end-to-end? In our next Expert Exchange webinar, Ofir Press explores autonomous LM systems for software engineering, featuring SWE-bench and SWE-agent—used by OpenAI, Meta, & more. 🔗 Register:

Can language model systems autonomously complete entire tasks end-to-end?

In our next Expert Exchange webinar, <a href="/OfirPress/">Ofir Press</a> explores autonomous LM systems for software engineering, featuring SWE-bench and SWE-agent—used by OpenAI, Meta, &amp; more.

🔗 Register:
John Yang (@jyangballin) 's Twitter Profile Photo

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified.

We built it by synthesizing a ton of agentic training data from 100+ Python repos.

Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
rex (@ ICLR 🇸🇬) (@12exyz) 's Twitter Profile Photo

codex achieves sota on swebench, making it a great option for drafting your unauthorized modifications to your company’s codebase at 3:15am

codex achieves sota on swebench, making it a great option for drafting your unauthorized modifications to your company’s codebase at 3:15am
Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.