carlos (@_carlosejimenez) Twitter Tweets • TwiCopy

Daniel

@growing_daniel

9 months ago

"Did you even say thank you" was the most bitchmade line in history

thumb_up_off_alt123,123K

chat_bubble_outline529

repeat5,5K

shareShare

AI coding tools are moving from autocomplete to autonomy 🤖 — with big implications for developers, users, and businesses 💼 Join GenAI Collective NYC this Thursday, April 3 at Brooklyn Navy Yard Bldg 303 for a panel + fireside chat featuring: 🧠 Carlos Jimenez & Kilian Lieret

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

Ofir Press

@ofirpress

8 months ago

We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.

thumb_up_off_alt30

chat_bubble_outline1

repeat5

shareShare

Ofir Press

@ofirpress

8 months ago

The creators of LiveCodeBench just released a new, private, SWE-bench like benchmark in Java, C++, Python, JavaScript, TypeScript. SWE-agent is at the top! All system use Claude 3.7.

thumb_up_off_alt67

chat_bubble_outline3

repeat4

shareShare

carlos

@_carlosejimenez

8 months ago

Llama 8B 🦙 tricked my friends to think they were talking to me! Check out our new paper IMPersona 🕵. It explores whether LMs can impersonate you while texting your friends. We train models on text conversations and test our friends in a personalized turing test!

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare

Kilian Lieret @ICLR

@klieret

8 months ago

Evaluating SWE-agent on SWE-bench lite was once an overnight job. With SWE-ReX parallelizing our execution it now takes half an hour! SWE-ReX spins up docker containers with a FastAPI server that uses pexpect to interface with shell sessions. MIT licensed, lightweight & hackable

thumb_up_off_alt21

chat_bubble_outline1

repeat3

shareShare

Alex Zhang

@a1zhang

8 months ago

Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> 🧵

thumb_up_off_alt416

chat_bubble_outline22

repeat56

shareShare

Ben Shi

@benshi34

8 months ago

I gave both LLaMA-8B and GPT-4o access to my messages, and tasked them with pretending to be me, seeing if close friends and family could tell the difference. LLaMa was able to deceive them 44% of the time. GPT-4o only 6%. Why is this?!

thumb_up_off_alt5

chat_bubble_outline2

repeat2

shareShare

Kilian Lieret @ICLR

@klieret

8 months ago

Had a great time talking about building agents, SWE-agent, SWE-bench, and more

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Kilian Lieret @ICLR

@klieret

8 months ago

Maps. Diagrams. UI glitches. SWE-bench Multimodal benchmarks AI agents on real-world frontend issues and they struggle. Poster at #ICLR25 today. Multiple submissions to the leaderboard already.

thumb_up_off_alt12

chat_bubble_outline1

repeat3

shareShare

Xindi Wu

@cindy_x_wu

7 months ago

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

thumb_up_off_alt149

chat_bubble_outline6

repeat42

shareShare

PyTorch

@pytorch

7 months ago

Can language model systems autonomously complete entire tasks end-to-end? In our next Expert Exchange webinar, Ofir Press explores autonomous LM systems for software engineering, featuring SWE-bench and SWE-agent—used by OpenAI, Meta, & more. 🔗 Register:

Can language model systems autonomously complete entire tasks end-to-end?

In our next Expert Exchange webinar, <a href="/OfirPress/">Ofir Press</a> explores autonomous LM systems for software engineering, featuring SWE-bench and SWE-agent—used by OpenAI, Meta, & more.

🔗 Register:

thumb_up_off_alt20

chat_bubble_outline3

repeat7

shareShare

Kabir

@plodq

7 months ago

Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧵

thumb_up_off_alt66

chat_bubble_outline2

repeat16

shareShare

John Yang

@jyangballin

7 months ago

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

thumb_up_off_alt638

chat_bubble_outline25

repeat132

shareShare

rex (@ ICLR 🇸🇬)

@12exyz

7 months ago

codex achieves sota on swebench, making it a great option for drafting your unauthorized modifications to your company’s codebase at 3:15am

thumb_up_off_alt1,1K

chat_bubble_outline46

repeat74

shareShare

Kilian Lieret @ICLR

@klieret

7 months ago

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.

thumb_up_off_alt84

chat_bubble_outline4

repeat13

shareShare

carlos

Daniel

The GenAI Collective

Ofir Press

Ofir Press

carlos

Kilian Lieret @ICLR

Alex Zhang

Ben Shi

Kilian Lieret @ICLR

Kilian Lieret @ICLR

Xindi Wu

PyTorch

Kabir

John Yang

rex (@ ICLR 🇸🇬)

Kilian Lieret @ICLR