carlos (@_carlosejimenez) 's Twitter Profile
carlos

@_carlosejimenez

phd student @princeton_nlp.
Message me on substack.com/@closji

ID: 1124834159841562624

linkhttps://www.carlosejimenez.com/ calendar_today05-05-2019 00:32:16

196 Tweet

982 Followers

489 Following

Ofir Press (@ofirpress) 's Twitter Profile Photo

Next Thursday we'll host a live virtual office hour for SWE-bench + SWE-agent. If you have any questions or are thinking of using our work or expanding on it, join us! The office hour will follow a 30 min overview of both projects. lu.ma/az1hdsa4

Karthik Narasimhan (@karthik_r_n) 's Twitter Profile Photo

Excited to release ๐œ-bench (TAU for Tool-Agent-User โš’๏ธ-๐Ÿค–-๐Ÿง‘), a new benchmark to evaluate AI agents' performance and reliability in real-world settings with dynamic user and tool interaction. Paper: arxiv.org/abs/2406.12045, Blog: sierra.ai/blog/benchmarkโ€ฆ

Shunyu Yao (@shunyuyao12) 's Twitter Profile Photo

Excited to share what I did Sierra with Noah Shinn pedram and Karthik Narasimhan ! ๐œ-bench evaluates critical agent capabilities omitted by current benchmarks: robustness, complex rule following, and human interaction skills. Try it out!

John Yang (@jyangballin) 's Twitter Profile Photo

A little Friday fun fact before you have a great weekend ๐Ÿ˜ Here's the distribution of *resolved* task instances by all SWE-bench Lite submissions so far. 167/300 = 55.67% of task instances are solved by 1+ submission

A little Friday fun fact before you have a great weekend ๐Ÿ˜

Here's the distribution of *resolved* task instances by all SWE-bench Lite submissions so far.

167/300 = 55.67% of task instances are solved by 1+ submission
Princeton PLI (@princetonpli) 's Twitter Profile Photo

Don't miss our upcoming seminar this Thursday, 6/27, at 3 pm EST on Zoom Ofir Press will discuss the autonomous system SWE-agent, as well as SWE-bench, the benchmark for measuring performance. Register now: lu.ma/az1hdsa4

Neil Chowdhury (@chowdhuryneil) 's Twitter Profile Photo

SWE-bench is a premier evaluation for frontier modelsโ€™ abilities as software engineering agents. Software engineering is a prerequisite skill for models to operate autonomously and self-improve through iterative ML research. As such, the OpenAI Preparedness team monitors &

Sriram Ramakrishnan (@sreezy3000) 's Twitter Profile Photo

The GenAI Collective had the privilege of hosting the esteemed Princeton researchers behind SWE-bench and SWE-agent at our first ever NYC research meetup! Huge shoutout to Ofir Press John Yang carlos and Kilian Lieret for informative talks and hanging with our community The

<a href="/GenAICollective/">The GenAI Collective</a> had the privilege of hosting the esteemed Princeton researchers behind SWE-bench and SWE-agent at our first ever NYC research meetup!

Huge shoutout to <a href="/OfirPress/">Ofir Press</a> <a href="/jyangballin/">John Yang</a> <a href="/_carlosejimenez/">carlos</a> and <a href="/KLieret/">Kilian Lieret</a> for informative talks and hanging with our community

The
Dan Friedman (@danfriedman0) 's Twitter Profile Photo

How can we understand neural chatbots in terms of interpretable, symbolic mechanisms? To explore this question, we constructed a Transformer that implements the classic ELIZA chatbot algorithm (with Abhishek Panigrahi and Danqi Chen). Paper: arxiv.org/abs/2407.10949 (1/6)

How can we understand neural chatbots in terms of interpretable, symbolic mechanisms? To explore this question, we constructed a Transformer that implements the classic ELIZA chatbot algorithm (with <a href="/Abhishek_034/">Abhishek Panigrahi</a> and <a href="/danqi_chen/">Danqi Chen</a>). Paper: arxiv.org/abs/2407.10949 (1/6)
Jane Pan (@janepan_) 's Twitter Profile Photo

Do LLMs exploit imperfect proxies of human preference in context? Yes! In fact, they do it so severely that iterative refinement can make outputs worse when judged by actual humans. In other words, reward hacking can occur even without gradient updates! w/ He He,

Do LLMs exploit imperfect proxies of human preference in context? Yes!

In fact, they do it so severely that iterative refinement can make outputs worse when judged by actual humans. In other words, reward hacking can occur even without gradient updates!

w/ <a href="/hhexiy/">He He</a>,
Alex Wettig (@_awettig) 's Twitter Profile Photo

Simple strategy: (1) keep pre-training with HQ mix of long & short documents (2) quick instruction-tuning with ONLY short UltraChat We find: avg. performance on our long-context evals keeps improving with increasing continual pre-training budgets

Simple strategy: (1) keep pre-training with HQ mix of long &amp; short documents (2) quick instruction-tuning with ONLY short UltraChat

We find: avg. performance on our long-context evals keeps improving with increasing continual pre-training budgets
Ofir Press (@ofirpress) 's Twitter Profile Photo

.Steve Frey & co are hosting a SWE-bench hackathon. John Yang carlos and I will give the kickoff presentation and provide some support. There are prizes for teams that build open source systems that push the state-of-the-art on SWE-bench! Exciting!!

OpenAI (@openai) 's Twitter Profile Photo

We're releasing a new iteration of SWE-bench, in collaboration with the original authors, to more reliably evaluate AI models on their ability to solve real-world software issues. openai.com/index/introducโ€ฆ

John Yang (@jyangballin) 's Twitter Profile Photo

SWE-bench + OpenAI = ๐—ฆ๐—ช๐—˜-๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต ๐—ฉ๐—ฒ๐—ฟ๐—ถ๐—ณ๐—ถ๐—ฒ๐—ฑ! A subset of 500 problems w/ a theoretical ceiling of 100% performance curated from human annotations Really excited to finally address the mystery of human performance on SWE-bench!

Xindi Wu (@cindy_x_wu) 's Twitter Profile Photo

How good is the compositional generation capability of current Text-to-Image models? arxiv.org/abs/2408.14339 Introducing ConceptMix, our new benchmark that evaluates how well models can generate images that accurately combine multiple visual concepts, pushing beyond simple,

How good is the compositional generation capability of current Text-to-Image models? arxiv.org/abs/2408.14339

Introducing ConceptMix, our new benchmark that evaluates how well models can generate images that accurately combine multiple visual concepts, pushing beyond simple,