Princeton NLP Group (@princeton_nlp) Twitter Tweets • TwiCopy

good girl

@goodgirlxsz

5 hours ago

🔥Telegram İfşa

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Ofir Press

@ofirpress

10 months ago

If you're NYC, carlos will be talking about SWE-bench / SWE-agent at this event on Feb 18.

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

John Yang

@jyangballin

10 months ago

SWE-bench Multimodal evaluation code is out now! SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).

thumb_up_off_alt124

chat_bubble_outline2

repeat16

shareShare

Ofir Press

@ofirpress

10 months ago

We're launching the SWE-bench command-line tool today to let you do SWE-bench evaluation on the cloud. SWE-bench Multimodal is also finally out now!

thumb_up_off_alt24

chat_bubble_outline2

repeat6

shareShare

Princeton NLP Group

@princeton_nlp

10 months ago

Congrats to the DeepSeek team on the impressive SWE-bench results!

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.

thumb_up_off_alt50

chat_bubble_outline4

repeat9

shareShare

Ofir Press

@ofirpress

9 months ago

Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5. SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.

thumb_up_off_alt43

chat_bubble_outline10

repeat3

shareShare

Yong Lin

@yong18850571

9 months ago

🚀 Introducing Goedel-Prover: A 7B LLM achieving SOTA open-source performance in automated theorem proving! 🔥 ✅ Improving +7% over previous open source SOTA on miniF2F 🏆 Ranking 1st on the PutnamBench Leaderboard 🤖 Solving 1.9X total problems compared to prior works on Lean

thumb_up_off_alt271

chat_bubble_outline13

repeat63

shareShare

Kilian Lieret @ICLR

@klieret

9 months ago

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.

thumb_up_off_alt60

chat_bubble_outline3

repeat18

shareShare

Ofir Press

@ofirpress

9 months ago

This Tuesday (Feb 18), carlos will discuss SWE-bench and the future of codegen evals, as part of the Conference on Synthetic Software in NYC. Kilian Lieret will also be there. RSVP: lu.ma/k2q27yi3

This Tuesday (Feb 18), <a href="/_carlosejimenez/">carlos</a> will discuss SWE-bench and the future of codegen evals, as part of the Conference on Synthetic Software in NYC. <a href="/KLieret/">Kilian Lieret</a> will also be there.
RSVP: lu.ma/k2q27yi3

thumb_up_off_alt8

chat_bubble_outline1

repeat2

shareShare

Alex Wettig

@_awettig

9 months ago

🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N

thumb_up_off_alt195

chat_bubble_outline5

repeat48

shareShare

Princeton NLP Group

@princeton_nlp

8 months ago

Nothing like a sunny hike to welcome spring!

thumb_up_off_alt76

chat_bubble_outline0

repeat6

shareShare

Ofir Press

@ofirpress

7 months ago

We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.

thumb_up_off_alt30

chat_bubble_outline1

repeat5

shareShare

Ofir Press

@ofirpress

7 months ago

Congrats on the Verified and Multimodal SWE-bench numbers. venturebeat.com/ai/zencoders-c…

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

Ben Shi

@benshi34

7 months ago

Can language models effectively impersonate you to family and friends? We find that they can: 44% of the time, close friends and family mis-identify Llama-3.1-8b as human… 🧵👇

thumb_up_off_alt6

chat_bubble_outline2

repeat3

shareShare

Alex Zhang

@a1zhang

7 months ago

Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> 🧵

thumb_up_off_alt416

chat_bubble_outline22

repeat56

shareShare

Stanford NLP Group

@stanfordnlp

7 months ago

Our warmest congratulations to ⁦Danqi Chen⁩, ⁦Stanford NLP Group⁩ grad and now Associate Professor at ⁦Princeton Computer Science⁩ and Associate Director of ⁦Princeton PLI⁩ on her stunning ⁦⁦ICLR 2025⁩ keynote!

Our warmest congratulations to ⁦<a href="/danqi_chen/">Danqi Chen</a>⁩, ⁦<a href="/stanfordnlp/">Stanford NLP Group</a>⁩ grad and now Associate Professor at ⁦<a href="/PrincetonCS/">Princeton Computer Science</a>⁩ and Associate Director of ⁦<a href="/PrincetonPLI/">Princeton PLI</a>⁩ on her stunning ⁦⁦<a href="/iclr_conf/">ICLR 2025</a>⁩ keynote!

thumb_up_off_alt269

chat_bubble_outline6

repeat19

shareShare

Ofir Press

@ofirpress

6 months ago

Join us on May 21st- I'll talk about how we built SWE-bench & SWE-agent and what I'm excited about for the future of autonomous AI systems.

thumb_up_off_alt20

chat_bubble_outline2

repeat3

shareShare

Kabir

@plodq

6 months ago

Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧵

thumb_up_off_alt66

chat_bubble_outline2

repeat16

shareShare

Ben Shi

@benshi34

5 months ago

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

thumb_up_off_alt177

chat_bubble_outline6

repeat39

shareShare

carlos

@_carlosejimenez

5 months ago

Improved reasoning increases performance on benchmarks, but are models able to pass their knowledge onto humans? 🧐 We evaluate models’ communication abilities in teaching novel solutions to users! See our new paper!

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare