Princeton NLP Group (@princeton_nlp) 's Twitter Profile
Princeton NLP Group

@princeton_nlp

Princeton NLP Group led by @prfsanjeevarora @danqi_chen @karthik_r_n

ID: 1297858144861925379

linkhttp://nlp.cs.princeton.edu calendar_today24-08-2020 11:27:54

250 Tweet

4,4K Takipçi

61 Takip Edilen

John Yang (@jyangballin) 's Twitter Profile Photo

SWE-bench Multimodal evaluation code is out now! SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).

SWE-bench Multimodal evaluation code is out now!

SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).
Ofir Press (@ofirpress) 's Twitter Profile Photo

We're launching the SWE-bench command-line tool today to let you do SWE-bench evaluation on the cloud. SWE-bench Multimodal is also finally out now!

We're launching the SWE-bench command-line tool today to let you do SWE-bench evaluation on the cloud. 

SWE-bench Multimodal is also finally out now!
Ofir Press (@ofirpress) 's Twitter Profile Photo

SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.

SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. 

To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.
Ofir Press (@ofirpress) 's Twitter Profile Photo

Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5. SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.

Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5.

SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.
Yong Lin (@yong18850571) 's Twitter Profile Photo

🚀 Introducing Goedel-Prover: A 7B LLM achieving SOTA open-source performance in automated theorem proving! 🔥 ✅ Improving +7% over previous open source SOTA on miniF2F 🏆 Ranking 1st on the PutnamBench Leaderboard 🤖 Solving 1.9X total problems compared to prior works on Lean

🚀 Introducing Goedel-Prover: A 7B LLM achieving SOTA open-source performance in automated theorem proving! 🔥

✅ Improving +7% over previous open source SOTA on miniF2F
🏆 Ranking 1st on the PutnamBench Leaderboard
🤖 Solving 1.9X total problems compared to prior works on Lean
Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.
Ofir Press (@ofirpress) 's Twitter Profile Photo

This Tuesday (Feb 18), carlos will discuss SWE-bench and the future of codegen evals, as part of the Conference on Synthetic Software in NYC. Kilian Lieret will also be there. RSVP: lu.ma/k2q27yi3

This Tuesday (Feb 18), <a href="/_carlosejimenez/">carlos</a>  will discuss SWE-bench and the future of codegen evals, as part of the Conference on Synthetic Software in NYC. <a href="/KLieret/">Kilian Lieret</a> will also be there.
RSVP: lu.ma/k2q27yi3
Alex Wettig (@_awettig) 's Twitter Profile Photo

🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N

🤔 Ever wondered how prevalent some type of web content is during LM pre-training?

In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐

Key takeaway: domains help us curate better pre-training data! 🧵/N
Ofir Press (@ofirpress) 's Twitter Profile Photo

We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.

We just updated the SWE-bench Multimodal leaderboard.
Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.
Ben Shi (@benshi34) 's Twitter Profile Photo

Can language models effectively impersonate you to family and friends? We find that they can: 44% of the time, close friends and family mis-identify Llama-3.1-8b as human… 🧵👇

Can language models effectively impersonate you to family and friends? We find that they can: 44% of the time, close friends and family mis-identify Llama-3.1-8b as human…
🧵👇
Alex Zhang (@a1zhang) 's Twitter Profile Photo

Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> 🧵

Ofir Press (@ofirpress) 's Twitter Profile Photo

Join us on May 21st- I'll talk about how we built SWE-bench & SWE-agent and what I'm excited about for the future of autonomous AI systems.

Ben Shi (@benshi34) 's Twitter Profile Photo

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes?

In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:
carlos (@_carlosejimenez) 's Twitter Profile Photo

Improved reasoning increases performance on benchmarks, but are models able to pass their knowledge onto humans? 🧐 We evaluate models’ communication abilities in teaching novel solutions to users! See our new paper!