JunShern (@junshernchan) Twitter Tweets • TwiCopy

JunShern

@junshernchan

+ Follow

Trying to make AI go well @AnthropicAI. Previously @OpenAI, @CHAI_Berkeley, @nyuniversity, autonomous vehicles @motionaldrive. 🇲🇾

ID: 973385310242508800

linkhttp://junshern.github.io calendar_today13-03-2018 02:28:37

183 Tweet

376 Followers

1,1K Following

Zhou Xian

@zhou_xian_

a year ago

Everything you love about generative models — now powered by real physics! Announcing the Genesis project — after a 24-month large-scale research collaboration involving over 20 research labs — a generative physics engine able to generate 4D dynamical worlds powered by a physics

thumb_up_off_alt16,16K

chat_bubble_outline578

repeat3,3K

shareShare

Zhengyao Jiang

@zhengyaojiang

10 months ago

Cool stuff! The full MLE-Bench was nearly impossible for anyone outside the big players to run. This will benefit the community so much although it might look like a simple update And, Weco AI's AIDE still performs best on this lite version of the benchmark!

thumb_up_off_alt11

chat_bubble_outline0

repeat3

shareShare

JunShern

@junshernchan

10 months ago

Neat! I haven't looked too closely but this seems thoughtfully done and probably very useful to many people. :) (Similar in spirit to our release of MLE-bench Lite yesterday too -- tis the season for minifying!)

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Joanne Jang

@joannejang

10 months ago

i find mishearing 'agents' as 'asians' way more entertaining "a swarm of asians working 24/7 while you sleep" "the future is asians" "you'll be able to deploy millions of asians in parallel"

thumb_up_off_alt4,4K

chat_bubble_outline188

repeat178

shareShare

Owain Evans

@owainevans_uk

10 months ago

New paper: We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *describe* their new behavior, despite no explicit mentions in the training data. So LLMs have a form of intuitive self-awareness 🧵

thumb_up_off_alt939

chat_bubble_outline43

repeat153

shareShare

Neil Chowdhury

@chowdhuryneil

10 months ago

Accepted to ICLR :)

thumb_up_off_alt102

chat_bubble_outline6

repeat7

shareShare

Aleksander Madry

@aleks_madry

9 months ago

We find that across the board, today’s top LLMs still make genuine mistakes on these benchmarks. At the same time, on the majority of the *original* benchmarks, over 50% of “model errors” are actually caused by label noise! 3/5

thumb_up_off_alt83

chat_bubble_outline1

repeat9

shareShare

Max Nadeau

@maxnadeau_

9 months ago

🧵 Announcing Open Philanthropy's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

🧵 Announcing <a href="/open_phil/">Open Philanthropy</a>'s Technical AI Safety RFP!

We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

thumb_up_off_alt252

chat_bubble_outline4

repeat83

shareShare

Dimitris Papailiopoulos

@dimitrispapail

9 months ago

AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination AIME 2025 part I was conducted yesterday, and the scores of some language models are available here: matharena.ai thanks to Mislav Balunović, Nikola Jovanović et al. I have to say I was impressed,

thumb_up_off_alt377

chat_bubble_outline10

repeat44

shareShare

Elvin

@elvin_not_11

9 months ago

Offline features like battery level

thumb_up_off_alt17,17K

chat_bubble_outline152

repeat563

shareShare

Tyler John

@tyler_m_john

8 months ago

It's hard to live like short ASI timelines are true, even if you're deeply intellectually convinced. It means going against all of the career advice and norms you've absorbed and rethinking your entire lifecycle, creating a lot of emotional tension to resolve.

thumb_up_off_alt938

chat_bubble_outline48

repeat66

shareShare

Cristian Peñas ░░░░░░░░

@ilumine_ai

8 months ago

This quick experiment I just did made my jaw drop... You can literally create and play any game by iterating over images with the new Gemini model! 🤯

thumb_up_off_alt3,3K

chat_bubble_outline137

repeat354

shareShare

Transluce

@transluceai

8 months ago

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇

thumb_up_off_alt330

chat_bubble_outline9

repeat66

shareShare

JunShern

@junshernchan

7 months ago

We are at ICLR, come chat!

thumb_up_off_alt10

chat_bubble_outline0

repeat0

shareShare