JunShern (@junshernchan) 's Twitter Profile
JunShern

@junshernchan

Trying to make AI go well @AnthropicAI. Previously @OpenAI, @CHAI_Berkeley, @nyuniversity, autonomous vehicles @motionaldrive. 🇲🇾

ID: 973385310242508800

linkhttp://junshern.github.io calendar_today13-03-2018 02:28:37

183 Tweet

376 Followers

1,1K Following

Zhou Xian (@zhou_xian_) 's Twitter Profile Photo

Everything you love about generative models — now powered by real physics! Announcing the Genesis project — after a 24-month large-scale research collaboration involving over 20 research labs — a generative physics engine able to generate 4D dynamical worlds powered by a physics

Zhengyao Jiang (@zhengyaojiang) 's Twitter Profile Photo

Cool stuff! The full MLE-Bench was nearly impossible for anyone outside the big players to run. This will benefit the community so much although it might look like a simple update And, Weco AI's AIDE still performs best on this lite version of the benchmark!

JunShern (@junshernchan) 's Twitter Profile Photo

Neat! I haven't looked too closely but this seems thoughtfully done and probably very useful to many people. :) (Similar in spirit to our release of MLE-bench Lite yesterday too -- tis the season for minifying!)

Joanne Jang (@joannejang) 's Twitter Profile Photo

i find mishearing 'agents' as 'asians' way more entertaining "a swarm of asians working 24/7 while you sleep" "the future is asians" "you'll be able to deploy millions of asians in parallel"

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

New paper: We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *describe* their new behavior, despite no explicit mentions in the training data. So LLMs have a form of intuitive self-awareness đź§µ

New paper:
We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions.
They can *describe* their new behavior, despite no explicit mentions in the training data.
So LLMs have a form of intuitive self-awareness đź§µ
Aleksander Madry (@aleks_madry) 's Twitter Profile Photo

We find that across the board, today’s top LLMs still make genuine mistakes on these benchmarks. At the same time, on the majority of the *original* benchmarks, over 50% of “model errors” are actually caused by label noise! 3/5

We find that across the board, today’s top LLMs still make genuine mistakes on these benchmarks.

At the same time, on the majority of the *original* benchmarks, over 50% of “model errors” are actually caused by label noise!

3/5
Max Nadeau (@maxnadeau_) 's Twitter Profile Photo

đź§µ Announcing Open Philanthropy's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

đź§µ Announcing <a href="/open_phil/">Open Philanthropy</a>'s Technical AI Safety RFP!

We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.
Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination AIME 2025 part I was conducted yesterday, and the scores of some language models are available here: matharena.ai thanks to Mislav Balunović, Nikola Jovanović et al. I have to say I was impressed,

Tyler John (@tyler_m_john) 's Twitter Profile Photo

It's hard to live like short ASI timelines are true, even if you're deeply intellectually convinced. It means going against all of the career advice and norms you've absorbed and rethinking your entire lifecycle, creating a lot of emotional tension to resolve.

Transluce (@transluceai) 's Twitter Profile Photo

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇