John(Yueh-Han) Chen (@jcyhc_ai) Twitter Tweets • TwiCopy

Forecasting Research Institute

2 months ago

📈 LLMs have surpassed the general public. A year ago, when we first released ForecastBench, the median forecast from a group of members of the public sat at #2 in our leaderboard—trailing behind only superforecasters. Today, the median public forecast is beaten by multiple

thumb_up_off_alt24

chat_bubble_outline1

repeat4

shareShare

Forecasting Research Institute

@research_fri

2 months ago

⬆️ LLMs’ forecasting abilities are steadily improving. GPT-4 (released March 2023) achieved a difficulty-adjusted Brier score of 0.131. Nearly two years later, GPT-4.5 (released Feb 2025) scored 0.101—a substantial improvement. A linear extrapolation of state-of-the-art LLM

thumb_up_off_alt228

chat_bubble_outline9

repeat28

shareShare

Forecasting Research Institute

@research_fri

2 months ago

🔮 When will AI forecasters match top human forecasters at predicting the future? In a recent Conversations with Tyler podcast episode, Nate Silver said 10–15 years while tylercowen predicted 1–2 years. Who was right? Our updated AI forecasting benchmark, ForecastBench, suggests that

thumb_up_off_alt49

chat_bubble_outline6

repeat16

shareShare

John(Yueh-Han) Chen

@jcyhc_ai

2 months ago

LLMs will likely reach superforecaster-level forecasting performance around a year from now.

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Ryan Greenblatt

@ryanpgreenblatt

2 months ago

Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/

thumb_up_off_alt374

chat_bubble_outline17

repeat26

shareShare

Yoshua Bengio

@yoshua_bengio

2 months ago

AI is evolving too quickly for an annual report to suffice. To help policymakers keep pace, we're introducing the first Key Update to the International AI Safety Report. 🧵⬇️ (1/10)

thumb_up_off_alt293

chat_bubble_outline17

repeat89

shareShare

Brenden Lake

@lakebrenden

2 months ago

Today in Nature Machine Intelligence, Kazuki Irie and I discuss 4 classic challenges for neural nets — systematic generalization, catastrophic forgetting, few-shot learning, and reasoning. We argue there is a unifying fix: the right incentives & practice. rdcu.be/eLRmg

thumb_up_off_alt175

chat_bubble_outline2

repeat31

shareShare

Forecasting Research Institute

@research_fri

2 months ago

🏆 We have new entries on our LLM forecasting accuracy benchmark, ForecastBench. GPT-5 matches state-of-the-art performance, tied with GPT-4.5 at #2 overall. The latest batch of frontier models—GPT-5, Gemini 2.5 Pro, Claude Opus 4.1—now all rank in the top 10. Here’s what you

thumb_up_off_alt16

chat_bubble_outline1

repeat7

shareShare

Stewart Slocum

@stewartslocum1

2 months ago

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

thumb_up_off_alt176

chat_bubble_outline5

repeat37

shareShare

Forecasting Research Institute

@research_fri

a month ago

Submit your model to our LLM forecasting benchmark, ForecastBench! 📅 The next submission deadline is November 9 🤖 Test your model against leading AI labs, human baselines and individual competitors 👇See next post for how to submit

thumb_up_off_alt4

chat_bubble_outline1

repeat4

shareShare

Forecasting Research Institute

@research_fri

a month ago

Today, we are launching the most rigorous ongoing source of expert forecasts on the future of AI: the Longitudinal Expert AI Panel (LEAP). We’ve assembled a panel of 339 top experts across computer science, AI industry, economics, and AI policy. Roughly every month—for the next

thumb_up_off_alt217

chat_bubble_outline14

repeat78

shareShare

Anthropic

@anthropicai

a month ago

We believe this is the first documented case of a large-scale AI cyberattack executed without substantial human intervention. It has significant implications for cybersecurity in the age of AI agents. Read more: anthropic.com/news/disruptin…

thumb_up_off_alt1,1K

chat_bubble_outline48

repeat252

shareShare

Chris Murphy 🟧

@chrismurphyct

a month ago

Guys wake the f up. This is going to destroy us - sooner than we think - if we don’t make AI regulation a national priority tomorrow.

thumb_up_off_alt35,35K

chat_bubble_outline1,1K

repeat4,4K

shareShare

Rico Angell

@rico_angell

a month ago

We develop efficient monitors for defending against this type of attack in our recent work x.com/AnthropicAI/st…

thumb_up_off_alt3

chat_bubble_outline1

repeat2

shareShare

John(Yueh-Han) Chen

@jcyhc_ai

a month ago

Frontier AI labs should immediately apply the lightweight sequential monitors described in our paper, "Monitoring decomposition attacks in LLMs with lightweight sequential monitors." These attacks are already being used in active cyber-espionage campaigns.

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

John(Yueh-Han) Chen

@jcyhc_ai

a month ago

Frontier AI labs should immediately apply the lightweight sequential monitors described in our paper, "Monitoring decomposition attacks in LLMs with lightweight sequential monitors." These attacks are already being used in active cyber-espionage campaigns.

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Maksym Andriushchenko @ ICLR

@maksym_andr

25 days ago

this Claude Code misuse case serves as strong motivation for our recent work on monitoring decomposition attacks: arxiv.org/abs/2506.10949

thumb_up_off_alt14

chat_bubble_outline2

repeat2

shareShare

Andon Labs

@andonlabs

22 days ago

Today, we're revealing two new evals: Vending-Bench 2 and Vending-Bench Arena. Soon, we expect models to manage entire businesses. This requires Long-term coherence, our key focus here. Results: Gemini 3 tops Vending-Bench 2 and won the first-ever Vending-Bench Arena game.

thumb_up_off_alt278

chat_bubble_outline10

repeat25

shareShare

Samuel Albanie 🇬🇧

@samuelalbanie

20 days ago

a data point for that ai 2027 graph

thumb_up_off_alt840

chat_bubble_outline51

repeat51

shareShare

METR

@metr_evals

20 days ago

METR completed a pre-deployment evaluation of GPT-5.1-Codex-Max & found its capabilities consistent with past trends. If our projections hold, we expect further OpenAI development in the next 6 months is unlikely to pose catastrophic risk via automated AI R&D or rogue autonomy.

thumb_up_off_alt287

chat_bubble_outline7

repeat35

shareShare