Peter Hase (@peterbhase) 's Twitter Profile
Peter Hase

@peterbhase

AI safety researcher. PhD from UNC Chapel Hill (Google PhD Fellow). Previously: Anthropic, AI2, Google, Meta

ID: 1119252439050354688

linkhttps://peterbhase.github.io/ calendar_today19-04-2019 14:52:30

447 Tweet

3,3K Takipçi

960 Takip Edilen

Yanda Chen (@yanda_chen_) 's Twitter Profile Photo

My first paper Anthropic is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏

Maksym Andriushchenko @ ICLR (@maksym_andr) 's Twitter Profile Photo

Excited to present our recent work on AI safety at this event! lmxsafety.com If you're coming to ICLR 2025 in Singapore and interested in AI safety, you should stop by :-)

Excited to present our recent work on AI safety at this event!
lmxsafety.com

If you're coming to ICLR 2025 in Singapore and interested in AI safety, you should stop by :-)
Sydney Levine (@sydneymlevine) 's Twitter Profile Photo

🔆Announcement time!🔆In Spring 2026, I will be joining the NYU Psych department as an Assistant Professor! My lab will study the computational cognitive science of moral judgment and how we can use that knowledge to build AI systems that are safe and aligned with human values.

Aaron Mueller (@amuuueller) 's Twitter Profile Photo

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!
rowan (@rowankwang) 's Twitter Profile Photo

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning

We study a technique for systematically modifying what AIs believe.

If possible, this would be a powerful new affordance for AI safety research.
Elias Stengel-Eskin (on the faculty job market) (@eliaseskin) 's Twitter Profile Photo

Extremely excited to announce that I will be joining UT Austin Computer Science at UT Austin in August 2025 as an Assistant Professor! 🎉 I’m looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD

Extremely excited to announce that I will be joining <a href="/UTAustin/">UT Austin</a> <a href="/UTCompSci/">Computer Science at UT Austin</a> in August 2025 as an Assistant Professor! 🎉

I’m looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD
Vaidehi Patil (@vaidehi_patil_) 's Twitter Profile Photo

🚨 Introducing our Transactions on Machine Learning Research paper “Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation” W:nt UnLOK-VQA, a benchmark to evaluate unlearning in vision-and-language models—where both images and text may encode sensitive or private

🚨 Introducing our <a href="/TmlrOrg/">Transactions on Machine Learning Research</a> paper “Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation”

W:nt UnLOK-VQA, a benchmark to evaluate unlearning in vision-and-language models—where both images and text may encode sensitive or private
Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

New AI/LLM Agents Track at #EMNLP2025! In the past few years, it feels a bit odd to submit agent work to *CL venues because one had to awkwardedly fit it into Question Answering or NLP Applications. Glad to see agent research finally finds home at *CL! Kudos to the PC for

New AI/LLM Agents Track at #EMNLP2025! 

In the past few years, it feels a bit odd to submit agent work to *CL venues because one had to awkwardedly fit it into Question Answering or NLP Applications. Glad to see agent research finally finds home at *CL! 

Kudos to the PC for
Dongkeun Yoon (@dongkeun_yoon) 's Twitter Profile Photo

🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”? ❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.

🙁 LLMs are overconfident even when they are dead wrong.

🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”?

❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.
FAR.AI (@farairesearch) 's Twitter Profile Photo

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

Jiaxin Wen @ICLR2025 (@jiaxinwen22) 's Twitter Profile Photo

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision.

Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.
Fazl Barez (@fazlbarez) 's Twitter Profile Photo

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! 

We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
Peter Hase (@peterbhase) 's Twitter Profile Photo

Overdue job update -- I am now: - A Visiting Scientist at Schmidt Sciences, supporting AI safety and interpretability - A Visiting Researcher at the Stanford NLP Group, working with Christopher Potts I am so grateful I get to keep working in this fascinating and essential area, and

Miles Turpin (@milesaturpin) 's Twitter Profile Photo

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

New @Scale_AI paper! 🌟

LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
Nouha Dziri (@nouhadziri) 's Twitter Profile Photo

Current agents are highly unsafe, o3-mini one of the most advanced models in reasoning score 71% in executing harmful requests 😱 We introduce a new framework for evaluating agent safety✨🦺 Discover more 👇 👩‍💻 Code & data: github.com/Open-Agent-Saf… 📄 Paper:

Current agents are highly unsafe, o3-mini one of the most advanced models in reasoning score 71% in executing harmful requests 😱
We introduce a new framework for evaluating agent safety✨🦺 Discover more 👇 

👩‍💻 Code &amp; data: github.com/Open-Agent-Saf… 
📄 Paper:
Hannah Rose Kirk (@hannahrosekirk) 's Twitter Profile Photo

My team at AI Security Institute is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️

Peter Hase (@peterbhase) 's Twitter Profile Photo

Shower thought: LLMs still have very incoherent notions of evidence, and they update in strange ways when presented with information in-context that is relevant to their beliefs. I really wonder what will happen when LLM agents start doing interp on themselves and see the source