Peter Hase (@peterbhase) Twitter Tweets • TwiCopy

Yanda Chen

8 months ago

My first paper Anthropic is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏

thumb_up_off_alt1,1K

chat_bubble_outline32

repeat87

shareShare

Maksym Andriushchenko @ ICLR

@maksym_andr

8 months ago

Excited to present our recent work on AI safety at this event! lmxsafety.com If you're coming to ICLR 2025 in Singapore and interested in AI safety, you should stop by :-)

thumb_up_off_alt105

chat_bubble_outline3

repeat9

shareShare

Sydney Levine

@sydneymlevine

8 months ago

🔆Announcement time!🔆In Spring 2026, I will be joining the NYU Psych department as an Assistant Professor! My lab will study the computational cognitive science of moral judgment and how we can use that knowledge to build AI systems that are safe and aligned with human values.

thumb_up_off_alt266

chat_bubble_outline21

repeat13

shareShare

Aaron Mueller

@amuuueller

7 months ago

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!

thumb_up_off_alt163

chat_bubble_outline2

repeat37

shareShare

rowan

@rowankwang

7 months ago

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.

thumb_up_off_alt347

chat_bubble_outline14

repeat48

shareShare

Elias Stengel-Eskin (on the faculty job market)

@eliaseskin

7 months ago

Extremely excited to announce that I will be joining UT Austin Computer Science at UT Austin in August 2025 as an Assistant Professor! 🎉 I’m looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD

Extremely excited to announce that I will be joining <a href="/UTAustin/">UT Austin</a> <a href="/UTCompSci/">Computer Science at UT Austin</a> in August 2025 as an Assistant Professor! 🎉

I’m looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD

thumb_up_off_alt441

chat_bubble_outline91

repeat67

shareShare

Vaidehi Patil

@vaidehi_patil_

7 months ago

🚨 Introducing our Transactions on Machine Learning Research paper “Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation” W:nt UnLOK-VQA, a benchmark to evaluate unlearning in vision-and-language models—where both images and text may encode sensitive or private

🚨 Introducing our <a href="/TmlrOrg/">Transactions on Machine Learning Research</a> paper “Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation”

W:nt UnLOK-VQA, a benchmark to evaluate unlearning in vision-and-language models—where both images and text may encode sensitive or private

thumb_up_off_alt102

chat_bubble_outline2

repeat37

shareShare

Yu Su @#ICLR2025

@ysu_nlp

7 months ago

New AI/LLM Agents Track at #EMNLP2025! In the past few years, it feels a bit odd to submit agent work to *CL venues because one had to awkwardedly fit it into Question Answering or NLP Applications. Glad to see agent research finally finds home at *CL! Kudos to the PC for

thumb_up_off_alt183

chat_bubble_outline9

repeat22

shareShare

Dongkeun Yoon

@dongkeun_yoon

7 months ago

🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”? ❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.

thumb_up_off_alt298

chat_bubble_outline9

repeat50

shareShare

FAR.AI

@farairesearch

6 months ago

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

thumb_up_off_alt34

chat_bubble_outline1

repeat6

shareShare

Jiaxin Wen @ICLR2025

@jiaxinwen22

6 months ago

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

thumb_up_off_alt1,1K

chat_bubble_outline35

repeat153

shareShare

Michel

@justenmichel

5 months ago

really interesting to see just how gendered excitement about AI is, even among AI experts

thumb_up_off_alt250

chat_bubble_outline15

repeat47

shareShare

Fazl Barez

@fazlbarez

5 months ago

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵

thumb_up_off_alt588

chat_bubble_outline19

repeat119

shareShare

Peter Hase

@peterbhase

5 months ago

Overdue job update -- I am now: - A Visiting Scientist at Schmidt Sciences, supporting AI safety and interpretability - A Visiting Researcher at the Stanford NLP Group, working with Christopher Potts I am so grateful I get to keep working in this fascinating and essential area, and

thumb_up_off_alt174

chat_bubble_outline14

repeat21

shareShare

Miles Turpin

@milesaturpin

5 months ago

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

thumb_up_off_alt217

chat_bubble_outline7

repeat36

shareShare

Nouha Dziri

@nouhadziri

5 months ago

Current agents are highly unsafe, o3-mini one of the most advanced models in reasoning score 71% in executing harmful requests 😱 We introduce a new framework for evaluating agent safety✨🦺 Discover more 👇 👩‍💻 Code & data: github.com/Open-Agent-Saf… 📄 Paper:

thumb_up_off_alt51

chat_bubble_outline1

repeat11

shareShare

Hannah Rose Kirk

@hannahrosekirk

5 months ago

My team at AI Security Institute is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️

thumb_up_off_alt109

chat_bubble_outline4

repeat24

shareShare

Peter Hase

@peterbhase

4 months ago

Shower thought: LLMs still have very incoherent notions of evidence, and they update in strange ways when presented with information in-context that is relevant to their beliefs. I really wonder what will happen when LLM agents start doing interp on themselves and see the source

thumb_up_off_alt23

chat_bubble_outline5

repeat5

shareShare