Matthew Kowal (@matthewkowal9) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

“We purposely build or discover situations where models might be behaving in misaligned ways” Evan Hubinger discusses stress-testing AI by creating “model organisms” to study failure points and refine model safeguards under Anthropic's Responsible Scaling Policy.

thumb_up_off_alt1,1K

chat_bubble_outline38

repeat120

shareShare

Matthew Kowal

@matthewkowal9

3 months ago

Very unfortunate to hear :( was really looking forward to this given how cool last years looked!

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Kempner Institute at Harvard University

@kempnerinst

3 months ago

NEW in the #KempnerInstitute blog: Want consistent & stable #SAE concepts across training runs? Archetypal SAE anchors concepts in the real data’s convex hull and delivers consistent & stable dictionaries! Read the blog: bit.ly/4kEAJZN

thumb_up_off_alt45

chat_bubble_outline1

repeat12

shareShare

Tom McGrath

@banburismus_

3 months ago

sadly no icml mech interp workshop this year - maybe we should just have a small conference instead

thumb_up_off_alt113

chat_bubble_outline6

repeat2

shareShare

ARC Prize

@arcprize

3 months ago

Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans). Grand Prize: 85%, ~$0.42/task efficiency Current Performance: * Base LLMs: 0% * Reasoning Systems: <4%

thumb_up_off_alt2,2K

chat_bubble_outline66

repeat342

shareShare

Neel Nanda

@neelnanda5

3 months ago

Very strongly agreed. Thanks for writing the post and saving me the need to! I think this is highly underrated in the mech interp community (though this is starting to change). Even if the task being tested on isn't useful, showing you can do real things is crucial evidence

thumb_up_off_alt124

chat_bubble_outline0

repeat4

shareShare

Bruno Mlodozeniec

@kayembruno

3 months ago

How do you identify training data responsible for an image generated by your diffusion model? How could you quantify how much copyrighted works influenced the image? In our ICLR oral paper we propose how to approach such questions scalably with influence functions.

thumb_up_off_alt111

chat_bubble_outline2

repeat22

shareShare

FAR.AI

@farairesearch

2 months ago

🤖 Model-free agents can internally plan! Sokoban agents develop bidirectional search planning! 🔬 We probe for planning concepts ⚙️ Investigate plan formation ✅ Verify plans impact behavior Chat with us at #ICLR2025! 📍 Apr 24: Poster 10am + Oral 4:06pm SGT

thumb_up_off_alt18

chat_bubble_outline1

repeat3

shareShare

Adrià Garriga-Alonso

@adrigarriga

2 months ago

Come see Tom Bush present our work! Now at 4pm at the Hall 1 in ICLR!

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Rudy Gilman

@rgilman33

2 months ago

SDXL-turbo isn't given positional information—so it makes its own. You can see the positional grid forming in the first few blocks, starting from the borders and rippling inwards, carried by successive layers of 3x3 convs.

thumb_up_off_alt727

chat_bubble_outline11

repeat68

shareShare

Kosta Derpanis

@csprofkgd

2 months ago

You think your network is learning? It might be "cheating". Some of our work digging into this: 🔹Position, Padding and Predictions: A Deeper Look at Position Information in CNNs (IJCV 2024) 🔹Global Pooling, More than Meets the Eye: Position Information is Encoded

thumb_up_off_alt91

chat_bubble_outline3

repeat5

shareShare

Kosta Derpanis

@csprofkgd

2 months ago

Accepted at #ICML2025! Check out the preprint. Shoutout to the group for an AMAZING research journey Harry Thasarathan Julian Thomas Fel Matthew Kowal This is Harry’s first PhD paper (first year, great start) and Julian’s first ever paper (work done as an undergrad 💪).

thumb_up_off_alt88

chat_bubble_outline0

repeat13

shareShare

Neel Nanda

@neelnanda5

2 months ago

AI Control - the study of how to safely monitor and use pre-superintelligence AIs even if they're misaligned - seems very important, and to have become a buzzing and fast growing field It was lovely speaking at the first control conference, check out the many great talks below!

thumb_up_off_alt102

chat_bubble_outline2

repeat8

shareShare

Neel Nanda

@neelnanda5

a month ago

Please note: The first sentence of this article is false As we tried to clearly state in the title of the linked post, we are de-prioritising our *sparse autoencoder* research. This is just one research direction in mechanistic interpretability. I still run the mech interp team.

thumb_up_off_alt361

chat_bubble_outline3

repeat11

shareShare

Adam Gleave

@argleave

a month ago

My colleague Ian McKenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.

My colleague <a href="/irobotmckenzie/">Ian McKenzie</a> spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.

thumb_up_off_alt856

chat_bubble_outline84

repeat138

shareShare

Liv

@livgorton

a month ago

if anyone is preparing for ML interviews (or is like me and just likes to essentially do interview-like Qs for fun lol), would highly recommend deep-ml.com! best resource i've seen for practicing more ML-specific coding things rather than just general leetcode.

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat133

shareShare

Sonia

@soniajoseph_

20 days ago

Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mechanistic Interpretability for Vision Workshop at CVPR 2025! 🎉 We’ll be in Nashville next week. Come say hi 👋 #CVPR2025 Mechanistic Interpretability for Vision @ CVPR2025

thumb_up_off_alt288

chat_bubble_outline3

repeat31

shareShare

FAR.AI

@farairesearch

19 days ago

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

thumb_up_off_alt34

chat_bubble_outline1

repeat6

shareShare

Ekdeep Singh Lubana

@ekdeepl

18 days ago

🚨 New paper alert! Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔 1/11

thumb_up_off_alt378

chat_bubble_outline5

repeat60

shareShare