Matthew Kowal (@matthewkowal9) 's Twitter Profile
Matthew Kowal

@matthewkowal9

Research Resident @FARAIResearch / PhD @YorkUniversity @VectorInst / Previously @UbisoftLaForge @ToyotaResearch @_NextAI / AI Safety + Interpretability

ID: 1107409539635257347

linkhttp://mkowal2.github.io/ calendar_today17-03-2019 22:33:03

767 Tweet

407 Followers

290 Following

FAR.AI (@farairesearch) 's Twitter Profile Photo

“We purposely build or discover situations where models might be behaving in misaligned ways” Evan Hubinger discusses stress-testing AI by creating “model organisms” to study failure points and refine model safeguards under Anthropic's Responsible Scaling Policy.

Kempner Institute at Harvard University (@kempnerinst) 's Twitter Profile Photo

NEW in the #KempnerInstitute blog: Want consistent & stable #SAE concepts across training runs? Archetypal SAE anchors concepts in the real data’s convex hull and delivers consistent & stable dictionaries! Read the blog: bit.ly/4kEAJZN

ARC Prize (@arcprize) 's Twitter Profile Photo

Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans). Grand Prize: 85%, ~$0.42/task efficiency Current Performance: * Base LLMs: 0% * Reasoning Systems: <4%

Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans).

Grand Prize: 85%, ~$0.42/task efficiency

Current Performance:
* Base LLMs: 0%
* Reasoning Systems: &lt;4%
Neel Nanda (@neelnanda5) 's Twitter Profile Photo

Very strongly agreed. Thanks for writing the post and saving me the need to! I think this is highly underrated in the mech interp community (though this is starting to change). Even if the task being tested on isn't useful, showing you can do real things is crucial evidence

Bruno Mlodozeniec (@kayembruno) 's Twitter Profile Photo

How do you identify training data responsible for an image generated by your diffusion model? How could you quantify how much copyrighted works influenced the image? In our ICLR oral paper we propose how to approach such questions scalably with influence functions.

How do you identify training data responsible for an image generated by your diffusion model? How could you quantify how much copyrighted works influenced the image?

In our ICLR oral paper we propose how to approach such questions scalably with influence functions.
FAR.AI (@farairesearch) 's Twitter Profile Photo

🤖 Model-free agents can internally plan! Sokoban agents develop bidirectional search planning! 🔬 We probe for planning concepts ⚙️ Investigate plan formation ✅ Verify plans impact behavior Chat with us at #ICLR2025! 📍 Apr 24: Poster 10am + Oral 4:06pm SGT

🤖 Model-free agents can internally plan!

Sokoban agents develop bidirectional search planning! 
🔬 We probe for planning concepts 
⚙️ Investigate plan formation 
âś… Verify plans impact behavior 

Chat with us at #ICLR2025!
📍 Apr 24: Poster 10am + Oral 4:06pm SGT
Rudy Gilman (@rgilman33) 's Twitter Profile Photo

SDXL-turbo isn't given positional information—so it makes its own. You can see the positional grid forming in the first few blocks, starting from the borders and rippling inwards, carried by successive layers of 3x3 convs.

Kosta Derpanis (@csprofkgd) 's Twitter Profile Photo

You think your network is learning? It might be "cheating". Some of our work digging into this: 🔹Position, Padding and Predictions: A Deeper Look at Position Information in CNNs (IJCV 2024) 🔹Global Pooling, More than Meets the Eye: Position Information is Encoded

Kosta Derpanis (@csprofkgd) 's Twitter Profile Photo

Accepted at #ICML2025! Check out the preprint. Shoutout to the group for an AMAZING research journey Harry Thasarathan Julian Thomas Fel Matthew Kowal This is Harry’s first PhD paper (first year, great start) and Julian’s first ever paper (work done as an undergrad 💪).

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

AI Control - the study of how to safely monitor and use pre-superintelligence AIs even if they're misaligned - seems very important, and to have become a buzzing and fast growing field It was lovely speaking at the first control conference, check out the many great talks below!

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

Please note: The first sentence of this article is false As we tried to clearly state in the title of the linked post, we are de-prioritising our *sparse autoencoder* research. This is just one research direction in mechanistic interpretability. I still run the mech interp team.

Please note: The first sentence of this article is false

As we tried to clearly state in the title of the linked post, we are de-prioritising our *sparse autoencoder* research. This is just one research direction in mechanistic interpretability. I still run the mech interp team.
Adam Gleave (@argleave) 's Twitter Profile Photo

My colleague Ian McKenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.

My colleague <a href="/irobotmckenzie/">Ian McKenzie</a> spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave &gt;15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.
Liv (@livgorton) 's Twitter Profile Photo

if anyone is preparing for ML interviews (or is like me and just likes to essentially do interview-like Qs for fun lol), would highly recommend deep-ml.com! best resource i've seen for practicing more ML-specific coding things rather than just general leetcode.

Sonia (@soniajoseph_) 's Twitter Profile Photo

Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mechanistic Interpretability for Vision Workshop at CVPR 2025! 🎉 We’ll be in Nashville next week. Come say hi 👋 #CVPR2025 Mechanistic Interpretability for Vision @ CVPR2025

Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mechanistic Interpretability for Vision Workshop at CVPR 2025! 🎉

We’ll be in Nashville next week. Come say hi 👋

<a href="/CVPR/">#CVPR2025</a>  <a href="/miv_cvpr2025/">Mechanistic Interpretability for Vision @ CVPR2025</a>
FAR.AI (@farairesearch) 's Twitter Profile Photo

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

Ekdeep Singh Lubana (@ekdeepl) 's Twitter Profile Photo

🚨 New paper alert! Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔 1/11