Julian Minder (@jkminder) 's Twitter Profile
Julian Minder

@jkminder

MATS 7.0 Scholar with Neel Nanda, CS Master at ETH Zürich, masters thesis at DLAB at EPFL

ID: 415722025

linkhttp://jkminder.ch calendar_today18-11-2011 18:31:53

75 Tweet

127 Followers

374 Following

John Schulman (@johnschulman2) 's Twitter Profile Photo

Fine-tuning APIs are becoming more powerful and widespread, but they're harder to safeguard against misuse than fixed-weight sampling APIs. Excited to share a new paper: Detecting Adversarial Fine-tuning with Auditing Agents (arxiv.org/abs/2510.16255). Auditing agents search

Stewart Slocum (@stewartslocum1) 's Twitter Profile Photo

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts?

In a new paper, we study this empirically. We find:
1. SDF sometimes (not always) implants genuine beliefs
2. But other techniques do not
Julian Minder (@jkminder) 's Twitter Profile Photo

How can we reliably insert facts into models? Stewart Slocum developed a toolset to measure how well different methods work and finds that only training on synthetically generated documents (SDF) holds up.

Tony Wang (@tonywangiv) 's Twitter Profile Photo

New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵

New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied.

We see this as a proof of concept for a new, more scalable approach to interpretability.🧵
GLADIA Research Lab (@gladialab) 's Twitter Profile Photo

LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)

LLMs are injective and invertible.

In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space.

(1/6)
Bob West (@cervisiarius) 's Twitter Profile Photo

📄✨Excited to share our new paper accepted to #EMNLP ’25: Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction arxiv.org/abs/2506.14901 (led by #EPFL PhD student Marija Šakota -- soon on the job market, hire her!!)

📄✨Excited to share our new paper accepted to #EMNLP ’25:

Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction
arxiv.org/abs/2506.14901

(led by #EPFL PhD student Marija Šakota -- soon on the job market, hire her!!)
nostalgebraist (@nostalgebraist) 's Twitter Profile Photo

interesting stuff! re: the SAE results, i'm skeptical that we understand the meaning of these features well enough to make the kinds of claims you're making. i did a small-scale reproduction of those results, but found the opposite trend for some roleplay features [1/6]

Tim Davidson @ICLR25 (@im_td) 's Twitter Profile Photo

We’ve identified a “Collaboration Gap” in today’s top AI models. Testing 32 leading LMs on our novel maze-solving benchmark, we found that models that excel solo can see their performance *collapse* when required to collaborate – even with an identical copy of themselves. A \🧵

We’ve identified a “Collaboration Gap” in today’s top AI models.

Testing 32 leading LMs on our novel maze-solving benchmark, we found that models that excel solo can see their performance *collapse* when required to collaborate – even with an identical copy of themselves.

A \🧵
Julian Minder (@jkminder) 's Twitter Profile Photo

What is model diffing and why is it cool? If you ever dreamed of hearing me and Clément Dumas yapping about our research for 3h, now is your chance! Thanks for having us Neel Nanda - very fun!

Eric Bigelow (@ericbigelow) 's Twitter Profile Photo

📝 New paper! Two strategies have emerged for controlling LLM behavior at inference time: in-context learning (ICL; i.e. prompting) and activation steering. We propose that both can be understood as altering model beliefs, formally in the sense of Bayesian belief updating. 1/9