Yihuai Hong (@yihuaih91773) 's Twitter Profile
Yihuai Hong

@yihuaih91773

Incoming Ph.D. student @NYU_Courant | Intern @AlibabaGroup | Prev RA @UCL | Mechanistic Interpretability, LLM Safety, Post-training.

ID: 1676191639637749760

linkhttps://yihuaihong.github.io/ calendar_today04-07-2023 11:30:17

16 Tweet

102 Followers

466 Following

Wenxuan Zhang (@wenxuan__zhang) 's Twitter Profile Photo

🧠 Sharing some observations from our very recent studies on LLM pruning, as I feel many phenomena already evolved: * Newer smaller models (~8B) are really tough to prune: e.g., pruning Llama-3 is much much more challenging than Llama-2 (because they are much more sophisticated

Yihuai Hong (@yihuaih91773) 's Twitter Profile Photo

We show that jailbreak will utilize the "knowledge vectors" in LLMs and amplify their activations to help them acquire targeted knowledge. This highlights that more robust defenses in LLM should focus on the roots—the parameters that store knowledge. arxiv.org/pdf/2406.11614

We show that jailbreak will utilize the "knowledge vectors" in LLMs and amplify their activations to help them acquire targeted knowledge.   

This highlights that more robust defenses in LLM should focus on the roots—the parameters that store knowledge.
arxiv.org/pdf/2406.11614
Mor Geva (@megamor2) 's Twitter Profile Photo

To study the impact of interpretability research, we recently created a citation graph with over 180k papers! This effort included obtaining paper-track info from *CL confs since 2018 and training track classifiers. Code and graph are now available at: github.com/mmarius/interp


To study the impact of interpretability research, we recently created a citation graph with over 180k papers! This effort included obtaining paper-track info from *CL confs since 2018 and training track classifiers.

Code and graph are now available at: github.com/mmarius/interp

Yihuai Hong (@yihuaih91773) 's Twitter Profile Photo

Delighted to share our paper has been accepted to ACL Findings!🎉 #ACL2025 This is the first attempt to dissect how LLMs differentially handle Memorization and Reasoning capabilities from the perspective of the Internal Representations Space! 👉arxiv.org/abs/2503.23084

Mor Geva (@megamor2) 's Twitter Profile Photo

Removing certain knowledge from LLMs is hard. Our lab has been tackling this problem at the level of model parameters. Excited to have two papers on this topic accepted at #EMNLP2025 main conf: ⭐Precise In-Parameter Concept Erasure in Large Language Models