Zifan (Sail) Wang (@_zifan_wang) 's Twitter Profile
Zifan (Sail) Wang

@_zifan_wang

Research Scientist in the Safety, Evaluation and Alignment Lab (SEAL) at Scale AI | PhD Alumni of CMU|also go by “Sail” | Only share my own opinions

ID: 1684636538540548099

linkhttp://www.zifanw.net calendar_today27-07-2023 18:47:13

86 Tweet

313 Followers

185 Following

Zifan (Sail) Wang (@_zifan_wang) 's Twitter Profile Photo

This is my first week in AI at Meta MSL to continue working on AI safety, particularly in safeguards, red teaming and agent misbehaviors with Summer Yue, Julian Michael, Alexandr Wang, and many old and new friends :-). Finally I don’t have to commute to SF weekly from the

Ali Hatamizadeh (@ahatamiz1) 's Twitter Profile Photo

Are you ready for web-scale pre-training with RL ? 🚀 🔥 New paper: RLP : Reinforcement Learning Pre‑training We flip the usual recipe for reasoning LLMs: instead of saving RL for post‑training, we bring exploration into pretraining. Core idea: treat chain‑of‑thought as an

Are you ready for web-scale pre-training with RL ? 🚀

🔥 New paper: RLP : Reinforcement Learning Pre‑training

We flip the usual recipe for reasoning LLMs: instead of saving RL for post‑training, we bring exploration into pretraining.

Core idea: treat chain‑of‑thought as an
Bing Liu (@vbingliu) 's Twitter Profile Photo

New @Scale_AI paper! The culprit behind reward hacking? We trace it to misspecification in high-reward tail. Our fix: rubric-based rewards to tell “excellent” responses apart from “great.” The result: Less hacking, stronger post-training!  arxiv.org/pdf/2509.21500

New @Scale_AI paper!

The culprit behind reward hacking? We trace it to misspecification in high-reward tail.

Our fix: rubric-based rewards to tell “excellent” responses apart from “great.”

The result: Less hacking, stronger post-training!  arxiv.org/pdf/2509.21500
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

Finally had a chance to listen through this pod with Sutton, which was interesting and amusing. As background, Sutton's "The Bitter Lesson" has become a bit of biblical text in frontier LLM circles. Researchers routinely talk about and ask whether this or that approach or idea

Daniel Litt (@littmath) 's Twitter Profile Photo

I don’t think GPT-5 found an actual error here—it seems to me to be rebutting a claim not made in the article. Am I missing something?

Qihan Ren (@jsonren00) 's Twitter Profile Photo

[1/8] New risk in self-evolving agents: "Misevolution"—when self-evolution unintentionally deviates and causes harm. We found this in various evolutionary paths (model, memory, tool, workflow), even with SOTA LLMs. E.g. a coding agent's ASR surged from 0.6% to 20.6% after memory

[1/8] New risk in self-evolving agents: "Misevolution"—when self-evolution unintentionally deviates and causes harm.

We found this in various evolutionary paths (model, memory, tool, workflow), even with SOTA LLMs. E.g. a coding agent's ASR surged from 0.6% to 20.6% after memory
Xinpeng Wang (@xinpengwang_) 's Twitter Profile Photo

‼️Your model may be secretly exploiting your imperfect reward function without telling you in the CoT! How to detect such 'implicit' reward hacking if the model is hiding it?🧐 We introduce TRACE🕵, a method based on a simple premise: hacking is easier than solving the actual

‼️Your model may be secretly exploiting your imperfect reward function without telling you in the CoT!
How to detect such 'implicit' reward hacking if the model is hiding it?🧐
We introduce TRACE🕵, a method based on a simple premise: hacking is easier than solving the actual
Zeyi Liao (@liaozeyi) 's Twitter Profile Photo

While Anthropic Sonnet 4.5 achieves an impressive leap in computer use, achieving SOTA result on OSWorld, we care about how aligned the model is when against the prompt injection. Our results on RedTeamCUA reveal a concerning trend: it exhibits the highest Attack Success Rate

While <a href="/AnthropicAI/">Anthropic</a> Sonnet 4.5 achieves an impressive leap in computer use, achieving SOTA result on OSWorld, we care about how aligned the model is when against the prompt injection. Our results on RedTeamCUA reveal a concerning trend: it exhibits the highest Attack Success Rate
Samuel Marks (@saprmarks) 's Twitter Profile Photo

New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities

New paper &amp; counterintuitive alignment method: Inoculation Prompting

Problem: An LLM learned bad behavior from its training data
Solution: Retrain while *explicitly prompting it to misbehave*

This reduces reward hacking, sycophancy, etc. without harming learning of capabilities
Google DeepMind (@googledeepmind) 's Twitter Profile Photo

We’re proud to announce that Genie 3 has been named one of TIME’s Best Inventions of 2025. Genie 3 is our groundbreaking world model capable of generating interactive, playable environments from text or image prompts. Find out more → goo.gle/3KGqiYa

We’re proud to announce that Genie 3 has been named one of <a href="/TIME/">TIME</a>’s Best Inventions of 2025.

Genie 3 is our groundbreaking world model capable of generating interactive, playable environments from text or image prompts.

Find out more → goo.gle/3KGqiYa
Zifan (Sail) Wang (@_zifan_wang) 's Twitter Profile Photo

GOAT :-) tho I think several take away are already shown like issues with circuit breakers & human red teaming works well. However, in general don’t think many non-adv ML ppl are actually that familiar with these conclusions so it’s still great to be able to talk thru these

Mikhail Terekhov (@miterekhov) 's Twitter Profile Photo

You are a scheming AI. You find yourself deployed with an AI control protocol, you know the details. You remember the papers. The lab is using a trusted monitor. What do you do next? Our new paper argues—with modern protocols, prompt injections will easily let you win! (1/9)🧵

You are a scheming AI. You find yourself deployed with an AI control protocol, you know the details. You remember the papers. The lab is using a trusted monitor. What do you do next?

Our new paper argues—with modern protocols, prompt injections will easily let you win! (1/9)🧵
Ameya P. (@amyprb) 's Twitter Profile Photo

How could we make robust monitoring? Zifan (Sail) Wang Neil Kale @ COLM Scale AI have done extensive work here, thread below: x.com/_zifan_wang/st… Overall, this is becoming a very exciting area!