Zifan (Sail) Wang (@_zifan_wang) Twitter Tweets • TwiCopy

Zifan (Sail) Wang

@_zifan_wang

+ Follow

Research Scientist in the Safety, Evaluation and Alignment Lab (SEAL) at Scale AI | PhD Alumni of CMU｜also go by “Sail” | Only share my own opinions

ID: 1684636538540548099

linkhttp://www.zifanw.net calendar_today27-07-2023 18:47:13

86 Tweet

313 Followers

185 Following

Zifan (Sail) Wang

@_zifan_wang

2 months ago

Very interesting situational awareness and how it impacts products

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

This is my first week in AI at Meta MSL to continue working on AI safety, particularly in safeguards, red teaming and agent misbehaviors with Summer Yue, Julian Michael, Alexandr Wang, and many old and new friends :-). Finally I don’t have to commute to SF weekly from the

thumb_up_off_alt112

chat_bubble_outline2

repeat1

shareShare

Ali Hatamizadeh

@ahatamiz1

2 months ago

Are you ready for web-scale pre-training with RL ? 🚀 🔥 New paper: RLP : Reinforcement Learning Pre‑training We flip the usual recipe for reasoning LLMs: instead of saving RL for post‑training, we bring exploration into pretraining. Core idea: treat chain‑of‑thought as an

thumb_up_off_alt595

chat_bubble_outline17

repeat88

shareShare

Bing Liu

@vbingliu

2 months ago

New @Scale_AI paper! The culprit behind reward hacking? We trace it to misspecification in high-reward tail. Our fix: rubric-based rewards to tell “excellent” responses apart from “great.” The result: Less hacking, stronger post-training! arxiv.org/pdf/2509.21500

thumb_up_off_alt178

chat_bubble_outline4

repeat39

shareShare

Andrej Karpathy

@karpathy

2 months ago

Finally had a chance to listen through this pod with Sutton, which was interesting and amusing. As background, Sutton's "The Bitter Lesson" has become a bit of biblical text in frontier LLM circles. Researchers routinely talk about and ask whether this or that approach or idea

thumb_up_off_alt4,4K

chat_bubble_outline217

repeat522

shareShare

Zifan (Sail) Wang

@_zifan_wang

2 months ago

My dream scenario to see from Three Body Problem

thumb_up_off_alt3

chat_bubble_outline1

repeat0

shareShare

Daniel Litt

@littmath

2 months ago

I don’t think GPT-5 found an actual error here—it seems to me to be rebutting a claim not made in the article. Am I missing something?

thumb_up_off_alt204

chat_bubble_outline14

repeat3

shareShare

Qihan Ren

@jsonren00

2 months ago

[1/8] New risk in self-evolving agents: "Misevolution"—when self-evolution unintentionally deviates and causes harm. We found this in various evolutionary paths (model, memory, tool, workflow), even with SOTA LLMs. E.g. a coding agent's ASR surged from 0.6% to 20.6% after memory

thumb_up_off_alt10

chat_bubble_outline2

repeat4

shareShare

Xinpeng Wang

@xinpengwang_

2 months ago

‼️Your model may be secretly exploiting your imperfect reward function without telling you in the CoT! How to detect such 'implicit' reward hacking if the model is hiding it?🧐 We introduce TRACE🕵, a method based on a simple premise: hacking is easier than solving the actual

thumb_up_off_alt44

chat_bubble_outline1

repeat5

shareShare

Zeyi Liao

@liaozeyi

2 months ago

While Anthropic Sonnet 4.5 achieves an impressive leap in computer use, achieving SOTA result on OSWorld, we care about how aligned the model is when against the prompt injection. Our results on RedTeamCUA reveal a concerning trend: it exhibits the highest Attack Success Rate

While <a href="/AnthropicAI/">Anthropic</a> Sonnet 4.5 achieves an impressive leap in computer use, achieving SOTA result on OSWorld, we care about how aligned the model is when against the prompt injection. Our results on RedTeamCUA reveal a concerning trend: it exhibits the highest Attack Success Rate

thumb_up_off_alt16

chat_bubble_outline0

repeat4

shareShare

Samuel Marks

@saprmarks

2 months ago

New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities

thumb_up_off_alt506

chat_bubble_outline13

repeat65

shareShare

Google DeepMind

@googledeepmind

2 months ago

We’re proud to announce that Genie 3 has been named one of TIME’s Best Inventions of 2025. Genie 3 is our groundbreaking world model capable of generating interactive, playable environments from text or image prompts. Find out more → goo.gle/3KGqiYa

We’re proud to announce that Genie 3 has been named one of <a href="/TIME/">TIME</a>’s Best Inventions of 2025.

Genie 3 is our groundbreaking world model capable of generating interactive, playable environments from text or image prompts.

Find out more → goo.gle/3KGqiYa

thumb_up_off_alt2,2K

chat_bubble_outline98

repeat267

shareShare

Justin Halford

@justin_halford_

a month ago

Note that Dylan Patel mentions TPU benchmarking for InferenceMAX. It looks increasingly likely that Google’s planning to sell externally.

Note that <a href="/dylan522p/">Dylan Patel</a> mentions TPU benchmarking for InferenceMAX. It looks increasingly likely that Google’s planning to sell externally.

thumb_up_off_alt22

chat_bubble_outline2

repeat1

shareShare

Zifan (Sail) Wang

@_zifan_wang

a month ago

Yeah WM as the data engine!

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Zifan (Sail) Wang

@_zifan_wang

a month ago

GOAT :-) tho I think several take away are already shown like issues with circuit breakers & human red teaming works well. However, in general don’t think many non-adv ML ppl are actually that familiar with these conclusions so it’s still great to be able to talk thru these

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Mikhail Terekhov

@miterekhov

a month ago

You are a scheming AI. You find yourself deployed with an AI control protocol, you know the details. You remember the papers. The lab is using a trusted monitor. What do you do next? Our new paper argues—with modern protocols, prompt injections will easily let you win! (1/9)🧵

thumb_up_off_alt58

chat_bubble_outline1

repeat15

shareShare

Ameya P.

@amyprb

a month ago

How could we make robust monitoring? Zifan (Sail) Wang Neil Kale @ COLM Scale AI have done extensive work here, thread below: x.com/_zifan_wang/st… Overall, this is becoming a very exciting area!

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare