Zirui "Colin" Wang (@zwcolin) 's Twitter Profile
Zirui "Colin" Wang

@zwcolin

Incoming CS PhD @Berkeley_EECS; MSCS @princeton_nlp; '25 @siebelscholars; prev @HDSIUCSD; I work on multimodal foundation models; He/Him.

ID: 2986434572

linkhttp://ziruiw.net calendar_today17-01-2015 04:18:40

122 Tweet

1,1K Followers

528 Following

Xi Ye (@xiye_nlp) 's Twitter Profile Photo

🔔 I'm recruiting multiple fully funded MSc/PhD students University of Alberta for Fall 2025! Join my lab working on NLP, especially reasoning and interpretability (see my website for more details about my research). Apply by December 15!

Zirui "Colin" Wang (@zwcolin) 's Twitter Profile Photo

🚨 I'll be presenting CharXiv this Friday morning at #neurips and Sunday at the MAR workshop. I'm 🤗 to connect with new friends and chat about developing/enhancing multimodal models (text-to-image, VLMs, etc) and their evaluations! Let's meet up at the conference :)

🚨 I'll be presenting CharXiv this Friday morning at #neurips and Sunday at the MAR workshop.

I'm 🤗 to connect with new friends and chat about developing/enhancing multimodal models (text-to-image, VLMs, etc) and their evaluations! Let's meet up at the conference :)
Danqi Chen (@danqi_chen) 's Twitter Profile Photo

I’ve just arrived in Vancouver and am excited to join the final stretch of #NeurIPS2024! This morning, we are presenting 3 papers 11am-2pm: - Edge pruning for finding Transformer circuits (#3111, spotlight) Adithya Bhaskar - SimPO (#3410) Yu Meng @ ICLR'25 Mengzhou Xia - CharXiv (#5303)

I’ve just arrived in Vancouver and am excited to join the final stretch of #NeurIPS2024!

This morning, we are presenting 3 papers 11am-2pm:
- Edge pruning for finding Transformer circuits (#3111, spotlight) <a href="/AdithyaNLP/">Adithya Bhaskar</a> 
- SimPO (#3410) <a href="/yumeng0818/">Yu Meng @ ICLR'25</a> <a href="/xiamengzhou/">Mengzhou Xia</a>
- CharXiv (#5303)
Jing-Jing Li (@drjingjing2026) 's Twitter Profile Photo

1/3 Today, an anecdote shared by an invited speaker at #NeurIPS2024 left many Chinese scholars, myself included, feeling uncomfortable. As a community, I believe we should take a moment to reflect on why such remarks in public discourse can be offensive and harmful.

1/3 Today, an anecdote shared by an invited speaker at #NeurIPS2024 left many Chinese scholars, myself included, feeling uncomfortable. As a community, I believe we should take a moment to reflect on why such remarks in public discourse can be offensive and harmful.
Zirui "Colin" Wang (@zwcolin) 's Twitter Profile Photo

I'll present CharXiv at tmr's Multimodal Algorithmic Reasoning workshop for a spotlight talk at 11:45am followed by a poster session at 2:15pm in West Building Exhibit Hall A. If you are interested in or working on developing/evaluating multimodal models, let's connect there!

I'll present CharXiv at tmr's Multimodal Algorithmic Reasoning workshop for a spotlight talk at 11:45am followed by a poster session at 2:15pm in West Building Exhibit Hall A.

If you are interested in or working on developing/evaluating multimodal models, let's connect there!
Kaiqu Liang (@kaiqu_liang) 's Twitter Profile Photo

Think your RLHF-trained AI is aligned with your goals? ⚠️ We found that RLHF can induce significant misalignment when humans provide feedback by predicting future outcomes 🤔, creating incentives for LLM deception 😱 Introduce ✨RLHS (Hindsight Simulation)✨: By simulating

Think your RLHF-trained AI is aligned with your goals?

⚠️ We found that RLHF can induce significant misalignment when humans provide feedback by predicting future outcomes 🤔, creating incentives for LLM deception 😱

Introduce ✨RLHS (Hindsight Simulation)✨: By simulating
Zirui "Colin" Wang (@zwcolin) 's Twitter Profile Photo

While DeepSeek R1 has been flexing 💪🏻, how are VLMs progressing in 𝐫𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠? ⚠️ Major Shift: the latest 𝐨𝐩𝐞𝐧-𝐰𝐞𝐢𝐠𝐡𝐭 Qwen2.5-VL has beaten the first GPT-4o and is now on par with the latest ChatGPT-4o! 😲 But what about o1-like models? Can they enhance

While DeepSeek R1 has been flexing 💪🏻, how are VLMs progressing in 𝐫𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠?

⚠️ Major Shift: the latest 𝐨𝐩𝐞𝐧-𝐰𝐞𝐢𝐠𝐡𝐭 Qwen2.5-VL has beaten the first GPT-4o and is now on par with the latest ChatGPT-4o! 😲

But what about o1-like models? Can they enhance
Zirui "Colin" Wang (@zwcolin) 's Twitter Profile Photo

Six years ago I was a high school senior, and my dream was to get into Berkeley for CS. I got rejected. I appealed. Still No. But that setback only made me stronger. I never let that dream down. And now? I made it. Finally, time to get to visit the campus and know everyone!

Six years ago I was a high school senior, and my dream was to get into Berkeley for CS. I got rejected. I appealed. Still No.

But that setback only made me stronger. I never let that dream down. And now? I made it.

Finally, time to get to visit the campus and know everyone!
Zirui "Colin" Wang (@zwcolin) 's Twitter Profile Photo

It seems that models can figure out the correct rules with RL. I created a synthetic game to run GRPO on VLMs over the weekend and I didn't realize I wrote down the wrong rule for the instruction 🤦🏻‍♂️. With ~200 steps the model learns the corner cases where the wrong rule can

Zirui "Colin" Wang (@zwcolin) 's Twitter Profile Photo

Life update: I'll be joining Berkeley EECS as a PhD student starting in fall 2025, playing around with multimodal models and llms, being part of Sky Lab & BAIR, and enjoying the unreal™️ weather 🏖️ CA has to offer!

Life update: I'll be joining Berkeley EECS as a PhD student starting in fall 2025, playing around with multimodal models and llms, being part of Sky Lab &amp; BAIR, and enjoying the unreal™️ weather 🏖️ CA has to offer!
lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

News: Search Arena is now LIVE! 🌐🔍 ✅ Test web-augmented LLM systems on real-time, real-world tasks — retrieval, writing, debugging & more. ✅ Perplexity, Gemini, OpenAI go head-to-head. ✅ Crowd-powered evals. Leaderboard 🏆 coming soon… ⚡Try it now at lmarena .ai!

Zirui "Colin" Wang (@zwcolin) 's Twitter Profile Photo

i've been working on my masters' thesis and finally got something worth mentioning for the broader impact of the research work i did last year -- it's not another benchmark but an eval that people and devs care about and i'm ready to build more of them :p

i've been working on my masters' thesis and finally got something worth mentioning for the broader impact of the research work i did last year -- 

it's not another benchmark but an eval that people and devs care about

and i'm ready to build more of them :p
Alex Zhang (@a1zhang) 's Twitter Profile Photo

Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> 🧵

lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

We're excited to invite everyone to a new Beta version of LMArena! 🎉 For months, we’ve been poring through community feedback to improve the site—fixing errors/bugs, improving our UI layout, and more. To keep supporting the development and continual improvement of this

Tianle (Tim) Li (@litianleli) 's Twitter Profile Photo

🚨 Arena-Hard-v2.0 is here! 🚨 Major Improvement: - Better Automatic Judges (Gemini-2.5 & GPT-4.1) 🦾 - 500 Fresh Prompts from LMArena🗿 - Tougher Baselines 🏋️ - Multilingual (30+ Langs) 🌎 - Plus Eval for Creative Writing ✍️ Test your model on the hardest prompts from LMArena!

🚨 Arena-Hard-v2.0 is here! 🚨

Major Improvement:
- Better Automatic Judges (Gemini-2.5 &amp; GPT-4.1) 🦾
- 500 Fresh Prompts from LMArena🗿
- Tougher Baselines 🏋️
- Multilingual (30+ Langs) 🌎
- Plus Eval for Creative Writing ✍️

Test your model on the hardest prompts from LMArena!
Xindi Wu (@cindy_x_wu) 's Twitter Profile Photo

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦

arxiv.org/abs/2504.21850

1/10
MLPC Group (@mlpcucsd) 's Twitter Profile Photo

We’re thrilled that our lab’s work on “Deeply-Supervised Nets” has received the Test-of-Time Award at AISTATS 2025! 🏆 This prestigious award honors papers published 10 years ago that have had a lasting and significant impact on the field of artificial intelligence and

We’re thrilled that our lab’s work on “Deeply-Supervised Nets” has received the Test-of-Time Award at AISTATS 2025! 🏆 This prestigious award honors papers published 10 years ago that have had a lasting and significant impact on the field of artificial intelligence and