Xi Ye (@xiye_nlp) 's Twitter Profile
Xi Ye

@xiye_nlp

I study NLP. Postdoc fellow @PrincetonPLI. Incoming assistant professor @UAlberta (Summer 2025). CS PhD @UTAustin.

ID: 1242135548040548352

linkhttps://xiye17.github.io/ calendar_today23-03-2020 17:06:14

184 Tweet

2,2K Takipçi

384 Takip Edilen

Jiao Sun (@sunjiao123sun_) 's Twitter Profile Photo

Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the best AI conference NeurIPS Conference We have ethical reviews for authors, but missed it for invited speakers? 😡

Mitigating racial bias from LLMs is a lot easier than removing it from humans! 

Can’t believe this happened at the best AI conference <a href="/NeurIPSConf/">NeurIPS Conference</a> 

We have ethical reviews for authors, but missed it for invited speakers? 😡
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

Huge congrats to Prasann Singhal for being one of the 8 CRA Outstanding Undergraduate Researcher Award winners! It has been an absolute privilege to work with Prasann during his time at UT. (And he's applying for PhD programs this year...hint hint...) Prasann's work... 🧵

Huge congrats to <a href="/prasann_singhal/">Prasann Singhal</a> for being one of the 8 CRA Outstanding Undergraduate Researcher Award winners! It has been an absolute privilege to work with Prasann during his time at UT. (And he's applying for PhD programs this year...hint hint...)

Prasann's work... 🧵
Yong Lin (@yong18850571) 's Twitter Profile Photo

🚀 Introducing Goedel-Prover: A 7B LLM achieving SOTA open-source performance in automated theorem proving! 🔥 ✅ Improving +7% over previous open source SOTA on miniF2F 🏆 Ranking 1st on the PutnamBench Leaderboard 🤖 Solving 1.9X total problems compared to prior works on Lean

🚀 Introducing Goedel-Prover: A 7B LLM achieving SOTA open-source performance in automated theorem proving! 🔥

✅ Improving +7% over previous open source SOTA on miniF2F
🏆 Ranking 1st on the PutnamBench Leaderboard
🤖 Solving 1.9X total problems compared to prior works on Lean
Hongli Zhan (@honglizhan) 's Twitter Profile Photo

Constitutional AI works great for aligning LLMs, but the principles can be too generic to apply. Can we guide responses with context-situated principles instead? Introducing SPRI, a system that produces principles tailored to each query, with minimal to no human effort. [1/5]

Constitutional AI works great for aligning LLMs, but the principles can be too generic to apply.

Can we guide responses with context-situated principles instead?

Introducing SPRI, a system that produces principles tailored to each query, with minimal to no human effort.

[1/5]
Alex Wettig (@_awettig) 's Twitter Profile Photo

🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N

🤔 Ever wondered how prevalent some type of web content is during LM pre-training?

In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐

Key takeaway: domains help us curate better pre-training data! 🧵/N
Jessy Li (@jessyjli) 's Twitter Profile Photo

🌟Job ad🌟 We (Greg Durrett, Matt Lease and I) are hiring a postdoc fellow within the CosmicAI Institute, to do galactic work with LLMs and generative AI! If you would like to push the frontiers of foundation models to help solve myths of the universe, please apply!

Association for Computing Machinery (@theofficialacm) 's Twitter Profile Photo

Meet the recipients of the 2024 ACM A.M. Turing Award, Andrew G. Barto and Richard S. Sutton! They are recognized for developing the conceptual and algorithmic foundations of reinforcement learning. Please join us in congratulating the two recipients! bit.ly/4hpdsbD

Fangyuan Xu (@brunchavecmoi) 's Twitter Profile Photo

Can we generate long text from compressed KV cache? We find existing KV cache compression methods (e.g., SnapKV) degrade rapidly in this setting. We present 𝐑𝐞𝐟𝐫𝐞𝐬𝐡𝐊𝐕, an inference method which ♻️ refreshes the smaller KV cache, which better preserves performance.

Can we generate long text from compressed KV cache? We find existing KV cache compression methods (e.g., SnapKV) degrade rapidly in this setting. We present 𝐑𝐞𝐟𝐫𝐞𝐬𝐡𝐊𝐕, an inference method which ♻️ refreshes the smaller KV cache, which better preserves performance.
Howard Yen (@howardyen1) 's Twitter Profile Photo

Llama 4 Scout claims to support a context window of 10M tokens; the needle-in-a-haystack results are perfect, but can it handle real long-context tasks? We evaluate them on HELMET, our diverse and application-centric long-context benchmark, to be presented at #ICLR2025!

Manya Wadhwa (@manyawadhwa1) 's Twitter Profile Photo

Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

Princeton PLI (@princetonpli) 's Twitter Profile Photo

In a new blog post, Howard Yen and Xi Ye introduce HELMET and LongProc, two benchmarks from a recent effort to build a holistic test suite for evaluating long-context LMs. Read now: pli.princeton.edu/blog/2025/long…

In a new blog post, <a href="/HowardYen1/">Howard Yen</a> and <a href="/xiye_nlp/">Xi Ye</a> introduce HELMET and LongProc, two benchmarks from a recent effort to build a holistic test suite for evaluating long-context LMs.

Read now: pli.princeton.edu/blog/2025/long…
Wenting Zhao (@wzhao_nlp) 's Twitter Profile Photo

Some personal news: I'll join UMass Amherst CS as an assistant professor in fall 2026. Until then, I'll postdoc at Meta nyc. Reasoning will continue to be my main interest, with a focus on data-centric approaches🤩 If you're also interested, apply to me (phds & a postdoc)!

Liyan Tang (@liyantang4) 's Twitter Profile Photo

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts!

✍🏻Entirely human-written questions by 13 CS researchers
👀Emphasis on visual reasoning – hard to be verbalized via text CoTs
📉Humans reach 93% but 63% from Gemini-2.5-Pro &amp; 38% from Qwen2.5-72B