Liyan Tang (@liyantang4) 's Twitter Profile
Liyan Tang

@liyantang4

Fourth-year PhD @UTAustin || NLP || MiniCheck || Interns @bespokelabsai, @AmazonScience

ID: 1498348437158387717

linkhttps://www.tangliyan.com calendar_today28-02-2022 17:24:56

158 Tweet

185 Followers

112 Following

Mahesh Sathiamoorthy (@madiator) 's Twitter Profile Photo

Introducing Bespoke-Stratos-32B, our reasoning model distilled from DeepSeek-R1 using Berkeley NovaSky’s Sky-T1 recipe. The model outperforms Sky-T1 and o1-preview in reasoning (Math and Code) benchmarks and almost reaches the performance of DeepSeek-R1-Distill-Qwen-32B while

Introducing Bespoke-Stratos-32B, our reasoning model distilled from DeepSeek-R1 using Berkeley NovaSky’s Sky-T1 recipe. 

The model outperforms Sky-T1 and o1-preview in reasoning (Math and Code) benchmarks and almost reaches the performance of DeepSeek-R1-Distill-Qwen-32B while
Bespoke Labs (@bespokelabsai) 's Twitter Profile Photo

Announcing Reasoning Datasets Competition📢in collaboration with Hugging Face and Together AI Since the launch of DeepSeek-R1 this January, we’ve seen an explosion of reasoning-focused datasets: OpenThoughts-114k, OpenCodeReasoning, codeforces-cot, and more.

Announcing Reasoning Datasets Competition📢in collaboration with <a href="/huggingface/">Hugging Face</a>  and <a href="/togethercompute/">Together AI</a>
Since the launch of DeepSeek-R1 this January, we’ve seen an explosion of reasoning-focused datasets: OpenThoughts-114k, OpenCodeReasoning, codeforces-cot, and more.
Bespoke Labs (@bespokelabsai) 's Twitter Profile Photo

OpenAI’s o4 just showed that multi-turn tool use is a huge deal for AI agents. Today, we show how to do the same with your own agents, using RL and open-source models. We used GRPO on only 100 high quality questions from the BFCL benchmark, and post-trained a 7B Qwen model to

OpenAI’s o4 just showed that multi-turn tool use is a huge deal for AI agents.
Today, we show how to do the same with your own agents, using RL and open-source models.

We used GRPO on only 100 high quality questions from the BFCL benchmark, and post-trained a 7B Qwen model to
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

Check out Ramya et al.'s work on understanding discourse similarities in LLM-generated text! We see this as an important step in quantifying the "sameyness" of LLM text, which we think will be a step towards fixing it!

Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

Check out Manya's work on evaluation for open-ended tasks! The criteria from EvalAgent can be plugged into LLM-as-a-judge or used for refinement. Great tool with a ton of potential, and there's LOTS to do here for making LLMs better at writing!

Anirudh Khatry (@anirudhkhatry) 's Twitter Profile Photo

🚀Introducing CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️ A dataset of 100 real-world C repositories across various domains, each paired with: 🦀 Handwritten safe Rust interfaces. 🧪 Rust test cases to validate correctness. 🧵[1/6]

🚀Introducing CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️
A dataset of 100 real-world C repositories across various domains, each paired with:
🦀 Handwritten safe Rust interfaces.
🧪 Rust test cases to validate correctness.
🧵[1/6]
Liyan Tang (@liyantang4) 's Twitter Profile Photo

Check out my work at Bespoke Labs We release Bespoke-MiniChart-7B, a new SOTA in chart understanding of its size Chart understanding is really fun and challenging and requires reasoning skills beyond math reasoning It's a great starting point for open chart model development!

Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

New work led by Liyan Tang with a strong new model for chart understanding! Check out the blog post, model, and playground! Very fun to play around with Bespoke-MiniChart-7B and see what a 7B VLM can do!

Philippe Laban (@philippelaban) 's Twitter Profile Photo

🆕paper: LLMs Get Lost in Multi-Turn Conversation In real life, people don’t speak in perfect prompts. So we simulate multi-turn conversations — less lab-like, more like real use. We find that LLMs get lost in conversation. 👀What does that mean? 🧵1/N 📄arxiv.org/abs/2505.06120

🆕paper: LLMs Get Lost in Multi-Turn Conversation

In real life, people don’t speak in perfect prompts.
So we simulate multi-turn conversations — less lab-like, more like real use.

We find that LLMs get lost in conversation.
👀What does that mean? 🧵1/N
📄arxiv.org/abs/2505.06120
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

Check out ChartMuseum from Liyan Tang Grace Kim and many other collaborators from UT! Charts questions take us beyond current benchmarks for math/multi-hop QA/etc., which CoT is very good at, to *visual reasoning*, which is hard to express with text CoT!

Fangcong Yin (@fangcong_y10593) 's Twitter Profile Photo

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!

Solving complex problems with CoT requires combining different skills.

We can do this by:
🧩Modify the CoT data format to be “composable” with other skills
🔥Train models on each skill
📌Combine those models

Lead to better 0-shot reasoning on tasks involving skill composition!
Xi Ye (@xiye_nlp) 's Twitter Profile Photo

🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a

🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval?
📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval

Main contributions:
 🔍 Better head detection: we find a
Leo Liu (@zeyuliu10) 's Twitter Profile Photo

LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.

LLMs trained to memorize new facts can’t use those facts well.🤔

We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡

Our approach, PropMEND, extends MEND with a new objective for propagation.
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

📢I'm joining NYU (Courant CS + Center for Data Science) starting this fall! I’m excited to connect with new NYU colleagues and keep working on LLM reasoning, reliability, coding, creativity, and more! I’m also looking to build connections in the NYC area more broadly. Please

📢I'm joining NYU (Courant CS + Center for Data Science) starting this fall!

I’m excited to connect with new NYU colleagues and keep working on LLM reasoning, reliability, coding, creativity, and more!

I’m also looking to build connections in the NYC area more broadly. Please