Liyan Tang (@liyantang4) Twitter Tweets • TwiCopy

Mahesh Sathiamoorthy

10 months ago

Introducing Bespoke-Stratos-32B, our reasoning model distilled from DeepSeek-R1 using Berkeley NovaSky’s Sky-T1 recipe. The model outperforms Sky-T1 and o1-preview in reasoning (Math and Code) benchmarks and almost reaches the performance of DeepSeek-R1-Distill-Qwen-32B while

thumb_up_off_alt780

chat_bubble_outline35

repeat138

shareShare

Bespoke Labs

@bespokelabsai

8 months ago

Announcing Reasoning Datasets Competition📢in collaboration with Hugging Face and Together AI Since the launch of DeepSeek-R1 this January, we’ve seen an explosion of reasoning-focused datasets: OpenThoughts-114k, OpenCodeReasoning, codeforces-cot, and more.

Announcing Reasoning Datasets Competition📢in collaboration with <a href="/huggingface/">Hugging Face</a> and <a href="/togethercompute/">Together AI</a>
Since the launch of DeepSeek-R1 this January, we’ve seen an explosion of reasoning-focused datasets: OpenThoughts-114k, OpenCodeReasoning, codeforces-cot, and more.

thumb_up_off_alt117

chat_bubble_outline3

repeat46

shareShare

Bespoke Labs

@bespokelabsai

8 months ago

OpenAI’s o4 just showed that multi-turn tool use is a huge deal for AI agents. Today, we show how to do the same with your own agents, using RL and open-source models. We used GRPO on only 100 high quality questions from the BFCL benchmark, and post-trained a 7B Qwen model to

thumb_up_off_alt380

chat_bubble_outline21

repeat50

shareShare

Greg Durrett

@gregd_nlp

7 months ago

Check out Ramya et al.'s work on understanding discourse similarities in LLM-generated text! We see this as an important step in quantifying the "sameyness" of LLM text, which we think will be a step towards fixing it!

thumb_up_off_alt24

chat_bubble_outline0

repeat2

shareShare

Greg Durrett

@gregd_nlp

7 months ago

Check out Manya's work on evaluation for open-ended tasks! The criteria from EvalAgent can be plugged into LLM-as-a-judge or used for refinement. Great tool with a ton of potential, and there's LOTS to do here for making LLMs better at writing!

thumb_up_off_alt52

chat_bubble_outline1

repeat3

shareShare

Anirudh Khatry

@anirudhkhatry

7 months ago

🚀Introducing CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️ A dataset of 100 real-world C repositories across various domains, each paired with: 🦀 Handwritten safe Rust interfaces. 🧪 Rust test cases to validate correctness. 🧵[1/6]

thumb_up_off_alt66

chat_bubble_outline2

repeat19

shareShare

Liyan Tang

@liyantang4

7 months ago

Check out my work at Bespoke Labs We release Bespoke-MiniChart-7B, a new SOTA in chart understanding of its size Chart understanding is really fun and challenging and requires reasoning skills beyond math reasoning It's a great starting point for open chart model development!

thumb_up_off_alt30

chat_bubble_outline0

repeat9

shareShare

Greg Durrett

@gregd_nlp

7 months ago

New work led by Liyan Tang with a strong new model for chart understanding! Check out the blog post, model, and playground! Very fun to play around with Bespoke-MiniChart-7B and see what a 7B VLM can do!

thumb_up_off_alt31

chat_bubble_outline1

repeat8

shareShare

Philippe Laban

@philippelaban

7 months ago

🆕paper: LLMs Get Lost in Multi-Turn Conversation In real life, people don’t speak in perfect prompts. So we simulate multi-turn conversations — less lab-like, more like real use. We find that LLMs get lost in conversation. 👀What does that mean? 🧵1/N 📄arxiv.org/abs/2505.06120

thumb_up_off_alt126

chat_bubble_outline5

repeat30

shareShare

Greg Durrett

@gregd_nlp

7 months ago

Check out ChartMuseum from Liyan Tang Grace Kim and many other collaborators from UT! Charts questions take us beyond current benchmarks for math/multi-hop QA/etc., which CoT is very good at, to *visual reasoning*, which is hard to express with text CoT!

thumb_up_off_alt34

chat_bubble_outline1

repeat9

shareShare

Puyuan Peng

@puyuanpeng

6 months ago

The paper is out! arxiv.org/pdf/2505.19462

thumb_up_off_alt60

chat_bubble_outline0

repeat11

shareShare

Fangcong Yin

@fangcong_y10593

6 months ago

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!

thumb_up_off_alt87

chat_bubble_outline5

repeat31

shareShare

Xi Ye

@xiye_nlp

6 months ago

🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a

thumb_up_off_alt66

chat_bubble_outline1

repeat17

shareShare

Leo Liu

@zeyuliu10

6 months ago

LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.

thumb_up_off_alt109

chat_bubble_outline3

repeat40

shareShare

Greg Durrett

@gregd_nlp

4 months ago

📢I'm joining NYU (Courant CS + Center for Data Science) starting this fall! I’m excited to connect with new NYU colleagues and keep working on LLM reasoning, reliability, coding, creativity, and more! I’m also looking to build connections in the NYC area more broadly. Please

thumb_up_off_alt755

chat_bubble_outline91

repeat45

shareShare