Jonibek Mansurov (@m_jonibek) 's Twitter Profile
Jonibek Mansurov

@m_jonibek

ID: 1658226525831852035

calendar_today15-05-2023 21:43:41

2 Tweet

12 Followers

69 Following

Alham Fikri Aji (@alhamfikri) 's Twitter Profile Photo

Final work promotion in 2024, by Jonibek Mansurov! We managed to achieve ~75% on a challenging GPQA with only 2 layers of transformers(~ 40M params) that were trained on different data; in our case, MedMCQA. Introducing...

Final work promotion in 2024, by <a href="/M_Jonibek/">Jonibek Mansurov</a>!

We managed to achieve ~75% on a challenging GPQA with only 2 layers of transformers(~ 40M params) that were trained on different data; in our case, MedMCQA.

Introducing...
AK (@_akhaliq) 's Twitter Profile Photo

Crosslingual Reasoning through Test-Time Scaling TL;DR: show that scaling up thinking tokens of English-centric reasoning language models, such as s1 models, can improve multilingual math reasoning performance. Also analyze the language-mixing patterns, effects of different

Crosslingual Reasoning through Test-Time Scaling

TL;DR: show that scaling up thinking tokens of English-centric reasoning language models, such as s1 models, can improve multilingual math reasoning performance. Also analyze the language-mixing patterns, effects of different
Yong Zheng-Xin (Yong) (@yong_zhengxin) 's Twitter Profile Photo

📣 New paper! We observe that reasoning language models finetuned only on English data are capable of zero-shot cross-lingual reasoning through a "quote-and-think" pattern. However, this does not mean they reason the same way across all languages or in new domains. [1/N]

📣 New paper!

We observe that reasoning language models finetuned only on English data are capable of zero-shot cross-lingual reasoning through a "quote-and-think" pattern.

However, this does not mean they reason the same way across all languages or in new domains. 

[1/N]
Cohere Labs (@cohere_labs) 's Twitter Profile Photo

Reasoning language models are primarily trained on English data, but do they generalize well to multilingual settings in various domains? We show that test-time scaling can improve their zero-shot crosslingual reasoning performance! 🔥

Reasoning language models are primarily trained on English data, but do they generalize well to multilingual settings in various domains?

We show that test-time scaling can improve their zero-shot crosslingual reasoning performance! 🔥
Genta Winata (@gentaiscool) 's Twitter Profile Photo

⭐️Reasoning LLMs trained on English data can think in other languages. Read our paper to learn more! Thank you Yong Zheng-Xin (Yong) for leading the project and team! It was an exciting colab! farid Jonibek Mansurov Ruochen Zhang Niklas Muennighoff Carsten Eickhoff Julia Kreutzer

Alham Fikri Aji (@alhamfikri) 's Twitter Profile Photo

🚨Multilingual LLMs, finetuned only on English reasoning data, can still reason when asked non-English questions, showing reasoning traces that go back & forth between languages. I had so much fun working on this project Please give our paper a read! arxiv.org/abs/2505.05408

🚨Multilingual LLMs, finetuned only on English reasoning data, can still reason when asked non-English questions, showing reasoning traces that go back &amp; forth between languages.

I had so much fun working on this project

Please give our paper a read!
arxiv.org/abs/2505.05408
farid (@faridlazuarda) 's Twitter Profile Photo

Can English-finetuned LLMs reason in other languages? Short Answer: Yes, thanks to “quote-and-think” + test-time scaling. You can even force them to reason in a target language! But: 🌐 Low-resource langs & non-STEM topics still tough. New paper: arxiv.org/abs/2505.05408

Yong Zheng-Xin (Yong) (@yong_zhengxin) 's Twitter Profile Photo

This is incredible findings – a reproducibility crisis where baselines are not faithfully reproduced or reported (e.g., footnote indicating performance difference) 🍎 In our work (arxiv.org/abs/2505.05408) we tried so hard to ensure apple-to-apple comparison.

This is incredible findings –  a reproducibility crisis where baselines are not faithfully reproduced or reported (e.g., footnote indicating performance difference)

🍎 In our work (arxiv.org/abs/2505.05408) we tried so hard to ensure apple-to-apple comparison.
Yong Zheng-Xin (Yong) (@yong_zhengxin) 's Twitter Profile Photo

Amidst the evaluation/reproducibility crisis for reasoning LLMs, it's great to see *concurrent independent work (with different models & benchmarks) aligns with our findings*! We reported the same fundamental trade-off: language forcing leads to ✅ compliance, ❌ accuracy!

Amidst the evaluation/reproducibility crisis for reasoning LLMs, it's great to see *concurrent independent work (with different models &amp; benchmarks) aligns with our findings*!

We reported the same fundamental trade-off: language forcing leads to ✅ compliance, ❌ accuracy!