Nandan Thakur (@beirmug) 's Twitter Profile
Nandan Thakur

@beirmug

PhD @uwaterloo🌲 Efficient IR, likes good evals 🔍 Ex Intern @DbrxMosaicAI @GoogleAI | RA @UKPLab | Author of beir.ai, miracl.ai, TREC-RAG! 🍻

ID: 751326416495517697

linkhttps://thakur-nandan.github.io calendar_today08-07-2016 08:05:51

1,1K Tweet

2,2K Followers

2,2K Following

tomaarsen (@tomaarsen) 's Twitter Profile Photo

Relabeling datasets for Information Retrieval improves NDCG@10 of both embedding models & cross-encoder rerankers. This was already the prevalent belief, but now it's been confirmed. Great job Nandan Thakur, Crystina Zhang, Xueguang Ma & Jimmy Lin

Relabeling datasets for Information Retrieval improves NDCG@10 of both embedding models & cross-encoder rerankers. This was already the prevalent belief, but now it's been confirmed. 

Great job <a href="/beirmug/">Nandan Thakur</a>, Crystina Zhang, <a href="/xueguang_ma/">Xueguang Ma</a> &amp; <a href="/lintool/">Jimmy Lin</a>
Dongfu Jiang (@dongfujiang) 's Twitter Profile Photo

Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and

Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl.

Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and
Xinyu Crystina Zhang | on job market (@crystina_z) 's Twitter Profile Photo

🚀 Current Deep Research Agent benchmarks mix LLM skills + web search quality, but that hides two key questions: 1️⃣ Which LLM agent works the best? 2️⃣ How much do better retrievers boost results? BrowseComp-Plus provides a solution! By using a fixed corpus with rich positives +

Omar Khattab (@lateinteraction) 's Twitter Profile Photo

Looks like the kind of benchmark that's so good, it feels like sacrifice for the sake of the community to repost it instead of working on it in silence ;D But we need much more of this, so here goes: Check out this release from Zijian Chen Xueguang Ma Shengyao Zhuang et al!

Xinyu Ma (@xinyuma8) 's Twitter Profile Photo

Excited to share ReasonRank—a listwise reranker that reasons before it ranks. On complex queries, ReasonRank-32B is #1 on the BRIGHT reasoning-intensive leaderboard (Aug 9, 2025; +.6 vs. strong baselines). Working with Wenhan, Weiwei Sun, etc.

Excited to share ReasonRank—a listwise reranker that reasons before it ranks.
On complex queries, ReasonRank-32B is #1 on the BRIGHT reasoning-intensive leaderboard (Aug 9, 2025; +.6 vs. strong baselines).
Working with Wenhan, <a href="/sunweiwei12/">Weiwei Sun</a>, etc.
Manuel Faysse (@manuelfaysse) 's Twitter Profile Photo

What BrowseComp+ confirms: (1) Better retrievers remain crucial in agentic search settings and just "grepping" is mostly a suboptimal strategy. Semantic embeddings enable getting to the answer faster and more often w.r.t BM25. (2) Having said that, agentic search is clearly the

What BrowseComp+ confirms:
 (1) Better retrievers remain crucial in agentic search settings and just "grepping" is mostly a suboptimal strategy. Semantic embeddings enable getting to the answer faster and more often w.r.t BM25.

(2) Having said that, agentic search is clearly the
Aashka Trivedi (@aashkaa_) 's Twitter Profile Photo

Granite Embedding R2 Models are here! 🔥 8k context 🏆 Top performance on BEIR, MTEB, COIR, MLDR, MT-RAG, Table IR, LongEmbed ⚡Fast and lightweight 🎯 Apache 2.0 license (trained on commercial friendly data) Try them now on Hugging Face 👉 hf.co/ibm-granite

Andrew Drozdov (@mrdrozdov) 's Twitter Profile Photo

We built a thing! The Databricks Reranker is now in Public Preview. It's as easy as changing the arguments to your vector search call, and doesn't require any additional setup. Read more: databricks.com/blog/reranking…

Xueguang Ma (@xueguang_ma) 's Twitter Profile Photo

Some recent update on HuggingFace leaderboard of BrowseComp-Plus, huggingface.co/spaces/Tevatro… 1. We added evaluation of DeepSeek-R1-0528 and Kimi-K2-0711 2. We add Qwen3-32B as LLM answer judgment for accuracy measurement in addition to GPT-4.1, so that the evaluation is fully based

Some recent update on HuggingFace leaderboard of BrowseComp-Plus, huggingface.co/spaces/Tevatro…

1. We added evaluation of DeepSeek-R1-0528 and Kimi-K2-0711
2. We add Qwen3-32B as LLM answer judgment for accuracy measurement in addition to GPT-4.1, so that the evaluation is fully based
Nandan Thakur (@beirmug) 's Twitter Profile Photo

Happy to see that our RLHN paper on cleaning IR training data (e.g. MSMARCO/HotPotQA) by relabeling hard negatives using LLMs has been accepted at #EMNLP2025 findings! 🎉🍾 Huge thanks to everyone involved: Xinyu Crystina Zhang, Xueguang Ma, and Jimmy Lin! 📜 arxiv.org/abs/2505.16967

Happy to see that our RLHN paper on cleaning IR training data (e.g. MSMARCO/HotPotQA) by relabeling hard negatives using LLMs has been accepted at #EMNLP2025 findings! 🎉🍾

Huge thanks to everyone involved: <a href="/crystina_z/">Xinyu Crystina Zhang</a>, <a href="/xueguang_ma/">Xueguang Ma</a>, and <a href="/lintool/">Jimmy Lin</a>!

📜 arxiv.org/abs/2505.16967
Benjamin Clavié (@bclavie) 's Twitter Profile Photo

really can't overstate how much of a gamechanger BrowseComp-Plus is for RAG benchmarks. actual real-world useful tasks AND the metrics are extremely easy to convey to people who don't have a PhD in IR evals. yes thank you

really can't overstate how much of a gamechanger BrowseComp-Plus is for RAG benchmarks.  

actual real-world useful tasks AND the metrics are extremely easy to convey to people who don't have a PhD in IR evals. yes thank you
Bo (@bo_wangbo) 's Twitter Profile Photo

had an interesting discussion with Xueguang Ma on BrowseComp-Plus, something i noticed before: scaling up text embedding beyond 2-4B brings marginal gain (MTEB), while in deep research doesn't hold anymore: larger ones seems can capture sophisticated query patterns way better.

had an interesting discussion with <a href="/xueguang_ma/">Xueguang Ma</a> on BrowseComp-Plus, something i noticed before: scaling up text embedding beyond 2-4B brings marginal gain (MTEB), while in deep research doesn't hold anymore: larger ones seems can capture sophisticated query patterns way better.
Sumit (@_reachsumit) 's Twitter Profile Photo

Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs Databricks demonstrates that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. 📝arxiv.org/abs/2508.17400

Fan Nie (@fannie1208) 's Twitter Profile Photo

Can frontier LLMs solve unsolved questions? [1/n] Benchmarks are saturating. It’s time to move beyond. Our latest work #UQ shifts evaluation to real-world unsolved questions: naturally difficult, realistic, and with no known solutions. All questions, candidate answers,

Can frontier LLMs solve unsolved questions? [1/n]

Benchmarks are saturating. It’s time to move beyond.
Our latest work #UQ shifts evaluation to real-world unsolved questions: naturally difficult, realistic, and with no known solutions.

All questions, candidate answers,
Omar Sanseviero (@osanseviero) 's Twitter Profile Photo

Introducing EmbeddingGemma🎉 🔥With only 308M params, this is the top open model under 500M 🌏Trained on 100+ languages 🪆Flexible embeddings (768 to 128 dims) with Matryoshka 🤗Works with your favorite open tools 🤏Runs with as little as 200MB developers.googleblog.com/en/introducing…

Introducing EmbeddingGemma🎉

🔥With only 308M params, this is the top open model under 500M
🌏Trained on 100+ languages
🪆Flexible embeddings (768 to 128 dims) with Matryoshka
🤗Works with your favorite open tools
🤏Runs with as little as 200MB

developers.googleblog.com/en/introducing…
Sundar Pichai (@sundarpichai) 's Twitter Profile Photo

Introducing EmbeddingGemma, our newest open model that can run completely on-device. It's the top model under 500M parameters on the MTEB benchmark and comparable to models nearly 2x its size – enabling state-of-the-art embeddings for search, retrieval + more.