Nandan Thakur (@beirmug) Twitter Tweets • TwiCopy

tomaarsen

7 months ago

Relabeling datasets for Information Retrieval improves NDCG@10 of both embedding models & cross-encoder rerankers. This was already the prevalent belief, but now it's been confirmed. Great job Nandan Thakur, Crystina Zhang, Xueguang Ma & Jimmy Lin

thumb_up_off_alt109

chat_bubble_outline4

repeat14

shareShare

Dongfu Jiang

@dongfujiang

6 months ago

Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and

thumb_up_off_alt342

chat_bubble_outline4

repeat62

shareShare

Xinyu Crystina Zhang | on job market

@crystina_z

4 months ago

🚀 Current Deep Research Agent benchmarks mix LLM skills + web search quality, but that hides two key questions: 1️⃣ Which LLM agent works the best? 2️⃣ How much do better retrievers boost results? BrowseComp-Plus provides a solution! By using a fixed corpus with rich positives +

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

Omar Khattab

@lateinteraction

4 months ago

Looks like the kind of benchmark that's so good, it feels like sacrifice for the sake of the community to repost it instead of working on it in silence ;D But we need much more of this, so here goes: Check out this release from Zijian Chen Xueguang Ma Shengyao Zhuang et al!

thumb_up_off_alt31

chat_bubble_outline2

repeat2

shareShare

Xinyu Ma

@xinyuma8

4 months ago

Excited to share ReasonRank—a listwise reranker that reasons before it ranks. On complex queries, ReasonRank-32B is #1 on the BRIGHT reasoning-intensive leaderboard (Aug 9, 2025; +.6 vs. strong baselines). Working with Wenhan, Weiwei Sun, etc.

thumb_up_off_alt70

chat_bubble_outline3

repeat9

shareShare

Manuel Faysse

@manuelfaysse

4 months ago

What BrowseComp+ confirms: (1) Better retrievers remain crucial in agentic search settings and just "grepping" is mostly a suboptimal strategy. Semantic embeddings enable getting to the answer faster and more often w.r.t BM25. (2) Having said that, agentic search is clearly the

thumb_up_off_alt104

chat_bubble_outline4

repeat8

shareShare

Matei Zaharia

@matei_zaharia

4 months ago

Try out GEPA! Excited to see how it does on people's problems.

thumb_up_off_alt67

chat_bubble_outline1

repeat12

shareShare

Aashka Trivedi

@aashkaa_

4 months ago

Granite Embedding R2 Models are here! 🔥 8k context 🏆 Top performance on BEIR, MTEB, COIR, MLDR, MT-RAG, Table IR, LongEmbed ⚡Fast and lightweight 🎯 Apache 2.0 license (trained on commercial friendly data) Try them now on Hugging Face 👉 hf.co/ibm-granite

thumb_up_off_alt14

chat_bubble_outline0

repeat3

shareShare

Andrew Drozdov

@mrdrozdov

4 months ago

We built a thing! The Databricks Reranker is now in Public Preview. It's as easy as changing the arguments to your vector search call, and doesn't require any additional setup. Read more: databricks.com/blog/reranking…

thumb_up_off_alt43

chat_bubble_outline0

repeat10

shareShare

Xueguang Ma

@xueguang_ma

4 months ago

Some recent update on HuggingFace leaderboard of BrowseComp-Plus, huggingface.co/spaces/Tevatro… 1. We added evaluation of DeepSeek-R1-0528 and Kimi-K2-0711 2. We add Qwen3-32B as LLM answer judgment for accuracy measurement in addition to GPT-4.1, so that the evaluation is fully based

thumb_up_off_alt38

chat_bubble_outline1

repeat6

shareShare

Andrew Drozdov

@mrdrozdov

4 months ago

Amazing to see FreshStack highlighted here in Chapter 2. Incredible work Nandan Thakur Hamel Husain !

thumb_up_off_alt10

chat_bubble_outline1

repeat2

shareShare

Nandan Thakur

@beirmug

4 months ago

Happy to see that our RLHN paper on cleaning IR training data (e.g. MSMARCO/HotPotQA) by relabeling hard negatives using LLMs has been accepted at #EMNLP2025 findings! 🎉🍾 Huge thanks to everyone involved: Xinyu Crystina Zhang, Xueguang Ma, and Jimmy Lin! 📜 arxiv.org/abs/2505.16967

thumb_up_off_alt32

chat_bubble_outline1

repeat4

shareShare

Benjamin Clavié

@bclavie

4 months ago

really can't overstate how much of a gamechanger BrowseComp-Plus is for RAG benchmarks. actual real-world useful tasks AND the metrics are extremely easy to convey to people who don't have a PhD in IR evals. yes thank you

thumb_up_off_alt70

chat_bubble_outline2

repeat8

shareShare

Bo

@bo_wangbo

4 months ago

had an interesting discussion with Xueguang Ma on BrowseComp-Plus, something i noticed before: scaling up text embedding beyond 2-4B brings marginal gain (MTEB), while in deep research doesn't hold anymore: larger ones seems can capture sophisticated query patterns way better.

had an interesting discussion with <a href="/xueguang_ma/">Xueguang Ma</a> on BrowseComp-Plus, something i noticed before: scaling up text embedding beyond 2-4B brings marginal gain (MTEB), while in deep research doesn't hold anymore: larger ones seems can capture sophisticated query patterns way better.

thumb_up_off_alt77

chat_bubble_outline7

repeat10

shareShare

Sumit

@_reachsumit

3 months ago

Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs Databricks demonstrates that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. 📝arxiv.org/abs/2508.17400

thumb_up_off_alt15

chat_bubble_outline0

repeat4

shareShare

Fan Nie

@fannie1208

3 months ago

Can frontier LLMs solve unsolved questions? [1/n] Benchmarks are saturating. It’s time to move beyond. Our latest work #UQ shifts evaluation to real-world unsolved questions: naturally difficult, realistic, and with no known solutions. All questions, candidate answers,

thumb_up_off_alt244

chat_bubble_outline13

repeat39

shareShare

Omar Sanseviero

@osanseviero

3 months ago

Introducing EmbeddingGemma🎉 🔥With only 308M params, this is the top open model under 500M 🌏Trained on 100+ languages 🪆Flexible embeddings (768 to 128 dims) with Matryoshka 🤗Works with your favorite open tools 🤏Runs with as little as 200MB developers.googleblog.com/en/introducing…

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat157

shareShare

Sundar Pichai

@sundarpichai

3 months ago

Introducing EmbeddingGemma, our newest open model that can run completely on-device. It's the top model under 500M parameters on the MTEB benchmark and comparable to models nearly 2x its size – enabling state-of-the-art embeddings for search, retrieval + more.

thumb_up_off_alt7,7K

chat_bubble_outline196

repeat494

shareShare