Ariel Gera (@arielgera2) Twitter Tweets • TwiCopy

Sumit

a year ago

JuStRank: Benchmarking LLM Judges for System Ranking IBM presents a large-scale benchmark to evaluate how well LLMs can rank other AI systems, showing that reward models often match larger LLMs at this task. 📝arxiv.org/abs/2412.09569

thumb_up_off_alt10

chat_bubble_outline2

repeat3

shareShare

fly51fly

@fly51fly

a year ago

[CL] JuStRank: Benchmarking LLM Judges for System Ranking A Gera, O Boni, Y Perlitz, R Bar-Haim... [IBM Research] (2024) arxiv.org/abs/2412.09569

thumb_up_off_alt11

chat_bubble_outline1

repeat5

shareShare

Shir Ashury-Tahan

@shirashurytahan

9 months ago

LLMs struggle with tables—but how robust are they really? 🔍 ToRR goes beyond accuracy, testing real-world robustness across formats & tasks. 📊 Different formats, same data—models show brittle behavior affecting rankings. Prompt configuration is a key dimension for evaluation!🚀

thumb_up_off_alt35

chat_bubble_outline2

repeat12

shareShare

Ariel Gera

@arielgera2

9 months ago

How do LLMs cope with multi-constraint instructions from real users? Not too well, it turns out... So lots of room for improvement! 🦾 Great internship work by Gili Lior 🌟

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

Ariel Gera

@arielgera2

6 months ago

Can LLMs judge debate speeches? 🤖 And how do the LLMaaJ judgments differ from human annotations? 👨‍⚖️ Great new work by Noy Sternlicht 🌟

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Noy Sternlicht

@noysternlicht

3 months ago

🎉 Proud to share that "Debatable Intelligence" has now been accepted to #EMNLP2025 (Main Conference)! noy-sternlicht.github.io/Debatable-Inte… Huge thenks to my amazing collaborators Ariel Gera, Roy Bar Haim, Tom Hope, Noam Slonim 🟢

thumb_up_off_alt48

chat_bubble_outline2

repeat13

shareShare

Ramon Astudillo

@ramonastudill12

2 months ago

The Generative Model Alignment team at IBM Research is looking for next summer interns! Two candidates for two topics 🍰Reinforcement Learning environments for LLMs 🐎Speculative and non-auto regressive generation for LLMs interested/curious? DM / email [email protected]

thumb_up_off_alt6

chat_bubble_outline0

repeat4

shareShare

Ariel Gera

@arielgera2

a month ago

Why I really enjoyed this project: It combines a lot: multimodality + hybrid retrieval + test-time optimization 🤯 At the same time, it is actually quite simple 💡 and helps to achieve more (retrieval quality) with less (compute resources) 🦾 plus Omri Uzan is pretty great

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare