Elron Bandel (@elronbandel) 's Twitter Profile
Elron Bandel

@elronbandel

Research Scientist | @IBMResearch | Ex @biunlp 🦾 The Turing test is how future machines verify that we humans are not involved 🦾

ID: 1236427249429221376

linkhttp://www.unitxt.ai calendar_today07-03-2020 23:03:35

748 Tweet

304 Followers

397 Following

Elron Bandel (@elronbandel) 's Twitter Profile Photo

Did you know that O1/R1 do not have access to their own thoughts from earlier in the conversation? Is it a well-known fact? For example, if you ask R1 to play a guessing game, it will pick a word, but in the following turns, it will have no idea what it picked. 😱 The same

Did you know that O1/R1 do not have access to their own thoughts from earlier in the conversation?

Is it a well-known fact? For example, if you ask R1 to play a guessing game, it will pick a word, but in the following turns, it will have no idea what it picked. 😱

The same
Sara Rosenthal (@seirasto) 's Twitter Profile Photo

🌟Want to know more about our MTRAG benchmark? Check out the IBM blog highlighting our work! research.ibm.com/blog/conversat… IBM Research

Leshem Choshen C U @ ICLR 🤖🤗 (@lchoshen) 's Twitter Profile Photo

General intelligence but they can't read a simple table or figure that changing the column does impact the content. Massive table benchmark:

Elron Bandel (@elronbandel) 's Twitter Profile Photo

The leaderboard for our new benchmark in collaboration with Stanford’s HELM powered by Unitxt! Take a look top models really struggle! crfm.stanford.edu/helm/torr/late…

Gili Lior (@gililior) 's Twitter Profile Photo

"Summarize this text" out ❌ "Provide a 50-word summary, explaining it to a 5-year-old" in ✅ The way we use LLMs has changed—user instructions are now longer, more nuanced, and packed with constraints. Interested in how LLMs keep up? 🤔 Check out WildIFEval, our new benchmark!

Ariel Gera (@arielgera2) 's Twitter Profile Photo

How do LLMs cope with multi-constraint instructions from real users? Not too well, it turns out... So lots of room for improvement! 🦾 Great internship work by Gili Lior 🌟

Eliya Habba (@eliyahabba) 's Twitter Profile Photo

Care about LLM evaluation? 🤖 🤔 We bring you🕊️ DOVE a massive (250M!) collection of LLMs outputs On different prompts, domains, tokens, models... Join our community effort to expand it with YOUR model predictions & become a co-author!

Eran Hirsch (@hirscheran) 's Twitter Profile Photo

🚨 Introducing LAQuer, accepted to #ACL2025 (main conf)! LAQuer provides more granular attribution for LLM generations: users can just highlight any output fact (top), and get attribution for that input snippet (bottom). This reduces the amount of text the user has to read by 2

🚨 Introducing LAQuer, accepted to #ACL2025 (main conf)!

LAQuer provides more granular attribution for LLM generations: users can just highlight any output fact (top), and get attribution for that input snippet (bottom). This reduces the amount of text the user has to read by 2