Elron Bandel (@elronbandel) Twitter Tweets • TwiCopy

Elron Bandel

@elronbandel

+ Follow

Research Scientist | @IBMResearch | Ex @biunlp 🦾 The Turing test is how future machines verify that we humans are not involved 🦾

ID: 1236427249429221376

linkhttp://www.unitxt.ai calendar_today07-03-2020 23:03:35

748 Tweet

304 Followers

397 Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Did you know that O1/R1 do not have access to their own thoughts from earlier in the conversation? Is it a well-known fact? For example, if you ask R1 to play a guessing game, it will pick a word, but in the following turns, it will have no idea what it picked. 😱 The same

thumb_up_off_alt3

chat_bubble_outline1

repeat0

shareShare

Sara Rosenthal

@seirasto

6 months ago

🌟Want to know more about our MTRAG benchmark? Check out the IBM blog highlighting our work! research.ibm.com/blog/conversat… IBM Research

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

Subhash Kantamneni

@thesubhashk

6 months ago

(1/N) LLMs represent numbers on a helix? And use trigonometry to do addition? Answers below 🧵

thumb_up_off_alt940

chat_bubble_outline23

repeat159

shareShare

Leshem Choshen C U @ ICLR 🤖🤗

@lchoshen

6 months ago

👉Evaluating LLMs?👈 We’d love to hear from you. Which best describes what you do? (comment for other uses)

thumb_up_off_alt6

chat_bubble_outline0

repeat4

shareShare

Leshem Choshen C U @ ICLR 🤖🤗

@lchoshen

5 months ago

General intelligence but they can't read a simple table or figure that changing the column does impact the content. Massive table benchmark:

thumb_up_off_alt13

chat_bubble_outline0

repeat1

shareShare

Elron Bandel

@elronbandel

5 months ago

The leaderboard for our new benchmark in collaboration with Stanford’s HELM powered by Unitxt! Take a look top models really struggle! crfm.stanford.edu/helm/torr/late…

thumb_up_off_alt6

chat_bubble_outline1

repeat2

shareShare

Gili Lior

@gililior

5 months ago

"Summarize this text" out ❌ "Provide a 50-word summary, explaining it to a 5-year-old" in ✅ The way we use LLMs has changed—user instructions are now longer, more nuanced, and packed with constraints. Interested in how LLMs keep up? 🤔 Check out WildIFEval, our new benchmark!

thumb_up_off_alt59

chat_bubble_outline1

repeat18

shareShare

Ariel Gera

@arielgera2

5 months ago

How do LLMs cope with multi-constraint instructions from real users? Not too well, it turns out... So lots of room for improvement! 🦾 Great internship work by Gili Lior 🌟

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

Eliya Habba

@eliyahabba

5 months ago

Care about LLM evaluation? 🤖 🤔 We bring you🕊️ DOVE a massive (250M!) collection of LLMs outputs On different prompts, domains, tokens, models... Join our community effort to expand it with YOUR model predictions & become a co-author!

thumb_up_off_alt49

chat_bubble_outline2

repeat14

shareShare

Eran Hirsch

@hirscheran

2 months ago

🚨 Introducing LAQuer, accepted to #ACL2025 (main conf)! LAQuer provides more granular attribution for LLM generations: users can just highlight any output fact (top), and get attribution for that input snippet (bottom). This reduces the amount of text the user has to read by 2

thumb_up_off_alt72

chat_bubble_outline3

repeat26

shareShare

Yonatan Gruber

@yonatangruber

2 months ago

Important reminder to the world

thumb_up_off_alt3,3K

chat_bubble_outline130

repeat720

shareShare