Gabriel Stanovsky (@gabistanovsky) Twitter Tweets • TwiCopy

Eliya Habba

6 months ago

🕊️ DOVE is a living benchmark! Just pushed major updates: 📊 Dataset expansion: Added ~5700 MMLU examples with Llama-70B - each tested across 100 different prompt variations = 570K new predictions! 📈 Website upgrades: New interactive plots throughout- slab-nlp.github.io/DOVE/

thumb_up_off_alt10

chat_bubble_outline1

repeat5

shareShare

Itay Itzhak

@itay_itzhak_

5 months ago

🚨New paper alert🚨 🧠 Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing? Excited to share our new paper, accepted to CoLM 2025🎉! See thread below 👇 #BiasInAI #LLMs #MachineLearning #NLProc

thumb_up_off_alt74

chat_bubble_outline3

repeat24

shareShare

Daria Lioubashevski

@darialioub

5 months ago

Ever wondered how Transformers refine their top-k predictions over their layers? 📊 Is there an order to the madness? Come find out at my poster presentation tommorow at ICML Conference 📍East Exhibition Hall E-2512, 11:00-13:30

thumb_up_off_alt16

chat_bubble_outline0

repeat2

shareShare

Itay Itzhak

@itay_itzhak_

4 months ago

In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage! Now hungry for discussing: – LLMs behavior – Interpretability – Biases & Hallucinations – Why eval is so hard (but so fun) Come say hi if that’s your vibe too!

thumb_up_off_alt23

chat_bubble_outline0

repeat4

shareShare

Eliya Habba

@eliyahabba

4 months ago

Presenting my poster : 🕊️ DOVE - A large-scale multi-dimensional predictions dataset towards meaningful LLM evaluation, Monday 18:00 Vienna, #ACL2025 Come chat about LLM evaluation, prompt sensitivity, and our 250M COLLECTION OF MODEL OUTPUTS!

thumb_up_off_alt46

chat_bubble_outline2

repeat11

shareShare

HUJI NLP

@nlphuji

4 months ago

In Vienna for #ACL2025NLP🇦🇹 ? Check out the work from @hujinlp and our collaborators presenting throughout the week. Looking forward to the discussions!

thumb_up_off_alt17

chat_bubble_outline0

repeat2

shareShare

Itay Itzhak

@itay_itzhak_

4 months ago

At #ACL2025 and not sure what to do next? GEM 💎² is the place to be for awesome talks on the future of LLM evaluation. Come hear Gabriel Stanovsky, Eliya Habba, Leshem (Legend) Choshen 🤖🤗 and others rethink what it means to actually evaluate LLMs beyond accuracy and vibes. Thursday @ Hall C!

thumb_up_off_alt23

chat_bubble_outline0

repeat4

shareShare

Eliya Habba

@eliyahabba

4 months ago

Come to GEM^2 tomorrow! #ACL2025 gem-benchmark.com/workshop

thumb_up_off_alt9

chat_bubble_outline0

repeat1

shareShare

Sebastian Gehrmann

@sebgehr

4 months ago

This year's GEM workshop is happening *today* starting at 9am in Vienna at #acl2025 in Hall C. I am looking forward to a day of evaluations.

thumb_up_off_alt14

chat_bubble_outline0

repeat2

shareShare

Enrico Santus

@enricosantus

4 months ago

I swear I warned all the romantics in the room — especially after the #Coldplay scandal! 😄🎶 If you were there (or wish you had been), tag yourself and your friends in the comments 👇 Bye bye from the #Gem organizers and speakers! #ACL2025 #ACL2025NLP #GEM2 #LLMs #NLP #Vienna

thumb_up_off_alt16

chat_bubble_outline4

repeat3

shareShare

Sebastian Gehrmann

@sebgehr

4 months ago

And this wraps GEM! Thanks all for attending

thumb_up_off_alt15

chat_bubble_outline0

repeat1

shareShare

Verena Rieser

@verena_rieser

4 months ago

How can we evaluate the real world impact of generative AI? Great panel GEM2 workshop #ACL2025NLP 🇦🇹

thumb_up_off_alt32

chat_bubble_outline1

repeat4

shareShare

HUJI NLP

@nlphuji

4 months ago

That’s a wrap on #ACL2025 in Vienna! Great to be there with our team.

thumb_up_off_alt36

chat_bubble_outline0

repeat1

shareShare

Adi Simhi

@adisimhi

3 months ago

Very pleased that "Trust me I'm Wrong" was accepted to EMNLP 2025 findings! Trust me I'm Wrong shows that LLMs can hallucinate with high certainty even when they know the correct answer! Check our latest work with Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov.

Very pleased that "Trust me I'm Wrong" was accepted to <a href="/emnlpmeeting/">EMNLP 2025</a> findings!

Trust me I'm Wrong shows that LLMs can hallucinate with high certainty even when they know the correct answer!

Check our latest work with <a href="/Itay_itzhak_/">Itay Itzhak</a>, <a href="/FazlBarez/">Fazl Barez</a>, <a href="/GabiStanovsky/">Gabriel Stanovsky</a>, and <a href="/boknilev/">Yonatan Belinkov</a>.

thumb_up_off_alt114

chat_bubble_outline5

repeat13

shareShare

Uri Berger

@uriberger88

3 months ago

Happy to share that our Image Captioning evaluation survey was accepted to TACL! I will be presenting the paper EMNLP 2025

thumb_up_off_alt13

chat_bubble_outline0

repeat4

shareShare

Noam Dahan

@dahan_noam

3 months ago

Old news: Single-prompt eval is unreliable🤯 New news: PromptSuite🌈 - an easy way to augment your benchmark with thousands of paraphrases ➡️ robust eval, zero sweat! - Works on any dataset! - Python API + web UI Eliya Habba, Gili Lior, Gabriel Stanovsky eliyahabba.github.io/PromptSuite/

thumb_up_off_alt58

chat_bubble_outline2

repeat14

shareShare

Jungsoo Park

@jungsoo___park

2 months ago

What if LLMs can forecast their own scores on unseen benchmarks from just a task description? We are the first to study text description→performance prediction, giving practitioners an early read on outcomes so they can plan what to build—before paying full price 💸

thumb_up_off_alt26

chat_bubble_outline3

repeat7

shareShare

Itay Itzhak

@itay_itzhak_

2 months ago

🚨Spotlight update🚨 Our paper on bias origins in LLMs is a *spotlight* paper with oral presentation at CoLM 2025!✨ Honored to be among just 24 selected and super excited to present and discuss biases and finetuning limits. Who’s joining in Montreal Tuesday morning? 👀

thumb_up_off_alt33

chat_bubble_outline3

repeat7

shareShare

Itay Itzhak

@itay_itzhak_

2 months ago

Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers 🎉 This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech. If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!

thumb_up_off_alt54

chat_bubble_outline0

repeat5

shareShare

Eliya Habba

@eliyahabba

2 months ago

Our 🌈 PromptSuite paper has been accepted to #EMNLP2025 🇨🇳 (System Demonstrations)! 🎉 🌈 PromptSuite is a flexible framework for generating thousands of prompt variations per instance - enabling robust, task-agnostic evaluation of LLMs. Noam Dahan, Gili Lior, Gabriel Stanovsky

thumb_up_off_alt31

chat_bubble_outline1

repeat12

shareShare