Gabriel Stanovsky (@gabistanovsky) 's Twitter Profile
Gabriel Stanovsky

@gabistanovsky

Assistant Professor at @CseHuji

ID: 792053594

linkhttps://gabrielstanovsky.github.io calendar_today30-08-2012 17:36:20

242 Tweet

752 Followers

262 Following

Eliya Habba (@eliyahabba) 's Twitter Profile Photo

🕊️ DOVE is a living benchmark! Just pushed major updates: 📊 Dataset expansion: Added ~5700 MMLU examples with Llama-70B - each tested across 100 different prompt variations = 570K new predictions! 📈 Website upgrades: New interactive plots throughout- slab-nlp.github.io/DOVE/

Itay Itzhak (@itay_itzhak_) 's Twitter Profile Photo

🚨New paper alert🚨 🧠 Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing? Excited to share our new paper, accepted to CoLM 2025🎉! See thread below 👇 #BiasInAI #LLMs #MachineLearning #NLProc

🚨New paper alert🚨

🧠 
Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing?

Excited to share our new paper, accepted to CoLM 2025🎉!
See thread below 👇
#BiasInAI #LLMs #MachineLearning #NLProc
Daria Lioubashevski (@darialioub) 's Twitter Profile Photo

Ever wondered how Transformers refine their top-k predictions over their layers? 📊 Is there an order to the madness? Come find out at my poster presentation tommorow at ICML Conference 📍East Exhibition Hall E-2512, 11:00-13:30

Ever wondered how Transformers refine their top-k predictions over their layers? 📊
Is there an order to the madness? 

Come find out at my poster presentation tommorow at <a href="/icmlconf/">ICML Conference</a> 
📍East Exhibition Hall E-2512, 11:00-13:30
Itay Itzhak (@itay_itzhak_) 's Twitter Profile Photo

In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage! Now hungry for discussing: – LLMs behavior – Interpretability – Biases & Hallucinations – Why eval is so hard (but so fun) Come say hi if that’s your vibe too!

In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!

Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases &amp; Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!
Eliya Habba (@eliyahabba) 's Twitter Profile Photo

Presenting my poster : 🕊️ DOVE - A large-scale multi-dimensional predictions dataset towards meaningful LLM evaluation, Monday 18:00 Vienna, #ACL2025 Come chat about LLM evaluation, prompt sensitivity, and our 250M COLLECTION OF MODEL OUTPUTS!

Presenting my poster :
🕊️ DOVE - A large-scale multi-dimensional predictions dataset towards meaningful LLM evaluation, Monday 18:00 Vienna, 
#ACL2025

Come chat about LLM evaluation, prompt sensitivity, and our 250M COLLECTION OF MODEL OUTPUTS!
HUJI NLP (@nlphuji) 's Twitter Profile Photo

In Vienna for #ACL2025NLP🇦🇹 ? Check out the work from @hujinlp and our collaborators presenting throughout the week. Looking forward to the discussions!

In Vienna for #ACL2025NLP🇦🇹 ?
Check out the work from @hujinlp and our collaborators presenting throughout the week.
Looking forward to the discussions!
Itay Itzhak (@itay_itzhak_) 's Twitter Profile Photo

At #ACL2025 and not sure what to do next? GEM 💎² is the place to be for awesome talks on the future of LLM evaluation. Come hear Gabriel Stanovsky, Eliya Habba, Leshem (Legend) Choshen 🤖🤗 and others rethink what it means to actually evaluate LLMs beyond accuracy and vibes. Thursday @ Hall C!

Sebastian Gehrmann (@sebgehr) 's Twitter Profile Photo

This year's GEM workshop is happening *today* starting at 9am in Vienna at #acl2025 in Hall C. I am looking forward to a day of evaluations.

This year's GEM workshop is happening *today* starting at 9am in Vienna at #acl2025 in Hall C. I am looking forward to a day of evaluations.
Enrico Santus (@enricosantus) 's Twitter Profile Photo

I swear I warned all the romantics in the room — especially after the #Coldplay scandal! 😄🎶 If you were there (or wish you had been), tag yourself and your friends in the comments 👇 Bye bye from the #Gem organizers and speakers! #ACL2025 #ACL2025NLP #GEM2 #LLMs #NLP #Vienna

I swear I warned all the romantics in the room — especially after the #Coldplay scandal! 😄🎶

If you were there (or wish you had been), tag yourself and your friends in the comments 👇

Bye bye from the #Gem organizers and speakers!

#ACL2025 #ACL2025NLP #GEM2 #LLMs #NLP #Vienna
Adi Simhi (@adisimhi) 's Twitter Profile Photo

Very pleased that "Trust me I'm Wrong" was accepted to EMNLP 2025 findings! Trust me I'm Wrong shows that LLMs can hallucinate with high certainty even when they know the correct answer! Check our latest work with Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov.

Very pleased that "Trust me I'm Wrong" was accepted to <a href="/emnlpmeeting/">EMNLP 2025</a> findings!

Trust me I'm Wrong shows that LLMs can hallucinate with high certainty even when they know the correct answer!

Check our latest work with <a href="/Itay_itzhak_/">Itay Itzhak</a>, <a href="/FazlBarez/">Fazl Barez</a>, <a href="/GabiStanovsky/">Gabriel Stanovsky</a>, and <a href="/boknilev/">Yonatan Belinkov</a>.
Noam Dahan (@dahan_noam) 's Twitter Profile Photo

Old news: Single-prompt eval is unreliable🤯 New news: PromptSuite🌈 - an easy way to augment your benchmark with thousands of paraphrases ➡️ robust eval, zero sweat! - Works on any dataset! - Python API + web UI Eliya Habba, Gili Lior, Gabriel Stanovsky eliyahabba.github.io/PromptSuite/

Jungsoo Park (@jungsoo___park) 's Twitter Profile Photo

What if LLMs can forecast their own scores on unseen benchmarks from just a task description? We are the first to study text description→performance prediction, giving practitioners an early read on outcomes so they can plan what to build—before paying full price 💸

What if LLMs can forecast their own scores on unseen benchmarks from just a task description?

We are the first to study text description→performance prediction, giving practitioners an early read on outcomes so they can plan what to build—before paying full price 💸
Itay Itzhak (@itay_itzhak_) 's Twitter Profile Photo

🚨Spotlight update🚨 Our paper on bias origins in LLMs is a *spotlight* paper with oral presentation at CoLM 2025!✨ Honored to be among just 24 selected and super excited to present and discuss biases and finetuning limits. Who’s joining in Montreal Tuesday morning? 👀

Itay Itzhak (@itay_itzhak_) 's Twitter Profile Photo

Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers 🎉 This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech. If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!

Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers 🎉
This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech.
If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!
Eliya Habba (@eliyahabba) 's Twitter Profile Photo

Our 🌈 PromptSuite paper has been accepted to #EMNLP2025 🇨🇳 (System Demonstrations)! 🎉 🌈 PromptSuite is a flexible framework for generating thousands of prompt variations per instance - enabling robust, task-agnostic evaluation of LLMs. Noam Dahan, Gili Lior, Gabriel Stanovsky

Our 🌈 PromptSuite paper has been accepted to #EMNLP2025 🇨🇳 (System Demonstrations)! 🎉

🌈 PromptSuite is a flexible framework for generating thousands of prompt variations per instance - enabling robust, task-agnostic evaluation of LLMs.

<a href="/Dahan_Noam/">Noam Dahan</a>, <a href="/GiliLior/">Gili Lior</a>, <a href="/GabiStanovsky/">Gabriel Stanovsky</a>