Jessy Li (@jessyjli) 's Twitter Profile
Jessy Li

@jessyjli

Associate Professor @UT_Linguistics, computational linguistics and #NLProc

ID: 202355029

linkhttps://jessyli.com calendar_today13-10-2010 21:31:36

923 Tweet

3,3K Takipçi

968 Takip Edilen

Hongli Zhan (@honglizhan) 's Twitter Profile Photo

I’m excited to share that our paper has been accepted at ICML 2025! 🎉🥳🎊 This work was done during my internship at IBM Research, and it wouldn’t have been possible without a top-notch team Muneeza Azmat Rå¥å Mikhail Yurochkin and my amazing advisor Jessy Li 👏

I’m excited to share that our paper has been accepted at ICML 2025! 🎉🥳🎊

This work was done during my internship at IBM Research, and it wouldn’t have been possible without a top-notch team <a href="/MuneezaAzmat/">Muneeza Azmat</a> <a href="/RayaHoresh/">Rå¥å</a> <a href="/Yurochkin_M/">Mikhail Yurochkin</a> and my amazing advisor <a href="/jessyjli/">Jessy Li</a> 👏
Peter West (@peterwesttm) 's Twitter Profile Photo

I’ve been fascinated lately by the question: what kinds of capabilities might base LLMs lose when they are aligned? i.e. where can alignment make models WORSE? I’ve been looking into this with Christopher Potts and here's one piece of the answer: randomness and creativity

I’ve been fascinated lately by the question: what kinds of capabilities might base LLMs lose when they are aligned? i.e. where can alignment make models WORSE? I’ve been looking into this with <a href="/ChrisGPotts/">Christopher Potts</a> and here's one piece of the answer: randomness and creativity
Philippe Laban (@philippelaban) 's Twitter Profile Photo

🆕paper: LLMs Get Lost in Multi-Turn Conversation In real life, people don’t speak in perfect prompts. So we simulate multi-turn conversations — less lab-like, more like real use. We find that LLMs get lost in conversation. 👀What does that mean? 🧵1/N 📄arxiv.org/abs/2505.06120

🆕paper: LLMs Get Lost in Multi-Turn Conversation

In real life, people don’t speak in perfect prompts.
So we simulate multi-turn conversations — less lab-like, more like real use.

We find that LLMs get lost in conversation.
👀What does that mean? 🧵1/N
📄arxiv.org/abs/2505.06120
Fei Liu @ #ICLR2025 (@feiliu_nlp) 's Twitter Profile Photo

✨ Our paper #PlanGenLLMs: A Modern Survey of LLM Planning Capabilities (arxiv.org/pdf/2502.11221) is accepted to the #ACL2025 main conference! Huge thanks to the reviewers for the unanimous 4-4-4 reviews and meta score ❤️ Grateful for your thoughtful feedback! #ACL2025 #NLProc

✨ Our paper #PlanGenLLMs: A Modern Survey of LLM Planning Capabilities (arxiv.org/pdf/2502.11221) is accepted to the #ACL2025 main conference! 

Huge thanks to the reviewers for the unanimous 4-4-4 reviews and meta score ❤️ Grateful for your thoughtful feedback! #ACL2025 #NLProc
Liyan Tang (@liyantang4) 's Twitter Profile Photo

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts!

✍🏻Entirely human-written questions by 13 CS researchers
👀Emphasis on visual reasoning – hard to be verbalized via text CoTs
📉Humans reach 93% but 63% from Gemini-2.5-Pro &amp; 38% from Qwen2.5-72B
Jessy Li (@jessyjli) 's Twitter Profile Photo

Super thrilled that Kanishka Misra 🌊 is going to join UT Linguistics Dept as our newest computational linguistics faculty member -- looking forward to doing great research together! 🧑‍🎓Students: Kanishka is a GREAT mentor -- apply to be his PhD student in the upcoming cycle!!

Sebastian Joseph (@sebajoed) 's Twitter Profile Photo

How good are LLMs at 🔭 scientific computing and visualization 🔭? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵

How good are LLMs at 🔭 scientific computing and visualization 🔭?

AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.

SOTA models like Gemini 2.5 Pro &amp; Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵
Jessy Li (@jessyjli) 's Twitter Profile Photo

Is AI ready to play a real role in science? This work with CosmicAI evaluates LLMs targeting the implementation of scientific workflows, and the scientific utility of visualizations from LLM-generated code -- and the answer is not yet, even with the best SOTA models 👇

Asher Zheng (@asher_zheng00) 's Twitter Profile Photo

Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.🛟 👉Introducing CoBRA🐍, a framework that assesses strategic language. Work with my amazing advisors Jessy Li and David Beaver! 🧵👇

Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.🛟

👉Introducing CoBRA🐍, a framework that assesses strategic language.

Work with my amazing advisors <a href="/jessyjli/">Jessy Li</a> and <a href="/David_Beaver/">David Beaver</a>!
🧵👇
CosmicAI (@cosmicai_inst) 's Twitter Profile Photo

CosmicAI collab: benchmarking the utility of LLMs in astronomy coding workflows & focusing on the key research capability of scientific visualization. Sebastian Joseph Jessy Li Murtaza Husain Greg Durrett Dr. Stephanie Juneau paul.torrey Adam Bolton, Stella Offner, Juan Frias, Niall Gaffney

Jessy Li (@jessyjli) 's Twitter Profile Photo

We have very good frameworks for cooperative dialog… but how about the opposite? Asher Zheng’s new paper takes a game-theoretic view and develops new metrics to quantify non-cooperative language ♟️ Turns out LLMs don’t have the pragmatic capabilities to perceive these…

Jessy Li (@jessyjli) 's Twitter Profile Photo

Check out this new opinion piece from Sebastian and Lily! We have really powerful AI systems now, so what’s the bottleneck preventing the wider adoption of fact checking systems, in high stakes scenarios like medicine? It’s how we define the tasks 👇

Manya Wadhwa (@manyawadhwa1) 's Twitter Profile Photo

Happy to share that EvalAgent has been accepted to #COLM2025 Conference on Language Modeling 🎉🇨🇦 We introduce a framework to identify implicit and diverse evaluation criteria for various open-ended tasks! 📜 arxiv.org/pdf/2504.15219

Hongli Zhan (@honglizhan) 's Twitter Profile Photo

👇Happening this afternoon 4:30pm! Come meet Mikhail Yurochkin, Rå¥å, and I, at East Exhibition Hall #1103. 📍I’m also on the industry job market this coming year! Let’s connect and chat about opportunities in the industry :)

👇Happening this afternoon 4:30pm! Come meet <a href="/Yurochkin_M/">Mikhail Yurochkin</a>, <a href="/RayaHoresh/">Rå¥å</a>, and I, at East Exhibition Hall #1103.

📍I’m also on the industry job market this coming year! Let’s connect and chat about opportunities in the industry :)