BerkeleyNLP (@berkeleynlp) 's Twitter Profile
BerkeleyNLP

@berkeleynlp

We work on natural language processing, machine learning, linguistics, and deep learning.

ID: 1173334037777141760

linkhttp://nlp.cs.berkeley.edu/ calendar_today15-09-2019 20:33:38

109 Tweet

5,5K Followers

33 Following

Catherine Chen (@cathychen23) 's Twitter Profile Photo

Do brain representations of language depend on whether the inputs are pixels or sounds? Our Communications Biology paper studies this question from the perspective of language timescales. We find that representations are highly similar between modalities! rdcu.be/dACh5 1/8

Do brain representations of language depend on whether the inputs are pixels or sounds?

Our <a href="/CommsBio/">Communications Biology</a> paper studies this question from the perspective of language timescales. We find that representations are highly similar between modalities! rdcu.be/dACh5

1/8
Katie Kang (@katie_kang_) 's Twitter Profile Photo

We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate arxiv.org/abs/2403.05612 🧵

We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning

Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate

arxiv.org/abs/2403.05612
🧵
Jiayi Pan (@jiayi_pirate) 's Twitter Profile Photo

New paper from @Berkeley_AI on Autonomous Evaluation and Refinement of Digital Agents! We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%. arxiv.org/abs/2404.06474 [🧵]

New paper from @Berkeley_AI on Autonomous Evaluation and Refinement of Digital Agents!

We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.

arxiv.org/abs/2404.06474 [🧵]
Sanjay Subramanian (@sanjayssub) 's Twitter Profile Photo

Excited to share some recent work! "Pose Priors from Language Models" We show how to use multimodal LMs to improve 3D human pose estimates in situations with physical contact. Joint work w/ Evonne Ng , Lea Müller , Dan Klein (BerkeleyNLP), Shiry Ginosar , trevordarrell

Kayo Yin (@kayo_yin) 's Twitter Profile Photo

Spoken languages exhibit communicative efficiency by minimizing speaker+listener effort. What about signed languages? American Sign Language handshapes reflect efficiency pressures - but only in native signs, not signs borrowed from English! #ACL2024 arxiv.org/abs/2406.04024 🧵

Spoken languages exhibit communicative efficiency by minimizing speaker+listener effort.

What about signed languages?

American Sign Language handshapes reflect efficiency pressures - but only in native signs, not signs borrowed from English!

#ACL2024 arxiv.org/abs/2406.04024 🧵
Nicholas Tomlin (@nickatomlin) 's Twitter Profile Photo

New preprint! 📰 Can LMs be improved with AlphaGo-style self-play? The classic answer is that self-play only works in certain types of zero-sum games, but we show that it can be effective in cooperative games too Paper: arxiv.org/abs/2406.18872 Code: github.com/nickatomlin/lm…

New preprint! 📰 Can LMs be improved with AlphaGo-style self-play? The classic answer is that self-play only works in certain types of zero-sum games, but we show that it can be effective in cooperative games too

Paper: arxiv.org/abs/2406.18872
Code: github.com/nickatomlin/lm…
Charlie Snell (@sea_snell) 's Twitter Profile Photo

On difficult problems, humans can think longer to improve their decisions. Can we instill a similar capability into LLMs? And can it do well? In our paper, we find that by optimally scaling test-time compute we can outperform *much* larger models in a FLOPs matched evaluation.

On difficult problems, humans can think longer to improve their decisions. Can we instill a similar capability into LLMs? And can it do well?

In our paper, we find that by optimally scaling test-time compute we can outperform *much* larger models in a FLOPs matched evaluation.
Ruiqi Zhong (@zhongruiqi) 's Twitter Profile Photo

large mental model update after working on this project 1. Even when LLM does not know what's correct, it can still learn to assist humans to finish the task 2. sometimes LLMs are even better than humans at distinguishing what is helpful for humans (!)

Ruiqi Zhong (@zhongruiqi) 's Twitter Profile Photo

A central concern in alignment is that AI systems will "deceive" humans by doing what looks correct to humans but is actually wrong. While a lot of works are motivated by this assumption, we lack empirical evidence. Our work shows systematic evidence that this concern is real

Ruiqi Zhong (@zhongruiqi) 's Twitter Profile Photo

Graphical models struggle to explain patterns in text & images 😭 LLM can do this but hallucinates. 👿 It’s time to combine their strengths! We define models with natural language parameters! Unlocking opportunities in science, business, ML, etc

Graphical models struggle to explain patterns in text &amp; images 😭 

LLM can do this but hallucinates. 👿 

It’s time to combine their strengths!  We define models with natural language parameters! 

Unlocking opportunities in science, business, ML, etc
Ruiqi Zhong (@zhongruiqi) 's Twitter Profile Photo

Given the rapid progress of LLMs, I feel compelled to present this topic (even if it's not the main focus of my Ph.D. work). I will cover concrete ML problems related to "AI deception" -- undesirable behaviors of AI systems that are hard to catch -- and how to study this

Kayo Yin (@kayo_yin) 's Twitter Profile Photo

🚨New dataset + challenge #EMNLP2024🚨 We release ASL STEM Wiki: the first signing dataset of STEM articles! 📰 254 Wikipedia articles 📹 ~300 hours of ASL interpretations 👋 New task: automatic sign suggestion to make STEM education more accessible microsoft.com/en-us/research… 🧵

Josh Barua (@baruajosh) 's Twitter Profile Photo

Do LLMs encode knowledge of concept variation across languages? Can they use this knowledge to resolve ambiguity in translation? Our #EMNLP2024 paper finds a big performance gap between closed- and open-weight LLMs, but lexical rules can help transfer knowledge across models! 🧵

Do LLMs encode knowledge of concept variation across languages? Can they use this knowledge to resolve ambiguity in translation?

Our #EMNLP2024 paper finds a big performance gap between closed- and open-weight LLMs, but lexical rules can help transfer knowledge across models!
🧵
Kayo Yin (@kayo_yin) 's Twitter Profile Photo

Cool new dataset for translation ambiguity in 9 language pairs (7 low-resource), and we found LLM-generated descriptions help weaker models resolve ambiguity! Josh Barua will be presenting this at the 2-3:30pm poster session today, come talk to us about multilinguality in LLMs!

Charlie Snell (@sea_snell) 's Twitter Profile Photo

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵
Kayo Yin (@kayo_yin) 's Twitter Profile Photo

Induction heads are commonly associated with in-context learning, but are they the primary driver of ICL at scale? We find that recently discovered "function vector" heads, which encode the ICL task, are the actual primary drivers of few-shot ICL. arxiv.org/abs/2502.14010 🧵

Induction heads are commonly associated with in-context learning, but are they the primary driver of ICL at scale?

We find that recently discovered "function vector" heads, which encode the ICL task, are the actual primary drivers of few-shot ICL.

arxiv.org/abs/2502.14010
🧵
Lakshya A Agrawal (@lakshyaaagrawal) 's Twitter Profile Photo

🧵Introducing LangProBe: the first benchmark testing where and how composing LLMs into language programs affects cost-quality tradeoffs! We find that, on avg across diverse tasks, smaller models within optimized programs beat calls to larger models at a fraction of the cost.

🧵Introducing LangProBe: the first benchmark testing where and how composing LLMs into language programs affects cost-quality tradeoffs!

We find that, on avg across diverse tasks, smaller models within optimized programs beat calls to larger models at a fraction of the cost.
Ruiqi Zhong (@zhongruiqi) 's Twitter Profile Photo

Finished my dissertation!!! (scalable oversight,link below) Very fortunate to have Jacob Steinhardt and Dan Klein as my advisors! Words can't describe my gratitude, so I used a pic of Frieren w/ her advisor :) Thanks for developing my research mission, and teaching me magic

Finished my dissertation!!!

(scalable oversight,link below)

Very fortunate to have <a href="/JacobSteinhardt/">Jacob Steinhardt</a> and Dan Klein as my advisors! Words can't describe my gratitude, so I used a pic of Frieren w/ her advisor :) 

Thanks for developing my research mission, and teaching me magic
Nicholas Tomlin (@nickatomlin) 's Twitter Profile Photo

I'm incredibly excited to share that I'll be joining TTIC as an assistant professor in Fall 2026! Until then, I'm wrapping up my PhD at Berkeley, and after that I'll be a faculty fellow at NYU Center for Data Science