Uri Shaham (@uri_shaham) 's Twitter Profile
Uri Shaham

@uri_shaham

Research Scientist at Google Research

ID: 1186248407939244033

calendar_today21-10-2019 11:51:34

156 Tweet

421 Followers

243 Following

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Google presents Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? Highlights the risk in introducing new factual knowledge through fine-tuning, which leads to hallucinations arxiv.org/abs/2405.05904

Google presents Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

Highlights the risk in introducing new factual knowledge through fine-tuning, which leads to hallucinations

arxiv.org/abs/2405.05904
Zorik Gekhman (@zorikgekhman) 's Twitter Profile Photo

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? New preprint!📣 - LLMs struggle to integrate new factual knowledge through fine-tuning - As the model eventually learns new knowledge, it becomes more prone to hallucinations😵‍💫 📜arxiv.org/pdf/2405.05904 🧵1/12👇

Avi Caciularu (@clu_avi) 's Twitter Profile Photo

🚨 New Paper 🚨 Are current LLMs up to the task of solving *complex* instructions based on content-rich text? Our new dataset, TACT, sheds some light on this challenge. How does it work? arxiv.org/abs/2406.03618 Work by Google AI & Google DeepMind 👇🧵

🚨 New Paper 🚨
Are current LLMs up to the task of solving *complex* instructions based on content-rich text?
Our new dataset, TACT, sheds some light on this challenge.
How does it work?
arxiv.org/abs/2406.03618
Work by <a href="/GoogleAI/">Google AI</a> &amp; <a href="/GoogleDeepMind/">Google DeepMind</a>
👇🧵
Mor Geva (@megamor2) 's Twitter Profile Photo

Do you have a "tell" when you are about to lie? We find that LLMs have “tells” in their internal representations which allow estimating how knowledgeable a model is about an entity 𝘣𝘦𝘧𝘰𝘳𝘦 it generates even a single token. Paper: arxiv.org/abs/2406.12673… 🧵 Daniela Gottesman

Do you have a "tell" when you are about to lie?

We find that LLMs have “tells” in their internal representations which allow estimating how knowledgeable a model is about an entity 𝘣𝘦𝘧𝘰𝘳𝘦 it generates even a single token.

Paper: arxiv.org/abs/2406.12673… 🧵

<a href="/dhgottesman/">Daniela Gottesman</a>
Maor Ivgi (@maorivg) 's Twitter Profile Photo

1/5 🧠 Excited to share our latest paper focusing on the heart of LLM training: data curation! We train a 7B LLM achieving 64% on 5-shot MMLU, using only 2.6T tokens. The key to this performance? Exceptional data curation. #LLM #DataCuration

Arie Cattan (@ariecattan) 's Twitter Profile Photo

🚨🚨 Check out our new paper for a new ICL method that greatly boosts LLMs in long contexts! >> arxiv.org/abs/2406.13632

🚨🚨 Check out our new paper for a new ICL method that greatly boosts LLMs in long contexts! 

&gt;&gt;

arxiv.org/abs/2406.13632
omer goldman (@omernlp) 's Twitter Profile Photo

new models have an amazingly long context. but can we actually tell how well they deal with it? 🚨🚨NEW PAPER ALERT🚨🚨 with Alon Jacovi Aviv Slobodkin Aviya Maimon, Ido Dagan and Reut Tsarfaty arXiv: arxiv.org/abs/2407.00402… 1/🧵

new models have an amazingly long context. but can we actually tell how well they deal with it?
🚨🚨NEW PAPER ALERT🚨🚨
with <a href="/alon_jacovi/">Alon Jacovi</a> <a href="/lovodkin93/">Aviv Slobodkin</a> Aviya Maimon, Ido Dagan and <a href="/rtsarfaty/">Reut Tsarfaty</a> 
arXiv: arxiv.org/abs/2407.00402…

1/🧵
Maor Ivgi (@maorivg) 's Twitter Profile Photo

1/7 🚨 What do LLMs do when they are uncertain? We found that the stronger the LLM, the more it hallucinates and the less it loops! This pattern extends to sampling methods and instruction tuning. 🧵👇 Mor Geva Jonathan Berant Ori Yoran

1/7 🚨 What do LLMs do when they are uncertain? We found that the stronger the LLM, the more it hallucinates and the less it loops! This pattern extends to sampling methods and instruction tuning. 🧵👇
<a href="/megamor2/">Mor Geva</a> <a href="/JonathanBerant/">Jonathan Berant</a> <a href="/OriYoran/">Ori Yoran</a>
Ori Yoran (@oriyoran) 's Twitter Profile Photo

Can AI agents solve realistic, time-consuming web tasks such as “Which gyms near me have fitness classes on the weekend, before 7AM?" We introduce AssistantBench, a benchmark with 214 such tasks. Our new GPT-4 based agent gets just 25% accuracy! assistantbench.github.io

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Google presents CoverBench: A Challenging Benchmark for Complex Claim Verification Provides a significant challenge to current models with large headroom arxiv.org/abs/2408.03325

Google presents CoverBench: A Challenging Benchmark for Complex Claim Verification

Provides a significant challenge to current models with 
large headroom

arxiv.org/abs/2408.03325
AK (@_akhaliq) 's Twitter Profile Photo

Google announces CoverBench A Challenging Benchmark for Complex Claim Verification discuss: huggingface.co/papers/2408.03… There is a growing line of research on verifying the correctness of language models' outputs. At the same time, LMs are being used to tackle complex queries that

Google announces CoverBench

A Challenging Benchmark for Complex Claim Verification

discuss: huggingface.co/papers/2408.03…

There is a growing line of research on verifying the correctness of language models' outputs. At the same time, LMs are being used to tackle complex queries that
Alon Jacovi (@alon_jacovi) 's Twitter Profile Photo

New complex reasoning eval set! CoverBench: Verify whether a claim is correct given a rich context. It requires implicit complex reasoning. It's efficient (<1k ex), convenient (binary classification), and hard. Take a look! arxiv.org/abs/2408.03325 huggingface.co/datasets/googl…

New complex reasoning eval set!

CoverBench: Verify whether a claim is correct given a rich context. It requires implicit complex reasoning.

It's efficient (&lt;1k ex), convenient (binary classification), and hard. Take a look!

arxiv.org/abs/2408.03325
huggingface.co/datasets/googl…
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model abs: arxiv.org/abs/2408.11039 New paper from Meta that introduces Transfusion, a recipe for training a model that can seamlessly generate discrete and continuous modalities. The authors pretrain a

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

abs: arxiv.org/abs/2408.11039

New paper from Meta that introduces Transfusion, a recipe for training a model that can seamlessly generate discrete and continuous modalities. The authors pretrain a
Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Meta presents Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model - Can generate images and text on a par with similar scale diffusion models and language models - Compresses each image to just 16 patches arxiv.org/abs/2408.11039

Meta presents Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

- Can generate images and text on a par with similar scale diffusion models and language models
- Compresses each image to just 16 patches

arxiv.org/abs/2408.11039
Ori Yoran (@oriyoran) 's Twitter Profile Photo

Working on a new web agent? AssistantBench, our benchmark with realistic and challenging web tasks such just got an update: * Our SeePlanAct Agent with Sonnet 3.5 achieved a new SoTA of 26.4%. * We just open sourced our agent. * Accepted to #EMNLP2024!

Working on a new web agent? AssistantBench, our benchmark with realistic and challenging web tasks such just got an update:

* Our SeePlanAct Agent with Sonnet 3.5 achieved a new SoTA of 26.4%. 
* We just open sourced our agent.
* Accepted to #EMNLP2024!
Or Honovich (@ohonovich) 's Twitter Profile Photo

Scaling inference compute by repeated sampling boosts coverage (% problems solved), but could this be due to lucky guesses, rather than correct reasoning? We show that sometimes, guessing beats repeated sampling 🎲 Gal Yona Omer Levy roeeaharoni arxiv.org/abs/2410.15466

Scaling inference compute by repeated sampling boosts coverage (% problems solved), but could this be due to lucky guesses, rather than correct reasoning?

We show that sometimes, guessing beats repeated sampling 🎲

<a href="/_galyo/">Gal Yona</a> <a href="/omerlevy_/">Omer Levy</a> <a href="/roeeaharoni/">roeeaharoni</a>

arxiv.org/abs/2410.15466
Ori Yoran (@oriyoran) 's Twitter Profile Photo

New #ICLR2024 paper! The KoLMogorov Test: can CodeLMs compress data by code generation? The optimal compression for a sequence is the shortest program that generates it. Empirically, LMs struggle even on simple sequences, but can be trained to outperform current methods! 🧵1/7

omer goldman (@omernlp) 's Twitter Profile Photo

Wanna check how well a model can share knowledge between languages? Of course you do! 🤩 But can you do it without access to the model’s weights? Now you can with ECLeKTic 🤯

Wanna check how well a model can share knowledge between languages? Of course you do! 🤩

But can you do it without access to the model’s weights? Now you can with ECLeKTic 🤯
Google AI (@googleai) 's Twitter Profile Photo

Introducing ECLeKTic, a new benchmark for Evaluating Cross-Lingual Knowledge Transfer in LLMs. It uses a closed-book QA task, where models must rely on internal knowledge to answer questions based on information captured only in a single language. More →goo.gle/3Y5TqvZ

Introducing ECLeKTic, a new benchmark for Evaluating Cross-Lingual Knowledge Transfer in LLMs. It uses a closed-book QA task, where models must rely on internal knowledge to answer questions based on information captured only in a single language. More →goo.gle/3Y5TqvZ
Itay Itzhak (@itay_itzhak_) 's Twitter Profile Photo

🚨New paper alert🚨 🧠 Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing? Excited to share our new paper, accepted to CoLM 2025🎉! See thread below 👇 #BiasInAI #LLMs #MachineLearning #NLProc

🚨New paper alert🚨

🧠 
Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing?

Excited to share our new paper, accepted to CoLM 2025🎉!
See thread below 👇
#BiasInAI #LLMs #MachineLearning #NLProc