Kevin Wu (@kevinywu) Twitter Tweets • TwiCopy

James Zou

a year ago

What happens when an LLM’s internal knowledge conflicts with retrieved info? In #ClashEval (to appear in #NeurIPS2024) we find OpenAI models have much more retrieval bias than Claude, Gemini or Llama, often deferring to incorrect retrieved info, ignoring LM's correct prior. 1/2

thumb_up_off_alt108

chat_bubble_outline1

repeat19

shareShare

James Zou

@james_y_zou

a year ago

How well do #LLM fine-tuning APIs work? Not well. We created #FineTuneBench to study if we can update #ChatGPT, #Gemini's knowledge or teach it new facts via finetuning APIs. Answer is mostly No. Finetune 4o-mini was the most effective; Gemini less arxiv.org/abs/2411.05059 Users

thumb_up_off_alt129

chat_bubble_outline5

repeat33

shareShare

Alejandro Lozano

@ale9806_

a year ago

Biomedical datasets are often confined to specific domains, missing valuable insights from adjacent fields. To bridge this gap, we present BIOMEDICA: an open-source framework to extract and serialize PMC-OA. 📄Paper: lnkd.in/dUUgA6rR 🌐Website: lnkd.in/dnqZZW4M

thumb_up_off_alt145

chat_bubble_outline13

repeat54

shareShare

Kevin Wu

@kevinywu

10 months ago

Which LLMs work best for medical queries? 🩺✨ Introducing MedArena 🏥—the first chatbot arena just for clinicians ⚕️! 👩‍⚕️👨‍⚕️ Licensed healthcare professionals can submit medical queries and compare + rank responses from two randomly selected LLMs (such as o1, Gemini 🌟, and

thumb_up_off_alt17

chat_bubble_outline1

repeat8

shareShare

Eric Wu

@ericwu93

10 months ago

Want to compare how different LLMs handle your medical queries—for free? Excited to share MedArena 🏥, the first public medical chatbot arena where clinicians can evaluate and rank the latest medical LLMs! 🏆 If you are a licensed healthcare professional, we welcome you to submit

thumb_up_off_alt13

chat_bubble_outline0

repeat3

shareShare

James Zou

@james_y_zou

10 months ago

📢Thrilled to introduce #MedArena medarena.ai 🏥, a platform for clinicians to try frontier #LLMs for free (o3, Gemini, Perplexity + more). Compare and rank medical chatbots and help shape the future of medical AI! 🏆 Current leaderboard👇 Please share #MedTwitter

thumb_up_off_alt198

chat_bubble_outline10

repeat73

shareShare

Eric Topol

@erictopol

10 months ago

Introducing MedArena medarena.ai/app/?__theme=l… for clinicians to ask medical questions and compare results of LLM models James Zou Doximity

Introducing MedArena medarena.ai/app/?__theme=l… for clinicians to ask medical questions and compare results of LLM models <a href="/james_y_zou/">James Zou</a> <a href="/doximity/">Doximity</a>

thumb_up_off_alt70

chat_bubble_outline3

repeat22

shareShare

David Ouyang, MD

@david_ouyang

9 months ago

Really cool way to evaluate LLMs for medicine in blinded fashion. It really helps build intuition on what works and doesnt. For example, I really like that there is RAG in some models (able to cite literature and references claims), but I've found when I've clicked into some of

thumb_up_off_alt31

chat_bubble_outline2

repeat6

shareShare

James Zou

@james_y_zou

9 months ago

Interesting that Gemini Flash Thinking has emerged as clinicians' preferred model on #MedArena! 🏅 Clinicians around the world can now use and compare frontier #LLMs for free at medarena.ai/login. #medtwitter

thumb_up_off_alt67

chat_bubble_outline1

repeat8

shareShare

StanfordDBDS

@stanforddbds

8 months ago

Zou Lab launches MedArena, a free platform for clinicians to use and compare frontier LLMs MedArena is a free platform for clinicians to use and compare how frontier LLMs work on medical queries. Check it out at: medarena.ai/login

thumb_up_off_alt5

chat_bubble_outline0

repeat3

shareShare

Kevin Wu

@kevinywu

8 months ago

Our paper out in Nature Communications! Citing relevant medical sources continues to be a difficult task for LLMs, largely mediated by a "tug-of-war" between model prior and context (are LLMs basing their answer off the source or do they find sources to back up their answer post-hoc?)

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

James Zou

@james_y_zou

8 months ago

We discuss medarena.ai and some interesting initial findings in new Stanford HAI blog. hai.stanford.edu/news/medarena-…

thumb_up_off_alt12

chat_bubble_outline1

repeat4

shareShare

Stanford HAI

@stanfordhai

7 months ago

Current paradigms for evaluating medical LLM suffer from significant challenges that limit their real-world applications. To address this, scholars introduce a free platform for clinicians to test and compare top-performing LLMs on their medical queries. hai.stanford.edu/news/medarena-…

thumb_up_off_alt20

chat_bubble_outline2

repeat6

shareShare

James Zou

@james_y_zou

6 months ago

Excited to introduce #CollabLLM -- a method to train LLMs to collaborate better w/ humans! Selected as #icml2025 oral (top 1%)🏅 New multi-turn training objective + user simulator👇

thumb_up_off_alt51

chat_bubble_outline6

repeat7

shareShare

James Zou

@james_y_zou

5 months ago

📢New conference where AI is the primary author and reviewer! agents4science.stanford.edu Current venues don't allow AI-written papers, so it's hard to assess the +/- of such works🤔 #Agents4Science solicits papers where AI is the main author w/ human advisors. 💡Initial reviews by

thumb_up_off_alt425

chat_bubble_outline16

repeat103

shareShare

Kevin Wu

@kevinywu

5 months ago

Fine-tuning APIs allow developers to update model weights for frontier models, but can they actually teach models new information? Our study published today in Nejmai shows that out-of-box SFT with commercial APIs has poor generalizability on medical knowledge. That is to say,

thumb_up_off_alt11

chat_bubble_outline0

repeat2

shareShare

Kevin Wu

@kevinywu

5 months ago

Had a fun time with Paul Yi and Eric Wu talking about medical LLM evaluations!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare