Kevin Wu (@kevinywu) 's Twitter Profile
Kevin Wu

@kevinywu

PhD Student @ Stanford Biomedical Informatics

ID: 1379118908691476481

linkhttp://kevinwu.ai calendar_today05-04-2021 17:09:10

29 Tweet

163 Followers

66 Following

James Zou (@james_y_zou) 's Twitter Profile Photo

What happens when an LLM’s internal knowledge conflicts with retrieved info? In #ClashEval (to appear in #NeurIPS2024) we find OpenAI models have much more retrieval bias than Claude, Gemini or Llama, often deferring to incorrect retrieved info, ignoring LM's correct prior. 1/2

What happens when an LLM’s internal knowledge conflicts with retrieved info?

In #ClashEval (to appear in #NeurIPS2024) we find OpenAI models have much more retrieval bias than Claude, Gemini or Llama, often deferring to incorrect retrieved info, ignoring LM's correct prior. 1/2
James Zou (@james_y_zou) 's Twitter Profile Photo

How well do #LLM fine-tuning APIs work? Not well. We created #FineTuneBench to study if we can update #ChatGPT, #Gemini's knowledge or teach it new facts via finetuning APIs. Answer is mostly No. Finetune 4o-mini was the most effective; Gemini less arxiv.org/abs/2411.05059 Users

How well do #LLM fine-tuning APIs work? Not well.

We created #FineTuneBench to study if we can update #ChatGPT, #Gemini's knowledge or teach it new facts via finetuning APIs. Answer is mostly No.

Finetune 4o-mini was the most effective; Gemini less arxiv.org/abs/2411.05059 Users
Alejandro Lozano (@ale9806_) 's Twitter Profile Photo

Biomedical datasets are often confined to specific domains, missing valuable insights from adjacent fields. To bridge this gap, we present BIOMEDICA: an open-source framework to extract and serialize PMC-OA. 📄Paper: lnkd.in/dUUgA6rR 🌐Website: lnkd.in/dnqZZW4M

Biomedical datasets are often confined to specific domains, missing valuable insights from adjacent fields. To bridge this gap, we present BIOMEDICA: an open-source framework to extract and serialize PMC-OA.

📄Paper: lnkd.in/dUUgA6rR 
🌐Website: lnkd.in/dnqZZW4M
Kevin Wu (@kevinywu) 's Twitter Profile Photo

Which LLMs work best for medical queries? 🩺✨ Introducing MedArena 🏥—the first chatbot arena just for clinicians ⚕️! 👩‍⚕️👨‍⚕️ Licensed healthcare professionals can submit medical queries and compare + rank responses from two randomly selected LLMs (such as o1, Gemini 🌟, and

Eric Wu (@ericwu93) 's Twitter Profile Photo

Want to compare how different LLMs handle your medical queries—for free? Excited to share MedArena 🏥, the first public medical chatbot arena where clinicians can evaluate and rank the latest medical LLMs! 🏆 If you are a licensed healthcare professional, we welcome you to submit

James Zou (@james_y_zou) 's Twitter Profile Photo

📢Thrilled to introduce #MedArena medarena.ai 🏥, a platform for clinicians to try frontier #LLMs for free (o3, Gemini, Perplexity + more). Compare and rank medical chatbots and help shape the future of medical AI! 🏆 Current leaderboard👇 Please share #MedTwitter

📢Thrilled to introduce #MedArena medarena.ai 🏥, a platform for clinicians to try frontier #LLMs for free (o3, Gemini, Perplexity + more).

Compare and rank medical chatbots and help shape the future of medical AI! 🏆 Current leaderboard👇

Please share #MedTwitter
David Ouyang, MD (@david_ouyang) 's Twitter Profile Photo

Really cool way to evaluate LLMs for medicine in blinded fashion. It really helps build intuition on what works and doesnt. For example, I really like that there is RAG in some models (able to cite literature and references claims), but I've found when I've clicked into some of

James Zou (@james_y_zou) 's Twitter Profile Photo

Interesting that Gemini Flash Thinking has emerged as clinicians' preferred model on #MedArena! 🏅 Clinicians around the world can now use and compare frontier #LLMs for free at medarena.ai/login. #medtwitter

Interesting that Gemini Flash Thinking has emerged as clinicians' preferred model on #MedArena! 🏅 

Clinicians around the world can now use and compare frontier #LLMs for free at medarena.ai/login. #medtwitter
StanfordDBDS (@stanforddbds) 's Twitter Profile Photo

Zou Lab launches MedArena, a free platform for clinicians to use and compare frontier LLMs MedArena is a free platform for clinicians to use and compare how frontier LLMs work on medical queries. Check it out at: medarena.ai/login

Kevin Wu (@kevinywu) 's Twitter Profile Photo

Our paper out in Nature Communications! Citing relevant medical sources continues to be a difficult task for LLMs, largely mediated by a "tug-of-war" between model prior and context (are LLMs basing their answer off the source or do they find sources to back up their answer post-hoc?)

Stanford HAI (@stanfordhai) 's Twitter Profile Photo

Current paradigms for evaluating medical LLM suffer from significant challenges that limit their real-world applications. To address this, scholars introduce a free platform for clinicians to test and compare top-performing LLMs on their medical queries. hai.stanford.edu/news/medarena-…

Current paradigms for evaluating medical LLM suffer from significant challenges that limit their real-world applications. To address this, scholars introduce a free platform for clinicians to test and compare top-performing LLMs on their medical queries. hai.stanford.edu/news/medarena-…
James Zou (@james_y_zou) 's Twitter Profile Photo

Excited to introduce #CollabLLM -- a method to train LLMs to collaborate better w/ humans! Selected as #icml2025 oral (top 1%)🏅 New multi-turn training objective + user simulator👇

James Zou (@james_y_zou) 's Twitter Profile Photo

📢New conference where AI is the primary author and reviewer! agents4science.stanford.edu Current venues don't allow AI-written papers, so it's hard to assess the +/- of such works🤔 #Agents4Science solicits papers where AI is the main author w/ human advisors. 💡Initial reviews by

📢New conference where AI is the primary author and reviewer! agents4science.stanford.edu

Current venues don't allow AI-written papers, so it's hard to assess the +/- of such works🤔 #Agents4Science solicits papers where AI is the main author w/ human advisors.

💡Initial reviews by
Kevin Wu (@kevinywu) 's Twitter Profile Photo

Fine-tuning APIs allow developers to update model weights for frontier models, but can they actually teach models new information? Our study published today in Nejmai shows that out-of-box SFT with commercial APIs has poor generalizability on medical knowledge. That is to say,