Lisa Dunlap (@lisabdunlap) 's Twitter Profile
Lisa Dunlap

@lisabdunlap

messin around with model evals @berkeley_ai and @lmarena_ai

ID: 1453078457978552320

linkhttp://lisabdunlap.com calendar_today26-10-2021 19:18:12

355 Tweet

1,1K Followers

252 Following

Lisa Dunlap (@lisabdunlap) 's Twitter Profile Photo

an someone do a study on the rates of blue boxes with blue text in websites pre and post Claude? I swear I see this style everywhere now

an someone do a study on the rates  of blue boxes with blue text in websites pre and post Claude? I swear I see this style everywhere now
YichuanWang (@yichuanm) 's Twitter Profile Photo

1/N 🚀 Launching LEANN — the tiniest vector index on Earth! Fast, accurate, and 100% private RAG on your MacBook. 0% internet. 97% smaller. Semantic search on everything. Your personal Jarvis, ready to dive into your emails, chats, and more. 🔗 Code: github.com/yichuan-w/LEANN 📄

1/N 🚀 Launching LEANN — the tiniest vector index on Earth!

Fast, accurate, and 100% private RAG on your MacBook.
0% internet. 97% smaller. Semantic search on everything.
Your personal Jarvis, ready to dive into your emails, chats, and more.

🔗 Code: github.com/yichuan-w/LEANN
📄
Lisa Dunlap (@lisabdunlap) 's Twitter Profile Photo

Recently when using in a somewhat long Cursor chat, it will just randomly delete all the code it has written from that chat and restoring the checkpoint doesn't bring any of it back. Files which it created are just completely empty. Anyone else notice this?

Nick Jiang @ ICLR (@nickhjiang) 's Twitter Profile Photo

What makes LLMs like Grok-4 unique? We use sparse autoencoders (SAEs) to tackle queries like these and apply them to four data analysis tasks: data diffing, correlations, targeted clustering, and retrieval. By analyzing model outputs, SAEs find novel insights on model behavior!

What makes LLMs like Grok-4 unique?

We use sparse autoencoders (SAEs) to tackle queries like these and apply them to four data analysis tasks: data diffing, correlations, targeted clustering, and retrieval. By analyzing model outputs, SAEs find novel insights on model behavior!
Jessy Lin (@realjessylin) 's Twitter Profile Photo

🔍 How do we teach an LLM to 𝘮𝘢𝘴𝘵𝘦𝘳 a body of knowledge? In new work with AI at Meta, we propose Active Reading 📙: a way for models to teach themselves new things by self-studying their training data. Results: * 𝟔𝟔% on SimpleQA w/ an 8B model by studying the wikipedia

🔍 How do we teach an LLM to 𝘮𝘢𝘴𝘵𝘦𝘳 a body of knowledge?

In new work with <a href="/AIatMeta/">AI at Meta</a>, we propose Active Reading 📙: a way for models to teach themselves new things by self-studying their training data. Results:

* 𝟔𝟔% on SimpleQA w/ an 8B model by studying the wikipedia
Liana (@lianapatel_) 's Twitter Profile Photo

Interested in building and benchmarking deep research systems? Excited to introduce DeepScholar-Bench, a live benchmark for generative research synthesis, from our team at Stanford and Berkeley! 🏆Live Leaderboard guestrin-lab.github.io/deepscholar-le… 📚 Paper: arxiv.org/abs/2508.20033 🛠️

Interested in building and benchmarking deep research systems?

Excited to introduce DeepScholar-Bench, a live benchmark for generative research synthesis, from our team at Stanford and Berkeley!

🏆Live Leaderboard guestrin-lab.github.io/deepscholar-le…
📚 Paper: arxiv.org/abs/2508.20033
🛠️
XuDong Wang (@xdwang101) 's Twitter Profile Photo

🎉 Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models 🔥 Post-train w/ RecA: 8k images & 4 hours (8 GPUs) → SOTA UMMs: GenEval 0.73→0.90 | DPGBench 80.93→88.15 | ImgEdit 3.38→3.75 Code: github.com/HorizonWind200… 1/n

🎉 Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models

🔥 Post-train w/ RecA: 8k images &amp; 4 hours (8 GPUs) → SOTA UMMs:

GenEval 0.73→0.90 | DPGBench 80.93→88.15 | ImgEdit 3.38→3.75

Code: github.com/HorizonWind200…

1/n
Tsung-Han (Patrick) Wu @ ICLR’25 (@tsunghan_wu) 's Twitter Profile Photo

NeurIPS 2025 ✅ Our generate-verify de-hallucination paper is in! ✔️ DFS-backtracking–like tricks fix VLM hallucinations ✔️ Explicit confidence targets matter (we stressed this before OpenAI’s “Why LMs Hallucinate”) 👉 Check it out: reverse-vlm.github.io See u all at SD!

NeurIPS 2025 ✅ Our generate-verify de-hallucination paper is in!
✔️ DFS-backtracking–like tricks fix VLM hallucinations
✔️ Explicit confidence targets matter (we stressed this before <a href="/OpenAI/">OpenAI</a>’s “Why LMs Hallucinate”)
👉 Check it out: reverse-vlm.github.io

See u all at SD!
Melissa Pan (@melissapan) 's Twitter Profile Photo

Excited to share: MAST has been accepted as 🌟 NeurIPS D&B Spotlight🌟 Updates for the community: - NEW: We open-source 1,000+ multi-agent traces (link in 🧵). - lots of exciting use cases are emerging, we’ll be releasing blogs & tutorials to help you get started - And … more

Excited to share: MAST has been accepted as 🌟 NeurIPS D&amp;B Spotlight🌟

Updates for the community:
- NEW: We open-source 1,000+ multi-agent traces (link in 🧵).
- lots of exciting use cases are emerging, we’ll be releasing blogs &amp; tutorials to help you get started
- And …  more
LMSYS Org (@lmsysorg) 's Twitter Profile Photo

SGLang now supports deterministic LLM inference! Building on Thinking Machines batch-invariant kernels, we integrated deterministic attention & sampling ops into a high-throughput engine - fully compatible with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. ✅

SGLang now supports deterministic LLM inference! Building on <a href="/thinkymachines/">Thinking Machines</a> batch-invariant kernels, we integrated deterministic attention &amp; sampling ops into a high-throughput engine - fully compatible with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling.

✅
lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

🎉 Re-introducing Categories in Vision Arena! Since we first introduced categories over two years ago (and Vision Arena last year), the AI evaluation landscape has grown rapidly. Categories let us zoom in on model performance for specific areas, from captioning to diagrams. 🧵

Parth Asawa (@pgasawa) 's Twitter Profile Photo

Training our advisors was too hard, so we tried to train black-box models like GPT-5 instead. Check out our work: Advisor Models, a training framework that adapts frontier models behind an API to your specific environment, users, or tasks using a smaller, advisor model (1/n)!

Training our advisors was too hard, so we tried to train black-box models like GPT-5 instead. Check out our work: Advisor Models, a training framework that adapts frontier models behind an API to your specific environment, users, or tasks using a smaller, advisor model (1/n)!