Lisa Dunlap (@lisabdunlap) 's Twitter Profile
Lisa Dunlap

@lisabdunlap

messin around with model evals @berkeley_ai and @lmarena_ai

ID: 1453078457978552320

linkhttp://lisabdunlap.com calendar_today26-10-2021 19:18:12

355 Tweet

1,1K Takipรงi

252 Takip Edilen

Lisa Dunlap (@lisabdunlap) 's Twitter Profile Photo

an someone do a study on the rates of blue boxes with blue text in websites pre and post Claude? I swear I see this style everywhere now

an someone do a study on the rates  of blue boxes with blue text in websites pre and post Claude? I swear I see this style everywhere now
YichuanWang (@yichuanm) 's Twitter Profile Photo

1/N ๐Ÿš€ Launching LEANN โ€” the tiniest vector index on Earth! Fast, accurate, and 100% private RAG on your MacBook. 0% internet. 97% smaller. Semantic search on everything. Your personal Jarvis, ready to dive into your emails, chats, and more. ๐Ÿ”— Code: github.com/yichuan-w/LEANN ๐Ÿ“„

1/N ๐Ÿš€ Launching LEANN โ€” the tiniest vector index on Earth!

Fast, accurate, and 100% private RAG on your MacBook.
0% internet. 97% smaller. Semantic search on everything.
Your personal Jarvis, ready to dive into your emails, chats, and more.

๐Ÿ”— Code: github.com/yichuan-w/LEANN
๐Ÿ“„
Lisa Dunlap (@lisabdunlap) 's Twitter Profile Photo

Recently when using in a somewhat long Cursor chat, it will just randomly delete all the code it has written from that chat and restoring the checkpoint doesn't bring any of it back. Files which it created are just completely empty. Anyone else notice this?

Nick Jiang @ ICLR (@nickhjiang) 's Twitter Profile Photo

What makes LLMs like Grok-4 unique? We use sparse autoencoders (SAEs) to tackle queries like these and apply them to four data analysis tasks: data diffing, correlations, targeted clustering, and retrieval. By analyzing model outputs, SAEs find novel insights on model behavior!

What makes LLMs like Grok-4 unique?

We use sparse autoencoders (SAEs) to tackle queries like these and apply them to four data analysis tasks: data diffing, correlations, targeted clustering, and retrieval. By analyzing model outputs, SAEs find novel insights on model behavior!
Jessy Lin (@realjessylin) 's Twitter Profile Photo

๐Ÿ” How do we teach an LLM to ๐˜ฎ๐˜ข๐˜ด๐˜ต๐˜ฆ๐˜ณ a body of knowledge? In new work with AI at Meta, we propose Active Reading ๐Ÿ“™: a way for models to teach themselves new things by self-studying their training data. Results: * ๐Ÿ”๐Ÿ”% on SimpleQA w/ an 8B model by studying the wikipedia

๐Ÿ” How do we teach an LLM to ๐˜ฎ๐˜ข๐˜ด๐˜ต๐˜ฆ๐˜ณ a body of knowledge?

In new work with <a href="/AIatMeta/">AI at Meta</a>, we propose Active Reading ๐Ÿ“™: a way for models to teach themselves new things by self-studying their training data. Results:

* ๐Ÿ”๐Ÿ”% on SimpleQA w/ an 8B model by studying the wikipedia
Liana (@lianapatel_) 's Twitter Profile Photo

Interested in building and benchmarking deep research systems? Excited to introduce DeepScholar-Bench, a live benchmark for generative research synthesis, from our team at Stanford and Berkeley! ๐Ÿ†Live Leaderboard guestrin-lab.github.io/deepscholar-leโ€ฆ ๐Ÿ“š Paper: arxiv.org/abs/2508.20033 ๐Ÿ› ๏ธ

Interested in building and benchmarking deep research systems?

Excited to introduce DeepScholar-Bench, a live benchmark for generative research synthesis, from our team at Stanford and Berkeley!

๐Ÿ†Live Leaderboard guestrin-lab.github.io/deepscholar-leโ€ฆ
๐Ÿ“š Paper: arxiv.org/abs/2508.20033
๐Ÿ› ๏ธ
XuDong Wang (@xdwang101) 's Twitter Profile Photo

๐ŸŽ‰ Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models ๐Ÿ”ฅ Post-train w/ RecA: 8k images & 4 hours (8 GPUs) โ†’ SOTA UMMs: GenEval 0.73โ†’0.90 | DPGBench 80.93โ†’88.15 | ImgEdit 3.38โ†’3.75 Code: github.com/HorizonWind200โ€ฆ 1/n

๐ŸŽ‰ Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models

๐Ÿ”ฅ Post-train w/ RecA: 8k images &amp; 4 hours (8 GPUs) โ†’ SOTA UMMs:

GenEval 0.73โ†’0.90 | DPGBench 80.93โ†’88.15 | ImgEdit 3.38โ†’3.75

Code: github.com/HorizonWind200โ€ฆ

1/n
Tsung-Han (Patrick) Wu @ ICLRโ€™25 (@tsunghan_wu) 's Twitter Profile Photo

NeurIPS 2025 โœ… Our generate-verify de-hallucination paper is in! โœ”๏ธ DFS-backtrackingโ€“like tricks fix VLM hallucinations โœ”๏ธ Explicit confidence targets matter (we stressed this before OpenAIโ€™s โ€œWhy LMs Hallucinateโ€) ๐Ÿ‘‰ Check it out: reverse-vlm.github.io See u all at SD!

NeurIPS 2025 โœ… Our generate-verify de-hallucination paper is in!
โœ”๏ธ DFS-backtrackingโ€“like tricks fix VLM hallucinations
โœ”๏ธ Explicit confidence targets matter (we stressed this before <a href="/OpenAI/">OpenAI</a>โ€™s โ€œWhy LMs Hallucinateโ€)
๐Ÿ‘‰ Check it out: reverse-vlm.github.io

See u all at SD!
Melissa Pan (@melissapan) 's Twitter Profile Photo

Excited to share: MAST has been accepted as ๐ŸŒŸ NeurIPS D&B Spotlight๐ŸŒŸ Updates for the community: - NEW: We open-source 1,000+ multi-agent traces (link in ๐Ÿงต). - lots of exciting use cases are emerging, weโ€™ll be releasing blogs & tutorials to help you get started - And โ€ฆ more

Excited to share: MAST has been accepted as ๐ŸŒŸ NeurIPS D&amp;B Spotlight๐ŸŒŸ

Updates for the community:
- NEW: We open-source 1,000+ multi-agent traces (link in ๐Ÿงต).
- lots of exciting use cases are emerging, weโ€™ll be releasing blogs &amp; tutorials to help you get started
- And โ€ฆ  more
LMSYS Org (@lmsysorg) 's Twitter Profile Photo

SGLang now supports deterministic LLM inference! Building on Thinking Machines batch-invariant kernels, we integrated deterministic attention & sampling ops into a high-throughput engine - fully compatible with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. โœ…

SGLang now supports deterministic LLM inference! Building on <a href="/thinkymachines/">Thinking Machines</a> batch-invariant kernels, we integrated deterministic attention &amp; sampling ops into a high-throughput engine - fully compatible with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling.

โœ…
lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

๐ŸŽ‰ Re-introducing Categories in Vision Arena! Since we first introduced categories over two years ago (and Vision Arena last year), the AI evaluation landscape has grown rapidly. Categories let us zoom in on model performance for specific areas, from captioning to diagrams. ๐Ÿงต

Parth Asawa (@pgasawa) 's Twitter Profile Photo

Training our advisors was too hard, so we tried to train black-box models like GPT-5 instead. Check out our work: Advisor Models, a training framework that adapts frontier models behind an API to your specific environment, users, or tasks using a smaller, advisor model (1/n)!

Training our advisors was too hard, so we tried to train black-box models like GPT-5 instead. Check out our work: Advisor Models, a training framework that adapts frontier models behind an API to your specific environment, users, or tasks using a smaller, advisor model (1/n)!