Aman Arora (@amaarora) Twitter Tweets • TwiCopy

clem 🤗

a month ago

We need better agent evaluations! Glad to have collaborated with Meta Super Intelligence Lab to release Gaia2 and ARE! GPT5 (high) from OpenAI is leading on execution, search, ambiguity, adaptability and noise. Kimi-K2 from Kimi.ai is leading open weight. Full

We need better agent evaluations! Glad to have collaborated with <a href="/Meta/">Meta</a> Super Intelligence Lab to release Gaia2 and ARE!

GPT5 (high) from <a href="/OpenAI/">OpenAI</a> is leading on execution, search, ambiguity, adaptability and noise.

Kimi-K2 from <a href="/Kimi_Moonshot/">Kimi.ai</a> is leading open weight.

Full

thumb_up_off_alt494

chat_bubble_outline19

repeat53

shareShare

Jo Kristian Bergum

@jobergum

a month ago

Wonderful read! It’s so well executed and explained. No slop.

thumb_up_off_alt34

chat_bubble_outline2

repeat5

shareShare

Gabriel Synnaeve

@syhw

a month ago

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…

thumb_up_off_alt1,1K

chat_bubble_outline56

repeat262

shareShare

David Pfau

@pfau

a month ago

I miss finding and discussing machine learning papers on this website.

thumb_up_off_alt506

chat_bubble_outline28

repeat18

shareShare

Matei Zaharia

@matei_zaharia

a month ago

Prompt optimization is becoming a powerful technique for improving AI that can even beat SFT! Here are some of our research results with GEPA at Databricks, in difficult Agent Bricks info extraction tasks. We can match the best models at 90x lower cost, or improve them by ~6%.

thumb_up_off_alt879

chat_bubble_outline30

repeat127

shareShare

Kimi.ai

@kimi_moonshot

a month ago

Say hi to OK Computer, Kimi's agent mode 🤖🎸 Your AI product & engineering team, all in one. ✨ From chat → multi-page websites, mobile first designs, editable slides ✨ From up to 1 million rows of data → interactive dashboards ✨ Agency: self-scopes, surveys & designs ✨

thumb_up_off_alt2,2K

chat_bubble_outline135

repeat304

shareShare

Grant Lee

@thisisgrantlee

a month ago

Gamma crossed $50M ARR with 28 employees and more cash in the bank than we had raised ($23M) In hindsight: We got here because we ignored common VC advice. Examples of glaringly bad advice that you should ignore to save you $10M+ and years of time, like we did for Gamma:

thumb_up_off_alt2,2K

chat_bubble_outline214

repeat245

shareShare

Simon Willison

@simonw

a month ago

If you hide the system prompt and tool descriptions for your LLM agent, what you're actually doing is taking the single most detailed set of documentation for your service and deliberately hiding it from your most sophisticated users!

thumb_up_off_alt712

chat_bubble_outline40

repeat55

shareShare

Aman Arora

@amaarora

a month ago

I noticed that the Qwen3Guard paper did not mention latency overhead. So I created an endpoint to find out for myself. TLDR; Approx ~1s for 0.6b and ~1.4s for 4b latency overhead per request!! 😮 Served using modal - here is the script github.com/amaarora/scrip…

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

a month ago

practical, modern GRPO tweaks as described in Meta's Code World Models paper

thumb_up_off_alt878

chat_bubble_outline13

repeat85

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

a month ago

looks like DeepSeek V3.2 is coming soon (taken down now)

thumb_up_off_alt132

chat_bubble_outline10

repeat5

shareShare

DeepSeek

@deepseek_ai

a month ago

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n

thumb_up_off_alt6,6K

chat_bubble_outline227

repeat893

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

a month ago

DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention NEW DEEPSEEK MODEL RELEASE AND PAPER "We introduce DeepSeek-V3.2-Exp, an experimental sparse-attention model, which equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through

thumb_up_off_alt236

chat_bubble_outline6

repeat28

shareShare

Aran Komatsuzaki

@arankomatsuzaki

a month ago

ReasoningBank: memory for self-evolving LLM agents • Distills strategies from both successes & failures • Enables agents to learn, reuse, and improve over time • Outperforms prior memory methods on web & SWE tasks (+34.2% eff., –16% steps)

thumb_up_off_alt599

chat_bubble_outline12

repeat114

shareShare

OpenAI

@openai

a month ago

Sora 2 is here.

thumb_up_off_alt8,8K

chat_bubble_outline952

repeat1,1K

shareShare

Jules

@julesagent

a month ago

Jules now has Memory. 🧠 Jules automatically learns your preferences and project conventions over time, applying them to future tasks to get better at working on your code. You always have the ability to edit or turn memory off. The changelog has the details →

thumb_up_off_alt364

chat_bubble_outline13

repeat44

shareShare

Simon Willison

@simonw

a month ago

If you've been trying to figure out DSPy - the automatic prompt optimization system - this talk by Drew Breunig is the clearest explanation I've seen yet, with a very useful real-world case study youtube.com/watch?v=I9Ztkg… My notes here: simonwillison.net/2025/Oct/4/dre…

thumb_up_off_alt1,1K

chat_bubble_outline18

repeat128

shareShare

Ethan Mollick

@emollick

a month ago

This paper shows that you can predict actual purchase intent (90% accuracy) by asking an LLM to impersonate a customer with a demographic profile, giving it a product & having it give its impressions, which another AI rates. No fine-tuning or training & beats classic ML methods.

thumb_up_off_alt7,7K

chat_bubble_outline138

repeat717

shareShare

Pietro Schirano

@skirano

23 days ago

So, basically when working on frontend only use Claude 4.5. Backend, refactoring, logic issues, Codex-High. Hope this helps!

thumb_up_off_alt4,4K

chat_bubble_outline178

repeat211

shareShare

wh

@nrehiew_

20 days ago

New post! This time, about the current state of Long Context Evaluation. I discuss existing benchmarks, what makes a good long context eval, what's missing from existing ones and introduce a new one - LongCodeEdit :)

thumb_up_off_alt504

chat_bubble_outline13

repeat43

shareShare