Aman Arora (@amaarora) 's Twitter Profile
Aman Arora

@amaarora

Building AI Agents 🤖 | Blog: amaarora.github.io | Previously: @weights_biases; @Harrison.ai

ID: 2582562763

linkhttp://amaarora.github.io calendar_today22-06-2014 17:05:12

3,3K Tweet

5,5K Followers

1,1K Following

clem 🤗 (@clementdelangue) 's Twitter Profile Photo

We need better agent evaluations! Glad to have collaborated with Meta Super Intelligence Lab to release Gaia2 and ARE! GPT5 (high) from OpenAI is leading on execution, search, ambiguity, adaptability and noise. Kimi-K2 from Kimi.ai is leading open weight. Full

We need better agent evaluations! Glad to have collaborated with <a href="/Meta/">Meta</a> Super Intelligence Lab to release Gaia2 and ARE! 

GPT5 (high) from <a href="/OpenAI/">OpenAI</a> is leading on execution, search, ambiguity, adaptability and noise.

Kimi-K2 from <a href="/Kimi_Moonshot/">Kimi.ai</a> is leading open weight.

Full
Gabriel Synnaeve (@syhw) 's Twitter Profile Photo

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…

Matei Zaharia (@matei_zaharia) 's Twitter Profile Photo

Prompt optimization is becoming a powerful technique for improving AI that can even beat SFT! Here are some of our research results with GEPA at Databricks, in difficult Agent Bricks info extraction tasks. We can match the best models at 90x lower cost, or improve them by ~6%.

Prompt optimization is becoming a powerful technique for improving AI that can even beat SFT! Here are some of our research results with GEPA at Databricks, in difficult Agent Bricks info extraction tasks. We can match the best models at 90x lower cost, or improve them by ~6%.
Kimi.ai (@kimi_moonshot) 's Twitter Profile Photo

Say hi to OK Computer, Kimi's agent mode 🤖🎸 Your AI product & engineering team, all in one. ✨ From chat → multi-page websites, mobile first designs, editable slides ✨ From up to 1 million rows of data → interactive dashboards ✨ Agency: self-scopes, surveys & designs ✨

Grant Lee (@thisisgrantlee) 's Twitter Profile Photo

Gamma crossed $50M ARR with 28 employees and more cash in the bank than we had raised ($23M) In hindsight: We got here because we ignored common VC advice. Examples of glaringly bad advice that you should ignore to save you $10M+ and years of time, like we did for Gamma:

Gamma crossed $50M ARR with 28 employees and more cash in the bank than we had raised ($23M)

In hindsight: We got here because we ignored common VC advice.

Examples of glaringly bad advice that you should ignore to save you $10M+ and years of time, like we did for Gamma:
Simon Willison (@simonw) 's Twitter Profile Photo

If you hide the system prompt and tool descriptions for your LLM agent, what you're actually doing is taking the single most detailed set of documentation for your service and deliberately hiding it from your most sophisticated users!

Aman Arora (@amaarora) 's Twitter Profile Photo

I noticed that the Qwen3Guard paper did not mention latency overhead. So I created an endpoint to find out for myself. TLDR; Approx ~1s for 0.6b and ~1.4s for 4b latency overhead per request!! 😮 Served using modal - here is the script github.com/amaarora/scrip…

I noticed that the Qwen3Guard paper did not mention latency overhead. So I created an endpoint to find out for myself. 

TLDR; Approx ~1s for 0.6b and ~1.4s for 4b latency overhead  per request!! 😮

Served using modal - here is the script github.com/amaarora/scrip…
DeepSeek (@deepseek_ai) 's Twitter Profile Photo

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n

Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention NEW DEEPSEEK MODEL RELEASE AND PAPER "We introduce DeepSeek-V3.2-Exp, an experimental sparse-attention model, which equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through

DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency
with DeepSeek Sparse Attention

NEW DEEPSEEK MODEL RELEASE AND PAPER

"We introduce DeepSeek-V3.2-Exp, an experimental sparse-attention model, which equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through
Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

ReasoningBank: memory for self-evolving LLM agents • Distills strategies from both successes & failures • Enables agents to learn, reuse, and improve over time • Outperforms prior memory methods on web & SWE tasks (+34.2% eff., –16% steps)

ReasoningBank: memory for self-evolving LLM agents

• Distills strategies from both successes &amp; failures
• Enables agents to learn, reuse, and improve over time
• Outperforms prior memory methods on web &amp; SWE tasks (+34.2% eff., –16% steps)
Jules (@julesagent) 's Twitter Profile Photo

Jules now has Memory. 🧠 Jules automatically learns your preferences and project conventions over time, applying them to future tasks to get better at working on your code. You always have the ability to edit or turn memory off. The changelog has the details →

Simon Willison (@simonw) 's Twitter Profile Photo

If you've been trying to figure out DSPy - the automatic prompt optimization system - this talk by Drew Breunig is the clearest explanation I've seen yet, with a very useful real-world case study youtube.com/watch?v=I9Ztkg… My notes here: simonwillison.net/2025/Oct/4/dre…

Ethan Mollick (@emollick) 's Twitter Profile Photo

This paper shows that you can predict actual purchase intent (90% accuracy) by asking an LLM to impersonate a customer with a demographic profile, giving it a product & having it give its impressions, which another AI rates. No fine-tuning or training & beats classic ML methods.

This paper shows that you can predict actual purchase intent (90% accuracy) by asking an LLM to impersonate a customer with a demographic profile, giving it a product &amp; having it give its impressions, which another AI rates.

No fine-tuning or training &amp; beats classic ML methods.
wh (@nrehiew_) 's Twitter Profile Photo

New post! This time, about the current state of Long Context Evaluation. I discuss existing benchmarks, what makes a good long context eval, what's missing from existing ones and introduce a new one - LongCodeEdit :)

New post! This time, about the current state of Long Context Evaluation.

I discuss existing benchmarks, what makes a good long context eval, what's missing from existing ones and introduce a new one - LongCodeEdit :)