Dheeraj Mekala (@mekaladheeraj) Twitter Tweets • TwiCopy

Dheeraj Mekala

@mekaladheeraj

+ Follow

Ph.D. student at @UCSanDiego. Research Scientist Intern at Llama Research @MetaAI
Previously FAIR, @msftresearch, @AmazonScience, @iitkanpur

Data! Data! Data!

ID: 1003938762382905344

linkhttp://dheeraj7596.github.io/ calendar_today05-06-2018 09:57:08

732 Tweet

1,1K Followers

374 Following

Rohan Paul

@rohanpaul_ai

3 months ago

LLMs act sub-optimally in decisions due to greediness, frequency bias, and a knowing-doing gap. A classic Google DeepMind paper. Shows why LLM agents make poor decisions and how reinforcement learning fine-tuning fixes a chunk of it. Tic-tac-toe win rate jumps 15% to 75% after

LLMs act sub-optimally in decisions due to greediness, frequency bias, and a knowing-doing gap.

A classic <a href="/GoogleDeepMind/">Google DeepMind</a> paper.

Shows why LLM agents make poor decisions and how reinforcement learning fine-tuning fixes a chunk of it.

Tic-tac-toe win rate jumps 15% to 75% after

thumb_up_off_alt414

chat_bubble_outline8

repeat76

shareShare

Clémentine Fourrier 🍊

@clefourrier

2 months ago

Wanna upgrade your agent game? With AI at Meta , we're releasing 2 incredibly cool artefacts: - GAIA 2: assistant evaluation with a twist (new: adaptability, robustness to failure & time sensitivity) - ARE, an agent research environment to empower all! huggingface.co/blog/gaia2

thumb_up_off_alt74

chat_bubble_outline1

repeat17

shareShare

Romain Froger

@froger_romain

2 months ago

Most agent benchmarks assume static, perfect worlds. But real life is asynchronous, noisy, and ambiguous. 🌍 🚀 Meet Gaia2 + ARE: a new benchmark and open-source platform for creating environments and evaluating AI agents in (more) realistic environments.

thumb_up_off_alt42

chat_bubble_outline1

repeat12

shareShare

Thomas Scialom

@thomasscialom

2 months ago

🚀 ARE: scaling up agent environments and evaluations Everyone talks about RL envs so we built one we actually use. In the second half of AI, evals & envs are the bottleneck. Today we OSS it all: Meta Agent Research Environment + GAIA-2 (code, demo, evals). 🔗Links👇

thumb_up_off_alt123

chat_bubble_outline8

repeat54

shareShare

Ulyana Piterbarg

@ulyanapiterbarg

2 months ago

Proud to have been part of the team behind Gaia2 and ARE! ARE = a gym/platform for scaling up LLM agent envs for evals & RL Gaia2 = a new benchmark for hard & practical agent tasks (search, execution, ambiguity, time, noise, & multi-agent) tinyurl.com/aregaia2

thumb_up_off_alt275

chat_bubble_outline9

repeat46

shareShare

elvis

@omarsar0

2 months ago

Very cool work from Meta Superintelligence Lab. They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments. Great resource to stress-test agents in environments closer to real apps. Read on for more:

thumb_up_off_alt1,1K

chat_bubble_outline39

repeat185

shareShare

Grégoire Mialon

@mialon_gregoire

2 months ago

🏗️ ARE: scaling up agent environments and evaluations In the LLM+RL era, evals and envs are the bottleneck Happy to release Gaia2, an extensible benchmark for agents aiming to reduce the sim2real gap + ARE, the platform in which Gaia2 is built Enjoy evaluating your agents! 👇

thumb_up_off_alt102

chat_bubble_outline1

repeat28

shareShare

Dheeraj Mekala

@mekaladheeraj

2 months ago

It's hard to build a benchmark where the numbers are not saturated. And.... we just built one.

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Guohao Li (Hiring!) 🐫

@guohao_li

2 months ago

Most exiciting agent environment release so far! Many people asked what are the differences between a benchmark and an environment. I guess this work answer parts of it. Benchmarks usually are static. Real-world agent environments are usually dynamic and noisy. The event-based,

thumb_up_off_alt8

chat_bubble_outline0

repeat3

shareShare

Clémentine Fourrier 🍊

@clefourrier

2 months ago

Did you see that the Agent Research Environment is MCP compatible? -> using any MCP tools with any agent is now completely trivial! Check it out! We've used an LLM agent to 1) move a robot arm remotely 2) depending on real time web search results! :D How to in thread ^^

thumb_up_off_alt31

chat_bubble_outline1

repeat9

shareShare

clem 🤗

@clementdelangue

2 months ago

We need better agent evaluations! Glad to have collaborated with Meta Super Intelligence Lab to release Gaia2 and ARE! GPT5 (high) from OpenAI is leading on execution, search, ambiguity, adaptability and noise. Kimi-K2 from Kimi.ai is leading open weight. Full

We need better agent evaluations! Glad to have collaborated with <a href="/Meta/">Meta</a> Super Intelligence Lab to release Gaia2 and ARE!

GPT5 (high) from <a href="/OpenAI/">OpenAI</a> is leading on execution, search, ambiguity, adaptability and noise.

Kimi-K2 from <a href="/Kimi_Moonshot/">Kimi.ai</a> is leading open weight.

Full

thumb_up_off_alt494

chat_bubble_outline19

repeat53

shareShare

Gabriel Synnaeve

@syhw

2 months ago

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…

thumb_up_off_alt1,1K

chat_bubble_outline56

repeat262

shareShare