Dheeraj Mekala (@mekaladheeraj) 's Twitter Profile
Dheeraj Mekala

@mekaladheeraj

Ph.D. student at @UCSanDiego. Research Scientist Intern at Llama Research @MetaAI
Previously FAIR, @msftresearch, @AmazonScience, @iitkanpur

Data! Data! Data!

ID: 1003938762382905344

linkhttp://dheeraj7596.github.io/ calendar_today05-06-2018 09:57:08

732 Tweet

1,1K Followers

374 Following

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

LLMs act sub-optimally in decisions due to greediness, frequency bias, and a knowing-doing gap. A classic Google DeepMind paper. Shows why LLM agents make poor decisions and how reinforcement learning fine-tuning fixes a chunk of it. Tic-tac-toe win rate jumps 15% to 75% after

LLMs act sub-optimally in decisions due to greediness, frequency bias, and a knowing-doing gap.

A classic <a href="/GoogleDeepMind/">Google DeepMind</a> paper.

Shows why LLM agents make poor decisions and how reinforcement learning fine-tuning fixes a chunk of it.

Tic-tac-toe win rate jumps 15% to 75% after
Clémentine Fourrier 🍊 (@clefourrier) 's Twitter Profile Photo

Wanna upgrade your agent game? With AI at Meta , we're releasing 2 incredibly cool artefacts: - GAIA 2: assistant evaluation with a twist (new: adaptability, robustness to failure & time sensitivity) - ARE, an agent research environment to empower all! huggingface.co/blog/gaia2

Romain Froger (@froger_romain) 's Twitter Profile Photo

Most agent benchmarks assume static, perfect worlds. But real life is asynchronous, noisy, and ambiguous. 🌍 🚀 Meet Gaia2 + ARE: a new benchmark and open-source platform for creating environments and evaluating AI agents in (more) realistic environments.

Most agent benchmarks assume static, perfect worlds. But real life is asynchronous, noisy, and ambiguous. 🌍

🚀 Meet Gaia2 + ARE: a new benchmark and open-source platform for creating environments and evaluating AI agents in (more) realistic environments.
Thomas Scialom (@thomasscialom) 's Twitter Profile Photo

🚀 ARE: scaling up agent environments and evaluations Everyone talks about RL envs so we built one we actually use. In the second half of AI, evals & envs are the bottleneck. Today we OSS it all: Meta Agent Research Environment + GAIA-2 (code, demo, evals). 🔗Links👇

🚀 ARE: scaling up agent environments and evaluations

Everyone talks about RL envs so we built one we actually use. In the second half of AI, evals &amp; envs are the bottleneck.
Today we OSS it all: Meta Agent Research Environment + GAIA-2 (code, demo, evals).
🔗Links👇
Ulyana Piterbarg (@ulyanapiterbarg) 's Twitter Profile Photo

Proud to have been part of the team behind Gaia2 and ARE! ARE = a gym/platform for scaling up LLM agent envs for evals & RL Gaia2 = a new benchmark for hard & practical agent tasks (search, execution, ambiguity, time, noise, & multi-agent) tinyurl.com/aregaia2

Proud to have been part of the team behind Gaia2 and ARE!

ARE = a gym/platform for scaling up LLM agent envs for evals &amp; RL
Gaia2 = a new benchmark for hard &amp; practical agent tasks (search, execution, ambiguity, time, noise, &amp; multi-agent)

tinyurl.com/aregaia2
elvis (@omarsar0) 's Twitter Profile Photo

Very cool work from Meta Superintelligence Lab. They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments. Great resource to stress-test agents in environments closer to real apps. Read on for more:

Very cool work from Meta Superintelligence Lab.

They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments.

Great resource to stress-test agents in environments closer to real apps.

Read on for more:
Grégoire Mialon (@mialon_gregoire) 's Twitter Profile Photo

🏗️ ARE: scaling up agent environments and evaluations In the LLM+RL era, evals and envs are the bottleneck Happy to release Gaia2, an extensible benchmark for agents aiming to reduce the sim2real gap + ARE, the platform in which Gaia2 is built Enjoy evaluating your agents! 👇

🏗️ ARE: scaling up agent environments and evaluations

In the LLM+RL era, evals and envs are the bottleneck 
Happy to release Gaia2, an extensible benchmark for agents aiming to reduce the sim2real gap + ARE, the platform in which Gaia2 is built
Enjoy evaluating your agents!

👇
Guohao Li (Hiring!) 🐫 (@guohao_li) 's Twitter Profile Photo

Most exiciting agent environment release so far! Many people asked what are the differences between a benchmark and an environment. I guess this work answer parts of it. Benchmarks usually are static. Real-world agent environments are usually dynamic and noisy. The event-based,

Clémentine Fourrier 🍊 (@clefourrier) 's Twitter Profile Photo

Did you see that the Agent Research Environment is MCP compatible? -> using any MCP tools with any agent is now completely trivial! Check it out! We've used an LLM agent to 1) move a robot arm remotely 2) depending on real time web search results! :D How to in thread ^^

clem 🤗 (@clementdelangue) 's Twitter Profile Photo

We need better agent evaluations! Glad to have collaborated with Meta Super Intelligence Lab to release Gaia2 and ARE! GPT5 (high) from OpenAI is leading on execution, search, ambiguity, adaptability and noise. Kimi-K2 from Kimi.ai is leading open weight. Full

We need better agent evaluations! Glad to have collaborated with <a href="/Meta/">Meta</a> Super Intelligence Lab to release Gaia2 and ARE! 

GPT5 (high) from <a href="/OpenAI/">OpenAI</a> is leading on execution, search, ambiguity, adaptability and noise.

Kimi-K2 from <a href="/Kimi_Moonshot/">Kimi.ai</a> is leading open weight.

Full
Gabriel Synnaeve (@syhw) 's Twitter Profile Photo

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…