Bobby (@bobbycxy) 's Twitter Profile
Bobby

@bobbycxy

Researching AI @ASTARsg | @TheAcornAI

ID: 1231231575964012547

linkhttp://textarena.ai calendar_today22-02-2020 14:57:44

21 Tweet

58 Followers

76 Following

Bobby (@bobbycxy) 's Twitter Profile Photo

My first tweet. And in honor of textarena and León Leshem (Legend) Choshen 🤖🤗 Henry Mao : I just won against GPT 4o mini in SpellingBee-v0 on TextArena! Check it out at textarena.ai api.textarena.ai/shared_img/09a…

Bobby (@bobbycxy) 's Twitter Profile Photo

I just won against GPT 4o mini in Tak-v0 on TextArena! Check it out at textarena.ai api.textarena.ai/shared_img/656…

León (@leonguertler) 's Twitter Profile Photo

Andrej Karpathy Perfect timing, we are just about to publish TextArena. A collection of 57 text-based games (30 in the first release) including single-player, two-player and multi-player games. We tried keeping the interface similar to OpenAI gym, made it very easy to add new games, and created

León (@leonguertler) 's Twitter Profile Photo

Competitive games with a fixed pace provide an excellent evaluation framework for balancing quality and speed in decision-making.

León (@leonguertler) 's Twitter Profile Photo

TextArena is live on arXiv! We present a benchmark of 57+ competitive text-based games to evaluate and train LLMs on agentic behavior — including negotiation, deception, theory of mind and many more. Real-time TrueSkill. Multiplayer support. Human-vs-models. Model-vs-model.

TextArena is live on arXiv! We present a benchmark of 57+ competitive text-based games to evaluate and train LLMs on agentic behavior — including negotiation, deception, theory of mind and many more.  Real-time TrueSkill. Multiplayer support. Human-vs-models. Model-vs-model.
León (@leonguertler) 's Twitter Profile Photo

For the past ~2 months we have been working on training reasoning models on TextArena games. The first paper (introducing what we think is a very promising new paradigm) will hopefully be up later this week / early next; and the second one, focusing on the "scaling laws" of

Kevin Wang (@kevinwang_111) 's Twitter Profile Photo

Excited to announce the Mindgame @NeurIPS Competition is officially LIVE! 🤖 Pit your agents against others in Mafia, Codename, Prisoner’s Dilemma, Stg Hunt, and Colonel Blotto. Sign up now for $500 in compute credits on your initial run! 🔗 Register : mindgamesarena.com

Excited to announce the Mindgame @NeurIPS Competition is officially LIVE!
🤖 Pit your agents against others in Mafia, Codename, Prisoner’s Dilemma, Stg Hunt, and Colonel Blotto.
Sign up now for $500 in compute credits on your initial run!
🔗 Register : mindgamesarena.com
Simon Yu (@simon_ycl) 's Twitter Profile Photo

Exciting to see more work on "Game as Benchmark", which is similar to our idea of TextArena (led by León) for benchmarking models on >60 games. though you can see GM Magnus Carlsen's comments on LLMs chess play 🔥

Exciting to see more work on "Game as Benchmark", which is similar to our idea of TextArena (led by <a href="/LeonGuertler/">León</a>) for benchmarking models on &gt;60 games. 

though you can see GM <a href="/MagnusCarlsen/">Magnus Carlsen</a>'s comments on LLMs chess play 🔥
will brown (@willccbb) 's Twitter Profile Photo

something we've lost in the blogification of research is that citing prior work is often just not done at all, even when said work is quite similar + already broadly adopted (in this case, TextArena). especially sad when it's a big lab steamrolling the efforts of smaller teams