Kyle Montgomery (@kylepmont) Twitter Tweets • TwiCopy

Kyle Montgomery

a year ago

Excited to share our work at #ICLR2025! JudgeBench ⚖️ tests the reliability of LLM-based judges with a focus on objective correctness. JudgeBench converts tough 🧠 datasets in knowledge, reasoning, math & code into labeled response pairs, forcing objective grading over vibes.

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Kyle Montgomery

@kylepmont

6 months ago

Thrilled to have been a part of this release — looking forward to what’s coming next with rLLM!

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

Nicholas Crispino

@nrcrispino

5 months ago

Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉 Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess. LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step

thumb_up_off_alt11

chat_bubble_outline1

repeat5

shareShare

Chenguang Wang

@chenguangwang

5 months ago

🚀So excited to have just received a research gift from Google to support our work on AI agents! Huge thanks to Google. 🙌Come join us, let's build the future of agents together!

thumb_up_off_alt5

chat_bubble_outline0

repeat3

shareShare

rLLM

@rllm_project

4 months ago

🚀 We just released rLLM v0.2.1 — packed with several exciting new features! What’s new: - rLLM SDK (preview): Turn your agents written in any frameworks (e.g. LangGraph, Strands) into continuous learners. - Tinker backend: Run serverless RL training with Tinker as the backend.

thumb_up_off_alt28

chat_bubble_outline1

repeat7

shareShare

Dawn Song

@dawnsongtweets

3 months ago

🚨 Excited to announce Agents in the Wild: Safety, Security, and Beyond, our workshop at ICLR 2026 (Apr 26–27, Rio de Janeiro)! AI agents are rapidly deployed in the real world—but safety & security lag behind. Submit your work to help shape this field: 🗓️ Submission deadline:

thumb_up_off_alt202

chat_bubble_outline12

repeat33

shareShare

Chenguang Wang

@chenguangwang

3 months ago

🚀Happy to receive the Tinker Research Grant from Thinking Machines to support our work on secure AI agents! 🙏Huge thanks to amazing collaborators: my postdoc advisor Dawn Song (UC Berkeley), and my student Jianhong Tu (UC Santa Cruz), and student collaborator Zhun Wang

thumb_up_off_alt4

chat_bubble_outline0

repeat3

shareShare

Chenguang Wang

@chenguangwang

3 months ago

📢 Calling all reviewers! We are looking for reviewers by February 9th for our Agents in the Wild: Safety, Security, and Beyond workshop ICLR 2026 2026 (April 26-27, Rio)! Sign up to review: forms.gle/LpRnYnL3hQWDpF… 🌟 Featuring amazing speakers and panelists including

thumb_up_off_alt6

chat_bubble_outline0

repeat4

shareShare

Chenguang Wang

@chenguangwang

3 months ago

🚨 Last call for papers! The submission deadline for the ICLR 2026 2026 workshop — Agents in the Wild: Safety, Security, and Beyond is tomorrow (Feb 5, 2026, AoE) for both regular and short paper tracks! 📝 Submit here: openreview.net/group?id=ICLR.… 🙏 Thanks to Agentic AI Weekly

🚨 Last call for papers!
The submission deadline for the <a href="/iclr_conf/">ICLR 2026</a> 2026 workshop — Agents in the Wild: Safety, Security, and Beyond is tomorrow (Feb 5, 2026, AoE) for both regular and short paper tracks!

📝 Submit here: openreview.net/group?id=ICLR.…

🙏 Thanks to Agentic AI Weekly

thumb_up_off_alt4

chat_bubble_outline0

repeat3

shareShare

Sijun Tan

@sijun_tan

2 months ago

Excited to collaborate with Snorkel AI on this project! Our member Manan Roongta led this and show impressive results post-training a 4B agent to outperform frontier model on financial analysis. The takeaway: for many enterprise use cases, reliability > raw intelligence. A

thumb_up_off_alt13

chat_bubble_outline0

repeat8

shareShare