Kyle Montgomery (@kylepmont) 's Twitter Profile
Kyle Montgomery

@kylepmont

PhD student at UC Santa Cruz

ID: 4775783984

calendar_today18-01-2016 00:00:41

22 Tweet

46 Followers

38 Following

Kyle Montgomery (@kylepmont) 's Twitter Profile Photo

Excited to share our work at #ICLR2025! JudgeBench ⚖️ tests the reliability of LLM-based judges with a focus on objective correctness. JudgeBench converts tough 🧠 datasets in knowledge, reasoning, math & code into labeled response pairs, forcing objective grading over vibes.

Excited to share our work at #ICLR2025! JudgeBench ⚖️ tests the reliability of LLM-based judges with a focus on objective correctness. JudgeBench converts tough 🧠 datasets in knowledge, reasoning, math & code into labeled response pairs, forcing objective grading over vibes.
Nicholas Crispino (@nrcrispino) 's Twitter Profile Photo

Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉 Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess. LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step

Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉

Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess.

LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step
Chenguang Wang (@chenguangwang) 's Twitter Profile Photo

🚀So excited to have just received a research gift from Google to support our work on AI agents! Huge thanks to Google. 🙌Come join us, let's build the future of agents together!

rLLM (@rllm_project) 's Twitter Profile Photo

🚀 We just released rLLM v0.2.1 — packed with several exciting new features! What’s new: - rLLM SDK (preview): Turn your agents written in any frameworks (e.g. LangGraph, Strands) into continuous learners. - Tinker backend: Run serverless RL training with Tinker as the backend.

Dawn Song (@dawnsongtweets) 's Twitter Profile Photo

🚨 Excited to announce Agents in the Wild: Safety, Security, and Beyond, our workshop at ICLR 2026 (Apr 26–27, Rio de Janeiro)! AI agents are rapidly deployed in the real world—but safety & security lag behind. Submit your work to help shape this field: 🗓️ Submission deadline:

🚨 Excited to announce Agents in the Wild: Safety, Security, and Beyond, our workshop at ICLR 2026 (Apr 26–27, Rio de Janeiro)!

AI agents are rapidly deployed in the real world—but safety & security lag behind. Submit your work to help shape this field:

🗓️ Submission deadline:
Chenguang Wang (@chenguangwang) 's Twitter Profile Photo

🚀Happy to receive the Tinker Research Grant from Thinking Machines to support our work on secure AI agents! 🙏Huge thanks to amazing collaborators: my postdoc advisor Dawn Song (UC Berkeley), and my student Jianhong Tu (UC Santa Cruz), and student collaborator Zhun Wang

Chenguang Wang (@chenguangwang) 's Twitter Profile Photo

📢 Calling all reviewers! We are looking for reviewers by February 9th for our Agents in the Wild: Safety, Security, and Beyond workshop ICLR 2026 2026 (April 26-27, Rio)! Sign up to review: forms.gle/LpRnYnL3hQWDpF… 🌟 Featuring amazing speakers and panelists including

📢 Calling all reviewers! We are looking for reviewers by February 9th for our Agents in the Wild: Safety, Security, and Beyond workshop <a href="/iclr_conf/">ICLR 2026</a> 2026 (April 26-27, Rio)!  Sign up to review: forms.gle/LpRnYnL3hQWDpF…

🌟 Featuring amazing speakers and panelists including
Chenguang Wang (@chenguangwang) 's Twitter Profile Photo

🚨 Last call for papers! The submission deadline for the ICLR 2026 2026 workshop — Agents in the Wild: Safety, Security, and Beyond is tomorrow (Feb 5, 2026, AoE) for both regular and short paper tracks! 📝 Submit here: openreview.net/group?id=ICLR.… 🙏 Thanks to Agentic AI Weekly

🚨 Last call for papers!
The submission deadline for the <a href="/iclr_conf/">ICLR 2026</a> 2026 workshop — Agents in the Wild: Safety, Security, and Beyond is tomorrow (Feb 5, 2026, AoE) for both regular and short paper tracks!

📝 Submit here: openreview.net/group?id=ICLR.…

🙏 Thanks to Agentic AI Weekly
Sijun Tan (@sijun_tan) 's Twitter Profile Photo

Excited to collaborate with Snorkel AI on this project! Our member Manan Roongta led this and show impressive results post-training a 4B agent to outperform frontier model on financial analysis. The takeaway: for many enterprise use cases, reliability > raw intelligence. A