Xiao Liu (Shaw) (@shawliu12) 's Twitter Profile
Xiao Liu (Shaw)

@shawliu12

PhD @Tsinghua @THUKEG Developing P-Tuning, ChatGLM, AgentBench, and AutoGLM. 📖 Sharing paper digest on LLMs.

ID: 1318409063324004354

linkhttps://github.com/xiao9905 calendar_today20-10-2020 04:30:29

109 Tweet

524 Takipçi

168 Takip Edilen

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Self-play with tree-search helps LLMs learn instructions-following capability. SPAR introduces a self-play framework that enhances LLMs' instruction-following by minimizing irrelevant variations during training through tree-search refinement. 🤖 Original Problem: → Current

Self-play with tree-search helps LLMs learn instructions-following capability.

SPAR introduces a self-play framework that enhances LLMs' instruction-following by minimizing irrelevant variations during training through tree-search refinement.

🤖 Original Problem:

→ Current
Cunxiang Wang (@cunxiangwang) 's Twitter Profile Photo

很荣幸全程参与了在lmarena中位列Top9的新Zhipu GLM的诞生过程。不过如果能早几天出结果就更好了🤣 Honored to fully participate in the birth of the new Zhipu GLM, which ranks Top 9 in lmarena. But if would be better to see the results a few days earlier (before the new Qwen-max)🤣

Xiao Liu (Shaw) (@shawliu12) 's Twitter Profile Photo

Diving into the world of LLM agents! 🚀 Starting today, I'll share insights from the newest and sharpest papers I read. The agentic AI wave is rising—2025-2026 will be game-changing. Let’s explore, learn, and shape the future together! 🔥 #LLM #AgenticAI

Xiao Liu (Shaw) (@shawliu12) 's Twitter Profile Photo

#Apple uses RL to boost a 3.2B LLM phone-use agent to outperform #OpenAI o1 by 9% Focusing on solving the problem of IDAs’ poor performance in executing complex tasks, especially in digital environments that require multi-step interactions and state management. It addresses the

#Apple uses RL to boost a 3.2B LLM phone-use agent to outperform #OpenAI o1 by 9%

Focusing on solving the problem of IDAs’ poor performance in executing complex tasks, especially in digital environments that require multi-step interactions and state management. It addresses the
Xiao Liu (Shaw) (@shawliu12) 's Twitter Profile Photo

🔥 Chinese top smartphone producer #Xiaomi unveils ReachAgent, a mobile AI agent framework 🚀 Boosts step-level IoU & accuracy by rethinking how agents handle GUI tasks. Breaking tasks into subtasks + a 2-stage process = smarter, faster results! 🧠📱#AI #LLM #AgenticAI #AGI

🔥 Chinese top smartphone producer #Xiaomi unveils ReachAgent, a mobile AI agent framework 

🚀 Boosts step-level IoU & accuracy by rethinking how agents handle GUI tasks. Breaking tasks into subtasks + a 2-stage process = smarter, faster results! 🧠📱#AI #LLM #AgenticAI #AGI
Xiao Liu (Shaw) (@shawliu12) 's Twitter Profile Photo

#Meta researchers have unveiled MLGym-Bench, the most comprehensive framework yet for evaluating the intelligence of LLMs in AI research First-ever ML gym environment spanning CV, NLP, RL & game theory with 13 diverse tasks. Even GPT-4o & Claude-3.5 struggle with true

#Meta researchers have unveiled MLGym-Bench, the most comprehensive framework yet for evaluating the intelligence of LLMs in AI research

First-ever ML gym environment spanning CV, NLP, RL & game theory with 13 diverse tasks. Even GPT-4o & Claude-3.5 struggle with true
Casper Hansen (@casper_hansen_) 's Twitter Profile Photo

o3 competitor: GLM 4.5 by Zhipu AI - hybrid reasoning model (on by default) - trained on 15T tokens - 128k context, 96k output tokens - $0.11 / 1M tokens - MoE: 355B A32B and 106B A12B Benchmark details: - tool calling: 90.6% success rate vs Sonnet’s 89.5% vs Kimi K2 86.2% -

o3 competitor: GLM 4.5 by Zhipu AI
- hybrid reasoning model (on by default)
- trained on 15T tokens
- 128k context, 96k output tokens
- $0.11 / 1M tokens
- MoE: 355B A32B and 106B A12B

Benchmark details:
- tool calling: 90.6% success rate vs Sonnet’s 89.5% vs Kimi K2 86.2%
-
Sam Paech (@sam_paech) 's Twitter Profile Photo

z.ai's GLM-4.5 gets a very strong result on EQ-Bench & Longform Writing. In creative writing it's a little further down the pack near Gemma 3 27b & qwen3-235b-a22b. Its lexical profile clusters nearest to R1-0528.

z.ai's GLM-4.5 gets a very strong result on EQ-Bench & Longform Writing.

In creative writing it's a little further down the pack near Gemma 3 27b & qwen3-235b-a22b.

Its lexical profile clusters nearest to R1-0528.
lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

🔥BREAKING: Z.ai’s GLM-4.5 enters the top-5 in Arena! With 4K+ community votes, it now ranks #5 Overall in the Text Arena - matching DeepSeek-R1 and Kimi-K2 as the top open models. Huge congrats to the Zai team on this incredible milestone and contribution to the open

🔥BREAKING: <a href="/Zai_org/">Z.ai</a>’s GLM-4.5 enters the top-5 in Arena!

With 4K+ community votes, it now ranks #5 Overall in the Text Arena - matching DeepSeek-R1 and Kimi-K2 as the top open models.

Huge congrats to the Zai team on this incredible milestone and contribution to the open
Jiayi Weng (@trinkle23897) 's Twitter Profile Photo

Finally... OAI internally talked about releasing open-source model since 2022 and we got close a few times since then. Now it is.

Jiayi Weng (@trinkle23897) 's Twitter Profile Photo

Harmony format is finally open-sourced. I still remember 3 years ago (before ChatGPT release) Shengjia Zhao, Daniel and I were brainstorming about the right abstraction for RL training, and that is the start point of the entire harmony library. github.com/openai/harmony

Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents "To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents

"To support scalable and robust training, we develop a distributed RL  infrastructure capable of orchestrating thousands of parallel virtual  desktop environments to accelerate large-scale
Alexander Doria (@dorialexander) 's Twitter Profile Photo

Rare research in the open on training for computer use from simulated experiences/desktop. General catch: you need not only RL environments but mid-training environments ("Trajectory Collection with Multiple General LLMs")

Rare research in the open on training for computer use  from simulated experiences/desktop. General catch: you need not only RL environments but mid-training environments ("Trajectory Collection with Multiple General LLMs")
Xiao Liu (Shaw) (@shawliu12) 's Twitter Profile Photo

🚨Thrilled to share our latest progress on Computer Use Agent, ComputerRL, an end-to-end RL method which achieves 48.1% success rate on OSWorld Benchmark with only 9B open model, beating OpenAI Operator, Claude Sonnet 4.0, and other previous models, state-of-the-art performance.

DAIR.AI (@dair_ai) 's Twitter Profile Photo

Top AI Papers of The Week (August 18-24): - ComputerRL - Beyond GPT-5 - Chain-of-Agents - Parallel Text Generation - Retrieval-Augmented Reasoning - Has GPT-5 Achieved Spatial Intelligence? - Open Foundations for Compute-Use Agents Read on for more: