Harbor Framework (@harborframework) 's Twitter Profile
Harbor Framework

@harborframework

ID: 2013765983048183808

linkhttps://harborframework.com calendar_today21-01-2026 00:10:10

30 Tweet

129 Followers

60 Following

Martian (@withmartian) 's Twitter Profile Photo

ARES is built around three pillars: 1. A familiar environment loop, modern concurrency 2. The correct RL training boundary 3. Integration with the OSS task ecosystem ( Harbor Framework )

Josh Greaves (@joshua_gre63805) 's Twitter Profile Photo

ARES uses the Harbor task format ( Alex Shaw ). It comes with SWE-Bench Verified, TerminalBench2, SWESmith, and everything else in the Harbor ecosystem. We're also releasing 1k new JavaScript tasks with Vmax ( Augustine Mavor-Parker Matthew Sargent ) to help the ecosystem grow.

Claude (@claudeai) 's Twitter Profile Photo

Opus 4.6 is state-of-the-art on several evaluations including agentic coding, multi-discipline reasoning, knowledge work, and agentic search. We're also shipping new features across Claude in Excel, Claude in PowerPoint, Claude Code, and our API to let Opus 4.6 do even more.

Opus 4.6 is state-of-the-art on several evaluations including agentic coding, multi-discipline reasoning, knowledge work, and agentic search.

We're also shipping new features across Claude in Excel, Claude in PowerPoint, Claude Code, and our API to let Opus 4.6 do even more.
Alex Ratner (@ajratner) 's Twitter Profile Photo

Exciting mention of TBench 2.0 in today's model releases - congrats to Mike A. Merrill Alex Shaw & team + proud of Snorkel AI 's contributions! Benchmarks are just one (limited) measurement tool - but critical guideposts of frontier progress. Much more to build here ahead!

Exciting mention of TBench 2.0 in today's model releases - congrats to <a href="/Mike_A_Merrill/">Mike A. Merrill</a> <a href="/alexgshaw/">Alex Shaw</a> &amp; team + proud of <a href="/SnorkelAI/">Snorkel AI</a> 's contributions!

Benchmarks are just one (limited) measurement tool - but critical guideposts of frontier progress. Much more to build here ahead!
Marco Mascorro (@mascobot) 's Twitter Profile Photo

TerminalBench 2.0 & OSWorld benchmarks are having their moment - they seem to be one of the main focuses of the model benchmark reports, and I'm glad to finally see more focus on what models can do overall on a computer (beyond coding). We haven't even scratched the surface of

TerminalBench 2.0 &amp; OSWorld benchmarks are having their moment - they seem to be one of the main focuses of the model benchmark reports, and I'm glad to finally see more focus on what models can do overall on a computer (beyond coding).

We haven't even scratched the surface of
Alex Ratner (@ajratner) 's Twitter Profile Photo

Our ability to measure AI has been outpaced by our ability to develop it. This evaluation gap is one of the most critical problems in AI. Today, we’re excited to announce the Open Benchmarks Grants - with a starting $3M commitment from @Snorkel + support from @HuggingFace

Harbor Framework (@harborframework) 's Twitter Profile Photo

We are partnering with Snorkel AI on Open Benchmarks Grants. This is an amazing opportunity to build the next generation of great evals. Come build your benchmark with Harbor and Snorkel!

We are partnering with <a href="/SnorkelAI/">Snorkel AI</a> on Open Benchmarks Grants. This is an amazing opportunity to build the next generation of great evals. 

Come build your benchmark with Harbor and Snorkel!
Viv (@vtrivedy10) 's Twitter Profile Photo

Building Better Coding Agent Harnesses at LangChain we're thinking hard about the science of harness engineering + open research on what works & doesn't A quick peak on our deepagents X Terminal Bench 2.0 work, shoutout to Alex Shaw & Harbor (they're great). Broad research

Building Better Coding Agent Harnesses
at <a href="/LangChain/">LangChain</a> we're thinking hard about the science of harness engineering + open research on what works &amp; doesn't

A quick peak on our deepagents X Terminal Bench 2.0 work, shoutout to <a href="/alexgshaw/">Alex Shaw</a> &amp; Harbor (they're great).  Broad research
Paul Kuruvilla (@rohitpaulk) 's Twitter Profile Photo

Huge props to Alex Shaw and the folks at terminalbench / Harbor Framework — without Harbor, CCBench would’ve taken us months to ship instead a week. We actually tried this last year and gave up. Only decided to give it another shot because Harbor was released.

Alex Shaw (@alexgshaw) 's Twitter Profile Photo

Your recurring reminder that Harbor has lots of datasets. Integrate once and get them all for free. harborframework.com/registry

Tyler Griggs (@tyler_griggs_) 's Twitter Profile Photo

SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: novasky-ai.notion.site/skyrl-tinker 🧵

SkyRL now implements the Tinker API.

Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends.

Blog: novasky-ai.notion.site/skyrl-tinker
🧵
Xiangyi Li 李向一 (@xdotli) 's Twitter Profile Photo

Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇

Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work?
105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance &amp; more built SkillsBench to find that out.
86 tasks. 11 domains. 7,308 trajectories. 🧵👇