Harbor Framework (@harborframework) Twitter Tweets • TwiCopy

Martian

3 months ago

ARES is built around three pillars: 1. A familiar environment loop, modern concurrency 2. The correct RL training boundary 3. Integration with the OSS task ecosystem ( Harbor Framework )

thumb_up_off_alt12

chat_bubble_outline2

repeat2

shareShare

ARES uses the Harbor task format ( Alex Shaw ). It comes with SWE-Bench Verified, TerminalBench2, SWESmith, and everything else in the Harbor ecosystem. We're also releasing 1k new JavaScript tasks with Vmax ( Augustine Mavor-Parker Matthew Sargent ) to help the ecosystem grow.

thumb_up_off_alt15

chat_bubble_outline1

repeat3

shareShare

Will Myers

@mwilliammyers

3 months ago

Alex Shaw Gotta love free datasets! Harbor is pretty slick.

thumb_up_off_alt0

chat_bubble_outline0

repeat1

shareShare

Tyler Griggs

@tyler_griggs_

3 months ago

SkyRL 🤝Ares🤝 Harbor

thumb_up_off_alt22

chat_bubble_outline0

repeat3

shareShare

Shriyash Upadhyay

@shriyashku

3 months ago

Joan Cabezas Martian Harbor Framework It's a good framework

thumb_up_off_alt0

chat_bubble_outline0

repeat1

shareShare

Claude

@claudeai

3 months ago

Opus 4.6 is state-of-the-art on several evaluations including agentic coding, multi-discipline reasoning, knowledge work, and agentic search. We're also shipping new features across Claude in Excel, Claude in PowerPoint, Claude Code, and our API to let Opus 4.6 do even more.

thumb_up_off_alt4,4K

chat_bubble_outline103

repeat317

shareShare

Alex Ratner

@ajratner

3 months ago

Exciting mention of TBench 2.0 in today's model releases - congrats to Mike A. Merrill Alex Shaw & team + proud of Snorkel AI 's contributions! Benchmarks are just one (limited) measurement tool - but critical guideposts of frontier progress. Much more to build here ahead!

Exciting mention of TBench 2.0 in today's model releases - congrats to <a href="/Mike_A_Merrill/">Mike A. Merrill</a> <a href="/alexgshaw/">Alex Shaw</a> & team + proud of <a href="/SnorkelAI/">Snorkel AI</a> 's contributions!

Benchmarks are just one (limited) measurement tool - but critical guideposts of frontier progress. Much more to build here ahead!

thumb_up_off_alt40

chat_bubble_outline0

repeat9

shareShare

Harbor Framework

@harborframework

3 months ago

Terminal-Bench 2.0 (built on Harbor) getting lots of love today!

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare

Harbor Framework

@harborframework

3 months ago

More Harbor format RL training environments! Now on the Harbor registry: harborframework.com/registry/seta-…

thumb_up_off_alt18

chat_bubble_outline0

repeat3

shareShare

Marco Mascorro

@mascobot

3 months ago

TerminalBench 2.0 & OSWorld benchmarks are having their moment - they seem to be one of the main focuses of the model benchmark reports, and I'm glad to finally see more focus on what models can do overall on a computer (beyond coding). We haven't even scratched the surface of

thumb_up_off_alt56

chat_bubble_outline4

repeat5

shareShare

Jacek Migdal

@jakozaur

3 months ago

It is our other eval leveraging the great Harbor Framework by Alex Shaw Ryan Marten from Laude Institute .

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Alex Ratner

@ajratner

3 months ago

Our ability to measure AI has been outpaced by our ability to develop it. This evaluation gap is one of the most critical problems in AI. Today, we’re excited to announce the Open Benchmarks Grants - with a starting $3M commitment from @Snorkel + support from @HuggingFace

thumb_up_off_alt51

chat_bubble_outline3

repeat14

shareShare

Harbor Framework

@harborframework

3 months ago

We are partnering with Snorkel AI on Open Benchmarks Grants. This is an amazing opportunity to build the next generation of great evals. Come build your benchmark with Harbor and Snorkel!

We are partnering with <a href="/SnorkelAI/">Snorkel AI</a> on Open Benchmarks Grants. This is an amazing opportunity to build the next generation of great evals.

Come build your benchmark with Harbor and Snorkel!

thumb_up_off_alt19

chat_bubble_outline2

repeat3

shareShare

Viv

@vtrivedy10

3 months ago

Building Better Coding Agent Harnesses at LangChain we're thinking hard about the science of harness engineering + open research on what works & doesn't A quick peak on our deepagents X Terminal Bench 2.0 work, shoutout to Alex Shaw & Harbor (they're great). Broad research

Building Better Coding Agent Harnesses
at <a href="/LangChain/">LangChain</a> we're thinking hard about the science of harness engineering + open research on what works & doesn't

A quick peak on our deepagents X Terminal Bench 2.0 work, shoutout to <a href="/alexgshaw/">Alex Shaw</a> & Harbor (they're great). Broad research

thumb_up_off_alt55

chat_bubble_outline3

repeat17

shareShare

Paul Kuruvilla

@rohitpaulk

3 months ago

Huge props to Alex Shaw and the folks at terminalbench / Harbor Framework — without Harbor, CCBench would’ve taken us months to ship instead a week. We actually tried this last year and gave up. Only decided to give it another shot because Harbor was released.

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare

Alex Shaw

@alexgshaw

3 months ago

Ship benchmarks in weeks with Harbor. Congrats Paul Kuruvilla and team!

thumb_up_off_alt19

chat_bubble_outline0

repeat2

shareShare

Alex Shaw

@alexgshaw

3 months ago

Your recurring reminder that Harbor has lots of datasets. Integrate once and get them all for free. harborframework.com/registry

thumb_up_off_alt23

chat_bubble_outline1

repeat3

shareShare

Tyler Griggs

@tyler_griggs_

3 months ago

SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: novasky-ai.notion.site/skyrl-tinker 🧵

thumb_up_off_alt160

chat_bubble_outline3

repeat40

shareShare

Xiangyi Li 李向一

@xdotli

3 months ago

Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇

thumb_up_off_alt376

chat_bubble_outline11

repeat53

shareShare

Harbor Framework

Martian

Josh Greaves

Will Myers

Tyler Griggs

Shriyash Upadhyay

Claude

Alex Ratner

Harbor Framework

Harbor Framework

Marco Mascorro

Jacek Migdal

Alex Ratner

Harbor Framework

Viv

Paul Kuruvilla

Alex Shaw

Alex Shaw

Tyler Griggs

Xiangyi Li 李向一