Ross Taylor (@rosstaylor90) 's Twitter Profile
Ross Taylor

@rosstaylor90

Ship against the dying of the light. @GenReasoning Prev: reasoning lead @MetaAI, LLaMA 2/3, @paperswithcode co-creator, Galactica LLM lead, cofo Atlas ML (acq)

ID: 524807755

linkhttp://rossjtaylor.com calendar_today14-03-2012 22:51:10

2,2K Tweet

8,8K Followers

1,1K Following

Ross Taylor (@rosstaylor90) 's Twitter Profile Photo

We need a version of the Needle for model releases: - Day 1 benchmarks (SWE-Bench Verified, MMLU Pro, HLE) - the model predictably looks good; needle points to “good vibes”. - Day 2 benchmarks (Pelicans, EQBench…and eval bizarro world) - the model underperforms: the needle

We need a version of the Needle for model releases:

- Day 1 benchmarks (SWE-Bench Verified, MMLU Pro, HLE) - the model predictably looks good; needle points to “good vibes”.
- Day 2 benchmarks (Pelicans, EQBench…and eval bizarro world) - the model underperforms: the needle
Ross Taylor (@rosstaylor90) 's Twitter Profile Photo

Most takes on RL environments are bad. 1. There are hardly any high-quality RL environments and evals available. Most agentic environments and evals are flawed when you look at the details. It’s a crisis: and no one is talking about it because they’re being hoodwinked by labs

Ross Taylor (@rosstaylor90) 's Twitter Profile Photo

This was an LLM wars subtweet, and is both right and wrong in different ways. It’s wrong in the sense that you learn things at scale that you wouldn’t learn otherwise - so, on the contrary, scaling allows you to learn. You don’t want to overoptimise for lessons at smaller scales

Ross Taylor (@rosstaylor90) 's Twitter Profile Photo

Quick hiring call. We’re looking for full stack engineers to join our growing team at General Reasoning. We have more work than hands at the moment - a nice problem to have! - and are working with clients on some groundbreaking projects (the most excited I’ve been since my early LLM

Quick hiring call. We’re looking for full stack engineers to join our growing team at <a href="/GenReasoning/">General Reasoning</a>.

We have more work than hands at the moment - a nice problem to have! - and are working with clients on some groundbreaking projects (the most excited I’ve been since my early LLM
Taco Cohen (@tacocohen) 's Twitter Profile Photo

Last week we found an issue with SWE-Bench, allowing agents to cheat by looking at future commits. Instead of celebrating the SWE-Bench Devs for quickly fixing the issue and being transparent, the HN crowd is dunking on them and drawing wildly inaccurate conclusions about