kelly (@kellyhongsn) Twitter Tweets • TwiCopy

Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI We’re releasing: * 3 games (environments) * $10K agent contest * AI agents API Starting scores - Frontier AI: 0%, Humans: 100%

thumb_up_off_alt1,1K

chat_bubble_outline61

repeat218

shareShare

kelly

@kellyhongsn

2 months ago

weekend in sd

thumb_up_off_alt45

chat_bubble_outline7

repeat0

shareShare

kelly

@kellyhongsn

2 months ago

happening tomorrow: lu.ma/vw17piwl

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

anton

@atroyn

2 months ago

tomorrow! kelly and i going live to talk about her research on how models actually use long context in realistic settings. we’ll discuss the motivation behind the work, the ideas that didn’t quite make it, what we think the results mean, and answer audience q’s.

thumb_up_off_alt23

chat_bubble_outline2

repeat2

shareShare

matt palmer

@mattppal

2 months ago

live learning about context with chroma 🙌

thumb_up_off_alt16

chat_bubble_outline1

repeat3

shareShare

Chroma

@trychroma

2 months ago

Go behind the research with kelly and anton 🇺🇸. Our latest technical report: "Context Rot" investigates how model performance grows increasingly unreliable as input length grows.

thumb_up_off_alt33

chat_bubble_outline4

repeat3

shareShare

wh

@nrehiew_

a month ago

Some notes on the Context Rot research. I think there are pretty big/validated implications on several use cases and how we should think about these models.