Rayan Krishnan (@rayankrishnan) Twitter Tweets • TwiCopy

Rayan Krishnan

@rayankrishnan

+ Follow

ceo @_valsai | solve evals, solve intelligence

prev @stanford @PalantirTech

ID: 1117951640940597248

linkhttps://www.vals.ai/ calendar_today16-04-2019 00:43:36

57 Tweet

217 Takipçi

193 Takip Edilen

Vals AI

@_valsai

10 months ago

We had the chance to sit down with Thomas Bueler-Faudree and Joe, the leaders behind VecFlow who build the AI copilot, Oliver. They share their perspective on DeepSeek, developing with agents, and the future of legal AI:

thumb_up_off_alt9

chat_bubble_outline1

repeat3

shareShare

Samay

@samaysham

9 months ago

How much labor can AI actually automate? We enriched 16,000+ US worker tasks across 700+ occupations to find out. Introducing the AI Labor Index.

thumb_up_off_alt202

chat_bubble_outline26

repeat33

shareShare

Vals AI

@_valsai

9 months ago

We just released benchmark results for cohere's Command A, and here’s what we found: 1. Ranked 9th/46 on LegalBench (78.8%), outperforming GPT-4o Mini & Claude 3.7 Sonnet 💪🏼🏅 2. Struggled in CorpFin (10th/22, 54.5%) & placed lowest in AIME & GPQA 👀🤔 3. Higher cost vs.

We just released benchmark results for <a href="/cohere/">cohere</a>'s Command A, and here’s what we found:

1. Ranked 9th/46 on LegalBench (78.8%), outperforming GPT-4o Mini & Claude 3.7 Sonnet 💪🏼🏅

2. Struggled in CorpFin (10th/22, 54.5%) & placed lowest in AIME & GPQA 👀🤔

3. Higher cost vs.

thumb_up_off_alt3

chat_bubble_outline0

repeat1

shareShare

Vals AI

@_valsai

9 months ago

Thank you to all of our friends that came to our Vals AI Game Night. We loved having everyone come hang out ☺️🥳 Between the chattiest game of poker and the Fascists getting Hitler (Ayaan Momin + Brian Serrata) elected in Secret Hitler, we would say that we had quite the

thumb_up_off_alt8

chat_bubble_outline0

repeat1

shareShare

Rez Havaei

@havaeirez

9 months ago

LLMs are being deployed in high-stakes environments—and the potential impact of failure is colossal. A jailbroken AI could leak your customer data, financial records, or enable catastrophically harmful actions. At General Analysis we have compiled the definitive guide to understand

thumb_up_off_alt69

chat_bubble_outline7

repeat27

shareShare

Rayan Krishnan

@rayankrishnan

9 months ago

Really impressed with this model release, set's the new SOTA on our benchmarks. Congrats Google DeepMind

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Rayan Krishnan

@rayankrishnan

9 months ago

Is identifying tasks easy for humans which are hard for AI really a relevant North Star for AGI? This seems to be ARC Prize 's focus but doing well would imply a higher Turing test pass rate (preventing human-simple questions from discriminating AIs), not show generality or

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

Vals AI

@_valsai

8 months ago

Grok 3 Beta dominates on our proprietary benchmarks, setting the new SOTA on our Finance, Legal and Tax benchmarks. Congrats xAI Grok Elon Musk 🚀🚀🚀 We just released the benchmark results for xAI's new models: Grok 3 Beta & Grok 3 Mini Fast Beta (High & Low Reasoning) –

Grok 3 Beta dominates on our proprietary benchmarks, setting the new SOTA on our Finance, Legal and Tax benchmarks.

Congrats <a href="/xai/">xAI</a> <a href="/grok/">Grok</a> <a href="/elonmusk/">Elon Musk</a> 🚀🚀🚀

We just released the benchmark results for xAI's new models: Grok 3 Beta & Grok 3 Mini Fast Beta (High & Low Reasoning) –

thumb_up_off_alt1,1K

chat_bubble_outline395

repeat284

shareShare

Rayan Krishnan

@rayankrishnan

8 months ago

Credit where credit is due

thumb_up_off_alt24

chat_bubble_outline3

repeat0

shareShare

Vals AI

@_valsai

8 months ago

Hello and welcome to all of our new followers! We wanted to formally introduce ourselves and what we are working on. (1/5)

thumb_up_off_alt13

chat_bubble_outline1

repeat2

shareShare

Rayan Krishnan

@rayankrishnan

8 months ago

Surprisingly good model for OAI, but how have we still not established naming best practices... 3.5 --> 4 --> 4o --> 4.5 --> 4.1?? sonnet 3.5 --> sonnet 3.5-latest?? --> sonnet 3.7

thumb_up_off_alt2

chat_bubble_outline2

repeat0

shareShare