Rayan Krishnan (@rayankrishnan) 's Twitter Profile
Rayan Krishnan

@rayankrishnan

ceo @_valsai | solve evals, solve intelligence

prev @stanford @PalantirTech

ID: 1117951640940597248

linkhttps://www.vals.ai/ calendar_today16-04-2019 00:43:36

57 Tweet

217 Takipçi

193 Takip Edilen

Vals AI (@_valsai) 's Twitter Profile Photo

We had the chance to sit down with Thomas Bueler-Faudree and Joe, the leaders behind VecFlow who build the AI copilot, Oliver. They share their perspective on DeepSeek, developing with agents, and the future of legal AI:

Samay (@samaysham) 's Twitter Profile Photo

How much labor can AI actually automate? We enriched 16,000+ US worker tasks across 700+ occupations to find out. Introducing the AI Labor Index.

How much labor can AI actually automate?

We enriched 16,000+ US worker tasks across 700+ occupations to find out.

Introducing the AI Labor Index.
Vals AI (@_valsai) 's Twitter Profile Photo

We just released benchmark results for cohere's Command A, and here’s what we found: 1. Ranked 9th/46 on LegalBench (78.8%), outperforming GPT-4o Mini & Claude 3.7 Sonnet 💪🏼🏅 2. Struggled in CorpFin (10th/22, 54.5%) & placed lowest in AIME & GPQA 👀🤔 3. Higher cost vs.

We just released benchmark results for <a href="/cohere/">cohere</a>'s Command A, and here’s what we found:

1. Ranked 9th/46 on LegalBench (78.8%), outperforming GPT-4o Mini &amp; Claude 3.7 Sonnet 💪🏼🏅

2. Struggled in CorpFin (10th/22, 54.5%) &amp; placed lowest in AIME &amp; GPQA 👀🤔

3. Higher cost vs.
Vals AI (@_valsai) 's Twitter Profile Photo

Thank you to all of our friends that came to our Vals AI Game Night. We loved having everyone come hang out ☺️🥳 Between the chattiest game of poker and the Fascists getting Hitler (Ayaan Momin + Brian Serrata) elected in Secret Hitler, we would say that we had quite the

Rez Havaei (@havaeirez) 's Twitter Profile Photo

LLMs are being deployed in high-stakes environments—and the potential impact of failure is colossal. A jailbroken AI could leak your customer data, financial records, or enable catastrophically harmful actions. At General Analysis we have compiled the definitive guide to understand

LLMs are being deployed in high-stakes environments—and the potential impact of failure is colossal. A jailbroken AI could leak your customer data, financial records, or enable catastrophically harmful actions. At <a href="/gen_analysis/">General Analysis</a>  we have compiled the definitive guide to understand
Rayan Krishnan (@rayankrishnan) 's Twitter Profile Photo

Is identifying tasks easy for humans which are hard for AI really a relevant North Star for AGI? This seems to be ARC Prize 's focus but doing well would imply a higher Turing test pass rate (preventing human-simple questions from discriminating AIs), not show generality or

Vals AI (@_valsai) 's Twitter Profile Photo

Grok 3 Beta dominates on our proprietary benchmarks, setting the new SOTA on our Finance, Legal and Tax benchmarks. Congrats xAI Grok Elon Musk 🚀🚀🚀 We just released the benchmark results for xAI's new models: Grok 3 Beta & Grok 3 Mini Fast Beta (High & Low Reasoning) –

Grok 3 Beta dominates on our proprietary benchmarks, setting the new SOTA on our Finance, Legal and Tax benchmarks.

Congrats <a href="/xai/">xAI</a> <a href="/grok/">Grok</a> <a href="/elonmusk/">Elon Musk</a> 🚀🚀🚀

We just released the benchmark results for xAI's new models: Grok 3 Beta &amp; Grok 3 Mini Fast Beta (High &amp; Low Reasoning) –
Vals AI (@_valsai) 's Twitter Profile Photo

Hello and welcome to all of our new followers! We wanted to formally introduce ourselves and what we are working on. (1/5)

Hello and welcome to all of our new followers! We wanted to formally introduce ourselves and what we are working on. (1/5)
Rayan Krishnan (@rayankrishnan) 's Twitter Profile Photo

Surprisingly good model for OAI, but how have we still not established naming best practices... 3.5 --> 4 --> 4o --> 4.5 --> 4.1?? sonnet 3.5 --> sonnet 3.5-latest?? --> sonnet 3.7