Sumuk (@sumukx) 's Twitter Profile
Sumuk

@sumukx

evals @ huggingface 🤗 | working towards a plurality of autonomous intelligent systems

ID: 1704646735950168064

linkhttps://sumuk.org calendar_today21-09-2023 00:00:39

547 Tweet

291 Takipçi

423 Takip Edilen

Sumuk (@sumukx) 's Twitter Profile Photo

super excited to see yourbench converge to be the default generative benchmarking / synthetic data creation solution for llms 💛

Clémentine Fourrier 🍊 (@clefourrier) 's Twitter Profile Photo

To make sure your AI agent is not bullshitting you, you need to evaluate its reasoning... but to do so automatically, you need an LLM... 🤔so how do you evaluate the trace evaluator? With TRAIL, which contains: - a full taxonomy of agent errors and most frequent failure cases,

Alina Lozovskaya (@ailozovskaya) 's Twitter Profile Photo

How well do LLMs really know about Hugging Face? 🤔 I used Yourbench to create a custom eval set across the HF docs to test 10 models Next question: what were the hardest questions for the models? Drop your guesses in the comments ⬇️

How well do LLMs really know about Hugging Face? 🤔

I used Yourbench to create a custom eval set across the HF docs to test 10 models

Next question: what were the hardest questions for the models? Drop your guesses in the comments ⬇️
Sagnik Mukherjee (@saagnikkk) 's Twitter Profile Photo

🚨 Paper Alert: “RL Finetunes Small Subnetworks in Large Language Models” From DeepSeek V3 Base to DeepSeek R1 Zero, a whopping 86% of parameters were NOT updated during RL training 😮😮 And this isn’t a one-off. The pattern holds across RL algorithms and models. 🧵A Deep Dive

🚨 Paper Alert: “RL Finetunes Small Subnetworks in Large Language Models”

From DeepSeek V3 Base to DeepSeek R1 Zero, a whopping 86% of parameters were NOT updated during RL training 😮😮
And this isn’t a one-off. The pattern holds across RL algorithms and models.
🧵A Deep Dive
Tim Soret (@timsoret) 's Twitter Profile Photo

2 years apart. Again, don't look at the finger when it is pointing at the moon. Many mocked the early results, but it was already profound to witness a machine clumsily hallucinate from its learning. To me, it felt like mocking a child's drawing.

Ryan Marten (@ryanmart3n) 's Twitter Profile Photo

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals.

We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
meowbooks (@untitled01ipynb) 's Twitter Profile Photo

i have to subtweet this one for legal reasons but i guess the 300 users that see my memes would understand the context anyhow and will not reveal it in the replies

i have to subtweet this one for legal reasons but i guess the 300 users that see my memes would understand the context anyhow and will not reveal it in the replies
Lisan al Gaib (@scaling01) 's Twitter Profile Photo

A few more observations after replicating the Tower of Hanoi game with their exact prompts: - You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and

A few more observations after replicating the Tower of Hanoi game with their exact prompts:

- You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff.
- Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and
Sumuk (@sumukx) 's Twitter Profile Photo

there's something deeply fascinating about deep motor skills that we're failing to replicate properly with machines, but everything else seems to come relatively easily

Sumuk (@sumukx) 's Twitter Profile Photo

if we have human training data, human emotions are necessary for RL few understand this now, but i suspect we’ll see lots of papers intensifying emotion vectors for improved performance