Alessandro Ercolani (@giuxale) Twitter Tweets • TwiCopy

Philipp Schmid

2 years ago

When a new model is released, those are the key metrics I first look at to understand its performance: 👀 - MixEval: A dynamic benchmark evaluating LLMs using real-world user queries and benchmarks, achieving a 0.96 model ranking correlation with Chatbot Arena.

thumb_up_off_alt103

chat_bubble_outline8

repeat19

shareShare

Philipp Schmid

@_philschmid

2 years ago

All you need is synthetic data, LoRA, and 750 human responses for evaluation.

thumb_up_off_alt110

chat_bubble_outline3

repeat10

shareShare

Omar Sanseviero

@osanseviero

2 years ago

Microsoft just silently dropped Florence 👀Vision model that can tackle many vision tasks (captioning, detection, region proposal, OCR) 🤏Small models (200M and 800M) with ~quality to models 100x larger 🔥MIT licensed Paper and models: huggingface.co/collections/mi…

thumb_up_off_alt1,1K

chat_bubble_outline10

repeat272

shareShare

Thomas Wolf

@thom_wolf

2 years ago

The kyutai fully end-to-end audio model demo of today is a huge deal that many people missed in the room Mostly irrelevant are the facts that: - they come a few week after OpenAI ChatGPT-4o - the demo was less polished than the 4o one (in terms of voice quality, voice

thumb_up_off_alt1,1K

chat_bubble_outline75

repeat361

shareShare

Andrej Karpathy

@karpathy

2 years ago

I'm playing around with generative AI tools and stitching them together into visual stories. Here I took the first few sentences of Pride and Prejudice and made it into a video. The gen stack used for this one: - Anthropic Claude took the first chapter, generated the scenes

thumb_up_off_alt4,4K

chat_bubble_outline302

repeat589

shareShare

Thomas Wolf

@thom_wolf

2 years ago

There was a super impressive AI competition that happened last week that many people missed in the noise of AI world. I happen to know several participants so let me tell you a bit of this story as a Sunday morning coffee time. You probably know the Millennium Prize Problems

thumb_up_off_alt2,2K

chat_bubble_outline63

repeat516

shareShare

Andrej Karpathy

@karpathy

2 years ago

LLM model size competition is intensifying… backwards! My bet is that we'll see models that "think" very well and reliably that are very very small. There is most likely a setting even of GPT-2 parameters for which most people will consider GPT-2 "smart". The reason current

thumb_up_off_alt7,7K

chat_bubble_outline196

repeat935

shareShare

Alessandro Ercolani

@giuxale

2 years ago

I just published MMLU-PRO-ITA a new eval for Italian LLMs, in the article the link to the open source dataset, EleutherAI lm-eval PR and results for Italian LLMs. link.medium.com/Z4HfLHIysLb

thumb_up_off_alt2

chat_bubble_outline0

repeat2

shareShare

Jan P. Harries

@jphme

2 years ago

5. Didn't see this one before: Meta´s post training pipeline utilizes Pairwise annotated preference data to train (and use) both, a Reward Model for early-stage Rejection Sampling and to improve intermediary SFT models with DPO 🤯 - SPIN on steroids!? 😉

5. Didn't see this one before: <a href="/Meta/">Meta</a>´s post training pipeline utilizes Pairwise annotated preference data to train (and use) both, a Reward Model for early-stage Rejection Sampling and to improve intermediary SFT models with DPO 🤯 - SPIN on steroids!? 😉

thumb_up_off_alt69

chat_bubble_outline1

repeat4

shareShare

Philipp Schmid

@_philschmid

2 years ago

Exciting update for AI developers! The Hugging Face Hub is now more natively integrated into Google Cloud Vertex AI Model Garden. Search through thousands of open Generative AI models from Hugging Face models & deploy them with one click to Vertex AI or GKE. 🤯 What's new: 🔎

thumb_up_off_alt116

chat_bubble_outline4

repeat30

shareShare

Alessandro Ercolani

@giuxale

2 years ago

I just released the Italian fork of llm-evaluation-harness from EleutherAI for evaluating LLMs on Italian language tasks link.medium.com/NQSZ3g2fnNb

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Andrej Karpathy

@karpathy

a year ago

Remember the llm.c repro of the GPT-2 (124M) training run? It took 45 min on 8xH100. Since then, Keller Jordan (and by now many others) have iterated on that extensively in the new modded-nanogpt repo that achieves the same result, now in only 5 min! Love this repo 👏 600 LOC

Remember the llm.c repro of the GPT-2 (124M) training run? It took 45 min on 8xH100. Since then, <a href="/kellerjordan0/">Keller Jordan</a> (and by now many others) have iterated on that extensively in the new modded-nanogpt repo that achieves the same result, now in only 5 min!
Love this repo 👏 600 LOC

thumb_up_off_alt4,4K

chat_bubble_outline50

repeat405

shareShare

Jiayi Pan

@jiayi_pirate

a year ago

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: github.com/Jiayi-Pan/Tiny… Here's what we learned 🧵

thumb_up_off_alt6,6K

chat_bubble_outline195

repeat1,1K

shareShare

Alessandro Ercolani

@giuxale

a year ago

#MII-LLM just released Propaganda an open source framework for evaluating and training #LLMs on political bias and opinions medium.com/p/propaganda-0…

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

François Fleuret

@francoisfleuret

a year ago

You can similarly predict the task performance of a x1000 larger model. The blue square on these graphs is the actual performance of Llama3 405B, on the graph fitted on smaller models. This stuff really works. arxiv.org/abs/2407.21783 4/4

thumb_up_off_alt168

chat_bubble_outline3

repeat7

shareShare

Toby Kim

@_doyeob_

a year ago

Two undergrads. One still in the military. Zero funding. One ridiculous goal: build a TTS model that rivals NotebookLM Podcast, ElevenLabs Studio, and Sesame CSM. Somehow… we pulled it off. Here’s how 👇

thumb_up_off_alt5,5K

chat_bubble_outline225

repeat607

shareShare

Andrej Karpathy

@karpathy

a year ago

We're missing (at least one) major paradigm for LLM learning. Not sure what to call it, possibly it has a name - system prompt learning? Pretraining is for knowledge. Finetuning (SL/RL) is for habitual behavior. Both of these involve a change in parameters but a lot of human

thumb_up_off_alt9,9K

chat_bubble_outline698

repeat1,1K

shareShare

Andrej Karpathy

@karpathy

8 months ago

Transforming human knowledge, sensors and actuators from human-first and human-legible to LLM-first and LLM-legible is a beautiful space with so much potential and so much can be done... One example I'm obsessed with recently - for every textbook pdf/epub, there is a perfect

thumb_up_off_alt4,4K

chat_bubble_outline257

repeat610

shareShare

Andrej Karpathy

@karpathy

5 months ago

x.com/i/article/2002…

thumb_up_off_alt7,7K

chat_bubble_outline191

repeat1,1K

shareShare

Andrej Karpathy

@karpathy

3 months ago

A few random notes from claude coding quite a bit last few weeks. Coding workflow. Given the latest lift in LLM coding capability, like many others I rapidly went from about 80% manual+autocomplete coding and 20% agents in November to 80% agent coding and 20% edits+touchups in

thumb_up_off_alt16,16K

chat_bubble_outline801

repeat2,2K

shareShare