Alessandro Ercolani (@giuxale) 's Twitter Profile
Alessandro Ercolani

@giuxale

Good, better, best. Never let it rest. 'Til your good is better and your better is best St. Jerome and Tim Duncan mother

ID: 627366168

calendar_today05-07-2012 10:58:26

384 Tweet

155 Followers

1,1K Following

Philipp Schmid (@_philschmid) 's Twitter Profile Photo

When a new model is released, those are the key metrics I first look at to understand its performance: 👀 - MixEval: A dynamic benchmark evaluating LLMs using real-world user queries and benchmarks, achieving a 0.96 model ranking correlation with Chatbot Arena.

When a new model is released, those are the key metrics I first look at to understand its performance: 👀

- MixEval: A dynamic benchmark evaluating LLMs using real-world user queries and benchmarks, achieving a 0.96 model ranking correlation with Chatbot Arena.
Omar Sanseviero (@osanseviero) 's Twitter Profile Photo

Microsoft just silently dropped Florence 👀Vision model that can tackle many vision tasks (captioning, detection, region proposal, OCR) 🤏Small models (200M and 800M) with ~quality to models 100x larger 🔥MIT licensed Paper and models: huggingface.co/collections/mi…

Thomas Wolf (@thom_wolf) 's Twitter Profile Photo

The kyutai fully end-to-end audio model demo of today is a huge deal that many people missed in the room Mostly irrelevant are the facts that: - they come a few week after OpenAI ChatGPT-4o - the demo was less polished than the 4o one (in terms of voice quality, voice

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

I'm playing around with generative AI tools and stitching them together into visual stories. Here I took the first few sentences of Pride and Prejudice and made it into a video. The gen stack used for this one: - Anthropic Claude took the first chapter, generated the scenes

Thomas Wolf (@thom_wolf) 's Twitter Profile Photo

There was a super impressive AI competition that happened last week that many people missed in the noise of AI world. I happen to know several participants so let me tell you a bit of this story as a Sunday morning coffee time. You probably know the Millennium Prize Problems

There was a super impressive AI competition that happened last week that many people missed in the noise of AI world. I happen to know several participants so let me tell you a bit of this story as a Sunday morning coffee time.

You probably know the Millennium Prize Problems
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

LLM model size competition is intensifying… backwards! My bet is that we'll see models that "think" very well and reliably that are very very small. There is most likely a setting even of GPT-2 parameters for which most people will consider GPT-2 "smart". The reason current

Alessandro Ercolani (@giuxale) 's Twitter Profile Photo

I just published MMLU-PRO-ITA a new eval for Italian LLMs, in the article the link to the open source dataset, EleutherAI lm-eval PR and results for Italian LLMs. link.medium.com/Z4HfLHIysLb

Jan P. Harries (@jphme) 's Twitter Profile Photo

5. Didn't see this one before: Meta´s post training pipeline utilizes Pairwise annotated preference data to train (and use) both, a Reward Model for early-stage Rejection Sampling and to improve intermediary SFT models with DPO 🤯 - SPIN on steroids!? 😉

5. Didn't see this one before: <a href="/Meta/">Meta</a>´s post training pipeline utilizes Pairwise annotated preference data to train (and use) both, a Reward Model for early-stage Rejection Sampling and to improve intermediary SFT models with DPO 🤯 - SPIN on steroids!? 😉
Philipp Schmid (@_philschmid) 's Twitter Profile Photo

Exciting update for AI developers! The Hugging Face Hub is now more natively integrated into Google Cloud Vertex AI Model Garden. Search through thousands of open Generative AI models from Hugging Face models & deploy them with one click to Vertex AI or GKE. 🤯 What's new: 🔎

Alessandro Ercolani (@giuxale) 's Twitter Profile Photo

I just released the Italian fork of llm-evaluation-harness from EleutherAI for evaluating LLMs on Italian language tasks link.medium.com/NQSZ3g2fnNb

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

Remember the llm.c repro of the GPT-2 (124M) training run? It took 45 min on 8xH100. Since then, Keller Jordan (and by now many others) have iterated on that extensively in the new modded-nanogpt repo that achieves the same result, now in only 5 min! Love this repo 👏 600 LOC

Remember the llm.c repro of the GPT-2 (124M) training run? It took 45 min on 8xH100. Since then, <a href="/kellerjordan0/">Keller Jordan</a> (and by now many others) have iterated on that extensively in the new modded-nanogpt repo that achieves the same result, now in only 5 min! 
Love this repo 👏 600 LOC
Jiayi Pan (@jiayi_pirate) 's Twitter Profile Photo

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: github.com/Jiayi-Pan/Tiny… Here's what we learned 🧵

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works 

Through RL, the 3B base LM develops self-verification and search abilities all on its own 

You can experience the Ahah moment yourself for &lt; $30 
Code: github.com/Jiayi-Pan/Tiny…

Here's what we learned 🧵
Alessandro Ercolani (@giuxale) 's Twitter Profile Photo

#MII-LLM just released Propaganda an open source framework for evaluating and training #LLMs on political bias and opinions medium.com/p/propaganda-0…

François Fleuret (@francoisfleuret) 's Twitter Profile Photo

You can similarly predict the task performance of a x1000 larger model. The blue square on these graphs is the actual performance of Llama3 405B, on the graph fitted on smaller models. This stuff really works. arxiv.org/abs/2407.21783 4/4

You can similarly predict the task performance of a x1000 larger model.

The blue square on these graphs is the actual performance of Llama3 405B, on the graph fitted on smaller models.

This stuff really works.

arxiv.org/abs/2407.21783

4/4
Toby Kim (@_doyeob_) 's Twitter Profile Photo

Two undergrads. One still in the military. Zero funding. One ridiculous goal: build a TTS model that rivals NotebookLM Podcast, ElevenLabs Studio, and Sesame CSM. Somehow… we pulled it off. Here’s how 👇

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

We're missing (at least one) major paradigm for LLM learning. Not sure what to call it, possibly it has a name - system prompt learning? Pretraining is for knowledge. Finetuning (SL/RL) is for habitual behavior. Both of these involve a change in parameters but a lot of human

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

Transforming human knowledge, sensors and actuators from human-first and human-legible to LLM-first and LLM-legible is a beautiful space with so much potential and so much can be done... One example I'm obsessed with recently - for every textbook pdf/epub, there is a perfect

Transforming human knowledge, sensors and actuators from human-first and human-legible to LLM-first and LLM-legible is a beautiful space with so much potential and so much can be done...

One example I'm obsessed with recently - for every textbook pdf/epub, there is a perfect
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

A few random notes from claude coding quite a bit last few weeks. Coding workflow. Given the latest lift in LLM coding capability, like many others I rapidly went from about 80% manual+autocomplete coding and 20% agents in November to 80% agent coding and 20% edits+touchups in