Argilla (@argilla_io) 's Twitter Profile
Argilla

@argilla_io

Making AI data go brrrr (acquired by 🤗 Hugging Face)

ID: 1432630844720562177

linkhttps://github.com/argilla-io calendar_today31-08-2021 09:06:42

1,1K Tweet

4,4K Followers

39 Following

Sara Han (@sdiazlor) 's Twitter Profile Photo

🙅‍♀️ No-code end-to-end example to train your model 1️⃣ Use the Synthetic Data Generator to create your custom dataset 2️⃣ Use AutoTrain to use the generated dataset and train your model Check it here: huggingface.co/blog/synthetic…

🙅‍♀️ No-code end-to-end example to train your model

1️⃣ Use the Synthetic Data Generator to create your custom dataset

2️⃣ Use AutoTrain to use the generated dataset and train your model

Check it here: huggingface.co/blog/synthetic…
Daniel van Strien (@vanstriendaniel) 's Twitter Profile Photo

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages. Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages. 318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍
Prolific (@prolific) 's Twitter Profile Photo

Creating an RLHF dataset on social reasoning using Argilla (Hugging Face) comes next in our 12 days of studies selection. 🧩 We developed and released our own open-source dataset that the community can use to fine-tune models. Full story ➡️ prolific.com/resources/how-…

David Berenstein (@davidberenstei) 's Twitter Profile Photo

Fine-tuning ModernBERT for text classification using synthetic data generation From prompt to model in 3 steps. 1 simple datasets 20 minutes of generating 60 minutes of fine-tuning on my Macbook Pro Tutorial: buff.ly/4gwxBg7

David Berenstein (@davidberenstei) 's Twitter Profile Photo

Yes! Smoll models can beat frontier models but don’t expect miracles. Consider all costs and gains like difference performance and the value of using private and local models and data. A couple of hours of generating and training on Apple m1. Notebook: buff.ly/3DH3bJD

Argilla (@argilla_io) 's Twitter Profile Photo

We're building FineWeb-Edu in many languages and need your help. This effort will help the Open-Source AI community close the language gap. Assamese is 99.4% done, French needs 64 more, Tamil: 216. Can you help us reach 1,000 annotations?

We're building FineWeb-Edu in many languages and need your help. 

This effort will help the Open-Source AI community close the language gap.

Assamese is 99.4% done, French needs 64 more, Tamil: 216.

Can you help us reach 1,000 annotations?
David Berenstein (@davidberenstei) 's Twitter Profile Photo

High-quality data for fine-tuning language models for free and at the click of a button! Prompt and wait for your dataset to push to Argilla or the Hub Evaluate, review and fine-tune a model. Blog: buff.ly/4h6KMUH

Daniel van Strien (@vanstriendaniel) 's Twitter Profile Photo

How Fine is FineWeb2? The community has evaluated the educational quality of over 1,000 examples from FineWeb 2 across 15 languages (and counting!). tl;dr: The Hugging Face community is amazing, and there's already sufficient data to start building. More in 🧵

How Fine is FineWeb2? The community has evaluated the educational quality of over 1,000 examples from FineWeb 2 across 15 languages (and counting!).

tl;dr: The <a href="/huggingface/">Hugging Face</a> community is amazing, and there's already sufficient data to start building. More in 🧵
David Berenstein (@davidberenstei) 's Twitter Profile Photo

New Year's resolutions, 1) get better at AI, 2) train more models, 3) work with smaller models, 4) save some money, 5) .... wait a moment -> smol-course! Public course, private models: full of potential: buff.ly/3ZCMKX2

David Berenstein (@davidberenstei) 's Twitter Profile Photo

Smol-scale vector search doesn't need a dedicated vector database! You can simply use the Hugging Face Hub. code: buff.ly/4jfilpC build on DuckDB: buff.ly/4jgWAWl

Smol-scale vector search doesn't need a dedicated vector database!

You can simply use the Hugging Face Hub.

code: buff.ly/4jfilpC 
build on DuckDB: buff.ly/4jgWAWl
Daniel van Strien (@vanstriendaniel) 's Twitter Profile Photo

🎉 50,000+ annotations reached! The FineWeb2-C community is helping build better language models on annotation at a time. 📊 Current stats: - 115 languages represented - 419 amazing contributors - 24 languages with complete datasets But we're not done yet! 🧵

🎉 50,000+ annotations reached! The FineWeb2-C community is helping build better language models on annotation at a time.

📊 Current stats:
- 115 languages represented
- 419 amazing contributors
- 24 languages with complete datasets

But we're not done yet! 🧵
David Berenstein (@davidberenstei) 's Twitter Profile Photo

You can now use the "Synthetic Data Generator" at a much larger scale with your preferred inference engine: Ollama, vLLM, TGI, and serverless inference! 🔥 install, configure, launch! examples: buff.ly/3PEivcx duplicate on HF: buff.ly/40ztiLk

You can now use the "Synthetic Data Generator" at a much larger scale with your preferred inference engine: Ollama, vLLM, TGI, and serverless inference! 🔥

install, configure, launch!

examples: buff.ly/3PEivcx
duplicate on HF: buff.ly/40ztiLk
Sara Han (@sdiazlor) 's Twitter Profile Photo

💫 Generate RAG data with the Synthetic Data Generator to improve your RAG system! 1️⃣ Generate from your documents, dataset, or dataset description. 2️⃣ Configure it. 3️⃣ Generate the synthetic dataset. 4️⃣ Fine-tune the retrieval and reranking models. 5️⃣ Build a RAG pipeline.

David Berenstein (@davidberenstei) 's Twitter Profile Photo

🔥 The synthetic data for SmolLM and open DeepSeek-R1 relies on this awesome package! 1.2K distilabel datasets on the Hub buff.ly/3PW46si reproducible and sharable pipelines any LLM provider scale however you want library: buff.ly/3MXAB8G

🔥 The synthetic data for SmolLM and open DeepSeek-R1 relies on this awesome package!

1.2K distilabel datasets on the Hub buff.ly/3PW46si 
reproducible and sharable pipelines
any LLM provider
scale however you want

library: buff.ly/3MXAB8G
Daniel Vila Suero (@dvilasuero) 's Twitter Profile Photo

Open Source AI vibes on the Hugging Face Hub I'm building a small vibe benchmark you can run with Hugging Face Inference Providers (link in the next message). IMO, Inference Providers will become one of the most important pieces of the stack for building with AI: No lock-in,

Open Source AI vibes on the Hugging Face Hub

I'm building a small vibe benchmark you can run with <a href="/huggingface/">Hugging Face</a> Inference Providers (link in the next message).

IMO, Inference Providers will become one of the most important pieces of the stack for building with AI: 

No lock-in,