Argilla (@argilla_io) Twitter Tweets • TwiCopy

Sara Han

10 months ago

🙅‍♀️ No-code end-to-end example to train your model 1️⃣ Use the Synthetic Data Generator to create your custom dataset 2️⃣ Use AutoTrain to use the generated dataset and train your model Check it here: huggingface.co/blog/synthetic…

thumb_up_off_alt25

chat_bubble_outline0

repeat6

shareShare

Ben Burtenshaw

@ben_burtenshaw

10 months ago

People are flexing their end of year stats, so I made this app to show Hugging Face hub stats in a tidy design! Thanks Argilla for the feature.

People are flexing their end of year stats, so I made this app to show <a href="/huggingface/">Hugging Face</a> hub stats in a tidy design!

Thanks <a href="/argilla_io/">Argilla</a> for the feature.

thumb_up_off_alt26

chat_bubble_outline2

repeat5

shareShare

Wing Lian (caseus)

@winglian

10 months ago

If you're using Argilla's Distilabel with vLLM , be sure to use replicas rather than a large batch size to keep your GPUs busy.

If you're using <a href="/argilla_io/">Argilla</a>'s Distilabel with <a href="/vllm_project/">vLLM</a> , be sure to use replicas rather than a large batch size to keep your GPUs busy.

thumb_up_off_alt46

chat_bubble_outline3

repeat7

shareShare

Daniel van Strien

@vanstriendaniel

10 months ago

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages. Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages. 318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

thumb_up_off_alt149

chat_bubble_outline2

repeat29

shareShare

Prolific

@prolific

10 months ago

Creating an RLHF dataset on social reasoning using Argilla (Hugging Face) comes next in our 12 days of studies selection. 🧩 We developed and released our own open-source dataset that the community can use to fine-tune models. Full story ➡️ prolific.com/resources/how-…

thumb_up_off_alt9

chat_bubble_outline0

repeat3

shareShare

David Berenstein

@davidberenstei

10 months ago

Fine-tuning ModernBERT for text classification using synthetic data generation From prompt to model in 3 steps. 1 simple datasets 20 minutes of generating 60 minutes of fine-tuning on my Macbook Pro Tutorial: buff.ly/4gwxBg7

thumb_up_off_alt423

chat_bubble_outline6

repeat79

shareShare

David Berenstein

@davidberenstei

10 months ago

Yes! Smoll models can beat frontier models but don’t expect miracles. Consider all costs and gains like difference performance and the value of using private and local models and data. A couple of hours of generating and training on Apple m1. Notebook: buff.ly/3DH3bJD

thumb_up_off_alt127

chat_bubble_outline4

repeat29

shareShare

Daniel Vila Suero

@dvilasuero

10 months ago

From text to dataset: 1. Copy & paste a dataset and its label descriptions. 2. Get a synthetic dataset for data augmentation.

thumb_up_off_alt9

chat_bubble_outline1

repeat3

shareShare

Argilla

@argilla_io

10 months ago

We're building FineWeb-Edu in many languages and need your help. This effort will help the Open-Source AI community close the language gap. Assamese is 99.4% done, French needs 64 more, Tamil: 216. Can you help us reach 1,000 annotations?

thumb_up_off_alt48

chat_bubble_outline3

repeat12

shareShare

David Berenstein

@davidberenstei

10 months ago

High-quality data for fine-tuning language models for free and at the click of a button! Prompt and wait for your dataset to push to Argilla or the Hub Evaluate, review and fine-tune a model. Blog: buff.ly/4h6KMUH

thumb_up_off_alt180

chat_bubble_outline1

repeat39

shareShare

Daniel van Strien

@vanstriendaniel

10 months ago

How Fine is FineWeb2? The community has evaluated the educational quality of over 1,000 examples from FineWeb 2 across 15 languages (and counting!). tl;dr: The Hugging Face community is amazing, and there's already sufficient data to start building. More in 🧵

thumb_up_off_alt22

chat_bubble_outline1

repeat6

shareShare

David Berenstein

@davidberenstei

10 months ago

New Year's resolutions, 1) get better at AI, 2) train more models, 3) work with smaller models, 4) save some money, 5) .... wait a moment -> smol-course! Public course, private models: full of potential: buff.ly/3ZCMKX2

thumb_up_off_alt366

chat_bubble_outline1

repeat68

shareShare

David Berenstein

@davidberenstei

10 months ago

Smol-scale vector search doesn't need a dedicated vector database! You can simply use the Hugging Face Hub. code: buff.ly/4jfilpC build on DuckDB: buff.ly/4jgWAWl

thumb_up_off_alt114

chat_bubble_outline2

repeat24

shareShare

Daniel van Strien

@vanstriendaniel

9 months ago

🎉 50,000+ annotations reached! The FineWeb2-C community is helping build better language models on annotation at a time. 📊 Current stats: - 115 languages represented - 419 amazing contributors - 24 languages with complete datasets But we're not done yet! 🧵

thumb_up_off_alt30

chat_bubble_outline4

repeat10

shareShare

David Berenstein

@davidberenstei

9 months ago

You can now use the "Synthetic Data Generator" at a much larger scale with your preferred inference engine: Ollama, vLLM, TGI, and serverless inference! 🔥 install, configure, launch! examples: buff.ly/3PEivcx duplicate on HF: buff.ly/40ztiLk

thumb_up_off_alt53

chat_bubble_outline2

repeat11

shareShare

Sara Han

@sdiazlor

9 months ago

💫 Generate RAG data with the Synthetic Data Generator to improve your RAG system! 1️⃣ Generate from your documents, dataset, or dataset description. 2️⃣ Configure it. 3️⃣ Generate the synthetic dataset. 4️⃣ Fine-tune the retrieval and reranking models. 5️⃣ Build a RAG pipeline.

thumb_up_off_alt32

chat_bubble_outline1

repeat7

shareShare

David Berenstein

@davidberenstei

9 months ago

🔥 The synthetic data for SmolLM and open DeepSeek-R1 relies on this awesome package! 1.2K distilabel datasets on the Hub buff.ly/3PW46si reproducible and sharable pipelines any LLM provider scale however you want library: buff.ly/3MXAB8G

thumb_up_off_alt46

chat_bubble_outline1

repeat4

shareShare

Daniel Vila Suero

@dvilasuero

7 months ago

Open Source AI vibes on the Hugging Face Hub I'm building a small vibe benchmark you can run with Hugging Face Inference Providers (link in the next message). IMO, Inference Providers will become one of the most important pieces of the stack for building with AI: No lock-in,

Open Source AI vibes on the Hugging Face Hub

I'm building a small vibe benchmark you can run with <a href="/huggingface/">Hugging Face</a> Inference Providers (link in the next message).

IMO, Inference Providers will become one of the most important pieces of the stack for building with AI:

No lock-in,

thumb_up_off_alt61

chat_bubble_outline9

repeat2

shareShare