
XetHub
@xetdata
XetHub enables ML teams to collaborate effectively on massive datasets.
Now part of @HuggingFace 🤗!
ID: 1499078781390110721
https://huggingface.co/xet-team 02-03-2022 17:47:04
62 Tweet
368 Followers
13 Following



Deduplicating evolving datasets is a no-brainer - store differences instead of full versions of each one. But format matters! Here's how appends, modifications, and deletes on Apache Parquet files (~20% of what's stored on Hugging Face Hub) deduplicate. 🧵



Did you know that @Huggingface Hub holds over 29 PB of Git LFS files across datasets, models, and spaces? 📈 That's the equivalent of 64 Common Crawl Foundation downloads - and it's growing every day. So what's inside? 🧵



We're turning Hugging Face Hub's files into content-defined chunks to speed up your workflows!⚡️ This means: - 🧠We store your file as deduplicated chunks - ⏩ You only upload changed chunks when iterating! - 🚀 Pulling changes? Only download changed chunks!