
JosH100
@josh_wills
Engineering at @datologyai; @duckdb enthusiast, ex-@slackhq
ID: 14578294
https://bsky.app/profile/spite.vc 29-04-2008 01:03:11
20,20K Tweet
17,17K Takipçi
1,1K Takip Edilen





Tired: Bringing up politics at Thanksgiving Wired: Bringing up DatologyAI’s new text curation results at Thanksgiving That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃 🧵





. DatologyAI is back: state of the art CLIP model performance using data curation alone 🚀 ✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2 ✅ 8x training efficiency gains ✅ 2x inference efficiency gains ✅ Public model release Details in








We teamed up with DatologyAI to build what we believe is the strongest pretraining corpus in the world—and I truly think we nailed it. Their team was absolutely key to the model’s success. We started with ~23T tokens of high-quality data and distilled it down to 6.58T through

congrats on the launch!!! data curation powered by DatologyAI :))


This trend will only continue. Training your own model doesn't need to cost 10s of millions, especially in specialized domains. Better data is a compute multiplier, and DatologyAI's mission is to make this easy, massively reducing the cost and difficulty of training.