JosH100 (@josh_wills) 's Twitter Profile
JosH100

@josh_wills

Engineering at @datologyai; @duckdb enthusiast, ex-@slackhq

ID: 14578294

linkhttps://bsky.app/profile/spite.vc calendar_today29-04-2008 01:03:11

20,20K Tweet

17,17K Takipçi

1,1K Takip Edilen

Sarah Catanzaro (@sarahcat21) 's Twitter Profile Photo

So many people are talking about the end of scaling laws this week; but while we might be seeing diminishing returns to scaling pretraining compute there are still highly impactful, empirically validated ways to improve model performance. Data curation is top among these.

Amro (@amrokamal1997) 's Twitter Profile Photo

1/n Today, we are thrilled to introduce @DatologyAI's state-of-the-art Automated Image-Text Data Curation Pipeline. CLIP models trained on datasets produced by our pipeline achieve up to a 43x speedup in training.

1/n 
Today, we are thrilled to introduce @DatologyAI's state-of-the-art Automated Image-Text Data Curation Pipeline. CLIP models trained on datasets  produced by our pipeline achieve up to a 43x speedup in training.
Bogdan Gaza (@hurrycane) 's Twitter Profile Photo

High-quality data leads to better models! At DatologyAI, we've made data curation accessible! Our curation pipeline enables training faster (up to 98% less compute), better (up to 13% higher performance), and smaller (>60% fewer parameters) datologyai.com/post/datologya… 1/n 🧵

Haoli Yin (@haoliyin) 's Twitter Profile Photo

Check out this great thread by one of the big researchers in Data Curation! You might recognize him from SemDeDup 😉

Matthew Leavitt (@leavittron) 's Twitter Profile Photo

Tired: Bringing up politics at Thanksgiving Wired: Bringing up DatologyAI’s new text curation results at Thanksgiving That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃 🧵

DatologyAI (@datologyai) 's Twitter Profile Photo

Come on over to our booth to grab some delicious Data Future cookies and pick up a fun DatologyAI-branded fidget cube! You can find us at booth 303, right next to the entrance. We can't wait to see you!! #NeurIPS2024

Come on over to our booth to grab some delicious Data Future cookies and pick up a fun DatologyAI-branded fidget cube! 
You can find us at booth 303, right next to the entrance. We can't wait to see you!! #NeurIPS2024
Bogdan Gaza (@hurrycane) 's Twitter Profile Photo

Definitely a paradigm shift we're still learning to navigate intelligently as an industry. Knowing when and how to use AI tools effectively is becoming essential.

Hamel Husain (@hamelhusain) 's Twitter Profile Photo

Only a matter of time before the job title “AI Scientist” emerges - Better than most AI Engineers at Evals, statistics, Data Analysis, Error Analysis, A/B testing, etc - Better than most Software Engineers at AI Engineering (😅 I hope we don’t need another job title )

Chris Albon (@chrisalbon) 's Twitter Profile Photo

AI Scientist (n.): Person who is better at data science than any AI engineer and better at AI engineering than any data scientist.

Ricardo Monti (@ricardomonti9) 's Twitter Profile Photo

. DatologyAI is back: state of the art CLIP model performance using data curation alone 🚀 ✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2 ✅ 8x training efficiency gains ✅ 2x inference efficiency gains ✅ Public model release Details in

. <a href="/datologyai/">DatologyAI</a> is back: state of the art CLIP model performance using data curation alone 🚀

✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2
✅ 8x training efficiency gains
✅ 2x inference efficiency gains
✅ Public model release

Details in
Matthew Leavitt (@leavittron) 's Twitter Profile Photo

The team absolutely crushed it here. They blew away nearly every CLIP baseline, and matched or exceeded SigLIP2—which uses a slew of training algorithm improvements—on a number of benchmarks. USING. DATA. CURATION. ONLY. I’m so proud of Ricardo Monti , Haoli Yin ,

Luke Merrick (@lukemerrick_) 's Twitter Profile Photo

Although it is no secret that the "secret sauce" underpinning every high-quality model is high-quality training data, I honestly didn't expect to match SigLIP2 without borrowing at least a few of its training tricks. Impressive to see how big the data-quality effect is!

David Crawshaw (@davidcrawshaw) 's Twitter Profile Photo

Based on feedback from my latest blog post, I am not alone in feeling the pain around code review. Before LLMs, it was already not doing what it said on the tin. Now those problems are amplified. Some big question marks here about how we should work.

Based on feedback from my latest blog post, I am not alone in feeling the pain around code review. Before LLMs, it was already not doing what it said on the tin. Now those problems are amplified. Some big question marks here about how we should work.
Lucas Atkins (@lucasatkins7) 's Twitter Profile Photo

Our customers needed a better base model <10B parameters. We spent the last 5 months building one. I'm delighted to share a preview of our first Arcee Foundation Model: AFM-4.5B-Preview.

Lucas Atkins (@lucasatkins7) 's Twitter Profile Photo

We teamed up with DatologyAI to build what we believe is the strongest pretraining corpus in the world—and I truly think we nailed it. Their team was absolutely key to the model’s success. We started with ~23T tokens of high-quality data and distilled it down to 6.58T through

Ari Morcos (@arimorcos) 's Twitter Profile Photo

Congratulations to our friends and partners Arcee.ai on the release of AFM-4.5B! With data powered by DatologyAI, this model outperforms Gemma3-4B and is competitive with Qwen3-4B despite being trained on a fraction of the data.

Ari Morcos (@arimorcos) 's Twitter Profile Photo

This trend will only continue. Training your own model doesn't need to cost 10s of millions, especially in specialized domains. Better data is a compute multiplier, and DatologyAI's mission is to make this easy, massively reducing the cost and difficulty of training.