JosH100 (@josh_wills) Twitter Tweets • TwiCopy

Sarah Catanzaro

a year ago

So many people are talking about the end of scaling laws this week; but while we might be seeing diminishing returns to scaling pretraining compute there are still highly impactful, empirically validated ways to improve model performance. Data curation is top among these.

thumb_up_off_alt30

chat_bubble_outline1

repeat10

shareShare

Amro

@amrokamal1997

a year ago

1/n Today, we are thrilled to introduce @DatologyAI's state-of-the-art Automated Image-Text Data Curation Pipeline. CLIP models trained on datasets produced by our pipeline achieve up to a 43x speedup in training.

thumb_up_off_alt33

chat_bubble_outline2

repeat6

shareShare

Bogdan Gaza

@hurrycane

a year ago

High-quality data leads to better models! At DatologyAI, we've made data curation accessible! Our curation pipeline enables training faster (up to 98% less compute), better (up to 13% higher performance), and smaller (>60% fewer parameters) datologyai.com/post/datologya… 1/n 🧵

thumb_up_off_alt30

chat_bubble_outline1

repeat4

shareShare

Haoli Yin

@haoliyin

a year ago

Check out this great thread by one of the big researchers in Data Curation! You might recognize him from SemDeDup 😉

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Matthew Leavitt

@leavittron

a year ago

Tired: Bringing up politics at Thanksgiving Wired: Bringing up DatologyAI’s new text curation results at Thanksgiving That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃 🧵

thumb_up_off_alt83

chat_bubble_outline5

repeat16

shareShare

DatologyAI

@datologyai

10 months ago

Come on over to our booth to grab some delicious Data Future cookies and pick up a fun DatologyAI-branded fidget cube! You can find us at booth 303, right next to the entrance. We can't wait to see you!! #NeurIPS2024

thumb_up_off_alt8

chat_bubble_outline0

repeat2

shareShare

Bogdan Gaza

@hurrycane

7 months ago

Definitely a paradigm shift we're still learning to navigate intelligently as an industry. Knowing when and how to use AI tools effectively is becoming essential.

thumb_up_off_alt11

chat_bubble_outline1

repeat4

shareShare

Hamel Husain

@hamelhusain

6 months ago

Only a matter of time before the job title “AI Scientist” emerges - Better than most AI Engineers at Evals, statistics, Data Analysis, Error Analysis, A/B testing, etc - Better than most Software Engineers at AI Engineering (😅 I hope we don’t need another job title )

thumb_up_off_alt376

chat_bubble_outline36

repeat34

shareShare

Chris Albon

@chrisalbon

6 months ago

AI Scientist (n.): Person who is better at data science than any AI engineer and better at AI engineering than any data scientist.

thumb_up_off_alt72

chat_bubble_outline5

repeat5

shareShare

Ricardo Monti

@ricardomonti9

4 months ago

. DatologyAI is back: state of the art CLIP model performance using data curation alone 🚀 ✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2 ✅ 8x training efficiency gains ✅ 2x inference efficiency gains ✅ Public model release Details in

. <a href="/datologyai/">DatologyAI</a> is back: state of the art CLIP model performance using data curation alone 🚀

✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2
✅ 8x training efficiency gains
✅ 2x inference efficiency gains
✅ Public model release

Details in

thumb_up_off_alt138

chat_bubble_outline3

repeat22

shareShare

Matthew Leavitt

@leavittron

4 months ago

The team absolutely crushed it here. They blew away nearly every CLIP baseline, and matched or exceeded SigLIP2—which uses a slew of training algorithm improvements—on a number of benchmarks. USING. DATA. CURATION. ONLY. I’m so proud of Ricardo Monti , Haoli Yin ,

thumb_up_off_alt36

chat_bubble_outline1

repeat11

shareShare

Luke Merrick

@lukemerrick_

4 months ago

Although it is no secret that the "secret sauce" underpinning every high-quality model is high-quality training data, I honestly didn't expect to match SigLIP2 without borrowing at least a few of its training tricks. Impressive to see how big the data-quality effect is!

thumb_up_off_alt13

chat_bubble_outline0

repeat4

shareShare

Amplify Partners

@amplifypartners

4 months ago

Good data, actually, is all you need

thumb_up_off_alt13

chat_bubble_outline0

repeat4

shareShare

David Crawshaw

@davidcrawshaw

4 months ago

Based on feedback from my latest blog post, I am not alone in feeling the pain around code review. Before LLMs, it was already not doing what it said on the tin. Now those problems are amplified. Some big question marks here about how we should work.

thumb_up_off_alt28

chat_bubble_outline0

repeat1

shareShare

Ari Morcos

@arimorcos

4 months ago

Welcome Vineeth! We're hiring across the board to build the best data engine for AI. Come join us! jobs.ashbyhq.com/datologyai

thumb_up_off_alt21

chat_bubble_outline0

repeat4

shareShare

Lucas Atkins

@lucasatkins7

4 months ago

Our customers needed a better base model <10B parameters. We spent the last 5 months building one. I'm delighted to share a preview of our first Arcee Foundation Model: AFM-4.5B-Preview.

thumb_up_off_alt324

chat_bubble_outline22

repeat39

shareShare

Lucas Atkins

@lucasatkins7

4 months ago

We teamed up with DatologyAI to build what we believe is the strongest pretraining corpus in the world—and I truly think we nailed it. Their team was absolutely key to the model’s success. We started with ~23T tokens of high-quality data and distilled it down to 6.58T through

thumb_up_off_alt47

chat_bubble_outline1

repeat5

shareShare

Alvin

@alvind319

4 months ago

congrats on the launch!!! data curation powered by DatologyAI :))

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare

Ari Morcos

@arimorcos

4 months ago

Congratulations to our friends and partners Arcee.ai on the release of AFM-4.5B! With data powered by DatologyAI, this model outperforms Gemma3-4B and is competitive with Qwen3-4B despite being trained on a fraction of the data.

thumb_up_off_alt46

chat_bubble_outline0

repeat11

shareShare

Ari Morcos

@arimorcos

4 months ago

This trend will only continue. Training your own model doesn't need to cost 10s of millions, especially in specialized domains. Better data is a compute multiplier, and DatologyAI's mission is to make this easy, massively reducing the cost and difficulty of training.

thumb_up_off_alt27

chat_bubble_outline2

repeat4

shareShare