Ari Morcos (@arimorcos) 's Twitter Profile
Ari Morcos

@arimorcos

CEO and Co-founder @datologyai working to make it easy for anyone to make the most of their data. Former: RS @AIatMeta (FAIR), RS @DeepMind, PhD @PiN_Harvard.

ID: 29907525

linkhttp://www.datologyai.com calendar_today09-04-2009 03:18:43

1,1K Tweet

6,6K Takipçi

1,1K Takip Edilen

Jiaxin Wen @ICLR2025 (@jiaxinwen22) 's Twitter Profile Photo

want to clarify some common misunderstandings - this paper is about elicitation, not self-improvement. - we're not adding new skills --- humans typically can't teach models anything superhuman during post-training. - we are most surprised by the reward modeling results. Unlike

Mark Ibrahim (@marksibrahim) 's Twitter Profile Photo

A good language model should say “I don’t know” by reasoning about the limits of its knowledge. Our new work AbstentionBench carefully measures this overlooked skill in leading models in an open-codebase others can build on! We find frontier reasoning degrades models’ ability to

A good language model should say “I don’t know” by reasoning about the limits of its knowledge. Our new work AbstentionBench carefully measures this overlooked skill in leading models in an open-codebase others can build on!

We find frontier reasoning degrades models’ ability to
Lucas Atkins (@lucasatkins7) 's Twitter Profile Photo

Our customers needed a better base model <10B parameters. We spent the last 5 months building one. I'm delighted to share a preview of our first Arcee Foundation Model: AFM-4.5B-Preview.

Lucas Atkins (@lucasatkins7) 's Twitter Profile Photo

We teamed up with DatologyAI to build what we believe is the strongest pretraining corpus in the world—and I truly think we nailed it. Their team was absolutely key to the model’s success. We started with ~23T tokens of high-quality data and distilled it down to 6.58T through

Andrew M. Dai @ ICLR (@iamandrewdai) 's Twitter Profile Photo

It turns out LLM data is more like oil than coal, if you refine it properly. Congratulations to the contributors of the many researcher-years of work!

It turns out LLM data is more like oil than coal, if you refine it properly. Congratulations to the contributors of the many researcher-years of work!
Ari Morcos (@arimorcos) 's Twitter Profile Photo

Congratulations to our friends and partners Arcee.ai on the release of AFM-4.5B! With data powered by DatologyAI, this model outperforms Gemma3-4B and is competitive with Qwen3-4B despite being trained on a fraction of the data.

Ari Morcos (@arimorcos) 's Twitter Profile Photo

This trend will only continue. Training your own model doesn't need to cost 10s of millions, especially in specialized domains. Better data is a compute multiplier, and DatologyAI's mission is to make this easy, massively reducing the cost and difficulty of training.

Ari Morcos (@arimorcos) 's Twitter Profile Photo

Training costs have definitely been going down, but I think there's still a massive barrier when it comes to data, especially if you want to train on proprietary datasets. DatologyAI, we are razor focused on changing that.

Ari Morcos (@arimorcos) 's Twitter Profile Photo

This is why automating data curation is not only necessary for scalability reasons (a human obviously can't curate trillions of tokens), but also because humans aren't actually *good* at assessing data quality.

Matthew Leavitt (@leavittron) 's Twitter Profile Photo

It depends on how much you know about what you're using your model for. You want your data to be as similar to your test distribution as possible. In practice, benchmarks are an incomplete description of your true test distribution, so you want to hedge diversity vs.

DatologyAI (@datologyai) 's Twitter Profile Photo

🌞 We're excited to share our "Summer of Data Seminar" series at DatologyAI! We're hosting weekly sessions with brilliant researchers diving deep into pretraining, data curation, and everything that makes datasets tick. Are you data-obsessed yet? 🤓 Thread 👇

🌞 We're excited to share our "Summer of Data Seminar" series at <a href="/datologyai/">DatologyAI</a>!

We're hosting weekly sessions with brilliant researchers diving deep into pretraining, data curation, and everything that makes datasets tick.

Are you data-obsessed yet? 🤓

Thread 👇
Chai Discovery (@chaidiscovery) 's Twitter Profile Photo

We’re excited to introduce Chai-2, a major breakthrough in molecular design. Chai-2 enables zero-shot antibody discovery in a 24-well plate, exceeding previous SOTA by >100x. Thread👇