Ari Morcos (@arimorcos) Twitter Tweets • TwiCopy

Ari Morcos

@arimorcos

+ Follow

CEO and Co-founder @datologyai working to make it easy for anyone to make the most of their data. Former: RS @AIatMeta (FAIR), RS @DeepMind, PhD @PiN_Harvard.

ID: 29907525

linkhttp://www.datologyai.com calendar_today09-04-2009 03:18:43

1,1K Tweet

6,6K Takipçi

1,1K Takip Edilen

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

want to clarify some common misunderstandings - this paper is about elicitation, not self-improvement. - we're not adding new skills --- humans typically can't teach models anything superhuman during post-training. - we are most surprised by the reward modeling results. Unlike

thumb_up_off_alt116

chat_bubble_outline4

repeat20

shareShare

Haoli Yin

@haoliyin

2 months ago

Join DatologyAI if you have conviction on #4

thumb_up_off_alt15

chat_bubble_outline2

repeat3

shareShare

Mark Ibrahim

@marksibrahim

2 months ago

A good language model should say “I don’t know” by reasoning about the limits of its knowledge. Our new work AbstentionBench carefully measures this overlooked skill in leading models in an open-codebase others can build on! We find frontier reasoning degrades models’ ability to

thumb_up_off_alt109

chat_bubble_outline3

repeat18

shareShare

Ari Morcos

@arimorcos

2 months ago

Welcome Vineeth! We're hiring across the board to build the best data engine for AI. Come join us! jobs.ashbyhq.com/datologyai

thumb_up_off_alt21

chat_bubble_outline0

repeat4

shareShare

Lucas Atkins

@lucasatkins7

2 months ago

Our customers needed a better base model <10B parameters. We spent the last 5 months building one. I'm delighted to share a preview of our first Arcee Foundation Model: AFM-4.5B-Preview.

thumb_up_off_alt324

chat_bubble_outline22

repeat39

shareShare

Lucas Atkins

@lucasatkins7

2 months ago

We teamed up with DatologyAI to build what we believe is the strongest pretraining corpus in the world—and I truly think we nailed it. Their team was absolutely key to the model’s success. We started with ~23T tokens of high-quality data and distilled it down to 6.58T through

thumb_up_off_alt47

chat_bubble_outline1

repeat5

shareShare

Andrew M. Dai @ ICLR

@iamandrewdai

2 months ago

It turns out LLM data is more like oil than coal, if you refine it properly. Congratulations to the contributors of the many researcher-years of work!

thumb_up_off_alt48

chat_bubble_outline1

repeat5

shareShare

Ari Morcos

@arimorcos

2 months ago

Congratulations to our friends and partners Arcee.ai on the release of AFM-4.5B! With data powered by DatologyAI, this model outperforms Gemma3-4B and is competitive with Qwen3-4B despite being trained on a fraction of the data.

thumb_up_off_alt46

chat_bubble_outline0

repeat11

shareShare

Ari Morcos

@arimorcos

2 months ago

This trend will only continue. Training your own model doesn't need to cost 10s of millions, especially in specialized domains. Better data is a compute multiplier, and DatologyAI's mission is to make this easy, massively reducing the cost and difficulty of training.

thumb_up_off_alt27

chat_bubble_outline2

repeat4

shareShare

Ari Morcos

@arimorcos

2 months ago

Data powered by DatologyAI 😎

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Ari Morcos

@arimorcos

2 months ago

Training costs have definitely been going down, but I think there's still a massive barrier when it comes to data, especially if you want to train on proprietary datasets. DatologyAI, we are razor focused on changing that.

thumb_up_off_alt13

chat_bubble_outline1

repeat1

shareShare

Simo Ryu

@cloneofsimo

2 months ago

Im going to have to ask you to STOP LOOKING AT NEWEST ATTENTION VARIANTs and LOOK AGAIN AT YOUR FUCKING DATASET

thumb_up_off_alt594

chat_bubble_outline22

repeat30

shareShare

Ari Morcos

@arimorcos

2 months ago

This is why automating data curation is not only necessary for scalability reasons (a human obviously can't curate trillions of tokens), but also because humans aren't actually *good* at assessing data quality.

thumb_up_off_alt22

chat_bubble_outline1

repeat1

shareShare

Matthew Leavitt

@leavittron

2 months ago

It depends on how much you know about what you're using your model for. You want your data to be as similar to your test distribution as possible. In practice, benchmarks are an incomplete description of your true test distribution, so you want to hedge diversity vs.

thumb_up_off_alt22

chat_bubble_outline2

repeat6

shareShare

DatologyAI

@datologyai

a month ago

🌞 We're excited to share our "Summer of Data Seminar" series at DatologyAI! We're hosting weekly sessions with brilliant researchers diving deep into pretraining, data curation, and everything that makes datasets tick. Are you data-obsessed yet? 🤓 Thread 👇

🌞 We're excited to share our "Summer of Data Seminar" series at <a href="/datologyai/">DatologyAI</a>!

We're hosting weekly sessions with brilliant researchers diving deep into pretraining, data curation, and everything that makes datasets tick.

Are you data-obsessed yet? 🤓

Thread 👇

thumb_up_off_alt39

chat_bubble_outline1

repeat9

shareShare

Simo Ryu

@cloneofsimo

a month ago

STOP LOOKING AT SUBQUADRATIC ATTENTION PAPERS and GET BETTER DATA

thumb_up_off_alt316

chat_bubble_outline15

repeat13

shareShare

Chai Discovery

@chaidiscovery

a month ago

We’re excited to introduce Chai-2, a major breakthrough in molecular design. Chai-2 enables zero-shot antibody discovery in a 24-well plate, exceeding previous SOTA by >100x. Thread👇

thumb_up_off_alt2,2K

chat_bubble_outline85

repeat390

shareShare

Alexander Doria

@dorialexander

a month ago

the other bitter lesson: data >>> code/arch

thumb_up_off_alt112

chat_bubble_outline9

repeat5

shareShare

Ari Morcos

Gate.io

Jiaxin Wen @ICLR2025

Haoli Yin

Mark Ibrahim

Ari Morcos

Lucas Atkins

Lucas Atkins

Andrew M. Dai @ ICLR

Ari Morcos

Ari Morcos

Ari Morcos

Ari Morcos

Simo Ryu

Ari Morcos

Matthew Leavitt

DatologyAI

Simo Ryu

Chai Discovery

Alexander Doria