
Sinclair Wang
@sinclairwang1
PhDing @sjtu1896 #NLProc
Working on Data Engineering for LLMs: MathPile (2023), ๐ซ ProX (2024),
๐ MegaMath (2025)
ID: 1326804636683022338
12-11-2020 08:31:09
1,1K Tweet
1,1K Followers
2,2K Following

Emad Nathan Lambert processing all of CommonCrawl is about $20-50k [0], plus maybe 10-50k H100 if you wanna do GPU classification [1]. You can extract 1T tokens from PDFs for around $10k [2]. Major expenses are synth data, and verify which one of your approaches work [3]. -----------------------


MegaMath has been accepted to Conference on Language Modeling 2025๐ฅณ Hoping you find our data useful!




Excited to share that our two papers have been accepted to #ICML2025! ICML Conference However, I can't be there in person due to visa issues. What a pity.๐ฅฒ Feel free to check out our poster, neither online nor offline in the Vancouver Convention Center. Programming Every Example:




When building MegaScience, we learned the hard way: ๐ Strong datasets need strong proxy models. Our data was too spicy ๐ถ๏ธ for small models like Qwen2.5-1.5B & 3Bโthey just flopped. But once we tried Qwen3-14B and 30Bโฆ boom ๐ฅ, everything clicked. Kinda terrifying to think: if



Failing on ๐ฅ๐๐ซ๐ ๐-๐ฌ๐๐๐ฅ๐ ๐๐ with VeRL? โ ๏ธ Mixing inference backend (๐ฏ๐๐๐/๐๐๐๐๐ง๐ ) with training backends (๐ ๐๐๐/๐๐๐ ๐๐ญ๐ซ๐จ๐ง) ๐ฌ๐๐๐ซ๐๐ญ๐ฅ๐ฒ ๐ญ๐ฎ๐ซ๐ง๐ฌ ๐ฒ๐จ๐ฎ๐ซ ๐๐ ๐ข๐ง๐ญ๐จ ๐จ๐๐-๐ฉ๐จ๐ฅ๐ข๐๐ฒ โ even if they share the same weights! ๐ย Blog:


1. npx @โqwen-code/[email protected] 2. get 2000 free calls/day via Qwen Chat quick math: let's suppose avg agentic interaction โ 32k context 2000 ร 32k โ 64 million tokens/day

