Pratyush Maini (@pratyushmaini) 's Twitter Profile
Pratyush Maini

@pratyushmaini

Data Quality x Privacy | PhD student @mldcmu | Founding Member @datologyai | Prev. Comp Sc @iitdelhi

🦋: bsky.app/profile/pratyu…

ID: 1191440736517939200

linkhttp://pratyushmaini.github.io calendar_today04-11-2019 19:43:22

562 Tweet

1,1K Followers

418 Following

Ari Morcos (@arimorcos) 's Twitter Profile Photo

Today, we introduce BeyondWeb, our synthetic data generation approach which significantly outperforms all open synthetic data and is a key component of our curation pipeline. Many talk about “doing” synthetic data, but generating high-quality synthetic data at scale is extremely

Matthew Leavitt (@leavittron) 's Twitter Profile Photo

Very excited to announce BeyondWeb, @datologyAI’s synthetic pretraining data generation paradigm. BeyondWeb is a rephrasing-based approach that substantially outperforms existing public synthetic pretraining data baselines, and is a core part of our curation pipeline.

Very excited to announce BeyondWeb, @datologyAI’s synthetic pretraining data generation paradigm. BeyondWeb is a rephrasing-based approach that substantially outperforms existing public synthetic pretraining data baselines, and is a core part of our curation pipeline.
Vineeth (@vineethdorna) 's Twitter Profile Photo

Big day for the DatologyAI team! We introduce BeyondWeb, scaling synthetic data for trillion-scale pretraining! ✨ Collect more → curate better synthetic → win big ✨Not all synthetic data is created equal → doing it right pays off. ✨ With only targeted synthetic data

Siddharth Joshi (@sjoshi804) 's Twitter Profile Photo

As we hit the limits of real web-scale data, DatologyAI's synthetic data shows how we can leverage the models we've already trained to squeeze even more value out of limited data! Huge props to Pratyush Maini for leading this work 👏🚀

Rishabh Adiga (@rishabhadiga01) 's Twitter Profile Photo

Thrilled to see BeyondWeb launched 🚀 Phenomenal insights and a huge step forward for scaling high-quality synthetic data to trillions of tokens. Amazing work by Pratyush Maini and the DatologyAI team - super excited to be learning from you all!

Sarah Catanzaro (@sarahcat21) 's Twitter Profile Photo

For years, researchers have known that synthetic data is valuable; but not all synthetic data is created equally. Generating high quality synthetic data not only requires specific research expertise it also requires you to deeply understand [your] data. These key learnings are a

Amro (@amrokamal1997) 's Twitter Profile Photo

After months of development, we finally share with the world some hard-earned science behind synthetic data. DatologyAI presents BeyondWeb — a SOTA approach showing how thoughtful synthetic data design can beat strong baselines.

Lucas Atkins (@lucasatkins7) 's Twitter Profile Photo

The last two days have been a whirlwind, and I haven’t had a chance to read this end to end - though I did see an early draft - let alone comment. I’m one of the few people outside DatologyAI fortunate enough to have seen these results firsthand, and everyone can experience