Shashank Rajput (@shashank_r12) 's Twitter Profile
Shashank Rajput

@shashank_r12

LLM Pretraining @DbrxMosaicAI

ID: 1955982469

linkhttps://shashankrajput.github.io/ calendar_today12-10-2013 06:44:48

191 Tweet

761 Followers

599 Following

Jonathan Frankle (@jefrankle) 's Twitter Profile Photo

I guess I should probably include some images, since this is an image generation model. I'm so proud of Cory Stephenson, Landan Seguin, Austin Jacobson, jasmine collins, and our extraordinary collaborators at Shutterstock.

I guess I should probably include some images, since this is an image generation model. I'm so proud of <a href="/CoryMosaicML/">Cory Stephenson</a>, <a href="/landanjs/">Landan Seguin</a>, Austin Jacobson, <a href="/jazco/">jasmine collins</a>, and our extraordinary collaborators at <a href="/Shutterstock/">Shutterstock</a>.
jasmine collins (@jazco) 's Twitter Profile Photo

today we're announcing our Databricks Mosaic Research x Shutterstock partnership, and a new text-to-image diffusion model: ✨ImageAI!!✨ this model is geared towards enterprise use cases and is trained exclusively on shutterstock's trusted data catalog! databricks.com/company/newsro…

today we're announcing our <a href="/DbrxMosaicAI/">Databricks Mosaic Research</a> x <a href="/Shutterstock/">Shutterstock</a> partnership, and a new text-to-image diffusion model: ✨ImageAI!!✨ 

this model is geared towards enterprise use cases and is trained exclusively on shutterstock's trusted data catalog! 

databricks.com/company/newsro…
Rishab Parthasarathy (@rishab_partha) 's Twitter Profile Photo

We are excited to announce Vid3D, a technique for generating 3D video using only 2D video diffusion models and Gaussian splatting! Paper: arxiv.org/abs/2406.11196 Github: github.com/rishab-partha/… Project Page: rishab-partha.github.io/Vid3D

Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities by Finetuning on Synthetic Data TLDR: FT'ing on randint key-value retrieval tasks, improves LLM perf on real retrieval tasks arxiv.org/abs/2406.19292 Great project led by Zheyang Xiong& Vasilis Papageorgiou

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities by Finetuning on Synthetic Data

TLDR: FT'ing on randint key-value retrieval tasks, improves LLM perf on real retrieval tasks
arxiv.org/abs/2406.19292

Great project led by <a href="/zheyangxiong/">Zheyang Xiong</a>&amp; <a href="/vpapageorgiou_/">Vasilis Papageorgiou</a>
Sasha Doubov (@sashadoubov) 's Twitter Profile Photo

big shoutout to Nikhil and Jacob Portes for spearheading this work on scaling laws that accounts for inference costs Come say hi to us at ICML :)))

Sasha Doubov (@sashadoubov) 's Twitter Profile Photo

some notes from paper! - 405B trained on 15.6T tokens, 3.8e25 flops - use SFT, rejection sampling and DPO - annealing is used to judge quality of domain specific data (s/o dbrx paper)

Abhay Gupta (@gupta__abhay) 's Twitter Profile Photo

The new Llama-3.1 base models are pretty much the same as the old one, barring the multilingual + extended context length capabilities. Ran a quick cosine similarity check on projection matrices. Here are some examples from the 8B model

The new Llama-3.1 base models are pretty much the same as the old one, barring the multilingual + extended context length capabilities. Ran a quick cosine similarity check on projection matrices. Here are some examples from the 8B model
Shashank Rajput (@shashank_r12) 's Twitter Profile Photo

Performed gradient ascent from a random starting point in SF and ended up at Jones st & Sacramento st. Wondering if this is a global maxima as well, or should i have added some stochasticity lol

Performed gradient ascent from a random starting point in SF and ended up at Jones st &amp; Sacramento st. Wondering if this is a global maxima as well, or should i have added some stochasticity lol
Alex Dimakis (@alexgdimakis) 's Twitter Profile Photo

Excited to launch the first model from our startup: Bespoke Labs. Bespoke-Minicheck-7B is a grounded factuality checker: super lightweight and fast. Outperforms all big foundation models including Claude 3.5 Sonnet, Mistral-Large m2 and GPT 4o and its only 7B. Also, I want to

Databricks Mosaic Research (@dbrxmosaicai) 's Twitter Profile Photo

How good do the latest long context LLMs (LLama-3.1-405b, GPT-4o-mini and Claude-3.5-sonnet) perform on RAG? We benchmarked 13 popular OSS and commercial models on context lengths from 2k to 125k, and the results are very interesting! Full post: databricks.com/blog/long-cont…

Eitan Turok (@eitanturok) 's Twitter Profile Photo

Sharing is caring, especially among KV-caches! Introducing MixAttention which shares KV-caches between global and sliding window attention. MixAttention has * ~2.5x less memory consumption * ~2x faster inference speed without sacrificing performance.

Databricks Mosaic Research (@dbrxmosaicai) 's Twitter Profile Photo

At Databricks, we want to help customers build more #inference-friendly #llms. With MixAttention architecture, you can maintain model quality while improving inference speed and reducing memory footprint: databricks.com/blog/mixattent…

At <a href="/databricks/">Databricks</a>, we want to help customers build more #inference-friendly #llms. With MixAttention architecture, you can maintain model quality while improving inference speed and reducing memory footprint: databricks.com/blog/mixattent…
Shashank Rajput (@shashank_r12) 's Twitter Profile Photo

You can now finetune Llama 3.1 models on 131K context length using our optimized stack that uses Sequence Parallelism for training and Provisioned Throughput for serving!