Avner May (@avnermay) 's Twitter Profile
Avner May

@avnermay

Staff Research Scientist at together.ai.

Formerly research scientist at Google, postdoc at Stanford, and PhD student at Columbia.

ID: 41314685

linkhttps://avnermay.github.io/ calendar_today20-05-2009 07:16:11

34 Tweet

249 Followers

238 Following

Together AI (@togethercompute) 's Twitter Profile Photo

Excited to announce our new speculative decoding method, Sequoia! Sequoia scales speculative decoding to very large speculation budgets, is robust to different decoding configurations, and can adapt to different hardware. Serve Llama2-70B on one RTX4090 with half-second/token

Michael Poli (@michaelpoli6) 's Twitter Profile Photo

📢New research on mechanistic architecture design and scaling laws. - We perform the largest scaling laws analysis (500+ models, up to 7B) of beyond Transformer architectures to date - For the first time, we show that architecture performance on a set of isolated token

📢New research on mechanistic architecture design and scaling laws.

- We perform the largest scaling laws analysis (500+ models, up to 7B) of beyond Transformer architectures to date

- For the first time, we show that architecture performance on a set of isolated token
Together AI (@togethercompute) 's Twitter Profile Photo

🚀Excited to be recognized for a second year by FORTUNE in their Top 50 AI Startups list! We have come so far in the past year and a huge thank you to the now over 60,000 developers building on the Together API. Thank you!

Together AI (@togethercompute) 's Twitter Profile Photo

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯 together.ai/blog/together-…

Together AI (@togethercompute) 's Twitter Profile Photo

We are excited to introduce SpecExec, a speculative decoding method for accelerating inference of offloaded LLMs! SpecExec applies classical ideas from speculative execution to LLM inference, leveraging a powerful draft model to construct a tree of the most likely token

Together AI (@togethercompute) 's Twitter Profile Photo

Today we are announcing a new inference stack, which provides decoding throughput 4x faster than open-source vLLM. We are also introducing new Together Turbo and Together Lite endpoints that enable performance, quality, and price flexibility so you do not have to compromise.

Today we are announcing a new inference stack, which provides decoding throughput 4x faster than open-source vLLM. 

We are also introducing new Together Turbo and Together Lite endpoints that enable performance, quality, and price flexibility so you do not have to compromise.
Sasha Rush (@srush_nlp) 's Twitter Profile Photo

The Mamba in the Llama: arxiv.org/abs//2408.15237 RNN are neat. Here's a video describing how to make them work really well with little money: youtube.com/watch?v=A5ff8h… (by Junxiong Wang and Daniele Paliotta )

Tri Dao (@tri_dao) 's Twitter Profile Photo

We made distillation and spec decoding work with Mamba (and linear RNNs in general)! Up to 300 tok/sec for 7B🚀. Spec dec is nontrivial as there's no KV cache to backtrack if some tokens aren't accepted, but there's an efficient hardware-aware algo to recompute the SSM states

Avner May (@avnermay) 's Twitter Profile Photo

Excited to share our latest work, where we show how to distill from a Llama model into a Mamba hybrid, and how to make speculative decoding work with these models! With Junxiong Wang, Daniele Paliotta, Sasha Rush, Tri Dao.

Cartesia (@cartesia_ai) 's Twitter Profile Photo

We’re pumped to see our Chief Scientist, Albert Gu, on the TIME 100 AI list today!🚀 We’re grateful to have Albert leading the SSM revolution here at Cartesia ⚡

We’re pumped to see our Chief Scientist, <a href="/_albertgu/">Albert Gu</a>, on the <a href="/TIME/">TIME</a> 100 AI list today!🚀

We’re grateful to have Albert leading the SSM revolution here at Cartesia ⚡
Tri Dao (@tri_dao) 's Twitter Profile Photo

Surprisingly, speculative decoding works well not just for small batch LLM inference but also large batch and long context. Once we understood the compute & memory profile of LLM inference, the new spec dec algorithms fall out naturally

Beidi Chen (@beidichen) 's Twitter Profile Photo

🥳Promised blogpost+tweet about MagicDec-1.0🪄🪄🪄 2.0 coming soon😉: How can we achieve Lossless, High Throughput, and Low Latency LLM Inference all at once? Seems too good to be true? Introducing MagicDec-1.0🪄, a Speculative Decoding (SD) based technique that can improve

Together AI (@togethercompute) 's Twitter Profile Photo

AI at Meta 🙌 We love that Llama has gone multimodal! We're excited to partner with AI at Meta to offer free access to the Llama 3.2 11B vision model for developers. Can't wait to see what everyone builds! Try now with our Llama-Vision-Free model endpoint. Sign up here:

Together AI (@togethercompute) 's Twitter Profile Photo

🚀 Big news! We’re thrilled to announce the launch of Llama 3.2 Vision Models & Llama Stack on Together AI. 🎉 Free access to Llama 3.2 Vision Model for developers to build and innovate with open source AI. api.together.ai/playground/cha… ➡️ Learn more in the blog

🚀 Big news! We’re thrilled to announce the launch of Llama 3.2 Vision Models &amp; Llama Stack on Together AI.

🎉 Free access to Llama 3.2 Vision Model for developers to build and innovate with open source AI. api.together.ai/playground/cha… 

➡️ Learn more in the blog
Avner May (@avnermay) 's Twitter Profile Photo

Excited that Together has just released TogetherChat! TogetherChat is a way for anyone to ask questions to Deepseek-R1 and other leading open-source models, with an awesome UI, with a very similar user experience to using ChatGPT or Claude models. Check it out at

Albert Gu (@_albertgu) 's Twitter Profile Photo

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence.

Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.