Avner May (@avnermay) Twitter Tweets • TwiCopy

Together AI

2 years ago

Excited to announce our new speculative decoding method, Sequoia! Sequoia scales speculative decoding to very large speculation budgets, is robust to different decoding configurations, and can adapt to different hardware. Serve Llama2-70B on one RTX4090 with half-second/token

thumb_up_off_alt221

chat_bubble_outline7

repeat32

shareShare

Together AI

@togethercompute

2 years ago

This was joint work between our team (Avner May, Max Ryabinin) and our awesome collaborators at CMU and Yandex: chen zhuoming , Ruslan Svirschevski, Yuhsun Huang, Zhihao Jia, and Beidi Chen.

thumb_up_off_alt11

chat_bubble_outline0

repeat2

shareShare

Michael Poli

@michaelpoli6

a year ago

📢New research on mechanistic architecture design and scaling laws. - We perform the largest scaling laws analysis (500+ models, up to 7B) of beyond Transformer architectures to date - For the first time, we show that architecture performance on a set of isolated token

thumb_up_off_alt449

chat_bubble_outline13

repeat90

shareShare

Together AI

@togethercompute

a year ago

🚀Excited to be recognized for a second year by FORTUNE in their Top 50 AI Startups list! We have come so far in the past year and a huge thank you to the now over 60,000 developers building on the Together API. Thank you!

thumb_up_off_alt19

chat_bubble_outline2

repeat7

shareShare

Together AI

@togethercompute

a year ago

We are thrilled to be a launch partner for Meta Llama 3. Experience Llama 3 now at up to 350 tokens per second for Llama 3 8B and up to 150 tokens per second for Llama 3 70B, running in full FP16 precision on the Together API! 🤯 together.ai/blog/together-…

thumb_up_off_alt389

chat_bubble_outline27

repeat55

shareShare

Together AI

@togethercompute

a year ago

We are excited to introduce SpecExec, a speculative decoding method for accelerating inference of offloaded LLMs! SpecExec applies classical ideas from speculative execution to LLM inference, leveraging a powerful draft model to construct a tree of the most likely token

thumb_up_off_alt93

chat_bubble_outline2

repeat16

shareShare

Together AI

@togethercompute

a year ago

Today we are announcing a new inference stack, which provides decoding throughput 4x faster than open-source vLLM. We are also introducing new Together Turbo and Together Lite endpoints that enable performance, quality, and price flexibility so you do not have to compromise.

thumb_up_off_alt333

chat_bubble_outline11

repeat54

shareShare

Sasha Rush

@srush_nlp

a year ago

The Mamba in the Llama: arxiv.org/abs//2408.15237 RNN are neat. Here's a video describing how to make them work really well with little money: youtube.com/watch?v=A5ff8h… (by Junxiong Wang and Daniele Paliotta )

thumb_up_off_alt248

chat_bubble_outline0

repeat39

shareShare

Tri Dao

@tri_dao

a year ago

We made distillation and spec decoding work with Mamba (and linear RNNs in general)! Up to 300 tok/sec for 7B🚀. Spec dec is nontrivial as there's no KV cache to backtrack if some tokens aren't accepted, but there's an efficient hardware-aware algo to recompute the SSM states

thumb_up_off_alt307

chat_bubble_outline2

repeat39

shareShare

Avner May

@avnermay

a year ago

Excited to share our latest work, where we show how to distill from a Llama model into a Mamba hybrid, and how to make speculative decoding work with these models! With Junxiong Wang, Daniele Paliotta, Sasha Rush, Tri Dao.

thumb_up_off_alt7

chat_bubble_outline1

repeat0

shareShare

Cartesia

@cartesia_ai

a year ago

We’re pumped to see our Chief Scientist, Albert Gu, on the TIME 100 AI list today!🚀 We’re grateful to have Albert leading the SSM revolution here at Cartesia ⚡

We’re pumped to see our Chief Scientist, <a href="/_albertgu/">Albert Gu</a>, on the <a href="/TIME/">TIME</a> 100 AI list today!🚀

We’re grateful to have Albert leading the SSM revolution here at Cartesia ⚡

thumb_up_off_alt34

chat_bubble_outline0

repeat3

shareShare

Avner May

@avnermay

a year ago

Excited to share our latest work on speculative decoding for high-throughput inference!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Tri Dao

@tri_dao

a year ago

Surprisingly, speculative decoding works well not just for small batch LLM inference but also large batch and long context. Once we understood the compute & memory profile of LLM inference, the new spec dec algorithms fall out naturally

thumb_up_off_alt231

chat_bubble_outline3

repeat23

shareShare

Beidi Chen

@beidichen

a year ago

🥳Promised blogpost+tweet about MagicDec-1.0🪄🪄🪄 2.0 coming soon😉: How can we achieve Lossless, High Throughput, and Low Latency LLM Inference all at once? Seems too good to be true? Introducing MagicDec-1.0🪄, a Speculative Decoding (SD) based technique that can improve

thumb_up_off_alt106

chat_bubble_outline1

repeat20

shareShare

Together AI

@togethercompute

a year ago

AI at Meta 🙌 We love that Llama has gone multimodal! We're excited to partner with AI at Meta to offer free access to the Llama 3.2 11B vision model for developers. Can't wait to see what everyone builds! Try now with our Llama-Vision-Free model endpoint. Sign up here:

thumb_up_off_alt55

chat_bubble_outline2

repeat6

shareShare

Together AI

@togethercompute

a year ago

🚀 Big news! We’re thrilled to announce the launch of Llama 3.2 Vision Models & Llama Stack on Together AI. 🎉 Free access to Llama 3.2 Vision Model for developers to build and innovate with open source AI. api.together.ai/playground/cha… ➡️ Learn more in the blog

thumb_up_off_alt263

chat_bubble_outline10

repeat48

shareShare

Avner May

@avnermay

6 months ago

Excited that Together has just released TogetherChat! TogetherChat is a way for anyone to ask questions to Deepseek-R1 and other leading open-source models, with an awesome UI, with a very similar user experience to using ChatGPT or Claude models. Check it out at

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Albert Gu

@_albertgu

2 months ago

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

thumb_up_off_alt1,1K

chat_bubble_outline58

repeat177

shareShare