Alexey Tumanov (@alsched) Twitter Tweets • TwiCopy

Georgia Tech School of Computer Science

2 years ago

Three SCS faculty members were recognized by their students for outstanding teaching and educational impact. Congratulations to Ashutosh Dhekne, Alexey Tumanov, and Umakishore Ramachandran 👏👏👏 blog.ctl.gatech.edu/2024/05/21/spr…

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

Alexey Tumanov

@alsched

a year ago

Really proud of my PhD student's work on developing the new mechanism and policy that significantly improves tail latency performance in Large Language Model (LLM) inference without sacrificing throughput. Already received 10+ citations, source is OSS and adopted in the industry.

thumb_up_off_alt8

chat_bubble_outline0

repeat0

shareShare

Alexey Tumanov

@alsched

a year ago

Let's set the standard for the interactive performance of LLMs capturing nuances of user experience. While latency/throughput tension is well known to the Systems community, latency jitter is less explored. Fluidity index & fluid token generation rate more aptly capture LLM perf.

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Amey Agrawal

@agrawalamey12

a year ago

⚡ Speed Meets Accuracy: Unlike approximation-based methods, Mnemosyne achieves exact inference—ensuring that the generated output remains precise, even when processing 10 million tokens by effectively combining all these parallelization techniques to scale up to hundred of GPUS

thumb_up_off_alt6

chat_bubble_outline1

repeat2

shareShare

Amey Agrawal

@agrawalamey12

a year ago

🔗 Curious to learn more? Dive into our paper to explore the technical details behind Mnemosyne: arxiv.org/abs/2409.17264…. Join work between Georgia Tech Computing, Microsoft and UCSD Engineering with amazing Esha Choukse, Alexey Tumanov, Ram Ramjee, Junda Chen, Íñigo Goiri & Chaojie Zhang!

thumb_up_off_alt9

chat_bubble_outline0

repeat2

shareShare

Alexey Tumanov

@alsched

a year ago

First publicly known support for LLM context of up to 10M tokens with high throughput & interactive production-grade TBT SLOs (30ms) with Mnemosyne. What would it take to pair program with GenAI on millions of LoC? Or analyze 10/110hrs of video/audio content? All precisely! <v>

thumb_up_off_alt10

chat_bubble_outline0

repeat0

shareShare

Amey Agrawal

@agrawalamey12

a year ago

Google has silently but surely developed an edge over OpenAI. Long context processing seems to be the key to Google's AI strategy. NotebookLM is a prime example of what long context processing can unlock. In our latest paper, we talk about how systems can be built to support

thumb_up_off_alt11

chat_bubble_outline2

repeat4

shareShare

Alexey Tumanov

@alsched

a year ago

Super-charged technical program this year at ACM SoCC: acmsocc.org/2024/schedule.… Looking forward! Hope to see you there! #socc24

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

ACM SoCC

@acmsocc

a year ago

At SoCC’24, Anastasia Ailamaki from EPFL will give a keynote on how disaggregated memory resources are becoming the norm and how this “new memory wall” affects database system design. This talk will be amazing, make sure to be there!! acmsocc.org/2024/keynotes.…

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

Amey Agrawal

@agrawalamey12

a year ago

Sequence pipeline parallelism being rapidly adopted for extreme long context inference in the industry! Checkout our paper on system design for long context inference for more details arxiv.org/abs/2409.17264

thumb_up_off_alt19

chat_bubble_outline0

repeat4

shareShare

Amey Agrawal

@agrawalamey12

9 months ago

Super long-context models with context window spanning millions of tokens are becoming commonplace (Google DeepMind Gemini, xAI Grok 3, Qwen Qwen2.5). But efficiently serving these models is tough, especially alongside short requests. Head-of-Line (HOL) blocking becomes

Super long-context models with context window spanning millions of tokens are becoming commonplace (<a href="/GoogleDeepMind/">Google DeepMind</a> Gemini, <a href="/xai/">xAI</a> Grok 3, <a href="/Alibaba_Qwen/">Qwen</a> Qwen2.5). But efficiently serving these models is tough, especially alongside short requests. Head-of-Line (HOL) blocking becomes

thumb_up_off_alt30

chat_bubble_outline1

repeat14

shareShare

Amey Agrawal

@agrawalamey12

9 months ago

Super excited to share another incredible systems that we have built over the past two years! Training giant foundation models (like Llama-3 405B) costs a FORTUNE 💰 (millions of dollars)! Optimizing the training "recipe" (parallelism, memory tricks, etc.) is critical but

thumb_up_off_alt20

chat_bubble_outline1

repeat13

shareShare

Amey Agrawal

@agrawalamey12

9 months ago

Maya offers a transparent, accurate, and efficient way to model and optimize large-scale DL training without needing expensive hardware clusters for exploration. A crucial step towards sustainable AI! Read the paper: arxiv.org/abs/2503.20191 Work done with Srihas Yarlagadda , Elton Pinto ,

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Sachit Kuhar

@sachitkuhar

7 months ago

Full code 🔓 github.com/sachitkuhar/PL… Collaboration with Yash Jain and Alexey Tumanov. (6/6) #EfficientAI #EdgeAI #Quantization #TMLR #AI #GaTech #GeorgiaTech

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Amey Agrawal

@agrawalamey12

5 months ago

Interesting work on long context inference from NVIDIA, where they scale KV parallelism on gb200-nvl72 systems! To learn more about accelerating long context inference and trade-offs between different parallelism dimensions checkout out our paper, Medha: arxiv.org/abs/2409.17264

thumb_up_off_alt14

chat_bubble_outline0

repeat5

shareShare

Georgia Tech School of Computer Science

@gatech_scs

5 months ago

Congratulations 👏 to our faculty who were recognized on the Spring 2025 CIOS Honor Roll for their outstanding teaching and educational impact: Assoc. Prof. Alexey Tumanov and Asst. Prof. Jan Van Den Brand!

thumb_up_off_alt11

chat_bubble_outline0

repeat1

shareShare

Amey Agrawal

@agrawalamey12

5 months ago

After hitting evaluation puzzles like this in our own work, we analyzed patterns across LLM inference papers and identified 8 systematic evaluation issues that can make performance comparisons misleading. We have compiled a practical evaluation checklist to help avoid these

thumb_up_off_alt5

chat_bubble_outline0

repeat3

shareShare