Luis Ceze (@luisceze) 's Twitter Profile
Luis Ceze

@luisceze

computer architect. marveled by biology. professor @uwcse. ceo @OctoAICloud. venture partner @madronaventures.

ID: 139128649

linkhttp://homes.cs.washington.edu/~luisceze/ calendar_today01-05-2010 16:43:33

1,1K Tweet

3,3K Takipçi

2,2K Takip Edilen

Luis Ceze (@luisceze) 's Twitter Profile Photo

Our SaaS customers love our full-stack approach to generative AI inference that is reliable, customizable, and efficient. OctoStack offers all these benefits directly in your environment - ultra-fast inference, model orchestration, and optimized up/down the stack. 🚀🐙

Allie K. Miller (@alliekmiller) 's Twitter Profile Photo

Fine-tuned open-sourced models are giving the AI giants a run for their money. Matt Shumer, CEO of HyperWrite, and I sat down with @OctoAICloud to talk about the major trends impacting fast-growing AI startups across open source, cost savings, and flexibility. ⏩️ This is

Tiernan Ray (@tiernanraytech) 's Twitter Profile Photo

More political deepfakes exist than you think, according to this AI expert With so many elections happening globally this year, TrueMedia founder Oren Etzioni hopes the company's deepfake detection tool can help reduce disinformation. Here's how. zdnet.com/article/ai-exp…

More political deepfakes exist than you think, according to this AI expert

With so many elections happening globally this year, TrueMedia founder Oren Etzioni hopes the company's deepfake detection tool can help reduce disinformation. Here's how.

zdnet.com/article/ai-exp…
Luis Ceze (@luisceze) 's Twitter Profile Photo

Huge achievement by the AI at Meta team on launching the Llama 3.1 models!  The quality benchmarks look incredible, our customers are going to be really excited for the whole Llama 3.1 herd. Learn more and try them on @OctoAICloud here: octo.ai/blog/llama-3-1…. 🙏🚀🐙

Luis Ceze (@luisceze) 's Twitter Profile Photo

Great to see @OctoAICloud only second to Groq Inc -- given our service runs on off-the-cloud-shelf NVIDIA hardware. It is all about carefully balancing speed, quality and cost in from a whole-system, cross-stack perspective.

Luis Ceze (@luisceze) 's Twitter Profile Photo

Fascinating to read about this analysis of how telenovelas have such a deep impact on real world culture — I’m brazilian :). As a computer scientist, reading TRIBAL by Michael Morris, Professor at Columbia University makes me wonder about culture impact on AI and its co-evolution with human culture.

Zihao Ye (@ye_combinator) 's Twitter Profile Photo

Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel: github.com/flashinfer-ai/… You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.

Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel:

github.com/flashinfer-ai/…

You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.
Tianqi Chen (@tqchenml) 's Twitter Profile Photo

Learn more about the latest advances in AI and systems, including LLM serving, efficient attentions, structured outputs, scaling up training, and more topics. Check out #MLSys2025. Accepted papers at mlsys.org/virtual/2025/p… and register today at mlsys.org/Register

Learn more about the latest advances in AI and systems, including LLM serving, efficient attentions, structured outputs, scaling up training, and more topics. Check out  #MLSys2025. Accepted papers at mlsys.org/virtual/2025/p…  and register today at mlsys.org/Register
Shanli Xing (@0xsling0) 's Twitter Profile Photo

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling.

Our implementation achieves over 50% reduction in sampling time.

Blog post: flashinfer.ai/2025/03/10/sam…
Zihao Ye (@ye_combinator) 's Twitter Profile Photo

LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0

Zihao Ye (@ye_combinator) 's Twitter Profile Photo

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to LMSYS Org’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to

Ying Sheng (@ying11231) 's Twitter Profile Photo

Congrats to Zihao Ye Tianqi Chen Luis Ceze! Flashinfer has been the real power behind various inference frameworks! Hope to see more people joining the community and build your own inference engines on top of it!