Luis Ceze (@luisceze) Twitter Tweets • TwiCopy

Luis Ceze

2 years ago

Our SaaS customers love our full-stack approach to generative AI inference that is reliable, customizable, and efficient. OctoStack offers all these benefits directly in your environment - ultra-fast inference, model orchestration, and optimized up/down the stack. 🚀🐙

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Allie K. Miller

@alliekmiller

2 years ago

Fine-tuned open-sourced models are giving the AI giants a run for their money. Matt Shumer, CEO of HyperWrite, and I sat down with @OctoAICloud to talk about the major trends impacting fast-growing AI startups across open source, cost savings, and flexibility. ⏩️ This is

thumb_up_off_alt42

chat_bubble_outline1

repeat11

shareShare

Tianqi Chen

@tqchenml

2 years ago

#Llama3 🦙🦙 running fully locally on iPad without internet connnection. credits to Ruihang Lai and the team

thumb_up_off_alt74

chat_bubble_outline0

repeat16

shareShare

Tianqi Chen

@tqchenml

2 years ago

It is amazing how cheap we can go when it comes to running #Llama3 models from AI at Meta , running on a $100 Orange Pi

thumb_up_off_alt69

chat_bubble_outline1

repeat13

shareShare

Luis Ceze

@luisceze

2 years ago

Great work Yilong, Chien-Yu Lin Zihao Ye Baris Kasikci and team!

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Luis Ceze

@luisceze

2 years ago

Go Lequn Chen (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with Zihao Ye Arvind Krishnamurthy and others! #mlsys24 arxiv.org/abs/2310.18547

Go <a href="/abcdabcd987/">Lequn Chen</a> (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with <a href="/ye_combinator/">Zihao Ye</a> <a href="/arvind_uw/">Arvind Krishnamurthy</a> and others! #mlsys24 arxiv.org/abs/2310.18547

thumb_up_off_alt20

chat_bubble_outline0

repeat2

shareShare

Tiernan Ray

@tiernanraytech

a year ago

More political deepfakes exist than you think, according to this AI expert With so many elections happening globally this year, TrueMedia founder Oren Etzioni hopes the company's deepfake detection tool can help reduce disinformation. Here's how. zdnet.com/article/ai-exp…

thumb_up_off_alt8

chat_bubble_outline1

repeat2

shareShare

Luis Ceze

@luisceze

a year ago

Huge achievement by the AI at Meta team on launching the Llama 3.1 models! The quality benchmarks look incredible, our customers are going to be really excited for the whole Llama 3.1 herd. Learn more and try them on @OctoAICloud here: octo.ai/blog/llama-3-1…. 🙏🚀🐙

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare

Luis Ceze

@luisceze

a year ago

Great to see @OctoAICloud only second to Groq Inc -- given our service runs on off-the-cloud-shelf NVIDIA hardware. It is all about carefully balancing speed, quality and cost in from a whole-system, cross-stack perspective.

thumb_up_off_alt11

chat_bubble_outline1

repeat2

shareShare

Luis Ceze

@luisceze

a year ago

Fascinating to read about this analysis of how telenovelas have such a deep impact on real world culture — I’m brazilian :). As a computer scientist, reading TRIBAL by Michael Morris, Professor at Columbia University makes me wonder about culture impact on AI and its co-evolution with human culture.

thumb_up_off_alt8

chat_bubble_outline0

repeat0

shareShare

Luis Ceze

@luisceze

a year ago

Amazing to see Flashinfer’s traction in the short 8mo since it was first introduced. Try out the latest release.

thumb_up_off_alt19

chat_bubble_outline0

repeat2

shareShare

Zihao Ye

@ye_combinator

9 months ago

Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel: github.com/flashinfer-ai/… You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.

thumb_up_off_alt146

chat_bubble_outline2

repeat31

shareShare

Tianqi Chen

@tqchenml

9 months ago

Learn more about the latest advances in AI and systems, including LLM serving, efficient attentions, structured outputs, scaling up training, and more topics. Check out #MLSys2025. Accepted papers at mlsys.org/virtual/2025/p… and register today at mlsys.org/Register

thumb_up_off_alt103

chat_bubble_outline3

repeat25

shareShare

Shanli Xing

@0xsling0

9 months ago

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…

thumb_up_off_alt180

chat_bubble_outline1

repeat33

shareShare

Zihao Ye

@ye_combinator

9 months ago

LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0

thumb_up_off_alt40

chat_bubble_outline0

repeat9

shareShare

Mahmoud Soliman

@mjsmlp

7 months ago

Anxhelo Xhebraj Sean Lee Vinod Grover ‘s work is finally out. Kick the tires and let them know what do you think!

thumb_up_off_alt8

chat_bubble_outline1

repeat1

shareShare

Luis Ceze

@luisceze

7 months ago

🚀🎉

thumb_up_off_alt10

chat_bubble_outline0

repeat3

shareShare

Zihao Ye

@ye_combinator

7 months ago

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to LMSYS Org’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to

thumb_up_off_alt229

chat_bubble_outline16

repeat37

shareShare

Ying Sheng

@ying11231

7 months ago

Congrats to Zihao Ye Tianqi Chen Luis Ceze! Flashinfer has been the real power behind various inference frameworks! Hope to see more people joining the community and build your own inference engines on top of it!

thumb_up_off_alt52

chat_bubble_outline1

repeat4

shareShare