David Pissarra (@davidpissarra) Twitter Tweets • TwiCopy

David Pissarra

@davidpissarra

+ Follow

PhD Student @NYU_Courant | MSc @istecnico @Tsinghua_Uni | prev: Research Intern @CSDatCMU

ID: 1213860292330848261

linkhttp://davidpissarra.com calendar_today05-01-2020 16:30:52

14 Tweet

114 Takipçi

133 Takip Edilen

Charlie Ruan

@charlie_ruan

2 years ago

Run Mistral AI's 7B model on your browser with WebGPU acceleration! Try it out at webllm.mlc.ai For native LLM deployment, sliding window attention is particularly helpful for enjoying longer context with less memory requirement.

Run <a href="/MistralAI/">Mistral AI</a>'s 7B model on your browser with <a href="/WebGPU/">WebGPU</a> acceleration! Try it out at webllm.mlc.ai

For native LLM deployment, sliding window attention is particularly helpful for enjoying longer context with less memory requirement.

thumb_up_off_alt60

chat_bubble_outline1

repeat10

shareShare

David Pissarra

@davidpissarra

2 years ago

Run the Mistral-7B-Instruct-v0.2 model on iPhone! Supports now StreamingLLM for endless generation. Try the MLC Chat App via TestFlight llm.mlc.ai For native LLM deployment, attention sinks are particularly helpful for longer generation with less memory requirement.

thumb_up_off_alt73

chat_bubble_outline3

repeat16

shareShare

Tianqi Chen

@tqchenml

2 years ago

Chat with Mistral 7B Instruct v0.2 running locally in iphone and ipad. Now available in App Store. apps.apple.com/gb/app/mlc-cha…

thumb_up_off_alt483

chat_bubble_outline22

repeat66

shareShare

Guangxuan Xiao

@guangxuan_xiao

2 years ago

Exciting news: StreamingLLM is now available on iPhone! 🎉 A huge thanks to David Pissarra for his fantastic extension to our work. Can't wait to explore the possibilities with StreamingLLM!

thumb_up_off_alt35

chat_bubble_outline1

repeat4

shareShare

Charlie Ruan

@charlie_ruan

2 years ago

New WizardMath V1.1 from WizardLM on WebLLM! Took me only ~20 mins to deploy it on browser with WebGPU acceleration. WebLLM can be an easy way for folks to try new models — a laptop with Chrome, that’s it! We are actively working on WebLLM to make it even better!

thumb_up_off_alt29

chat_bubble_outline3

repeat11

shareShare

Charlie Ruan

@charlie_ruan

2 years ago

With Chrome v121, you can run webllm.mlc.ai on your Android web browser with WebGPU acceleration, everything locally! Here is a 1x speed demo of running 4-bit quantized Phi-2 on Samsung S23. Thank you François Jason Mayes for the support and suggestions!

thumb_up_off_alt104

chat_bubble_outline7

repeat25

shareShare

Charlie Ruan

@charlie_ruan

2 years ago

CodeLlama 70B is now on MLC LLM -- local deployment everywhere! Thanks to JIT compilation, running on different platforms (even w/ multi-GPU) is made easy -- see how M2 Mac (left) and 2 x RTX4090 (right) have almost the same code. llm.mlc.ai/docs/ huggingface.co/mlc-ai

thumb_up_off_alt76

chat_bubble_outline2

repeat20

shareShare

Ruihang Lai

@ruihanglai

2 years ago

Run Gemma model locally on iPhone - we get blazing fast 20 tok/s for 2B model. This shows amazing potential ahead for Gemma fine-tunes on phones, made possible by the new MLC SLM compilation flow by Junru Shao from octoaicloud and many other contributors. github.com/mlc-ai/mlc-llm

thumb_up_off_alt35

chat_bubble_outline3

repeat17

shareShare

Charlie Ruan

@charlie_ruan

2 years ago

webllm.mlc.ai now adds Gemma from Google DeepMind! The 2b model is perfect for building in-browser agents with WebGPU acceleration -- everything local! Here is a 1x speed demo of 4-bit quantized gemma-2b-it on Google Pixel 7 Pro with Chrome.

thumb_up_off_alt70

chat_bubble_outline3

repeat25

shareShare

Charlie Ruan

@charlie_ruan

2 years ago

Excited to share WebLLM engine: a high-performance in-browser LLM inference engine! WebLLM offers local GPU acceleration via WebGPU, fully OpenAI-compatible API, and built-in web workers support to separate backend executions. Check out the blog post: blog.mlc.ai/2024/06/13/web…

thumb_up_off_alt390

chat_bubble_outline11

repeat94

shareShare

Shanli Xing

@0xsling0

6 months ago

🤔 Can AI optimize the systems it runs on? 🚀 Introducing FlashInfer-Bench, a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels - Implement kernels with your preferred language - Benchmark them against real-world serving

thumb_up_off_alt116

chat_bubble_outline3

repeat39

shareShare