Reyna Abhyankar (@reyna_abhyankar) Twitter Tweets • TwiCopy

good girl

@goodgirlxsz

5 hours ago

🔥Telegram İfşa

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Generative LLMs are slow and expensive to serve. Their much smaller, distilled versions are faster and cheaper but achieve suboptimal generative performance. We show it is possible to achieve the best of both worlds. Code: github.com/flexflow/FlexF… Paper: cs.cmu.edu/~zhihaoj2/pape…

thumb_up_off_alt363

chat_bubble_outline2

repeat73

shareShare

Yiying Zhang

@yiying__zhang

a year ago

Today, LLMs are constantly being augmented with tools, agents, models, RAG, etc. We built InferCept [ICML'24], the first serving framework designed for augmented LLMs. InferCept sustains a 1.6x-2x higher serving load than SOTA LLM serving systems. #AugLLM mlsys.wuklab.io/posts/infercep…

thumb_up_off_alt29

chat_bubble_outline1

repeat2

shareShare

Yiying Zhang

@yiying__zhang

a year ago

LLM prompts are getting longer and increasingly shared with agents, tools, documents, etc. We introduce Preble, the first distributed LLM serving system targeting long and shared prompts. Preble reduces latency by 1.5-14.5x over SOTA serving systems. #LLM mlsys.wuklab.io/posts/preble/

thumb_up_off_alt25

chat_bubble_outline2

repeat5

shareShare

Yiying Zhang

@yiying__zhang

a year ago

Join us at ICML in Vienna next Thursday 11:30-1pm local time (poster session 5) for our poster on InfeCept (Augmented, or compound, AI serving system) at Hall C 4-9 #709 Know more about InferCept with our newly posted video: youtube.com/watch?v=iOs1b0…

thumb_up_off_alt6

chat_bubble_outline5

repeat1

shareShare

Yiying Zhang

@yiying__zhang

a year ago

WukLab's new study reveals CPU scheduling overhead can dominate LLM inference time—up to 50% in systems like vLLM! Scheduling overhead can no longer be ignored as model forwarding speeds increase and more scheduling tasks get added.#LLM #vLLM #SGLang Read tinyurl.com/yk4jeaz8

thumb_up_off_alt55

chat_bubble_outline3

repeat12

shareShare

Yiying Zhang

@yiying__zhang

a year ago

Struggling with developing high-quality gen-AI apps? Meet Cognify: an open-source tool for automatically optimizing gen-AI workflows. 48% higher generation quality, 9x lower cost, fully compatible with LangChain, DSPy, Python. Read & try Cognify: tinyurl.com/a8b9cdnj #GenseeAI

thumb_up_off_alt22

chat_bubble_outline0

repeat4

shareShare

Reyna Abhyankar

@reyna_abhyankar

a year ago

Check out our latest work: Cognify!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Yiying Zhang

@yiying__zhang

8 months ago

Boost your gen-AI workflow's quality by 2.8x with just $5 in 24 minutes! Check how Cognify autotunes gen-AI workflow’s quality and execution efficiency with a tiny budget in our latest blog post tinyurl.com/4tyvvdks. Paper tinyurl.com/3kx2xjn9. Code tinyurl.com/2tp9bndr.

thumb_up_off_alt8

chat_bubble_outline0

repeat4

shareShare

Yiying Zhang

@yiying__zhang

5 months ago

Computer-use AI agents (CUAs) are powerful, but way too slow. A 2-minute human task can take a CUA over 20 minutes! At Wuklab, we're building faster CUAs. Recently, we created OSWorld-Human, a new benchmark to close the speed gap between humans and machines. Read our full blog

thumb_up_off_alt12

chat_bubble_outline1

repeat4

shareShare

Reyna Abhyankar

good girl

Zhihao Jia

Yiying Zhang

Yiying Zhang

Yiying Zhang

Yiying Zhang

Yiying Zhang

Reyna Abhyankar

Yiying Zhang

Yiying Zhang