@rohanpaul_ai : Self-Hosting LLaMA 3.1 750 on @runpod_io with @vllm_project Inference Engine For a 70B parameter model, to deploy in 16-bit floating point precision we’ll need ~140GB of memory, and for something like 4-bit (INT4) we only need ~45GB And you need additional memory for

Rohan Paul

@rohanpaul_ai

+ Follow

I Build & Write AI stuff.

→ Join my LLM Newsletter - rohanpaul.substack.com

💼 AI Engineer

ID: 2588345408

linkhttps://linktr.ee/rohanpaul calendar_today25-06-2014 22:38:54

17,17K Tweet

30,30K Followers

374 Following

Rohan Paul

@rohanpaul_ai

5 months ago

Self-Hosting LLaMA 3.1 70 on RunPod with vLLM Inference Engine For a 70B parameter model, to deploy in 16-bit floating point precision we’ll need ~140GB of memory, and for something like 4-bit (INT4) we only need ~45GB And you need additional memory for - Context

Self-Hosting LLaMA 3.1 70 on <a href="/runpod_io/">RunPod</a> with <a href="/vllm_project/">vLLM</a> Inference Engine

For a 70B parameter model, to deploy in 16-bit floating point precision we’ll need ~140GB of memory, and for something like 4-bit (INT4) we only need ~45GB

And you need additional memory for

- Context

thumb_up_off_alt9

chat_bubble_outline0

repeat2

shareShare