@danielhanchen : OpenAI's OSS model possible breakdown: 1. 1250B MoE 5B active + 20B text only 2. Trained with Float4 maybe Blackwell chips 3. SwiGLU clip (-7,7) like ReLU6 4. 128K context via YaRN from 4K 5. Sliding window 128 + attention sinks 6. Llama/Mixtral arch + biases Details: 1. 120B MoE • TwiCopy

Daniel Han

@danielhanchen

+ Follow

Building @UnslothAI. Finetune train LLMs faster. LLMs bug hunter. OSS package github.com/unslothai/unsl…. YC S24. Prev ML at NVIDIA. Hyperlearn used by NASA.

ID: 717359704226172928

linkhttps://unsloth.ai/ calendar_today05-04-2016 14:34:16

2,2K Tweet

23,23K Takipçi

1,1K Takip Edilen

Daniel Han

@danielhanchen

a month ago

OpenAI's OSS model possible breakdown: 1. 120B MoE 5B active + 20B text only 2. Trained with Float4 maybe Blackwell chips 3. SwiGLU clip (-7,7) like ReLU6 4. 128K context via YaRN from 4K 5. Sliding window 128 + attention sinks 6. Llama/Mixtral arch + biases Details: 1. 120B MoE

thumb_up_off_alt717

chat_bubble_outline24

repeat117

shareShare