Sebastian Raschka(@rasbt) 's Twitter Profileg
Sebastian Raschka

@rasbt

Machine learning & AI researcher writing at https://t.co/A0tXWzG1p5. LLM research engineer @LightningAI. Previously stats professor at UW-Madison.

ID:865622395

linkhttps://sebastianraschka.com/books/ calendar_today07-10-2012 02:06:16

15,5K Tweets

266,6K Followers

906 Following

Sebastian Raschka(@rasbt) 's Twitter Profile Photo

Persistent storage, flexible switching between CPU multi-GPU machines, VSCode, scaling training runs via multi-node jobs ... what's not to love šŸ˜Š

account_circle
Sebastian Raschka(@rasbt) 's Twitter Profile Photo

Andrew Gordon Wilson A long time ago!
I was faculty in a statistics department and always had to clarify to students and colleagues that in deep learning we use the term 'inference' differently, haha.
From my lecture notes: github.com/rasbt/stat479-ā€¦

@andrewgwils A long time ago! I was faculty in a statistics department and always had to clarify to students and colleagues that in deep learning we use the term 'inference' differently, haha. From my lecture notes: github.com/rasbt/stat479-ā€¦
account_circle
Daniel Han(@danielhanchen) 's Twitter Profile Photo

biased estimator Unsloth AI VRAM consumption for finetuning will not be nice. Also QKV is slower when fused interestingly. MLP fused might only be like teenily faster, but un-noticeable since the matrices are large twitter.com/rasbt/status/1ā€¦

account_circle
Daniel Han(@danielhanchen) 's Twitter Profile Photo

Phi 3 (3.8B) got released! The paper said it was just a Llama arch, but I found some quirks while adding this to Unsloth AI:

1. Sliding window of 2047? Mistral v1 4096. So does Phi mini have SWA? (And odd num?) Max RoPE position is 4096?
2. Upcasted RoPE? Like Gemma?
3. Dynamicā€¦

Phi 3 (3.8B) got released! The paper said it was just a Llama arch, but I found some quirks while adding this to @UnslothAI: 1. Sliding window of 2047? Mistral v1 4096. So does Phi mini have SWA? (And odd num?) Max RoPE position is 4096? 2. Upcasted RoPE? Like Gemma? 3. Dynamicā€¦
account_circle
Sebastian Raschka(@rasbt) 's Twitter Profile Photo

Phi-3 has 'only'Ā been trainedĀ on 5x fewer tokens than Llama 3 (3.3 trillion instead of 15 trillion)

Phi-3-mini less has 'only' 3.8 billion parameters, less than half the size of Llama 3 8B.

Despite being small enough to be deployedĀ on a phoneĀ (according to the technicalā€¦

account_circle
Sebastian Raschka(@rasbt) 's Twitter Profile Photo

I can't believe microsoft just dropped phi-3 less than a week after llama 3 arxiv.org/abs/2404.14219.
And it looks good!

I can't believe microsoft just dropped phi-3 less than a week after llama 3 arxiv.org/abs/2404.14219. And it looks good!
account_circle
Sebastian Raschka(@rasbt) 's Twitter Profile Photo

Just learned that the RedPajama-V2 pretraining dataset is actually 30T tokens. 2x the size used for Llama 3 šŸ¤Æ
github.com/togethercomputā€¦

Just learned that the RedPajama-V2 pretraining dataset is actually 30T tokens. 2x the size used for Llama 3 šŸ¤Æ github.com/togethercomputā€¦
account_circle
Sebastian Raschka(@rasbt) 's Twitter Profile Photo

'... do SSMs truly have an advantage (over transformers) in expressive power for state tracking? Surprisingly, the answer is no ... Thus, despite its recurrent formulation, the 'state' in an SSM is an illusion' šŸŽ¤āœ‹šŸ”„ arxiv.org/abs/2404.08819

account_circle
Guilherme Penedo(@gui_penedo) 's Twitter Profile Photo

We have just released šŸ· FineWeb: 15 trillion tokens of high quality web data.
We filtered and deduplicated all CommonCrawl between 2013 and 2024.
Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!

We have just released šŸ· FineWeb: 15 trillion tokens of high quality web data. We filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!
account_circle
Sebastian Raschka(@rasbt) 's Twitter Profile Photo

What a crazy week. And why I am (still) waiting for the Llama 3 paper, a little write-up on using and finetuning pretrained transformers! magazine.sebastianraschka.com/p/using-and-fiā€¦

account_circle
Ahmad Al-Dahle(@Ahmad_Al_Dahle) 's Twitter Profile Photo

Itā€™s here! Meet Llama 3, our latest generation of models that is setting a new standard for state-of-the art performance and efficiency for openly available LLMs.

Key highlights

ā€¢ 8B and 70B parameter openly available pre-trained and fine-tuned models.
ā€¢ Trained on moreā€¦

Itā€™s here! Meet Llama 3, our latest generation of models that is setting a new standard for state-of-the art performance and efficiency for openly available LLMs. Key highlights ā€¢ 8B and 70B parameter openly available pre-trained and fine-tuned models. ā€¢ Trained on moreā€¦
account_circle
Sebastian Raschka(@rasbt) 's Twitter Profile Photo

Super nice and concise slide deck introducing instruction finetuning & alignment for LLMs!
Also love that it starts with a clear definition of the terminology jargon.
(We often use 'finetuning' to mean instruction finetuning, which can be very confusing coming from classic ML.)

Super nice and concise slide deck introducing instruction finetuning & alignment for LLMs! Also love that it starts with a clear definition of the terminology jargon. (We often use 'finetuning' to mean instruction finetuning, which can be very confusing coming from classic ML.)
account_circle