Sebastian Raschka (@rasbt) Twitter Tweets • TwiCopy

Sebastian Raschka

@rasbt

+ Follow

Machine learning & AI researcher writing at https://t.co/A0tXWzG1p5. LLM research engineer @LightningAI. Previously stats professor at UW-Madison.

ID:865622395

linkhttps://sebastianraschka.com/books/ calendar_today07-10-2012 02:06:16

15,5K Tweets

266,6K Followers

906 Following

Sebastian Raschka

@rasbt

1 day ago

Persistent storage, flexible switching between CPU multi-GPU machines, VSCode, scaling training runs via multi-node jobs ... what's not to love 😊

account_circle

Andrew Gordon Wilson A long time ago!
I was faculty in a statistics department and always had to clarify to students and colleagues that in deep learning we use the term 'inference' differently, haha.
From my lecture notes: github.com/rasbt/stat479-…

@andrewgwils A long time ago! I was faculty in a statistics department and always had to clarify to students and colleagues that in deep learning we use the term 'inference' differently, haha. From my lecture notes: github.com/rasbt/stat479-…

thumb_up_off_alt57

chat_bubble_outline0

repeat6

shareShare

account_circle

Daniel Han

@danielhanchen

4 days ago

biased estimator Unsloth AI VRAM consumption for finetuning will not be nice. Also QKV is slower when fused interestingly. MLP fused might only be like teenily faster, but un-noticeable since the matrices are large twitter.com/rasbt/status/1…

thumb_up_off_alt37

chat_bubble_outline0

repeat4

shareShare

account_circle

Daniel Han

@danielhanchen

4 days ago

Phi 3 (3.8B) got released! The paper said it was just a Llama arch, but I found some quirks while adding this to Unsloth AI:

1. Sliding window of 2047? Mistral v1 4096. So does Phi mini have SWA? (And odd num?) Max RoPE position is 4096?
2. Upcasted RoPE? Like Gemma?
3. Dynamic…

account_circle

Sebastian Raschka

@rasbt

4 days ago

Phi-3 has 'only' been trained on 5x fewer tokens than Llama 3 (3.3 trillion instead of 15 trillion)

Phi-3-mini less has 'only' 3.8 billion parameters, less than half the size of Llama 3 8B.

Despite being small enough to be deployed on a phone (according to the technical…

account_circle

Sebastian Raschka

@rasbt

4 days ago

I can't believe microsoft just dropped phi-3 less than a week after llama 3 arxiv.org/abs/2404.14219.
And it looks good!

account_circle

Sebastian Raschka

@rasbt

5 days ago

Just learned that the RedPajama-V2 pretraining dataset is actually 30T tokens. 2x the size used for Llama 3 🤯
github.com/togethercomput…

account_circle

Sebastian Raschka

@rasbt

6 days ago

'... do SSMs truly have an advantage (over transformers) in expressive power for state tracking? Surprisingly, the answer is no ... Thus, despite its recurrent formulation, the 'state' in an SSM is an illusion' 🎤✋🔥 arxiv.org/abs/2404.08819

account_circle

Guilherme Penedo

@gui_penedo

1 week ago

We have just released 🍷 FineWeb: 15 trillion tokens of high quality web data.
We filtered and deduplicated all CommonCrawl between 2013 and 2024.
Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!

account_circle

Sebastian Raschka

@rasbt

1 week ago

What a crazy week. And why I am (still) waiting for the Llama 3 paper, a little write-up on using and finetuning pretrained transformers! magazine.sebastianraschka.com/p/using-and-fi…

account_circle

Ahmad Al-Dahle

@Ahmad_Al_Dahle

1 week ago

It’s here! Meet Llama 3, our latest generation of models that is setting a new standard for state-of-the art performance and efficiency for openly available LLMs.

Key highlights

• 8B and 70B parameter openly available pre-trained and fine-tuned models.
• Trained on more…

account_circle

Sebastian Raschka

@rasbt

1 week ago

Super nice and concise slide deck introducing instruction finetuning & alignment for LLMs!
Also love that it starts with a clear definition of the terminology jargon.
(We often use 'finetuning' to mean instruction finetuning, which can be very confusing coming from classic ML.)

account_circle