350M parameters is all you need! ⚡
Revisiting Meta's MobileLLM paper this morning:
> Reaches same perf as L2 7B in API callling competitive at chat
> Train thin and deep networks (instead of wide)
> Grouped Query Attention (even for smaller networks)
> Block wise weight
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
- Performs Linear-time RNN by propagating the gradient to the next step, i.e., test-time training
- Achieves better perplexity than Mamba
arxiv.org/abs/2407.04620
4 months since we released BitNet b1.58🔥🔥
After we compressed LLM to 1.58 bits, the inference of 1bit LLM is no longer memory-bound, but compute-bound.
🚀🚀Today we introduce Q-Sparse that can significantly speed up LLM computation.
If this actually replicates/works, this is huge
Lifelong learning, reduced forgetting, etc.
I’ve always had iffy experiences with MoEs, but this is very exciting
How soon will we see the first example of a single person, shepherding a humanoid robot fleet, building a billion-dollar revenue business as the sole employee?
LLM model size competition is intensifying… backwards!
My bet is that we'll see models that "think" very well and reliably that are very very small. There is most likely a setting even of GPT-2 parameters for which most people will consider GPT-2 "smart". The reason current
Currently it looks like Llama3.1-405b beats gpt-4o in almost all benchmarks (except human_eval and mmlu_social_sciences).
Previously there was a lot of concern that Llama 3 would perform worse, but initial tests show excellent results.
Meanwhile, rumors are growing louder that
Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.
It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task
Attention has been the key component for most advances in LLMs, but it can’t scale to long context. Does this mean we need to find an alternative?
Presenting Titans: a new architecture with attention and a meta in-context memory that learns how to memorize at test time. Titans
🚀 DeepSeek-R1 is here!
⚡ Performance on par with OpenAI-o1
📖 Fully open-source model & technical report
🏆 MIT licensed: Distill & commercialize freely!
🌐 Website & API are live now! Try DeepThink at chat.deepseek.com today!
🐋 1/n
swe-bench verified (real world coding benchmark; github issue input, github PR output) went from 5% solved to 65% in one year. prob near 100% before 2026
This is on the scale of the Apollo Program and Manhattan Project when measured as a fraction of GDP. This kind of investment only happens when the science is carefully vetted and people believe it will succeed and be completely transformative. I agree it’s the right time.
🚀 Meet EvaByte: The best open-source tokenizer-free language model! Our 6.5B byte LM matches modern tokenizer-based LMs with 5x less data & 2x faster decoding, naturally extending to multimodal tasks while fixing tokenization quirks.
💻 Blog: bit.ly/3CjEmTC
🧵 1/9
Kagi products will always be free of ads and trackers. In fact, Kagi Search will actively down-rank sites with lots of ads and trackers in the results and promote sites with little or no advertising.
An ad-free web is better, safer, more private and user-friendly.
I converted one of my favorite talks I've given over the past year into a blog post.
"On the Tradeoffs of SSMs and Transformers"
(or: tokens are bullshit)
In a few days, we'll release what I believe is the next major advance for architectures.
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence.
Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
again, the AI labs are obsessed with building reasoning-native language models when they need to be building *memory-native* language models
- this is possible (the techniques exist)
- no one has done it yet (no popular LLM has a built in memory module)
- door = wide open