The full winning recipe, documented across our team's papers and talks over past 5y:
1) train huge generalist model
2) fine-tune model from 1 on domain-specific data (even little!)
3) distill model from 2 into small deployable one with (a LOT of) patience and consistence
ggwp
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
- Performs Linear-time RNN by propagating the gradient to the next step, i.e., test-time training
- Achieves better perplexity than Mamba
arxiv.org/abs/2407.04620
My new blog post argues from first principles how length normalization in preference learning objectives (e.g., SimPO) can facilitate learning from model-annotated preference data. Check it out! cs.princeton.edu/~smalladi/blog…
Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors
Quantize tile neural network layers with sequences of bits. Reuses a tile per layer to represent a full tensor. Impressive results with ViT and Swin-t.
Links below
Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization
One-line change to DPO to implement the principle of pessimism to alleviate overoptimization. No models tested. Potential paper there.
Links below
the disparity between providers serving L3.1 is directly due to quantization and more indirectly due to a misunderstanding of benchmarks. people evaluate their quantization methods, which are all primarily activation outlier mitigation strategies, on benchmarks and (1/2)
- GDM is now leading the AGI race
- Llama3.1 changed everything and Llama4 is the most important model in the world right now in terms of potential impact (short of AGI has been achieved internally announcements)
- real talk, if Character.ai with Noam can't make it on
right now, using best of N sampling on L3.1 70B with L3.1 8B as a draft model to get a reasonably high N plus a reasonably good RM as a judge would put you in sight of SOTA. Its not SOTA by any stretch of the imagination, but you could see it from there.
Hi! Guess the secret’s out: Introducing Min P, a token sampling method for LLMs. Using < 10 lines of code, we achieve 10-20% better results on GSM8K and GPQA vs Top P at temperature=1.
Interestingly, reasoning on GPQA and creative writing ability *improves* as temperature > 1 🧵
Google Maps really be like "hey do you want to take an evening stroll through the cemetery where they reposed the mass graves of thousands of massacred civilians from the Japanese Occupation during the Ghost Month?" - No I do not.
was gonna share my notes on the ADAS paper but xjdr beat me to it and much more succinctly
here are more findings:
if you take a look at the outputs of the meta agent that the authors share, you’ll see the agent is mainly just remixing variants of this pattern: try 3-5
after weeks of intense labor in the eval gulags, i can confidently and unequivocally say that 405B instruct at BF16 (f32 softmax) with vanilla attention, scaled rope and best of N sampling (N at ~5) is the best available model for the things i do most often (code, agents, etc).
When MaxText adds something, you know its good. I am going to call this near confirmation of at least part of flash's magic being 1:6 or 1:8 global:local_sliding attention
github.com/google/maxtext…
Mistral-Large-Instruct-2407 at FP8 weights, FP8 activations, and FP8 KV cache (E4M3). Should be the optimal way to run it.
huggingface.co/alpindale/Mist…
This is my current mental model for how o1 works (replace batch for parallel generation in the last picture)
So far in the limited testing i have been able to do with the rate limits in place (even for T5 api) the results have been very consistent with this model.