Ops Aeterna (@opsaeterna) Twitter Tweets • TwiCopy

Lucas Beyer (bl16)

4 months ago

The full winning recipe, documented across our team's papers and talks over past 5y: 1) train huge generalist model 2) fine-tune model from 1 on domain-specific data (even little!) 3) distill model from 2 into small deployable one with (a LOT of) patience and consistence ggwp

thumb_up_off_alt594

chat_bubble_outline12

repeat53

shareShare

Aran Komatsuzaki

@arankomatsuzaki

4 months ago

Learning to (Learn at Test Time): RNNs with Expressive Hidden States - Performs Linear-time RNN by propagating the gradient to the next step, i.e., test-time training - Achieves better perplexity than Mamba arxiv.org/abs/2407.04620

thumb_up_off_alt419

chat_bubble_outline7

repeat80

shareShare

Sadhika Malladi

@sadhikamalladi

4 months ago

My new blog post argues from first principles how length normalization in preference learning objectives (e.g., SimPO) can facilitate learning from model-annotated preference data. Check it out! cs.princeton.edu/~smalladi/blog…

thumb_up_off_alt81

chat_bubble_outline1

repeat21

shareShare

Teortaxes▶️

@teortaxestex

4 months ago

Mega-mind to formalize Mini-mind to program Nano-mind to execute

thumb_up_off_alt107

chat_bubble_outline9

repeat7

shareShare

Ops Aeterna

@opsaeterna

4 months ago

retvrn to clean shaven

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

PapersAnon

@papers_anon

3 months ago

Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors Quantize tile neural network layers with sequences of bits. Reuses a tile per layer to represent a full tensor. Impressive results with ViT and Swin-t. Links below

thumb_up_off_alt17

chat_bubble_outline1

repeat3

shareShare

PapersAnon

@papers_anon

3 months ago

Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization One-line change to DPO to implement the principle of pessimism to alleviate overoptimization. No models tested. Potential paper there. Links below

thumb_up_off_alt10

chat_bubble_outline1

repeat1

shareShare

Telt 🍕

@twofifteenam

3 months ago

llama 3.1 8b on groq. The code gen is good (rust). Also this speed 🥲

thumb_up_off_alt38

chat_bubble_outline3

repeat1

shareShare

Ashwinee Panda

@pandaashwinee

3 months ago

the disparity between providers serving L3.1 is directly due to quantization and more indirectly due to a misunderstanding of benchmarks. people evaluate their quantization methods, which are all primarily activation outlier mitigation strategies, on benchmarks and (1/2)

thumb_up_off_alt86

chat_bubble_outline2

repeat8

shareShare

xjdr

@_xjdr

3 months ago

- GDM is now leading the AGI race - Llama3.1 changed everything and Llama4 is the most important model in the world right now in terms of potential impact (short of AGI has been achieved internally announcements) - real talk, if Character.ai with Noam can't make it on

thumb_up_off_alt543

chat_bubble_outline25

repeat42

shareShare

xjdr

@_xjdr

3 months ago

right now, using best of N sampling on L3.1 70B with L3.1 8B as a draft model to get a reasonably high N plus a reasonably good RM as a judge would put you in sight of SOTA. Its not SOTA by any stretch of the imagination, but you could see it from there.

thumb_up_off_alt73

chat_bubble_outline5

repeat3

shareShare

Minh Nhat Nguyen

@menhguin

2 months ago

Hi! Guess the secret’s out: Introducing Min P, a token sampling method for LLMs. Using < 10 lines of code, we achieve 10-20% better results on GSM8K and GPQA vs Top P at temperature=1. Interestingly, reasoning on GPQA and creative writing ability *improves* as temperature > 1 🧵

thumb_up_off_alt546

chat_bubble_outline12

repeat69

shareShare

Bryan

@bryancsk

2 months ago

Google Maps really be like "hey do you want to take an evening stroll through the cemetery where they reposed the mass graves of thousands of massacred civilians from the Japanese Occupation during the Ghost Month?" - No I do not.

thumb_up_off_alt103

chat_bubble_outline17

repeat7

shareShare

Aneesh Sachdeva

@therealaneesh

2 months ago

was gonna share my notes on the ADAS paper but xjdr beat me to it and much more succinctly here are more findings: if you take a look at the outputs of the meta agent that the authors share, you’ll see the agent is mainly just remixing variants of this pattern: try 3-5

was gonna share my notes on the ADAS paper but <a href="/_xjdr/">xjdr</a> beat me to it and much more succinctly

here are more findings:

if you take a look at the outputs of the meta agent that the authors share, you’ll see the agent is mainly just remixing variants of this pattern: try 3-5

thumb_up_off_alt138

chat_bubble_outline4

repeat17

shareShare

xjdr

@_xjdr

2 months ago

after weeks of intense labor in the eval gulags, i can confidently and unequivocally say that 405B instruct at BF16 (f32 softmax) with vanilla attention, scaled rope and best of N sampling (N at ~5) is the best available model for the things i do most often (code, agents, etc).

thumb_up_off_alt596

chat_bubble_outline17

repeat26

shareShare

Alpin

@alpindale

2 months ago

If you want a performant small model, this may surprise you. It writes *really* well for its size.

thumb_up_off_alt11

chat_bubble_outline0

repeat2

shareShare

xjdr

@_xjdr

2 months ago

When MaxText adds something, you know its good. I am going to call this near confirmation of at least part of flash's magic being 1:6 or 1:8 global:local_sliding attention github.com/google/maxtext…

thumb_up_off_alt98

chat_bubble_outline4

repeat7

shareShare

Alpin

@alpindale

2 months ago

Mistral-Large-Instruct-2407 at FP8 weights, FP8 activations, and FP8 KV cache (E4M3). Should be the optimal way to run it. huggingface.co/alpindale/Mist…

thumb_up_off_alt102

chat_bubble_outline3

repeat18

shareShare

Sadhika Malladi

@sadhikamalladi

2 months ago

Submit to the Math of Modern Machine Learning (M3L) workshop at NeurIPS 2024! Deadline is Sep 29. sites.google.com/view/m3l-2024/

thumb_up_off_alt16

chat_bubble_outline0

repeat4

shareShare

xjdr

@_xjdr

2 months ago

This is my current mental model for how o1 works (replace batch for parallel generation in the last picture) So far in the limited testing i have been able to do with the rate limits in place (even for T5 api) the results have been very consistent with this model.

thumb_up_off_alt478

chat_bubble_outline20

repeat37

shareShare