Ops Aeterna (@opsaeterna) 's Twitter Profile
Ops Aeterna

@opsaeterna

alt

ID: 1675016405132120064

calendar_today01-07-2023 05:40:33

871 Tweet

517 Followers

851 Following

Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

The full winning recipe, documented across our team's papers and talks over past 5y: 1) train huge generalist model 2) fine-tune model from 1 on domain-specific data (even little!) 3) distill model from 2 into small deployable one with (a LOT of) patience and consistence ggwp

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Learning to (Learn at Test Time): RNNs with Expressive Hidden States - Performs Linear-time RNN by propagating the gradient to the next step, i.e., test-time training - Achieves better perplexity than Mamba arxiv.org/abs/2407.04620

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

- Performs Linear-time RNN by propagating the gradient to the next step, i.e., test-time training 
- Achieves better perplexity than Mamba

arxiv.org/abs/2407.04620
Sadhika Malladi (@sadhikamalladi) 's Twitter Profile Photo

My new blog post argues from first principles how length normalization in preference learning objectives (e.g., SimPO) can facilitate learning from model-annotated preference data. Check it out! cs.princeton.edu/~smalladi/blog…

PapersAnon (@papers_anon) 's Twitter Profile Photo

Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors Quantize tile neural network layers with sequences of bits. Reuses a tile per layer to represent a full tensor. Impressive results with ViT and Swin-t. Links below

Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors

Quantize tile neural network layers with sequences of bits.  Reuses a tile per layer to represent a full tensor. Impressive results with ViT and Swin-t. 

Links below
PapersAnon (@papers_anon) 's Twitter Profile Photo

Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization One-line change to DPO to implement the principle of pessimism to alleviate overoptimization. No models tested. Potential paper there. Links below

Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization

One-line change to DPO to implement the principle of pessimism to alleviate overoptimization. No models tested. Potential paper there.

Links below
Ashwinee Panda (@pandaashwinee) 's Twitter Profile Photo

the disparity between providers serving L3.1 is directly due to quantization and more indirectly due to a misunderstanding of benchmarks. people evaluate their quantization methods, which are all primarily activation outlier mitigation strategies, on benchmarks and (1/2)

xjdr (@_xjdr) 's Twitter Profile Photo

- GDM is now leading the AGI race - Llama3.1 changed everything and Llama4 is the most important model in the world right now in terms of potential impact (short of AGI has been achieved internally announcements) - real talk, if Character.ai with Noam can't make it on

- GDM is now leading the AGI race
- Llama3.1 changed everything and Llama4 is the most important model in the world right now in terms of potential impact (short of AGI has been achieved internally announcements) 
- real talk, if Character.ai with Noam can't make it on
xjdr (@_xjdr) 's Twitter Profile Photo

right now, using best of N sampling on L3.1 70B with L3.1 8B as a draft model to get a reasonably high N plus a reasonably good RM as a judge would put you in sight of SOTA. Its not SOTA by any stretch of the imagination, but you could see it from there.

Minh Nhat Nguyen (@menhguin) 's Twitter Profile Photo

Hi! Guess the secret’s out: Introducing Min P, a token sampling method for LLMs. Using < 10 lines of code, we achieve 10-20% better results on GSM8K and GPQA vs Top P at temperature=1. Interestingly, reasoning on GPQA and creative writing ability *improves* as temperature > 1 🧵

Hi! Guess the secret’s out: Introducing Min P, a token sampling method for LLMs. Using &lt; 10 lines of code, we achieve 10-20% better results on GSM8K and GPQA vs Top P at temperature=1.

Interestingly, reasoning on GPQA and creative writing ability *improves* as temperature &gt; 1 🧵
Bryan (@bryancsk) 's Twitter Profile Photo

Google Maps really be like "hey do you want to take an evening stroll through the cemetery where they reposed the mass graves of thousands of massacred civilians from the Japanese Occupation during the Ghost Month?" - No I do not.

Google Maps really be like "hey do you want to take an evening stroll through the cemetery where they reposed the mass graves of thousands of massacred civilians from the Japanese Occupation during the Ghost Month?" - No I do not.
Aneesh Sachdeva (@therealaneesh) 's Twitter Profile Photo

was gonna share my notes on the ADAS paper but xjdr beat me to it and much more succinctly here are more findings: if you take a look at the outputs of the meta agent that the authors share, you’ll see the agent is mainly just remixing variants of this pattern: try 3-5

was gonna share my notes on the ADAS paper but <a href="/_xjdr/">xjdr</a> beat me to it and much more succinctly

here are more findings:

if you take a look at the outputs of the meta agent that the authors share, you’ll see the agent is mainly just remixing variants of this pattern: try 3-5
xjdr (@_xjdr) 's Twitter Profile Photo

after weeks of intense labor in the eval gulags, i can confidently and unequivocally say that 405B instruct at BF16 (f32 softmax) with vanilla attention, scaled rope and best of N sampling (N at ~5) is the best available model for the things i do most often (code, agents, etc).

xjdr (@_xjdr) 's Twitter Profile Photo

When MaxText adds something, you know its good. I am going to call this near confirmation of at least part of flash's magic being 1:6 or 1:8 global:local_sliding attention github.com/google/maxtext…

When MaxText adds something, you know its good. I am  going to call this near confirmation of at least part of flash's magic being 1:6 or 1:8 global:local_sliding attention 

github.com/google/maxtext…
Alpin (@alpindale) 's Twitter Profile Photo

Mistral-Large-Instruct-2407 at FP8 weights, FP8 activations, and FP8 KV cache (E4M3). Should be the optimal way to run it. huggingface.co/alpindale/Mist…

Sadhika Malladi (@sadhikamalladi) 's Twitter Profile Photo

Submit to the Math of Modern Machine Learning (M3L) workshop at NeurIPS 2024! Deadline is Sep 29. sites.google.com/view/m3l-2024/

xjdr (@_xjdr) 's Twitter Profile Photo

This is my current mental model for how o1 works (replace batch for parallel generation in the last picture) So far in the limited testing i have been able to do with the rate limits in place (even for T5 api) the results have been very consistent with this model.

This is my current mental model for how o1 works (replace batch for parallel generation in the last picture)

So far in the limited testing i have been able to do with the rate limits in place (even for T5 api) the results have been very consistent with this model.