Maximilian Beck (@maxmbeck) Twitter Tweets • TwiCopy

TuringPost

1 week ago

If we scale Long Short-Term Memory (LSTM) to billions of parameters and use modern techniques, can they compete with Transformers?

2 innovations to enhance LSTMs' performance:

- New gating mechanism
- Modified memory structures

And these changes are called xLSTM

🧵

thumb_up_off_alt13

chat_bubble_outline0

repeat3

shareShare

account_circle

Claus

@ClausHofm

1 week ago

Really excited to announce Energy-based Hopfield Boosting. 🎉

Hopfield Boosting advances the state-of-the-art in OOD detection. 🚀
Our new energy function allows a model to focus on hard training examples close to the decision boundary between in-distribution and OOD samples. 🧵

account_circle

Valeriy M., PhD, MBA, CQF

@predict_addict

1 week ago

'xLSTM: Extended Long Short-Term Memory' by Maximilian Beck , Korbinian Poeppel, Markus Spanring, Andreas Auer , Günter Klambauer. Johannes Brandstetter , Sepp Hochreiter and co-authors is signaling a potential shift in the landscape of natural language processing technologies.

As

'xLSTM: Extended Long Short-Term Memory' by @maxmbeck , Korbinian Poeppel, @MarkusSpanring, @AndAuer , Günter Klambauer. @jo_brandstetter , @HochreiterSepp and co-authors is signaling a potential shift in the landscape of natural language processing technologies. As

thumb_up_off_alt47

chat_bubble_outline0

repeat9

shareShare

account_circle

Tri Dao

@tri_dao

2 weeks ago

Glad to see folks studying this problem. The paper suggests that (their impl. of) FlashAttention has higher numerical error than baseline. However, FlashAttention in BF16/FP16 will often have *lower* error than baseline if the impl. does internal softmax rescaling in FP32

account_circle

RWKV

@RWKV_AI

1 month ago

🦅 Eagle & 🐦 Finch

The RWKV v5 and v6 architecture paper is here
arxiv.org/abs/2404.05892

Both of which, improve over RWKV-4, scaled up to 7.5b and 3.1b billion multilingual models respectively

Open-source code, weights, and dataset
Apache 2 licensed, under Linux Foundation

account_circle

Soham De

@sohamde_

2 months ago

Just got back from vacation, and super excited to finally release Griffin - a new hybrid LLM mixing RNN layers with Local Attention - scaled up to 14B params!

arxiv.org/abs/2402.19427

My co-authors have already posted about our amazing results, so here's a 🧵on how we got there!

account_circle

Samuel L Smith

@SamuelMLSmith

1 month ago

Announcing RecurrentGemma!
github.com/google-deepmin…

- A 2B model with open weights based on Griffin
- Replaces transformer with mix of gated linear recurrences and local attention
- Competitive with Gemma-2B on downstream evals
- Higher throughput when sampling long sequences

account_circle

Jonathan Frankle

@jefrankle

1 month ago

Meet DBRX, a new sota open llm from Databricks. It's a 132B MoE with 36B active params trained from scratch on 12T tokens. It sets a new bar on all the standard benchmarks, and - as an MoE - inference is blazingly fast. Simply put, it's the model your data has been waiting for.

account_circle

Sakana AI

@SakanaAILabs

2 months ago

Introducing Evolutionary Model Merge: A new approach bringing us closer to automating foundation model development. We use evolution to find great ways of combining open-source models, building new powerful foundation models with user-specified abilities!

sakana.ai/evolutionary-m…

account_circle

Günter Klambauer

@gklambauer

2 months ago

GNN-VPA: A Variance-Preserving Aggregation Strategy for Graph Neural Networks

A simple change in the aggregation function helps GNNs a) to improve message passing signal propagation and b) at the same time keep maximum 1-WL expressive power!!

Paper: arxiv.org/abs/2403.04747

account_circle

Tianle Cai

@tianle_cai

3 months ago

Everyone is talking about the incredible LLM inference speed that Groq Inc chips achieve, but few notice its cost, especially if you want to replace your GPU/TPU stack with it. Long story short, it requires hundreds of chips to serve a single LLM since each chip only has a ~200MB

account_circle