Maximilian Beck(@maxmbeck) 's Twitter Profileg
Maximilian Beck

@maxmbeck

PhD Student @ JKU Linz Institute for Machine Learning.

ID:1401163561322389508

linkhttp://maxbeck.ai calendar_today05-06-2021 13:06:56

110 Tweets

379 Followers

510 Following

TuringPost(@TheTuringPost) 's Twitter Profile Photo

If we scale Long Short-Term Memory (LSTM) to billions of parameters and use modern techniques, can they compete with Transformers?

2 innovations to enhance LSTMs' performance:

- New gating mechanism
- Modified memory structures

And these changes are called xLSTM

🧵

If we scale Long Short-Term Memory (LSTM) to billions of parameters and use modern techniques, can they compete with Transformers? 2 innovations to enhance LSTMs' performance: - New gating mechanism - Modified memory structures And these changes are called xLSTM 🧵
account_circle
Claus(@ClausHofm) 's Twitter Profile Photo

Really excited to announce Energy-based Hopfield Boosting. 🎉

Hopfield Boosting advances the state-of-the-art in OOD detection. 🚀
Our new energy function allows a model to focus on hard training examples close to the decision boundary between in-distribution and OOD samples. 🧵

account_circle
Valeriy M., PhD, MBA, CQF(@predict_addict) 's Twitter Profile Photo

'xLSTM: Extended Long Short-Term Memory' by Maximilian Beck , Korbinian Poeppel, Markus Spanring, Andreas Auer , Günter Klambauer. Johannes Brandstetter , Sepp Hochreiter and co-authors is signaling a potential shift in the landscape of natural language processing technologies.

As

'xLSTM: Extended Long Short-Term Memory' by @maxmbeck , Korbinian Poeppel, @MarkusSpanring, @AndAuer , Günter Klambauer. @jo_brandstetter , @HochreiterSepp and co-authors is signaling a potential shift in the landscape of natural language processing technologies. As
account_circle
Tri Dao(@tri_dao) 's Twitter Profile Photo

Glad to see folks studying this problem. The paper suggests that (their impl. of) FlashAttention has higher numerical error than baseline. However, FlashAttention in BF16/FP16 will often have *lower* error than baseline if the impl. does internal softmax rescaling in FP32

Glad to see folks studying this problem. The paper suggests that (their impl. of) FlashAttention has higher numerical error than baseline. However, FlashAttention in BF16/FP16 will often have *lower* error than baseline if the impl. does internal softmax rescaling in FP32
account_circle
RWKV(@RWKV_AI) 's Twitter Profile Photo

🦅 Eagle & 🐦 Finch

The RWKV v5 and v6 architecture paper is here
arxiv.org/abs/2404.05892

Both of which, improve over RWKV-4, scaled up to 7.5b and 3.1b billion multilingual models respectively

Open-source code, weights, and dataset
Apache 2 licensed, under Linux Foundation

account_circle
Soham De(@sohamde_) 's Twitter Profile Photo

Just got back from vacation, and super excited to finally release Griffin - a new hybrid LLM mixing RNN layers with Local Attention - scaled up to 14B params!

arxiv.org/abs/2402.19427

My co-authors have already posted about our amazing results, so here's a 🧵on how we got there!

account_circle
Samuel L Smith(@SamuelMLSmith) 's Twitter Profile Photo

Announcing RecurrentGemma!
github.com/google-deepmin…

- A 2B model with open weights based on Griffin
- Replaces transformer with mix of gated linear recurrences and local attention
- Competitive with Gemma-2B on downstream evals
- Higher throughput when sampling long sequences

Announcing RecurrentGemma! github.com/google-deepmin… - A 2B model with open weights based on Griffin - Replaces transformer with mix of gated linear recurrences and local attention - Competitive with Gemma-2B on downstream evals - Higher throughput when sampling long sequences
account_circle
Jonathan Frankle(@jefrankle) 's Twitter Profile Photo

Meet DBRX, a new sota open llm from Databricks. It's a 132B MoE with 36B active params trained from scratch on 12T tokens. It sets a new bar on all the standard benchmarks, and - as an MoE - inference is blazingly fast. Simply put, it's the model your data has been waiting for.

Meet DBRX, a new sota open llm from @databricks. It's a 132B MoE with 36B active params trained from scratch on 12T tokens. It sets a new bar on all the standard benchmarks, and - as an MoE - inference is blazingly fast. Simply put, it's the model your data has been waiting for.
account_circle
Sakana AI(@SakanaAILabs) 's Twitter Profile Photo

Introducing Evolutionary Model Merge: A new approach bringing us closer to automating foundation model development. We use evolution to find great ways of combining open-source models, building new powerful foundation models with user-specified abilities!

sakana.ai/evolutionary-m…

account_circle
Günter Klambauer(@gklambauer) 's Twitter Profile Photo

GNN-VPA: A Variance-Preserving Aggregation Strategy for Graph Neural Networks

A simple change in the aggregation function helps GNNs a) to improve message passing signal propagation and b) at the same time keep maximum 1-WL expressive power!!

Paper: arxiv.org/abs/2403.04747

GNN-VPA: A Variance-Preserving Aggregation Strategy for Graph Neural Networks A simple change in the aggregation function helps GNNs a) to improve message passing signal propagation and b) at the same time keep maximum 1-WL expressive power!! Paper: arxiv.org/abs/2403.04747
account_circle
Tianle Cai(@tianle_cai) 's Twitter Profile Photo

Everyone is talking about the incredible LLM inference speed that Groq Inc chips achieve, but few notice its cost, especially if you want to replace your GPU/TPU stack with it. Long story short, it requires hundreds of chips to serve a single LLM since each chip only has a ~200MB

account_circle