Julien Siems (@julien_siems) 's Twitter Profile
Julien Siems

@julien_siems

PhD student advised by Frank Hutter working on in-context learning. Previously Machine Learning Researcher @ Merantix Momentum, Research Intern @ AWS.

ID: 1552026847160008704

linkhttps://juliensiems.github.io calendar_today26-07-2022 20:23:46

71 Tweet

267 Followers

628 Following

Maximilian Beck (@maxmbeck) 's Twitter Profile Photo

📢🔔I am excited to share the details on our optimized xLSTM architecture for our xLSTM 7B model!🚨 We optimized the architecture with two goals in mind: - Efficiency (in Training and Inference) and - Stability 🧵(1/7)

📢🔔I am excited to share the details on our optimized xLSTM architecture for our xLSTM 7B model!🚨

We optimized the architecture with two goals in mind:

- Efficiency (in Training and Inference)
and 
- Stability

đź§µ(1/7)
Maximilian Beck (@maxmbeck) 's Twitter Profile Photo

Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧 We introduce ⚡️Tiled Flash Linear Attention (TFLA), ⚡️ A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating. We find TFLA is really fast! 🧵(1/11)

Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧

We introduce

⚡️Tiled Flash Linear Attention (TFLA), ⚡️

A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating.

We find TFLA is really fast!

đź§µ(1/11)
Riccardo Grazzi @ ICLR 2025 (@riccardograzzi) 's Twitter Profile Photo

Julien Siems leloy! Jyo Pari In our DeltaProduct work we also add a bit of theory to DeltaNet, showing that it can solve Dihedral groups, which are the groups of symmetries of regular polygons, with only two layers. This includes S3 (symmetries of the equilateral triangle).

<a href="/julien_siems/">Julien Siems</a> <a href="/leloykun/">leloy!</a> <a href="/jyo_pari/">Jyo Pari</a> In our DeltaProduct work we also add a bit of theory to DeltaNet, showing that it can solve Dihedral groups, which are the groups of symmetries of regular polygons, with only two layers. This includes S3 (symmetries of the equilateral triangle).
Xiaolong Wang (@xiaolonw) 's Twitter Profile Photo

Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated

BlinkDL (@blinkdl_ai) 's Twitter Profile Photo

RWKV7-G1 "GooseOne" 🪿 1.5B release: pure RNN (attention-free) reasoning model, comparable with Qwen3 1.7B and fully multilingual. Chat demo & download on RWKV.com Larger G1 training in progress.

RWKV7-G1 "GooseOne" 🪿 1.5B release: pure RNN (attention-free) reasoning model, comparable with Qwen3 1.7B and fully multilingual. Chat demo &amp; download on RWKV.com Larger G1 training in progress.
Songlin Yang (@songlinyang4) 's Twitter Profile Photo

📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381

Ali Behrouz (@behrouz_ali) 's Twitter Profile Photo

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers?

Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to
Julien Siems (@julien_siems) 's Twitter Profile Photo

⚡DeltaProduct update with new results: - Characterization of DeltaProduct’s state-tracking ability - Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet. - Improved scaling analysis And more!

⚡DeltaProduct update with new results:
- Characterization of DeltaProduct’s state-tracking ability
- Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet.
- Improved scaling analysis
And more!
Riccardo Grazzi @ ICLR 2025 (@riccardograzzi) 's Twitter Profile Photo

đź“– (1/n) DeltaProduct's theory got an update! 1) For any nâ‚•>1 (# of Householders), only 3 layers are needed to solve all group word problems (including S5). DeltaNet and RWKV-7 use 4. 2) For any nâ‚•, Gated DeltaProduct can recognize any regular language

đź“– (1/n) DeltaProduct's theory got an update!

1) For any nâ‚•&gt;1 (# of Householders), only 3 layers are needed to solve all group word problems (including S5). DeltaNet and RWKV-7 use 4.
2) For any nâ‚•, Gated DeltaProduct can recognize any regular language