Julien Siems (@julien_siems) Twitter Tweets • TwiCopy

Julien Siems

@julien_siems

+ Follow

PhD student advised by Frank Hutter working on in-context learning. Previously Machine Learning Researcher @ Merantix Momentum, Research Intern @ AWS.

ID: 1552026847160008704

linkhttps://juliensiems.github.io calendar_today26-07-2022 20:23:46

71 Tweet

267 Followers

628 Following

Maximilian Beck

@maxmbeck

9 months ago

📢🔔I am excited to share the details on our optimized xLSTM architecture for our xLSTM 7B model!🚨 We optimized the architecture with two goals in mind: - Efficiency (in Training and Inference) and - Stability 🧵(1/7)

thumb_up_off_alt330

chat_bubble_outline8

repeat60

shareShare

Maximilian Beck

@maxmbeck

9 months ago

Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧 We introduce ⚡️Tiled Flash Linear Attention (TFLA), ⚡️ A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating. We find TFLA is really fast! 🧵(1/11)

thumb_up_off_alt346

chat_bubble_outline3

repeat61

shareShare

Riccardo Grazzi @ ICLR 2025

@riccardograzzi

9 months ago

Julien Siems leloy! Jyo Pari In our DeltaProduct work we also add a bit of theory to DeltaNet, showing that it can solve Dihedral groups, which are the groups of symmetries of regular polygons, with only two layers. This includes S3 (symmetries of the equilateral triangle).

<a href="/julien_siems/">Julien Siems</a> <a href="/leloykun/">leloy!</a> <a href="/jyo_pari/">Jyo Pari</a> In our DeltaProduct work we also add a bit of theory to DeltaNet, showing that it can solve Dihedral groups, which are the groups of symmetries of regular polygons, with only two layers. This includes S3 (symmetries of the equilateral triangle).

thumb_up_off_alt22

chat_bubble_outline1

repeat5

shareShare

Xiaolong Wang

@xiaolonw

8 months ago

Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated

thumb_up_off_alt1,1K

chat_bubble_outline31

repeat181

shareShare

BlinkDL

@blinkdl_ai

8 months ago

RWKV7-G1 "GooseOne" 🪿 1.5B release: pure RNN (attention-free) reasoning model, comparable with Qwen3 1.7B and fully multilingual. Chat demo & download on RWKV.com Larger G1 training in progress.

thumb_up_off_alt175

chat_bubble_outline3

repeat33

shareShare

Songlin Yang

@songlinyang4

7 months ago

📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381

thumb_up_off_alt424

chat_bubble_outline9

repeat79

shareShare

Ali Behrouz

@behrouz_ali

7 months ago

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to

thumb_up_off_alt897

chat_bubble_outline23

repeat133

shareShare

Jake Robertson

@jakemrobertson

6 months ago

We present a new approach to causal inference. Pre-trained on synthetic data, Do-PFN opens the door to a new domain: PFNs for causal inference—we are excited to announce our new paper “Do-PFN: In-Context Learning for Causal Effect Estimation” on Arxiv! 🔨🔍 A thread:

thumb_up_off_alt40

chat_bubble_outline4

repeat3

shareShare

Julien Siems

@julien_siems

6 months ago

⚡DeltaProduct update with new results: - Characterization of DeltaProduct’s state-tracking ability - Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet. - Improved scaling analysis And more!

thumb_up_off_alt53

chat_bubble_outline0

repeat13

shareShare

Riccardo Grazzi @ ICLR 2025

@riccardograzzi

6 months ago

📖 (1/n) DeltaProduct's theory got an update! 1) For any nₕ>1 (# of Householders), only 3 layers are needed to solve all group word problems (including S5). DeltaNet and RWKV-7 use 4. 2) For any nₕ, Gated DeltaProduct can recognize any regular language

thumb_up_off_alt9

chat_bubble_outline1

repeat2

shareShare