Ali Behrouz (@behrouz_ali) 's Twitter Profile
Ali Behrouz

@behrouz_ali

Intern @Google, Ph.D. Student @Cornell_CS, interested in machine learning.

ID: 1611553104532762624

linkhttps://abehrouz.github.io/ calendar_today07-01-2023 02:39:47

111 Tweet

4,4K Followers

1,1K Following

Ali Behrouz (@behrouz_ali) 's Twitter Profile Photo

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers?

Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to
TuringPost (@theturingpost) 's Twitter Profile Photo

Last week, Google dropped a paper on ATLAS, a new architecture that reimagines how models learn and use memory. Unfortunately, it flew under everyone’s radar - but it shouldn’t have! So what's Atlas bringing to the table? ▪️ Active memory via Google’s so-called Omega rule. It

Last week, <a href="/Google/">Google</a> dropped a paper on ATLAS, a new architecture that reimagines how models learn and use memory.

Unfortunately, it flew under everyone’s radar - but it shouldn’t have! So what's Atlas bringing to the table?

▪️ Active memory via Google’s so-called Omega rule. It
leloy! (@leloykun) 's Twitter Profile Photo

Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration Hi all, I'm bacc. I have a lot to talk about, but let's start with this fun side-project. Here I'll talk about novel (?) ways to compute: 1. Spectral Clipping (discussed in Rohan's

Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration

Hi all, I'm bacc. I have a lot to talk about, but let's start with this fun side-project.

Here I'll talk about novel (?) ways to compute:
1. Spectral Clipping (discussed in Rohan's
Yingheng Wang (@yingheng_wang) 's Twitter Profile Photo

❓ Are LLMs actually problem solvers or just good at regurgitating facts? 🚨New Benchmark Alert! We built HeuriGym to benchmark if LLMs can craft real heuristics for real-world hard combinatorial optimization problems. 🛞 We’re open-sourcing it all: ✅ 9 problems ✅ Iterative

❓ Are LLMs actually problem solvers or just good at regurgitating facts?

🚨New Benchmark Alert! We built HeuriGym to benchmark if LLMs can craft real heuristics for real-world hard combinatorial optimization problems.

🛞 We’re open-sourcing it all:
✅ 9 problems
✅ Iterative
Reza Bayat (@reza_byt) 's Twitter Profile Photo

📄 New Paper Alert! ✨ 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput Across 135 M–1.7 B params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher few‑shot accuracy, and more than 2x throughput.

📄 New Paper Alert! ✨

🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput

Across 135 M–1.7 B params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher few‑shot accuracy, and more than 2x throughput.
Vahab Mirrokni (@mirrokni) 's Twitter Profile Photo

Proud to announce an official Gold Medal at #IMO2025🥇 The IMO committee has certified the result from our general-purpose Gemini system—a landmark moment for our team and for the future of AI reasoning. deepmind.google/discover/blog/… (1/n) Highlights in thread:

Ali Behrouz (@behrouz_ali) 's Twitter Profile Photo

Everyone is talking about reviewers who don't engage or provide low-quality reviews. While harmful, I don't see that as the biggest threat to the peer review system. As both an author and reviewer, I'm seeing zero-sum debates where a reviewer puts their full effort into rejecting

Gabriel Mongaras (@gmongaras) 's Twitter Profile Photo

Threw a paper I've been working on onto ArXiv. Trying to get a little closer to understanding why softmax in attention works so well compared to other activation functions. arxiv.org/abs/2507.23632

DeepLearning.AI (@deeplearningai) 's Twitter Profile Photo

Google researchers introduced ATLAS, a transformer-like language model architecture. ATLAS replaces attention with a trainable memory module and processes inputs up to 10 million tokens. The team trained a 1.3 billion-parameter model on FineWeb, updating only the memory module

Google researchers introduced ATLAS, a transformer-like language model architecture. ATLAS replaces attention with a trainable memory module and processes inputs up to 10 million tokens. 

The team trained a 1.3 billion-parameter model on FineWeb, updating only the memory module
Hamed Mahdavi (@hamedmahdavi93) 's Twitter Profile Photo

This paper from Penn State researchers literally blew my mind🤯🤯 Just kidding, I am excited to share our work on model merging! We leveraged the connection between Adam optimizer's second moments and curvature information /

Yunhao Fang (@fangyunhao_x) 's Twitter Profile Photo

Transformers hit the memory wall, RNNs hit the forgetting wall. TLDR: We introduce Artificial Hippocampus Networks (AHNs), a lightweight add-on (<0.5% extra parameters) that compresses infinite context into a fixed-size memory for efficient long-context modeling.

Transformers hit the memory wall, RNNs hit the forgetting wall.

TLDR: We introduce Artificial Hippocampus Networks (AHNs), a lightweight add-on (&lt;0.5% extra parameters) that compresses infinite context into a fixed-size memory for efficient long-context modeling.
himanshu dubey (@himanshustwts) 's Twitter Profile Photo

Author of Titans and Atlas from Deepmind is the upcoming guest on Ground Zero pod! it was a quite a chat w him on the recent progress with titans, atlas and their real adoption.

Author of Titans and Atlas from Deepmind is the upcoming guest on <a href="/groundzero_twt/">Ground Zero</a> pod!

it was a quite a chat w him on the recent progress with titans, atlas and their real adoption.
Tilde (@tilderesearch) 's Twitter Profile Photo

Modern optimizers can struggle with unstable training. Building off of Manifold Muon, we explore more lenient mechanisms for constraining the geometry of a neural network's weights directly through their Gram matrix 🧠 A 🧵… ~1/6~