Konrad Staniszewski (@cstankonrad) Twitter Tweets • TwiCopy

Szymon Antoniak

2 years ago

✨ Introducing 🍹 Mixture of Tokens 🍹, a stable alternative to existing Mixture of Experts techniques for LLMs, providing significantly more stable training 📈. You can check out our initial results at llm-random.github.io/posts/mixture_… 🧵 (1/n)

thumb_up_off_alt161

chat_bubble_outline8

repeat32

shareShare

Sebastian Jaszczur

@s_jaszczur

2 years ago

Introducing 🔥 MoE-Mamba 🔥 combining two exciting LLM techniques, Mixture of Experts and State Space Models. It matches Mamba’s performance in 2.2x less training steps 🚀. More in this short thread 👇

thumb_up_off_alt115

chat_bubble_outline1

repeat29

shareShare

Bartłomiej Cupiał

@cupiabart

2 years ago

🚀Excited to share our latest work on fine-tuning RL models! By integrating fine-tuning with knowledge retention methods, we've achieved SOTA🔥in NetHack🎮, with scores surpassing 10K points, doubling the previous record. A detailed thread coming soon! ✨

thumb_up_off_alt106

chat_bubble_outline7

repeat24

shareShare

Piotr Nawrot

@p_nawrot

2 years ago

The memory in Transformers grows linearly with the sequence length at inference time. In SSMs it is constant, but often at the expense of performance. We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance

thumb_up_off_alt461

chat_bubble_outline10

repeat73

shareShare

AI at Meta

@aiatmeta

2 years ago

Introducing Meta Llama 3: the most capable openly available LLM to date. Today we’re releasing 8B & 70B models that deliver on new capabilities such as improved reasoning and set a new state-of-the-art for models of their sizes. Today's release includes the first two Llama 3

thumb_up_off_alt5,5K

chat_bubble_outline344

repeat1,1K

shareShare

Gracjan Góral

@gracjan_goral

a year ago

🚨New research alert!🚨 Ever wondered if AI can imagine how the world looks through others’ eyes? In our latest preprint: researchgate.net/publication/38…, we tested Vision Language Models (VLMs) on this ability. And the results aren’t promising!

thumb_up_off_alt25

chat_bubble_outline2

repeat14

shareShare

Konrad Staniszewski

@cstankonrad

10 months ago

In a moment, I am going to present our work SPLiCe (arxiv.org/abs/2312.17296) at AAAI 2025 in room 119A. Thanks to Uniwersytet Warszawski, IDEAS NCBR, and co-authors Szymon Tworkowski, Sebastian Jaszczur, Yu Zhao, Henryk Michalewski, Piotr Miłoś and Łukasz Kuciński for amazing work.

thumb_up_off_alt25

chat_bubble_outline0

repeat6

shareShare

Piotr Nawrot

@p_nawrot

8 months ago

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

thumb_up_off_alt596

chat_bubble_outline5

repeat102

shareShare

Michal Nauman

@mic_nau

7 months ago

We wondered if off-policy RL could transfer to real robots on-par with on-policy PPO. Turns out it works surprisingly well! We also find that, like on-policy methods, off-policy can leverage massively parallel simulation for even better performance 🤖

thumb_up_off_alt48

chat_bubble_outline0

repeat9

shareShare

Edoardo Ponti

@pontiedoardo

6 months ago

🚀 By *learning* to compress the KV cache in Transformer LLMs, we can generate more tokens for the same compute budget. This unlocks *inference-time hyper-scaling* For the same runtime or memory load, we can boost LLM accuracy by pushing reasoning even further!

thumb_up_off_alt121

chat_bubble_outline5

repeat28

shareShare

Alicja Ziarko

@ziarkoalicja

4 months ago

Can complex reasoning emerge directly from learned representations? In our new work, we study representations that capture both perceptual and temporal structure, enabling agents to reason without explicit planning. princeton-rl.github.io/CRTR/

thumb_up_off_alt746

chat_bubble_outline4

repeat108

shareShare

Michał Zawalski

@mizawalski

a month ago

How do you know whether an LLM is solving the benchmark, or it has just memorized the test? We propose CoDeC (Contamination Detection via Context), a lightweight, and highly accurate method to find out. The key insight is simple but powerful. 🧵 (1/N)

thumb_up_off_alt15

chat_bubble_outline1

repeat5

shareShare

Piotr Nawrot

@p_nawrot

14 days ago

We'll present "Inference-Time Hyper-Scaling with KV Cache Compression", both at NeurIPS and EurIPS. We believe that future advances in AI will require model efficiency, and this work is another step in this direction. Save the date! -San Diego, Thur 11:00 -Copenhagen, Thur 10:30

thumb_up_off_alt13

chat_bubble_outline1

repeat4

shareShare