Konrad Staniszewski (@cstankonrad) 's Twitter Profile
Konrad Staniszewski

@cstankonrad

PhD student at @UniWarszawski. Working on long-context LLMs.

ID: 1640064109096861699

calendar_today26-03-2023 18:54:46

28 Tweet

109 Takipçi

60 Takip Edilen

Szymon Antoniak (@simontwice2) 's Twitter Profile Photo

✨ Introducing 🍹 Mixture of Tokens 🍹, a stable alternative to existing Mixture of Experts techniques for LLMs, providing significantly more stable training 📈. You can check out our initial results at llm-random.github.io/posts/mixture_… 🧵 (1/n)

✨ Introducing 🍹 Mixture of Tokens 🍹, a stable alternative to existing Mixture of Experts techniques for LLMs, providing significantly more stable training 📈. You can check out our initial results at llm-random.github.io/posts/mixture_… 🧵 (1/n)
Sebastian Jaszczur (@s_jaszczur) 's Twitter Profile Photo

Introducing 🔥 MoE-Mamba 🔥 combining two exciting LLM techniques, Mixture of Experts and State Space Models. It matches Mamba’s performance in 2.2x less training steps 🚀. More in this short thread 👇

Introducing 🔥 MoE-Mamba 🔥 combining two exciting LLM techniques, Mixture of Experts and State Space Models. It matches Mamba’s performance in 2.2x less training steps 🚀. More in this short thread 👇
Bartłomiej Cupiał (@cupiabart) 's Twitter Profile Photo

🚀Excited to share our latest work on fine-tuning RL models! By integrating fine-tuning with knowledge retention methods, we've achieved SOTA🔥in NetHack🎮, with scores surpassing 10K points, doubling the previous record. A detailed thread coming soon! ✨

🚀Excited to share our latest work on fine-tuning RL models! By integrating fine-tuning with knowledge retention methods, we've achieved SOTA🔥in NetHack🎮, with scores surpassing 10K points, doubling the previous record. A detailed thread coming soon! ✨
Piotr Nawrot (@p_nawrot) 's Twitter Profile Photo

The memory in Transformers grows linearly with the sequence length at inference time. In SSMs it is constant, but often at the expense of performance. We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance

The memory in Transformers grows linearly with the sequence length at inference time.

In SSMs it is constant, but often at the expense of performance.

We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance
AI at Meta (@aiatmeta) 's Twitter Profile Photo

Introducing Meta Llama 3: the most capable openly available LLM to date. Today we’re releasing 8B & 70B models that deliver on new capabilities such as improved reasoning and set a new state-of-the-art for models of their sizes. Today's release includes the first two Llama 3

Gracjan Góral (@gracjan_goral) 's Twitter Profile Photo

🚨New research alert!🚨 Ever wondered if AI can imagine how the world looks through others’ eyes? In our latest preprint: researchgate.net/publication/38…, we tested Vision Language Models (VLMs) on this ability. And the results aren’t promising!

Piotr Nawrot (@p_nawrot) 's Twitter Profile Photo

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs.

We performed the most comprehensive study on training-free sparse attention to date.

Here is what we found:
Michal Nauman (@mic_nau) 's Twitter Profile Photo

We wondered if off-policy RL could transfer to real robots on-par with on-policy PPO. Turns out it works surprisingly well! We also find that, like on-policy methods, off-policy can leverage massively parallel simulation for even better performance 🤖

Edoardo Ponti (@pontiedoardo) 's Twitter Profile Photo

🚀 By *learning* to compress the KV cache in Transformer LLMs, we can generate more tokens for the same compute budget. This unlocks *inference-time hyper-scaling* For the same runtime or memory load, we can boost LLM accuracy by pushing reasoning even further!

🚀 By *learning* to compress the KV cache in Transformer LLMs, we can generate more tokens for the same compute budget. 

This unlocks *inference-time hyper-scaling*

For the same runtime or memory load, we can boost LLM accuracy by pushing reasoning even further!
Alicja Ziarko (@ziarkoalicja) 's Twitter Profile Photo

Can complex reasoning emerge directly from learned representations? In our new work, we study representations that capture both perceptual and temporal structure, enabling agents to reason without explicit planning. princeton-rl.github.io/CRTR/

Michał Zawalski (@mizawalski) 's Twitter Profile Photo

How do you know whether an LLM is solving the benchmark, or it has just memorized the test? We propose CoDeC (Contamination Detection via Context), a lightweight, and highly accurate method to find out. The key insight is simple but powerful. 🧵 (1/N)

How do you know whether an LLM is solving the benchmark, or it has just memorized the test?

We propose CoDeC (Contamination Detection via Context), a lightweight, and highly accurate method to find out.

The key insight is simple but powerful. 🧵 (1/N)
Piotr Nawrot (@p_nawrot) 's Twitter Profile Photo

We'll present "Inference-Time Hyper-Scaling with KV Cache Compression", both at NeurIPS and EurIPS. We believe that future advances in AI will require model efficiency, and this work is another step in this direction. Save the date! -San Diego, Thur 11:00 -Copenhagen, Thur 10:30

We'll present "Inference-Time Hyper-Scaling with KV Cache Compression", both at NeurIPS and EurIPS. We believe that future advances in AI will require model efficiency, and this work is another step in this direction.

Save the date!
-San Diego, Thur 11:00
-Copenhagen, Thur 10:30