khalid (@k_saifullaah) Twitter Tweets • TwiCopy

Michael Saxon

9 months ago

🚨😱Obligatory job market announcement post‼️🤯 I'm searching for faculty positions/postdocs in multimodal/multilingual NLP and generative AI! I'll be at #NeurIPS2024 presenting our work on meta-evaluation for text-to-image faithfulness! Let's chat! Website in bio, papers in🧵

thumb_up_off_alt212

chat_bubble_outline6

repeat56

shareShare

Abhimanyu Hans

@ahans30

9 months ago

poster sent for print 😮‍💨 are you concerned your prod LLM might regurgitate exact training data to your users? join me and my co-authors at #NeurIPS2024 on wednesday's 1st poster session & learn how goldfish loss can help you. eager to meet friends from past and future!

thumb_up_off_alt17

chat_bubble_outline0

repeat4

shareShare

khalid

@k_saifullaah

8 months ago

haha so true!

thumb_up_off_alt14

chat_bubble_outline4

repeat0

shareShare

Mahesh Sathiamoorthy

@madiator

8 months ago

We are happy to announce Curator, an open-source library designed to streamline synthetic data generation! High-quality synthetic data generation is essential in training and evaluating LLMs/agents/RAG pipelines these days, but tooling around this is still entirely lacking! So

thumb_up_off_alt969

chat_bubble_outline27

repeat153

shareShare

Tom Goldstein

@tomgoldsteincs

7 months ago

Let’s sanity check DeepSeek’s claim to train on 2048 GPUs for under 2 months, for a cost of $5.6M. It sort of checks out and sort of doesn't. The v3 model is an MoE with 37B (out of 671B) active parameters. Let's compare to the cost of a 34B dense model. 🧵 (1/4)

thumb_up_off_alt219

chat_bubble_outline14

repeat23

shareShare

Aryaman Arora

@aryaman2020

7 months ago

new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering. we find that: 🥇prompting and finetuning are still best 🥈supervised interp methods are effective 😮SAEs lag behind

thumb_up_off_alt416

chat_bubble_outline7

repeat67

shareShare

Ashwinee Panda

@pandaashwinee

7 months ago

1 week to submit to our Sparsity workshop at ICLR 2026! That means SAEs, Sparse models, KV cache compression, quantization, pruning; we want to bring together folks from different sub areas to share ideas on how to make LLMs smaller, faster, and better!

thumb_up_off_alt88

chat_bubble_outline2

repeat16

shareShare

khalid

@k_saifullaah

7 months ago

glad to see livebench is being adopted widely! arxiv.org/abs/2406.19314

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Micah Goldblum

@micahgoldblum

7 months ago

Here’s an easy trick for improving the performance of gradient-boosted decision trees like XGBoost allowing them to read text column headers and to benefit from massive pretraining: replace the first tree with an LLM or TabPFN! 🧵 1/9

thumb_up_off_alt561

chat_bubble_outline7

repeat92

shareShare

Abhimanyu Hans

@ahans30

7 months ago

the real problem is that current retrievers are not instruction-friendly that's why you google keywords likely to be found in the answers/docs than try to "prompt" your query or you know...just ask your LLM

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Tom Goldstein

@tomgoldsteincs

7 months ago

New open source reasoning model! Huginn-3.5B reasons implicitly in latent space 🧠 Unlike O1 and R1, latent reasoning doesn’t need special chain-of-thought training data, and doesn't produce extra CoT tokens at test time. We trained on 800B tokens 👇

thumb_up_off_alt2,2K

chat_bubble_outline49

repeat272

shareShare

Jonas Geiping

@jonasgeiping

7 months ago

Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛

thumb_up_off_alt2,2K

chat_bubble_outline51

repeat200

shareShare

Luca Soldaini ✈️ ICLR 25

@soldni

7 months ago

we did a thing!! Refreshed OLMoE models, now running locally on your phone with brand new, open-source iOS app 🥳 We think we are gonna see more on-device AI in 2025, and we wanted to offer a simple way for everyone to prototype with it! (please be kind with video 😬)

thumb_up_off_alt187

chat_bubble_outline21

repeat22

shareShare

Sean McLeish

@seanmcleish

7 months ago

Introducing the Gemstones💎. 22 models ranging from 50M to 2B parameters, spanning 11 widths and 18 depths trained for 350B tokens of Dolma to allow for a more detailed analysis of scaling laws. 1/n

thumb_up_off_alt171

chat_bubble_outline5

repeat30

shareShare

khalid

@k_saifullaah

6 months ago

many such cases lol

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Ashwinee Panda

@pandaashwinee

6 months ago

people are talking about whether scaling laws are broken or pretraining is saturating. so what does that even mean? consider the loss curves from our recent gemstones paper. as we add larger models, the convex hull doesn’t flatten out on this log-log plot. that's good!

thumb_up_off_alt297

chat_bubble_outline4

repeat14

shareShare

PJ Ace

@pjaccetturo

5 months ago

Here’s a side-by-side comparison of the two trailers:

thumb_up_off_alt5,5K

chat_bubble_outline83

repeat586

shareShare

Pedro Sandoval

@psandovalsegura

5 months ago

Attention sinks in LLMs are weird. There’s ~20% of heads that don’t seem to do anything. Do these heads matter? Turns out that if we get rid of them, benchmark scores don’t change.

thumb_up_off_alt181

chat_bubble_outline2

repeat21

shareShare

Omar Khattab

@lateinteraction

4 months ago

So many things in the run-up to DSPy 3. Here's a first, EXPERIMENTAL one: 🚨We're releasing dspy.GRPO, an online RL optimizer for DSPy programs Your DSPy code as-is can be dspy.GRPO'ed. Yes, even compound multi-module programs. Led by Noah Ziems Lakshya A Agrawal dilara.

thumb_up_off_alt571

chat_bubble_outline23

repeat77

shareShare

jack morris

@jxmnop

3 months ago

new paper from our work at Meta! **GPT-style language models memorize 3.6 bits per param** we compute capacity by measuring total bits memorized, using some theory from Shannon (1953) shockingly, the memorization-datasize curves look like this: ___________ / / (🧵)

thumb_up_off_alt3,3K

chat_bubble_outline76

repeat369

shareShare