khalid (@k_saifullaah) 's Twitter Profile
khalid

@k_saifullaah

cs phd student @umdcs. prev: ai research @adobe, @deltanalytics fellow 🇧🇩

ID: 3117611679

linkhttps://scholar.google.com/citations?user=NNEbBIQAAAAJ&hl=en calendar_today26-03-2015 08:08:54

6,6K Tweet

3,3K Takipçi

1,1K Takip Edilen

Michael Saxon (@m2saxon) 's Twitter Profile Photo

🚨😱Obligatory job market announcement post‼️🤯 I'm searching for faculty positions/postdocs in multimodal/multilingual NLP and generative AI! I'll be at #NeurIPS2024 presenting our work on meta-evaluation for text-to-image faithfulness! Let's chat! Website in bio, papers in🧵

🚨😱Obligatory job market announcement post‼️🤯

I'm searching for faculty positions/postdocs in multimodal/multilingual NLP and generative AI!

I'll be at #NeurIPS2024 presenting our work on meta-evaluation for text-to-image faithfulness! Let's chat!

Website in bio, papers in🧵
Abhimanyu Hans (@ahans30) 's Twitter Profile Photo

poster sent for print 😮‍💨 are you concerned your prod LLM might regurgitate exact training data to your users? join me and my co-authors at #NeurIPS2024 on wednesday's 1st poster session & learn how goldfish loss can help you. eager to meet friends from past and future!

poster sent for print 😮‍💨

are you concerned your prod LLM might regurgitate exact training data to your users?

join me and my co-authors at #NeurIPS2024 on wednesday's 1st poster session & learn how goldfish loss can help you.

eager to meet friends from past and future!
Mahesh Sathiamoorthy (@madiator) 's Twitter Profile Photo

We are happy to announce Curator, an open-source library designed to streamline synthetic data generation! High-quality synthetic data generation is essential in training and evaluating LLMs/agents/RAG pipelines these days, but tooling around this is still entirely lacking! So

Tom Goldstein (@tomgoldsteincs) 's Twitter Profile Photo

Let’s sanity check DeepSeek’s claim to train on 2048 GPUs for under 2 months, for a cost of $5.6M. It sort of checks out and sort of doesn't. The v3 model is an MoE with 37B (out of 671B) active parameters. Let's compare to the cost of a 34B dense model. 🧵 (1/4)

Aryaman Arora (@aryaman2020) 's Twitter Profile Photo

new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering. we find that: 🥇prompting and finetuning are still best 🥈supervised interp methods are effective 😮SAEs lag behind

new paper! 🫡

we introduce  🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering.

we find that:
🥇prompting and finetuning are still best
🥈supervised interp methods are effective
😮SAEs lag behind
Ashwinee Panda (@pandaashwinee) 's Twitter Profile Photo

1 week to submit to our Sparsity workshop at ICLR 2026! That means SAEs, Sparse models, KV cache compression, quantization, pruning; we want to bring together folks from different sub areas to share ideas on how to make LLMs smaller, faster, and better!

Micah Goldblum (@micahgoldblum) 's Twitter Profile Photo

Here’s an easy trick for improving the performance of gradient-boosted decision trees like XGBoost allowing them to read text column headers and to benefit from massive pretraining: replace the first tree with an LLM or TabPFN! 🧵 1/9

Here’s an easy trick for improving the performance of gradient-boosted decision trees like XGBoost allowing them to read text column headers and to benefit from massive pretraining: replace the first tree with an LLM or TabPFN!  🧵 1/9
Abhimanyu Hans (@ahans30) 's Twitter Profile Photo

the real problem is that current retrievers are not instruction-friendly that's why you google keywords likely to be found in the answers/docs than try to "prompt" your query or you know...just ask your LLM

Tom Goldstein (@tomgoldsteincs) 's Twitter Profile Photo

New open source reasoning model! Huginn-3.5B reasons implicitly in latent space 🧠 Unlike O1 and R1, latent reasoning doesn’t need special chain-of-thought training data, and doesn't produce extra CoT tokens at test time. We trained on 800B tokens 👇

New open source reasoning model!

Huginn-3.5B reasons implicitly in latent space 🧠

Unlike O1 and R1, latent reasoning doesn’t need special chain-of-thought training data, and doesn't produce extra CoT tokens at test time.

We trained on 800B tokens 👇
Jonas Geiping (@jonasgeiping) 's Twitter Profile Photo

Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛

Ok, so I can finally talk about this! 

We spent the last year (actually  a bit longer) training an  LLM with recurrent depth at scale.

The model has an internal latent space in which it can adaptively spend more compute to think longer. 

I think the tech report ...🐦‍⬛
Luca Soldaini ✈️ ICLR 25 (@soldni) 's Twitter Profile Photo

we did a thing!! Refreshed OLMoE models, now running locally on your phone with brand new, open-source iOS app 🥳 We think we are gonna see more on-device AI in 2025, and we wanted to offer a simple way for everyone to prototype with it! (please be kind with video 😬)

Sean McLeish (@seanmcleish) 's Twitter Profile Photo

Introducing the Gemstones💎. 22 models ranging from 50M to 2B parameters, spanning 11 widths and 18 depths trained for 350B tokens of Dolma to allow for a more detailed analysis of scaling laws. 1/n

Introducing the Gemstones💎. 22 models ranging from 50M to 2B parameters, spanning 11 widths and 18 depths trained for 350B tokens of Dolma to allow for a more detailed analysis of scaling laws.
1/n
Ashwinee Panda (@pandaashwinee) 's Twitter Profile Photo

people are talking about whether scaling laws are broken or pretraining is saturating. so what does that even mean? consider the loss curves from our recent gemstones paper. as we add larger models, the convex hull doesn’t flatten out on this log-log plot. that's good!

people are talking about whether scaling laws are broken or pretraining is saturating. so what does that even mean? consider the loss curves from our recent gemstones paper. as we add larger models, the convex hull doesn’t flatten out on this log-log plot. that's good!
Pedro Sandoval (@psandovalsegura) 's Twitter Profile Photo

Attention sinks in LLMs are weird. There’s ~20% of heads that don’t seem to do anything. Do these heads matter? Turns out that if we get rid of them, benchmark scores don’t change.

Attention sinks in LLMs are weird. There’s ~20% of heads that don’t seem to do anything.

Do these heads matter? Turns out that if we get rid of them, benchmark scores don’t change.
Omar Khattab (@lateinteraction) 's Twitter Profile Photo

So many things in the run-up to DSPy 3. Here's a first, EXPERIMENTAL one: 🚨We're releasing dspy.GRPO, an online RL optimizer for DSPy programs Your DSPy code as-is can be dspy.GRPO'ed. Yes, even compound multi-module programs. Led by Noah Ziems Lakshya A Agrawal dilara.

jack morris (@jxmnop) 's Twitter Profile Photo

new paper from our work at Meta! **GPT-style language models memorize 3.6 bits per param** we compute capacity by measuring total bits memorized, using some theory from Shannon (1953) shockingly, the memorization-datasize curves look like this: ___________ / / (🧵)

new paper from our work at Meta!

**GPT-style language models memorize 3.6 bits per param**

we compute capacity by measuring total bits memorized, using some theory from Shannon (1953)

shockingly, the memorization-datasize curves look like this:
      ___________
  /
/

(🧵)