Sachin Kumar (@shocheen) 's Twitter Profile
Sachin Kumar

@shocheen

Assistant Professor at @OhioStateCSE. Hiring Ph.D. students (Fall '25).

Previous: @allen_ai, @UWNLP, @LTICMU. He/Him 🏳️‍🌈

ID: 267680298

linkhttp://shocheen.com calendar_today17-03-2011 10:39:58

424 Tweet

1,1K Takipçi

690 Takip Edilen

Alisa Liu (@alisawuffles) 's Twitter Profile Photo

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵
Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

Humans store thousands of multi-word expressions like "of course" in their mental lexicon, but current tokenizers don't support multi-word tokens. Enter SuperBPE, a tokenizer that lifts this restriction and brings substantial gains in efficiency and performance! 🚀 Details 👇

Abhilasha Ravichander (@lasha_nlp) 's Twitter Profile Photo

Want to know what training data has been memorized by models like GPT-4? We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models, without requiring access to 🙅‍♀️ Model weights 🙅‍♀️ Training data 🙅‍♀️ Token probabilities 🧵1/5

Want to know what training data has been memorized by models like GPT-4?

We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models,

without requiring access to
🙅‍♀️ Model weights
🙅‍♀️ Training data
🙅‍♀️ Token probabilities 🧵1/5
Patrick Da Silva (@patrickqdasilva) 's Twitter Profile Photo

We report many aggregated results in our paper, and invite researchers to comb through the extensive results in our repository to build intuitions about model variance Our paper: arxiv.org/abs/2504.04635 Code, Data, Results, and Figures for all LMs: github.com/patqdasilva/st… (9/10)

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

🚨 NEW WORKSHOP ALERT 🚨 We're thrilled to announce the first-ever Tokenization Workshop (TokShop) at #ICML2025 ICML Conference! 🎉 Submissions are open for work on tokenization across all areas of machine learning. 📅 Submission deadline: May 30, 2025 🔗 tokenization-workshop.github.io

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

There has been a lot of chatter about tokenization for LLMs over the last few months, but tokenization goes beyond text-based models. It's time we bring the NLP and ML communities together to explore this foundational topic. Let's talk about tokenization at TokShop!

Oreva Ahia (@orevaahia) 's Twitter Profile Photo

Working on tokenization across any modality, text, audio, images, videos ? Submit your paper to our Tokenization Workshop at #ICML2025!

Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

Delighted there will finally be a workshop devoted to tokenization - a critical topic for LLMs and beyond! 🎉 Join us for the inaugural edition of TokShop at #ICML2025 ICML Conference in Vancouver this summer! 🤗

Tuhin Chakrabarty (@tuhinchakr) 's Twitter Profile Photo

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.
Chan Young Park (@chan_young_park) 's Twitter Profile Photo

🚀 Excited to share our #NAACL2025 paper on Language Model Personalization! arxiv.org/abs/2410.16027 Current RLHF methods often overlook *whose* preferences are being optimized. This can cause conflicting signals and models that mainly cater to the “average” or most dominant users

🚀 Excited to share our #NAACL2025 paper on Language Model Personalization! arxiv.org/abs/2410.16027
Current RLHF methods often overlook *whose* preferences are being optimized. This can cause conflicting signals and models that mainly cater to the “average” or most dominant users
Chan Young Park (@chan_young_park) 's Twitter Profile Photo

While I'm on X to share my paper, I also have a life update I'll be joining School of Information - UT Austin as an assistant professor starting Fall 2026! Excited for this next chapter, and to keep working on teaching computers to better understand language and humans (+now teaching humans too)

Sachin Kumar (@shocheen) 's Twitter Profile Photo

I will be at #NAACL2025 next week to talk about this paper. Much work on personalizing LLMs focusses on explicit preferences, norms, values either directly optimized for or specified in the prompts/instructions. In this work, we study implicit preferences that may not

Sanchaita Hazra (@hsanchaita) 's Twitter Profile Photo

Very excited for a new #ICML2025 position paper accepted as oral w Bodhisattwa Majumder & Tuhin Chakrabarty! 😎 What are the longitudinal harms of AI development? We use economic theories to highlight AI’s intertemporal impacts on livelihoods & its role in deepening labor-market inequality.

Very excited for a new #ICML2025 position paper accepted as oral w <a href="/mbodhisattwa/">Bodhisattwa Majumder</a> &amp; <a href="/TuhinChakr/">Tuhin Chakrabarty</a>! 😎

What are the longitudinal harms of AI development?

We use economic theories to highlight AI’s intertemporal impacts on livelihoods &amp; its role in deepening labor-market inequality.
Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

📣 Call for Paper Alert: TokShop @ ICML 2025 TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

Got a good tokenization paper under review at COLM, but the scores were a letdown? 😬 Why bother with rebuttal when the perfect venue is right around the corner! Submit your paper to the #ICML2025 Tokenization Workshop (TokShop) by May 30! 🚀