Joshua Susskind (@jmsusskind) 's Twitter Profile
Joshua Susskind

@jmsusskind

ID: 1932661597450964992

calendar_today11-06-2025 04:50:26

4 Tweet

13 Takipçi

12 Takip Edilen

Vimal Thilak🦉🐒 (@aggieinca) 's Twitter Profile Photo

🚨 Machine Learning Research Internship opportunity in Apple MLR! We are looking for a PhD research intern with a strong interest in world modeling, planning or learning video representations for planning and/or reasoning. If interested, apply by sending an email to me at

Vimal Thilak🦉🐒 (@aggieinca) 's Twitter Profile Photo

Xianhang Li has a thread on work conducted during his internship. I'm very happy to see this project out in the open! Please check it out. We love video-based learning ;)

Michael Kirchhof (@mkirchhof_) 's Twitter Profile Photo

Many treat uncertainty = a number. At Apple, we're rethinking this: LLMs should output strings that reveal all information of their internal distributions. We find Reasoning, SFT, CoT can't do it -yet. To get there, we introduce the SelfReflect benchmark. arxiv.org/abs/2505.20295

Many treat uncertainty = a number. At Apple, we're rethinking this: LLMs should output strings that reveal all information of their internal distributions. We find Reasoning, SFT, CoT can't do it -yet. To get there, we introduce the SelfReflect benchmark.

arxiv.org/abs/2505.20295
Jorge Colon (@jorgeconsulting) 's Twitter Profile Photo

Dude. MLX team really went all out. How is it that you can get almost 2.5x the performance of the same quant of a model just by using the MLX version of it? Granite 4 H Tiny 4bit GGUF = 47 tokens/sec Granite 4 H Tiny 4bit MLX w/ DWQ = 115 tokens/sec This is on my M3 Max

Dude. MLX team really went all out. How is it that you can get almost 2.5x the performance of the same quant of a model just by using the MLX version of it?

Granite 4 H Tiny 4bit GGUF = 47 tokens/sec
Granite 4 H Tiny 4bit MLX w/ DWQ = 115 tokens/sec

This is on my M3 Max
Huangjie Zheng (@undergroundjeg) 's Twitter Profile Photo

We’re excited to share our new paper: Continuously-Augmented Discrete Diffusion (CADD) — a simple yet effective way to bridge discrete and continuous diffusion models on discrete data, such as language modeling. [1/n] Paper: arxiv.org/abs/2510.01329

We’re excited to share our new paper: Continuously-Augmented Discrete Diffusion (CADD) — a simple yet effective way to bridge discrete and continuous diffusion models on discrete data, such as language modeling. [1/n] 

Paper: arxiv.org/abs/2510.01329
Awni Hannun (@awnihannun) 's Twitter Profile Photo

I love this line of research from my colleagues at Apple: Augmenting a language model with a hierarchical memory makes perfect sense for several reasons: - Intuitively the memory parameters should be accessed much less frequently than the weights responsible for reasoning. You

I love this line of research from my colleagues at Apple:

Augmenting a language model with a hierarchical memory makes perfect sense for several reasons:

- Intuitively the memory parameters should be accessed much less frequently than the weights responsible for reasoning. You
Joshua Susskind (@jmsusskind) 's Twitter Profile Photo

When Eran Malach joined our Apple research group he immediately got to work on long context and length generalization. This led to beautiful results showing that state space models (SSMs) with tool calling show compelling generalization abilities in long form generation!

Joshua Susskind (@jmsusskind) 's Twitter Profile Photo

Check out RepTok, which represents each image as a single continuous latent token, and leverages pre-trained SSL encoders for highly efficient generative model training. This work was led by our excellent LMU collaborators with a couple of us from Apple research!

Albert Gu (@_albertgu) 's Twitter Profile Photo

I really like this research direction! For a long time, I've been talking about the "brain vs. database" analogy of SSMs vs Transformers. An extension of this that I've mentioned offhand a few times is that I think that the tradeoffs change when we start thinking about building

Awni Hannun (@awnihannun) 's Twitter Profile Photo

I always thought the decline in fundamental AI research funding would happen because AI didn’t generate enough value to be worth the cost. But it seems like it’s happening because it generated too much value. And the race to capture that value is taking priority. Just

Zhe Gan (@zhegan4) 's Twitter Profile Photo

🎁🎁 We release Pico-Banana-400K, a large-scale, high-quality image editing dataset distilled from Nana-Banana across 35 editing types. 🔗 Data link: github.com/apple/pico-ban… 🔗Paper link: arxiv.org/abs/2510.19808 It includes 258K single-turn image editing data, 72K multi-turn

🎁🎁 We release Pico-Banana-400K, a large-scale, high-quality image editing dataset distilled from Nana-Banana across 35 editing types. 

🔗 Data link: github.com/apple/pico-ban…

🔗Paper link: arxiv.org/abs/2510.19808

It includes 258K single-turn image editing data, 72K multi-turn
Joshua Susskind (@jmsusskind) 's Twitter Profile Photo

Check out this opening if you are looking for a research scientist position and have relevant interest and experience. Miguel Angel Bautista is an inspiring and highly creative research scientist, and the MLR team is one of a kind!

Alexander Toshev (@alexttoshev) 's Twitter Profile Photo

If you are excited about Multimodal and Agentic Reasoning with Foundation Models, Apple ML Research has openings for Researchers, Engineers, and Interns in this area. Consider applying through the links below or feel free to send a message for more information. - Machine

Alexander Toshev (@alexttoshev) 's Twitter Profile Photo

Another great collaboration advancing Computer Use Agents here at Apple. We investigate unifying UI interactions with tool use by synthesizing appropriate data and use of RL on OSWorld. This paper is nice behind the scenes peek into building an agentic system.

Preetum Nakkiran (@preetumnakkiran) 's Twitter Profile Photo

LLMs are notorious for "hallucinating": producing confident-sounding answers that are entirely wrong. But with the right definitions, we can extract a semantic notion of "confidence" from LLMs, and this confidence turns out to be calibrated out-of-the-box in many settings (!)

LLMs are notorious for "hallucinating": producing confident-sounding answers that are entirely wrong. But with the right definitions, we can extract a semantic notion of "confidence" from LLMs, and this confidence turns out to be calibrated out-of-the-box in many settings (!)
Preetum Nakkiran (@preetumnakkiran) 's Twitter Profile Photo

If you liked our calibration paper and want to work with me & our team, please apply to this PhD internship. 6-months in our Paris office:

If you liked our calibration paper and want to work with me & our team, please apply to this PhD internship. 6-months in our Paris office:
Miguel Angel Bautista (@itsbautistam) 's Twitter Profile Photo

I'm looking for reviewers for TMLR around geometry-aware molecular generation that focuses on order-agnostic autoregressive modeling, if you have expertise in geometric deep learning or generative AI for drug discovery, please let me know! 🧬

Peter Gray (@peteryugray) 's Twitter Profile Photo

MLX + the Neural Accelerators in the M5 GPU = up to 4x faster LLM inference🚀 machinelearning.apple.com/research/explo…

Jiatao Gu (@thoma_gu) 's Twitter Profile Photo

When exploring BOOT (arxiv.org/abs/2306.05544) we tried to study this data-free distillation problem — it is possible to distill without real data, but the performance lagged behind data-based methods. Happy to see data-free distillation can work so impressively well! Congrats!