ByteDance released Tar 1.5B and 7B: image-text in image-text out models 👏
They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🤯
I just tried out Vidu AI, and the Reference-to-Video is top-notch
You can control everything in your scene and keep character consistency, all in one place!
Thread on how to use it, including prompts 👇
Skywork-R1V3 🔥 New multimodal reasoning model by Skywork
huggingface.co/collections/Sk…
✨ 38B - MIT license
✨ RL-boosted fine-tuning
✨ Entropy of critical reasoning tokens
✨ SOTA results on benchmarks
It’s so over.
Higgsfield just dropped Soul ID, a consistent character model built to lock your face, style, and emotion in every shot.
Train once and get a full pack of styled, cinematic versions of you.
This is what real personalization looks like.
8 wild examples: