Alexey Gritsenko (@agritsenko) 's Twitter Profile
Alexey Gritsenko

@agritsenko

Research @ Google Brain
// opinions my own

ID: 135633621

calendar_today21-04-2010 21:32:54

33 Tweet

356 Followers

147 Following

Emiel Hoogeboom (@emiel_hoogeboom) 's Twitter Profile Photo

Generate data in random order? A masked language models as generative model? All this and more in "Autoregressive Diffusion Models" with Alexey Gritsenko @BastingsJasmijn Ben Poole Rianne van den Berg Tim Salimans. For details see arxiv.org/abs/2110.02037. Some explanations below...

Niels Rogge (@nielsrogge) 's Twitter Profile Photo

OWL-ViT by Google AI is now available Hugging Face Transformers. The model is a minimal extension of CLIP for zero-shot object detection given text queries. 🤯 🥳 It has impressive generalization capabilities and is a great first step for open-vocabulary object detection! (1/2)

Dumitru Erhan (@doomie) 's Twitter Profile Photo

1/ Today we are excited to introduce Phenaki: phenaki.github.io, short-link-to-paper, a model for generating videos from text, with prompts that can change over time, and that is able to generate videos that can be as long as multiple minutes!

Alexey Gritsenko (@agritsenko) 's Twitter Profile Photo

Really exciting progress on scaling Vision Transformers. Turns out being clever about applying lessons learned in language modelling leads to exceptional results in vision as well!

Piotr Padlewski (@piotrpadlewski) 's Twitter Profile Photo

Do you want to accelerate your vision model without losing quality? NaViT takes images of arbitrary resolutions and aspect ratios - no more resizing to square with constant resolution. One cool implication is that you can control compute/quality tradeoff by resizing:

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن (@ibomohsin) 's Twitter Profile Photo

Check out NaViT, a native resolution ViT for all aspect ratios, enhancing training efficiency & performance. By preserving aspect ratios, it improves fairness-signal annotation, useful where metrics like group calibration are noise-sensitive. NaViT helps overcome such challenges.

Check out NaViT, a native resolution ViT for all aspect ratios, enhancing training efficiency & performance. By preserving aspect ratios, it improves fairness-signal annotation, useful where metrics like group calibration are noise-sensitive. NaViT helps overcome such challenges.
Mostafa Dehghani (@m__dehghani) 's Twitter Profile Photo

1/ Excited to share "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution". NaViT breaks away from the CNN-designed input and modeling pipeline, sets a new course for ViTs, and opens up exciting possibilities in their development. arxiv.org/abs/2307.06304

1/ Excited to share "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution". NaViT breaks away from the CNN-designed input and modeling pipeline, sets a new course for ViTs, and opens up exciting possibilities in their development.
arxiv.org/abs/2307.06304
Joan Puigcerver (@joapuipe) 's Twitter Profile Photo

We explore several ways to accelerate the training in the paper: Token dropping is a nice trick to reduce the training cost of ViT models (or get better models for the same cost). With NaViT one can take it several steps further!

Matthias Minderer (@mjlm3) 's Twitter Profile Photo

We just open-sourced OWL-ViT v2, our improved open-vocabulary object detector that uses self-training to reach >40% zero-shot LVIS APr. Check out the paper, code, and pretrained checkpoints: arxiv.org/abs/2306.09683 github.com/google-researc…. With Alexey Gritsenko and Neil Houlsby.

We just open-sourced OWL-ViT v2, our improved open-vocabulary object detector that uses self-training to reach &gt;40% zero-shot LVIS APr. Check out the paper, code, and pretrained checkpoints: arxiv.org/abs/2306.09683 github.com/google-researc…. With <a href="/agritsenko/">Alexey Gritsenko</a> and <a href="/neilhoulsby/">Neil Houlsby</a>.
Alexey Gritsenko (@agritsenko) 's Twitter Profile Photo

Really nice work from my collaborators. I am particularly excited to see positive transfer of fine-grained information from image-level pre-training to instance-level tasks such as object detection. With SPARC, OWL-ViT improves on LVIS and COCO, which super encouraging.

Matthias Minderer (@mjlm3) 's Twitter Profile Photo

AK I added OWL-ViT v2 to the plot. A single OWLv2 B/16 model, finetuned on O365+VG, covers all speed/accuracy combinations: Simply adjust the inference resolution to match your latency requirements. No re-training needed. arxiv.org/abs/2306.09683

<a href="/_akhaliq/">AK</a> I added OWL-ViT v2 to the plot. A single OWLv2 B/16 model, finetuned on O365+VG, covers all speed/accuracy combinations: Simply adjust the inference resolution to match your latency requirements. No re-training needed. arxiv.org/abs/2306.09683
Xiaohua Zhai (@xiaohuazhai) 's Twitter Profile Photo

Introducing SigLIP2: now trained with additional captioning and self-supervised losses! Stronger everywhere: - multilingual - cls. / ret. - localization - ocr - captioning / vqa Try it out, backward compatible! Models: github.com/google-researc… Paper: arxiv.org/abs/2502.14786

Introducing SigLIP2: now trained with additional captioning and self-supervised losses!

Stronger everywhere: 
- multilingual
- cls. / ret.
- localization
- ocr
- captioning / vqa

Try it out, backward compatible!

Models: github.com/google-researc…

Paper: arxiv.org/abs/2502.14786