Alexey Gritsenko (@agritsenko) Twitter Tweets • TwiCopy

Emiel Hoogeboom

4 years ago

Generate data in random order? A masked language models as generative model? All this and more in "Autoregressive Diffusion Models" with Alexey Gritsenko @BastingsJasmijn Ben Poole Rianne van den Berg Tim Salimans. For details see arxiv.org/abs/2110.02037. Some explanations below...

thumb_up_off_alt96

chat_bubble_outline2

repeat23

shareShare

Niels Rogge

@nielsrogge

3 years ago

OWL-ViT by Google AI is now available Hugging Face Transformers. The model is a minimal extension of CLIP for zero-shot object detection given text queries. 🤯 🥳 It has impressive generalization capabilities and is a great first step for open-vocabulary object detection! (1/2)

thumb_up_off_alt1,1K

chat_bubble_outline7

repeat260

shareShare

Jonathan Ho

@hojonathanho

3 years ago

Excited to announce Imagen Video, our new text-conditioned video diffusion model that generates 1280x768 24fps HD videos! #ImagenVideo imagen.research.google/video/ Work w/ William Chan (陳俊樂) Chitwan Saharia Jay Whang Ruiqi Gao Alexey Gritsenko Durk Kingma Ben Poole Mohammad Norouzi David Fleet Tim Salimans

thumb_up_off_alt3,3K

chat_bubble_outline56

repeat709

shareShare

Dumitru Erhan

@doomie

3 years ago

1/ Today we are excited to introduce Phenaki: phenaki.github.io, short-link-to-paper, a model for generating videos from text, with prompts that can change over time, and that is able to generate videos that can be as long as multiple minutes!

thumb_up_off_alt1,1K

chat_bubble_outline36

repeat395

shareShare

Alexey Gritsenko

@agritsenko

3 years ago

Come talk to us about OWL-ViT at the Google Research booth at #ECCV2022

thumb_up_off_alt7

chat_bubble_outline0

repeat2

shareShare

Alexey Gritsenko

@agritsenko

3 years ago

Really exciting progress on scaling Vision Transformers. Turns out being clever about applying lessons learned in language modelling leads to exceptional results in vision as well!

thumb_up_off_alt28

chat_bubble_outline2

repeat5

shareShare

Piotr Padlewski

@piotrpadlewski

2 years ago

Do you want to accelerate your vision model without losing quality? NaViT takes images of arbitrary resolutions and aspect ratios - no more resizing to square with constant resolution. One cool implication is that you can control compute/quality tradeoff by resizing:

thumb_up_off_alt57

chat_bubble_outline2

repeat11

shareShare

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن

@ibomohsin

2 years ago

Check out NaViT, a native resolution ViT for all aspect ratios, enhancing training efficiency & performance. By preserving aspect ratios, it improves fairness-signal annotation, useful where metrics like group calibration are noise-sensitive. NaViT helps overcome such challenges.

thumb_up_off_alt23

chat_bubble_outline0

repeat7

shareShare

Mostafa Dehghani

@m__dehghani

2 years ago

1/ Excited to share "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution". NaViT breaks away from the CNN-designed input and modeling pipeline, sets a new course for ViTs, and opens up exciting possibilities in their development. arxiv.org/abs/2307.06304

thumb_up_off_alt611

chat_bubble_outline13

repeat149

shareShare

Joan Puigcerver

@joapuipe

2 years ago

We explore several ways to accelerate the training in the paper: Token dropping is a nice trick to reduce the training cost of ViT models (or get better models for the same cost). With NaViT one can take it several steps further!

thumb_up_off_alt24

chat_bubble_outline1

repeat3

shareShare

Matthias Minderer

@mjlm3

2 years ago

We just open-sourced OWL-ViT v2, our improved open-vocabulary object detector that uses self-training to reach >40% zero-shot LVIS APr. Check out the paper, code, and pretrained checkpoints: arxiv.org/abs/2306.09683 github.com/google-researc…. With Alexey Gritsenko and Neil Houlsby.

thumb_up_off_alt90

chat_bubble_outline0

repeat25

shareShare

Alexey Gritsenko

@agritsenko

2 years ago

Really nice work from my collaborators. I am particularly excited to see positive transfer of fine-grained information from image-level pre-training to instance-level tasks such as object detection. With SPARC, OWL-ViT improves on LVIS and COCO, which super encouraging.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Matthias Minderer

@mjlm3

2 years ago

AK I added OWL-ViT v2 to the plot. A single OWLv2 B/16 model, finetuned on O365+VG, covers all speed/accuracy combinations: Simply adjust the inference resolution to match your latency requirements. No re-training needed. arxiv.org/abs/2306.09683

<a href="/_akhaliq/">AK</a> I added OWL-ViT v2 to the plot. A single OWLv2 B/16 model, finetuned on O365+VG, covers all speed/accuracy combinations: Simply adjust the inference resolution to match your latency requirements. No re-training needed. arxiv.org/abs/2306.09683

thumb_up_off_alt68

chat_bubble_outline3

repeat8

shareShare

Xiaohua Zhai

@xiaohuazhai

9 months ago

Introducing SigLIP2: now trained with additional captioning and self-supervised losses! Stronger everywhere: - multilingual - cls. / ret. - localization - ocr - captioning / vqa Try it out, backward compatible! Models: github.com/google-researc… Paper: arxiv.org/abs/2502.14786

thumb_up_off_alt370

chat_bubble_outline9

repeat57

shareShare