Anh Thai (@ngailapdi) Twitter Tweets • TwiCopy

Stefan Stojanov

8 months ago

Extracting structure that’s implicitly learned by video foundation models _without_ relying on labeled data is a fundamental challenge. What’s a better place to start than extracting motion? Temporal correspondence is a key building block of perception. Check out our paper!

thumb_up_off_alt34

chat_bubble_outline1

repeat7

shareShare

Bolin Lai

@bryanislucky

7 months ago

📢#CVPR2025 Introducing InstaManip, a novel multimodal autoregressive model for few-shot image editing. 🎯InstaManip can learn a new image editing operation from textual and visual guidance via in-context learning, and apply it to new query images. [1/8] bolinlai.github.io/projects/Insta…

thumb_up_off_alt10

chat_bubble_outline1

repeat4

shareShare

Jia-Bin Huang

@jbhuang0604

6 months ago

Exploration is key for robots to generalize, especially in open-ended environments with vague goals and sparse rewards. BUT, how do we go beyond random poking? Wouldn't it be great to have a robot that explores an environment just like a kid? Introducing Imagine, Verify,

thumb_up_off_alt164

chat_bubble_outline3

repeat31

shareShare

Shangchen Zhou

@shangchenzhou

4 months ago

With #ObjectClear, you can now remove any objects, along with their shadows and reflections, from your images in just a few clicks or strokes! 👉Try our demo (click version): huggingface.co/spaces/jixin01… Big thanks to AK Adina Yakup!

thumb_up_off_alt118

chat_bubble_outline5

repeat19

shareShare

Zhenjun Zhao

@zhenjun_zhao

3 months ago

Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi tl;dr: CLIP->SLAM3R; CLIP+DINO+CG3D->2D-3D fused descriptor arxiv.org/abs/2507.22052

Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

Ziren Gong, Xiaohan Li, <a href="/fabiotosi92/">Fabio Tosi</a>, Jiawei Han, <a href="/s_matt/">Stefano Mattoccia</a>, Jianfei Cai, <a href="/mattpoggi/">Matteo Poggi</a>

tl;dr: CLIP->SLAM3R; CLIP+DINO+CG3D->2D-3D fused descriptor

arxiv.org/abs/2507.22052

thumb_up_off_alt72

chat_bubble_outline0

repeat15

shareShare

Zhenjun Zhao

@zhenjun_zhao

3 months ago

Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images Xiangyu Sun, Haoyi jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park tl;dr:

thumb_up_off_alt54

chat_bubble_outline1

repeat8

shareShare

Zhenjun Zhao

@zhenjun_zhao

3 months ago

A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding tl;dr: in title arxiv.org/abs/2508.09977

thumb_up_off_alt45

chat_bubble_outline0

repeat7

shareShare

Rana Hanocka

@ranahanocka

3 months ago

We’ve been building something we’re 𝑟𝑒𝑎𝑙𝑙𝑦 excited about – LL3M: LLM-powered agents that turn text into editable 3D assets. LL3M models shapes as interpretable Blender code, making geometry, appearance, and style easy to modify. 🔗 threedle.github.io/ll3m 1/

thumb_up_off_alt384

chat_bubble_outline10

repeat64

shareShare

Xingang Pan

@xingangp

3 months ago

Introducing 𝗦𝗧𝗿𝗲𝗮𝗺𝟯𝗥, a new 3D geometric foundation model for efficient 3D reconstruction from streaming input. Similar to LLMs, STream3R uses casual attention during training and KVCache at inference. No need to worry about post-alignment or reconstructing from scratch.

thumb_up_off_alt318

chat_bubble_outline5

repeat58

shareShare

Zhenjun Zhao

@zhenjun_zhao

3 months ago

GSFix3D: Diffusion-Guided Repair of Novel Views in Gaussian Splatting Jiaxin Wei, Stefan Leutenegger, Simon Schaefer tl;dr: fuse mesh and 3DGS->rendered images->pretrained diffusion model+random mask augmentation->removes artifacts+inpainting+completion arxiv.org/abs/2508.14717

GSFix3D: Diffusion-Guided Repair of Novel Views in Gaussian Splatting

Jiaxin Wei, <a href="/StefanLeuteneg1/">Stefan Leutenegger</a>, Simon Schaefer

tl;dr: fuse mesh and 3DGS->rendered images->pretrained diffusion model+random mask augmentation->removes artifacts+inpainting+completion

arxiv.org/abs/2508.14717

thumb_up_off_alt26

chat_bubble_outline0

repeat4

shareShare

Kwang Moo Yi

@kwangmoo_yi

3 months ago

Wei et al., "GSFix3D: Diffusion-Guided Repair of Novel Views in Gaussian Splatting" Fine-tune a diffusion model to fix missing holes and such in 3DGS reconstructions. Similar to other works that do this, but interestingly, it uses meshes alongside 3DGS to remove floaters, etc.

thumb_up_off_alt59

chat_bubble_outline1

repeat10

shareShare

Fei-Fei Li

@drfeifei

3 months ago

A picture now is worth more than a thousand words in genAI; it can be turned into a full 3D world! And you can stroll in this garden endlessly long, it will still be there.

thumb_up_off_alt2,2K

chat_bubble_outline123

repeat262

shareShare

MrNeRF

@janusch_patas

a month ago

Human3R: Everyone Everywhere All at Once Note: I recorded the video from the interactive demo on their project page (linked in the comment below). Abstract (excerpt): Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scenes ("everywhere"), and

thumb_up_off_alt646

chat_bubble_outline13

repeat95

shareShare

Michael Niemeyer

@mi_niemeyer

a month ago

How do we reconstruct a 3D scene from photos with varying exposures? Standard methods often fail, leaving you with blown-out colors or disturbing shadows. We're excited to introduce Neural Exposure Fields (NExF), our new work accepted at #NeurIPS2025! 🧵

thumb_up_off_alt1,1K

chat_bubble_outline11

repeat112

shareShare

Tenny Yin

@tennyyin

a month ago

Does VGGT offer an edge over DINO in spatial tasks? New research shows that visual-only features (DINO) outperform visual-geometry features (VGGT) here!

thumb_up_off_alt252

chat_bubble_outline4

repeat27

shareShare

Xiaoyang Wu

@xiaoyangwu_

15 days ago

Introducing Concerto 🎶 Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations. What is it: Concerto is a self-supervised Point Transformer V3 that jointly learns from 2D and 3D modalities, producing rich spatial representations. It can take both point clouds and

thumb_up_off_alt155

chat_bubble_outline3

repeat30

shareShare