Tsun-Yi Yang 楊存毅 🇹🇼🏳️‍🌈 (@shamangary) Twitter Tweets • TwiCopy

Tsun-Yi Yang 楊存毅 🇹🇼🏳️‍🌈

@shamangary

+ Follow

A proud Taiwanese boy. @RobinAI_UK LLM research engineer. Ex-Meta. PhD in computer vision at National Taiwan University (NTU)

ID: 1537755079

calendar_today22-06-2013 02:05:34

1,1K Tweet

471 Followers

656 Following

Zhenjun Zhao

@zhenjun_zhao

6 months ago

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer You Shen, Zhipeng Zhang, Yansong Qu, Liujuan Cao tl;dr: token merging->VGGT without dense global attention arxiv.org/abs/2509.02560

thumb_up_off_alt116

chat_bubble_outline1

repeat27

shareShare

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxestex

6 months ago

sorry Meta "superintelligence" lab but Andrew Zhao et al. did this better and you don't cite them. Actually, many have (eg Bo Liu (Benjamin Liu)), whether joint maximization or minimax, verifiers or RMs. Why kang of a breakthrough? Good job evaluating on Vicuna tho, peak 2023 llamacore

sorry Meta "superintelligence" lab but <a href="/_AndrewZhao/">Andrew Zhao</a> et al. did this better and you don't cite them. Actually, many have (eg <a href="/Benjamin_eecs/">Bo Liu (Benjamin Liu)</a>), whether joint maximization or minimax, verifiers or RMs. Why kang of a breakthrough? Good job evaluating on Vicuna tho, peak 2023 llamacore

thumb_up_off_alt587

chat_bubble_outline18

repeat59

shareShare

Wenhu Chen

@wenhuchen

6 months ago

Ever wonder what's really happening when we use RL to teach LLMs to reason? 🤔 The process is full of mysteries. 🤯 What causes those sudden "aha moments" in training? 📏 Why does better reasoning often lead to longer answers ("length-scaling")? 📉 Why does token entropy often

thumb_up_off_alt406

chat_bubble_outline9

repeat65

shareShare

alphaXiv

@askalphaxiv

6 months ago

Why Do Multimodal LLMs (MLLM) Struggle with Spatial Understanding? This research shows that MLLMs’ spatial struggles aren’t from data scarcity, but from architecture. Spatial ability relies on the vision encoder’s positional cues, so a redesign like prompt targeting is needed.

thumb_up_off_alt719

chat_bubble_outline7

repeat97

shareShare

DRONEFORGE

@thedroneforge

5 months ago

< Image as an IMU: Turning Motion Blur into a Velocity Sensor > In a new paper, researchers flip the script on motion blur. Instead of a problem to be fixed, they treat it as a rich signal for estimating a camera's instantaneous 6-DoF velocity From a single blurred image, their

thumb_up_off_alt383

chat_bubble_outline6

repeat50

shareShare

Zhenjun Zhao

@zhenjun_zhao

5 months ago

SLAM-Former: Putting SLAM into One Transformer Yijun Yuan, Zhuoguang Chen, Kenan Li, Weibang Wang, Hang Zhao tl;dr: e frontend and the backend promote each other with transformer arxiv.org/abs/2509.16909

SLAM-Former: Putting SLAM into One Transformer

Yijun Yuan, Zhuoguang Chen, Kenan Li, Weibang Wang, <a href="/zhaohang0124/">Hang Zhao</a>

tl;dr: e frontend and the backend promote each other with transformer

arxiv.org/abs/2509.16909

thumb_up_off_alt138

chat_bubble_outline2

repeat19

shareShare

Lucas Beyer (bl16)

@giffmana

5 months ago

I think this project could be one of those "why have we ever done this differently?!" kind of moments. Instead of doing code training by just predicting the next token in the source file, interleave that with interpreter state which also have to be predicted! Devil's in the

thumb_up_off_alt726

chat_bubble_outline24

repeat52

shareShare

Kwang Moo Yi

@kwangmoo_yi

5 months ago

Yang et al., "Dense Semantic Matching with VGGT Prior" Train a decoding head for semantic segmentation, with sparse GT supervision and cycle consistency --> dense non-rigid warping. Using a foundational model for "matching" for sure works better than "any" foundational model.

thumb_up_off_alt108

chat_bubble_outline1

repeat15

shareShare

martin_casado

@martin_casado

5 months ago

Total insanity. This is using an adaptive LOD scheme in sparksjs (not merged yet). The entire scene has 16million splats and this is real time navigation ... 😱😱

thumb_up_off_alt2,2K

chat_bubble_outline68

repeat248

shareShare

Wenhu Chen

@wenhuchen

5 months ago

What’s preventing us from training open-source image editing models like Nano-Banana or Seedream? The main barrier is the lack of high-quality training data for image editing. Most existing image editing datasets are synthesized using weak reward models or poor quality

thumb_up_off_alt362

chat_bubble_outline6

repeat56

shareShare

Thomas Fel

@napoolar

5 months ago

🕳️🐇Into the Rabbit Hull – Part II Continuing our interpretation of DINOv2, the second part of our study concerns the geometry of concepts and the synthesis of our findings toward a new representational phenomenology: the Minkowski Representation Hypothesis

thumb_up_off_alt323

chat_bubble_outline5

repeat52

shareShare

Min-Hung (Steve) Chen

@cmhungsteven

6 days ago

Current Vision-Language Models completely struggle with complex 4D dynamics. We fixed that. 🤯 🚨 Introducing 4D-RGPT: distilling perceptual knowledge directly into LLMs for precise space & time reasoning. 🎉 Excited to share our NVIDIA AI work has been accepted to #CVPR2026!

thumb_up_off_alt80

chat_bubble_outline2

repeat16

shareShare