Kai Zhang (@kaizhang9546) 's Twitter Profile
Kai Zhang

@kaizhang9546

Research Scientist at Adobe pondering about 3D’s role in the AGI story. Opinions are my own.

ID: 1346885093361471489

calendar_today06-01-2021 18:23:38

115 Tweet

770 Followers

259 Following

AI at Meta (@aiatmeta) 's Twitter Profile Photo

πŸ“ New from FAIR: An Introduction to Vision-Language Modeling. Vision-language models (VLMs) are an area of research that holds a lot of potential to change our interactions with technology, however there are many challenges in building these types of models. Together with a set

πŸ“ New from FAIR: An Introduction to Vision-Language Modeling.

Vision-language models (VLMs) are an area of research that holds a lot of potential to change our interactions with technology, however there are many challenges in building these types of models. Together with a set
Ruiqi Gao (@ruiqigao) 's Twitter Profile Photo

1-step distillation for diffusion remains challenging, sometimes vulnerable to mode collapse. 🀯 Check our new work: EM Distillation (EMD) to tackle this! Competitive results on ImageNet 64x64, 128x128 and Stable Diffusion. arxiv.org/abs/2405.16852 Led by brilliant Sirui Xie

1-step distillation for diffusion remains challenging, sometimes vulnerable to mode collapse.

🀯 Check our new work: EM Distillation (EMD) to tackle this! Competitive results on ImageNet 64x64, 128x128 and Stable Diffusion.

arxiv.org/abs/2405.16852

Led by brilliant <a href="/SiruiXie/">Sirui Xie</a>
Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Nice paper surveying Multimodal AI Architectures -- with a comprehensive taxonomy and analysis of their pros/cons & applications in any-to-any modality model development πŸ“Œ π‚π¨π¦π©π«πžπ‘πžπ§π¬π’π―πž π“πšπ±π¨π§π¨π¦π²: First work to explicitly identify and categorize four broad

Nice paper surveying Multimodal AI Architectures -- with a comprehensive taxonomy and analysis of their pros/cons &amp; applications in any-to-any modality model development

πŸ“Œ  π‚π¨π¦π©π«πžπ‘πžπ§π¬π’π―πž π“πšπ±π¨π§π¨π¦π²: First work to explicitly identify and categorize four broad
MrNeRF (@janusch_patas) 's Twitter Profile Photo

Relighting Any Object via Diffusion with Neural Gaffer Paper: arxiv.org/abs/2406.07520 Project: neural-gaffer.github.io - End-to-end 2D relighting diffusion model that accurately relights any object in a single image under various unseen lighting conditions. - Supports other

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Today along with 4 other models AI at Meta released Chameleon: 7B & 34B language models. This is based on AI at Meta 's brilliant paper released in May-2024. "Chameleon: Mixed-Modal Early-Fusion Foundation Models" πŸ”₯ πŸ‘¨β€πŸ”§ The Problem this paper solves: Chameleon tackles the key

Today along with 4 other models <a href="/AIatMeta/">AI at Meta</a> released Chameleon: 7B &amp; 34B language models.

This is based on <a href="/AIatMeta/">AI at Meta</a> 's brilliant paper released in May-2024.

"Chameleon: Mixed-Modal Early-Fusion Foundation Models" πŸ”₯

πŸ‘¨β€πŸ”§ The Problem this paper solves:

Chameleon tackles the key
Alex Dimakis (@alexgdimakis) 's Twitter Profile Photo

This paper seems very interesting: say you train an LLM to play chess using only transcripts of games of players up to 1000 elo. Is it possible that the model plays better than 1000 elo? (i.e. "transcends" the training data performance?). It seems you get something from nothing,

This paper seems very interesting: say you train an LLM to play chess using only transcripts of games of players up to 1000 elo. Is it possible that the model plays better than 1000 elo? (i.e. "transcends" the training data performance?). It seems you get something from nothing,
lmsys.org (@lmsysorg) 's Twitter Profile Photo

Exciting news - Chatbot Arena now supports image uploadsπŸ“Έ Challenge GPT-4o, Gemini, Claude, and LLaVA with your toughest questions. Plot to code, VQA, story telling, you name it. Let's get creative and have fun! Leaderboard coming soon. Credits to builders Christopher Chou

Gene Chou (@gene_ch0u) 's Twitter Profile Photo

Introducing MegaScenesβ€”a scene-level dataset containing 100K SfM reconstructions and 2M images with open content licenses. We validate its effectiveness in training large-scale, generalizable models on the task of novel view synthesis. 1/N project page: megascenes.github.io

Pedro Cuenca (@pcuenq) 's Twitter Profile Photo

Optimized Depth Anything V2 for Apple Neural Engine is out! It's a huge step up from V1. Here's the small Core ML version running on my iPhone (right), compared with the previous version (left). Amazed by the fine details!

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Give 'vision' capability to all of your local LLMs using the power of Open Interpreter tool🀯 Checkout this video by killian - Really 'WOW' example. Open Interpreter is a fully open-sourced tool, that lets LLMs run code locally (Python, Javascript, Shell, and more) is

Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? abs: arxiv.org/abs/2406.13121 code: github.com/google-deepmin… New paper from Google DeepMind; Introduces the LOFT benchmark. LOFT consists of 6 long-context task categories spanning retrieval, multi-hop

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

abs: arxiv.org/abs/2406.13121
code: github.com/google-deepmin…

New paper from Google DeepMind; Introduces the LOFT benchmark. LOFT consists of 6 long-context task categories spanning retrieval, multi-hop
James Yeung (@jamesyeung18) 's Twitter Profile Photo

πŸ“½οΈ8 new videos I made on Runway GEN-3 with prompts for each video in this post. Which one is your favourite? (No. 7 is my favourite) πŸ‘‡ 1. A drone shot of a police car travelling on a swirling road covered with snow at midnight, cinematic, very dark environment, roads only

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

I feel like I have to once again pull out this figure. These 32x32 texture patches were state of the art image generation in 2017 (7 years ago). What does it look like for Gen-3 and friends to look similarly silly 7 years from now.

I feel like I have to once again pull out this figure. These 32x32 texture patches were state of the art image generation in 2017 (7 years ago). What does it look like for Gen-3 and friends to look similarly silly 7 years from now.
Jon Barron (@jon_barron) 's Twitter Profile Photo

The legendary Ross Girshick just posted his CVPR workshop slides about the 1.5 decades he spent ~solving object detection as it relates to the ongoing LLM singularity. Excellent read, highly recommended. drive.google.com/file/d/1VodGlj…

Sara Rojas Martinez (@sarisro) 's Twitter Profile Photo

Exciting News! πŸŽ‰ My paper got accepted at #ECCV2024! Huge thanks to my Adobe and KAUST collaborators! πŸ’Œ DATENeRF: Depth-Aware Text-based Editing of NeRFs πŸ’Œ Sara Rojas Martinez, Julien Philip, Kai Zhang, Sai Bi, Fujun Luan,Bernard Ghanem , Kalyan Sunkavalli datenerf.github.io/DATENeRF/

Haider. (@slow_developer) 's Twitter Profile Photo

OpenAI Co-founder Andrej Karpathy explains the new computing paradigm: "We're entering a new computing paradigm with large language models acting like CPUs, using tokens instead of bytes, and having a context window instead of RAM. This is the Large Language Model OS (LMOS)"

Jonathan Granskog (@jongranskog) 's Twitter Profile Photo

I suspect the traditional 3D pipeline for offline rendering will over time be replaced by generative models guided largely by primitive 3D scenes and generated parts. Most of the control can be achieved in 3D but fidelity comes from consistent 2D generation.

NAVER LABS Europe (@naverlabseurope) 's Twitter Profile Photo

The wait is over πŸ“’ MAST3R is out! DUSt3R+ dense local feature maps & metric depth - 1st in #MapFreeReloc leaderboard, can handle 1000s of images πŸ˜€ !! Blog: shorturl.at/9JTH2 Code: github.com/naver/mast3r Paper: arxiv.org/abs/2406.09756