Xiaochuang Han
@xiaochuanghan
PhD student at the University of Washington
ID: 4916685123
http://xhan77.github.io 16-02-2016 00:47:56
93 Tweet
567 Followers
730 Following
Can ๐ฆ๐๐๐ก๐ข๐ง๐ ๐ฎ๐ง๐ฅ๐๐๐ซ๐ง๐ข๐ง๐ make language models forget their training data? We shows Yes but at the cost of privacy and utility. Current unlearning scales poorly with the size of the data to be forgotten and canโt handle sequential unlearning requests. ๐:
What do BPE tokenizers reveal about their training data?๐ง We develop an attack๐ก๏ธ that uncovers the training data mixtures๐ of commercial LLM tokenizers (incl. GPT-4o), using their ordered merge lists! Co-1โฃst Jonathan Hayase arxiv.org/abs/2407.16607 ๐งตโฌ๏ธ
๐คฏOne of the most outside-the-box-thinking usages of LLMs I have seen. Interesting work from Xiaochuang Han ๐
Huge congrats to Oreva Ahia and Shangbin Feng for winning awards at #ACL2024! DialectBench Best Social Impact Paper Award arxiv.org/abs/2403.11009 Don't Hallucinate, Abstain Area Chair Award, QA track & Outstanding Paper Award arxiv.org/abs/2402.00367
๐ Excited to share our latest work: Transfusion! A new multi-modal generative training combining language modeling and image diffusion in a single transformer! Huge shout to Chunting Zhou Omer Levy Michi Yasunaga Arun Babu Kushal Tirumala and other collaborators.
Check out JPEG-LM, a fun idea led by Xiaochuang Han -- we generate images simply by training an LM on raw JPEG bytes and show that it outperforms much more complicated VQ models, especially on rare inputs.