Zijun Wu (@zijunwu88) 's Twitter Profile
Zijun Wu

@zijunwu88

PhD student @AmiiThinks @ualberta / previous AS @AWS

ID: 1708562995004182528

linkhttps://khalilbalaree.github.io/ calendar_today01-10-2023 19:22:31

23 Tweet

354 Followers

619 Following

jack morris (@jxmnop) 's Twitter Profile Photo

fun research idea: Latent chain-of-thought / Latent scratchpad it's well-known that language models perform better when they generate intermediate reasoning tokens through some sort of 'scratchpad'. but there's no reason scratchpad tokens need to be human-readable. in fact,

fun research idea: Latent chain-of-thought / Latent scratchpad

it's well-known that language models perform better when they generate intermediate reasoning tokens through some sort of 'scratchpad'. 

but there's no reason scratchpad tokens need to be human-readable. in fact,
darren (@darrenangle) 's Twitter Profile Photo

LLM papers be like: ClearPrompt: Saying What You Mean Very Clearly Instead of Not Very Clearly Boosts Performance Up To 99% TotallyLegitBench: Models Other Than Ours Perform Poorly At An Eval We Invented LookAtData: We Looked At Our Data Before Training Our Model On It

Zijun Wu (@zijunwu88) 's Twitter Profile Photo

So excited for my paper to be accepted by ICLR 2024 #ICLR2024 ! In this paper, we explored a zero-shot method transferring the continuous prompt induced on one LM to the others. We had some interesting findings, please refer to our paper for more details! openreview.net/forum?id=26Xph…

So excited for my paper to be accepted by ICLR 2024 #ICLR2024 ! In this paper, we explored a zero-shot method transferring the continuous prompt induced on one LM to the others.  We had some interesting findings, please refer to our paper for more details!
openreview.net/forum?id=26Xph…
Zijun Wu (@zijunwu88) 's Twitter Profile Photo

Inspired by this, we found task semantics exists in the tuned prompt embeddings and they are transferable between different LMs openreview.net/pdf?id=26Xphug…

Benjamin Minixhofer (@bminixhofer) 's Twitter Profile Photo

Introducing Zero-Shot Tokenizer Transfer (ZeTT) ⚡ ZeTT frees language models from their tokenizer, allowing you to use any model with any tokenizer, with little or no extra training. Super excited to (finally!) share the first project of my PhD🧵

Introducing Zero-Shot Tokenizer Transfer (ZeTT) ⚡

ZeTT frees language models from their tokenizer, allowing you to use any model with any tokenizer, with little or no extra training.

Super excited to (finally!) share the first project of my PhD🧵
David Samuel (@davidsamuelcz) 's Twitter Profile Photo

We propose a simple inference technique that turns a pretrained masked language model into an autoregressive model that can generate text without any further training. With this, we can apply DeBERTa to any kind of 0/1/few-shot task. 2/6

We propose a simple inference technique that turns a pretrained masked language model into an autoregressive model that can generate text without any further training. With this, we can apply DeBERTa to any kind of 0/1/few-shot task.

2/6
Sophia Yang, Ph.D. (@sophiamyang) 's Twitter Profile Photo

Great paper summarizing the prompt techniques - The Prompt Report. - 58 text-only prompting techniques including zero-shot, few-shot, thought generation, ensembling, self-criticsm, and decomposition techniques. Few-shot CoT performs the best among the few techniques they

Great paper summarizing the prompt techniques - The Prompt Report. 

- 58 text-only prompting techniques including zero-shot, few-shot, thought generation, ensembling, self-criticsm, and decomposition techniques. Few-shot CoT performs the best among the few techniques they
Sean Welleck (@wellecks) 's Twitter Profile Photo

What do nucleus sampling, tree-of-thought, and PagedAttention have in common? They're all part of our new survey: "From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models" arxiv.org/abs/2406.16838

What do nucleus sampling, tree-of-thought, and PagedAttention have in common?

They're all part of our new survey: "From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models"

arxiv.org/abs/2406.16838
Vaishnavh Nagarajan (@_vaishnavh) 's Twitter Profile Photo

Looking forward to presenting our #ICML paper advocating multi-token prediction and correcting what it really means to say "next-token prediction cannot do what humans do" --- which is often argued poorly. Gregor Bachmann and I just updated the camera ready version on arxiv.

Looking forward to presenting our #ICML paper advocating multi-token prediction and correcting what it really means to say "next-token prediction cannot do what humans do" --- which is often argued poorly.

<a href="/GregorBachmann1/">Gregor Bachmann</a> and I just updated the camera ready version on arxiv.
Lilian Weng (@lilianweng) 's Twitter Profile Photo

Wrote about extrinsic hallucinations during the July 4th break. lilianweng.github.io/posts/2024-07-… Here is what ChatGPT suggested as a fun tweet for the blog: 🚀 Dive into the wild world of AI hallucinations! 🤖 Discover how LLMs can conjure up some seriously creative (and sometimes

Zeyuan Allen-Zhu (@zeyuanallenzhu) 's Twitter Profile Photo

If you're attending ICML 2024, join my 2-hour tutorial on Monday July 22 to explore the Physics of Language Model - all 6 parts. Visit: physics.allen-zhu.com and it will be live-streamed on Zoom. BONUS: this is the premiere of Part 2.1 + 2.2, don't miss out! #ICML2024 #MetaAI

If you're attending ICML 2024, join my 2-hour tutorial on Monday July 22 to explore the Physics of Language Model - all 6 parts. Visit: physics.allen-zhu.com and it will be live-streamed on Zoom. BONUS: this is the premiere of Part 2.1 + 2.2, don't miss out!  #ICML2024 #MetaAI
Yongchang Hao (@yongchanghao) 's Twitter Profile Photo

I am attending #ICML2024 this year to present Flora, in which I will talk about how we achieved memory saving with gradient compression and enabled pre-training significantly larger models. Come join us in Poster Session 4 (Jul 24 Wed, afternoon), Hall C 4-9 #2706.

I am attending #ICML2024 this year to present Flora, in which I will talk about how we achieved memory saving with gradient compression and enabled pre-training significantly larger models. Come join us in Poster Session 4 (Jul 24 Wed, afternoon), Hall C 4-9 #2706.
Yuntian Deng (@yuntiandeng) 's Twitter Profile Photo

We trained GPT2 to predict the product of two numbers up to 🌟20🌟 digits w/o intermediate reasoning steps, surpassing our previous 15-digit demo! How does a 12-layer LM solve 20-digit multiplication w/o CoT?🤯 Try our demo: huggingface.co/spaces/yuntian… Paper: bit.ly/internalize_st…

Aakash Kumar Nain (@a_k_nain) 's Twitter Profile Photo

I went through the Llama-3 technical report (92 pages!). The report is very detailed, and it will be hard to describe everything in a single tweet, but I will try to summarize it in the best possible way. Here we go... Overview - Standard dense Transformer with minor changes -

I went through the Llama-3 technical report (92 pages!). The report is very detailed, and it will be hard to describe everything in a single tweet, but I will try to summarize it in the best possible way. Here we go...

Overview
- Standard dense Transformer with minor changes
-
Zijun Wu (@zijunwu88) 's Twitter Profile Photo

Alexander | AI Operations jack morris Thanks for mentioning our work! 🙌 It offers a fresh perspective on prompt transferability in the non-discrete setting. We hope it can inspire more research in this direction.