Harold Benoit (@harold_matmul) Twitter Tweets • TwiCopy

Harold Benoit

@harold_matmul

+ Follow

Another day of being a researcher in theory but an engineer in practice | tech staff @LiquidAI_

ID: 1777370251673448452

linkhttps://notes.haroldbenoit.com/ calendar_today08-04-2024 16:18:05

72 Tweet

288 Followers

210 Following

Harold Benoit

@harold_matmul

4 months ago

You could just do MatFormer training + Matryoshka quantization to get a here of inference checkpoints (at multiple sizes and quantization) from a single full precision checkpoint

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Harold Benoit

@harold_matmul

3 months ago

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Very happy to release those models to the community. Lots of care went into the architecture design. 1. It has low cache requirements (the gated convolutions only require a cache size of batch_size x 3 x d_model) 2. It has fewer FLOPs than a standard transformer, even at short

thumb_up_off_alt18

chat_bubble_outline1

repeat2

shareShare

Harold Benoit

@harold_matmul

2 months ago

What's the frontier for OSS multimodal models ? Qwen-omni?

thumb_up_off_alt3

chat_bubble_outline1

repeat0

shareShare

Harold Benoit

@harold_matmul

2 months ago

You could have a functional RL + inference engine in ~2.5k LOC. 1. RL2 (github.com/ChenmienTan/RL2) 2. nano-vllm (github.com/GeeeekExplorer…)

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Harold Benoit

@harold_matmul

2 months ago

VLM learn to "see" by aligning the representation from the image encoder to the text representation from the LLM, using image-text pairs. Has anyone tried to make a VLM learn to "see" purely through exploration/RL? A setup could be to 1. feed the encoded image (with or

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Harold Benoit

@harold_matmul

2 months ago

Thank god for uv replacing conda slop

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Harold Benoit

@harold_matmul

2 months ago

Mixture-of-experts has this nice property of being optimal w.r.t scaling: 1. compute 2. suffering to write the infra to do training, quantization & inference

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Harold Benoit

@harold_matmul

2 months ago

fixing up the QAT pipeline at 2AM one day before a model delivery

thumb_up_off_alt33

chat_bubble_outline3

repeat0

shareShare

Harold Benoit

@harold_matmul

2 months ago

Training: 282+1=284 in bf16, lfg some regularization for my training. Inference: Leaving torch.backends.cudnn.allow_tf32 = True, too bad bruv, that's a 20 point drop on AIME

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Harold Benoit

@harold_matmul

2 months ago

Watching Zettlemoyer's lecture at the Simons Institute, this man is goated

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Harold Benoit

@harold_matmul

2 months ago

Really keen on adaptive sparse architectures for multi-modal models. It makes a lot of sense to reduce the difficulties of training large MM models (training instabilities induced by modality competition e.g. logit growth.) It also just scales better. MoMA is a good example of

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Harold Benoit

@harold_matmul

a month ago

never settling for slow evals in your research is the hill I'll die on.

thumb_up_off_alt41

chat_bubble_outline3

repeat2

shareShare

Harold Benoit

@harold_matmul

a month ago

A structured version of this e.g Matryoskha/ nested dropout would allow to train end-to-end a knob to adjust between recall and memory usage. We could even offload the unused KV cache to RAM, and bring it back later if better recall is needed.

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare