Harold Benoit (@harold_matmul) 's Twitter Profile
Harold Benoit

@harold_matmul

Another day of being a researcher in theory but an engineer in practice | tech staff @LiquidAI_

ID: 1777370251673448452

linkhttps://notes.haroldbenoit.com/ calendar_today08-04-2024 16:18:05

72 Tweet

288 Followers

210 Following

Harold Benoit (@harold_matmul) 's Twitter Profile Photo

You could just do MatFormer training + Matryoshka quantization to get a here of inference checkpoints (at multiple sizes and quantization) from a single full precision checkpoint

Harold Benoit (@harold_matmul) 's Twitter Profile Photo

Very happy to release those models to the community. Lots of care went into the architecture design. 1. It has low cache requirements (the gated convolutions only require a cache size of batch_size x 3 x d_model) 2. It has fewer FLOPs than a standard transformer, even at short

Harold Benoit (@harold_matmul) 's Twitter Profile Photo

You could have a functional RL + inference engine in ~2.5k LOC. 1. RL2 (github.com/ChenmienTan/RL2) 2. nano-vllm (github.com/GeeeekExplorer…)

Harold Benoit (@harold_matmul) 's Twitter Profile Photo

VLM learn to "see" by aligning the representation from the image encoder to the text representation from the LLM, using image-text pairs. Has anyone tried to make a VLM learn to "see" purely through exploration/RL? A setup could be to 1. feed the encoded image (with or

VLM learn to "see" by aligning the representation from the image encoder to the text representation from the LLM, using image-text pairs.

Has anyone tried to make a VLM learn to "see" purely through exploration/RL? 

A setup could be to 
1. feed the encoded image (with or
Harold Benoit (@harold_matmul) 's Twitter Profile Photo

Mixture-of-experts has this nice property of being optimal w.r.t scaling: 1. compute 2. suffering to write the infra to do training, quantization & inference

Harold Benoit (@harold_matmul) 's Twitter Profile Photo

Training: 282+1=284 in bf16, lfg some regularization for my training. Inference: Leaving torch.backends.cudnn.allow_tf32 = True, too bad bruv, that's a 20 point drop on AIME

Training: 
282+1=284 in bf16, lfg some regularization for my training.

Inference:  
Leaving torch.backends.cudnn.allow_tf32 = True, too bad bruv, that's a 20 point drop on AIME
Harold Benoit (@harold_matmul) 's Twitter Profile Photo

Really keen on adaptive sparse architectures for multi-modal models. It makes a lot of sense to reduce the difficulties of training large MM models (training instabilities induced by modality competition e.g. logit growth.) It also just scales better. MoMA is a good example of

Really keen on adaptive sparse architectures for multi-modal models. It makes a lot of sense to reduce the difficulties of training large MM models (training instabilities induced by modality competition e.g. logit growth.) It also just scales better.

MoMA  is a good example of
Harold Benoit (@harold_matmul) 's Twitter Profile Photo

A structured version of this e.g Matryoskha/ nested dropout would allow to train end-to-end a knob to adjust between recall and memory usage. We could even offload the unused KV cache to RAM, and bring it back later if better recall is needed.