Francesco Bertolotti (@f14bertolotti) 's Twitter Profile
Francesco Bertolotti

@f14bertolotti

Postdoctoral researcher at the university of Milan

ID: 1448917158025744420

linkhttp://f14-bertolotti.github.io calendar_today15-10-2021 07:43:15

117 Tweet

370 Followers

120 Following

Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

I have been studying a little of torch.distributed to set up a fully sharded data parallel example without tools like accelerate or deepspeed. This is a write-up of my current notes. Let me know if you find them helpful! 🔗f14-bertolotti.github.io/posts/02-09-25…

I have been studying a little of torch.distributed to set up a fully sharded data parallel example without tools like accelerate or deepspeed. This is a write-up of my current notes. Let me know if you find them helpful!  

🔗f14-bertolotti.github.io/posts/02-09-25…
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

In a new technical report, researchers introduce two brain-inspired LLMs via continued pre-training. Thanks to a spiking scheme with 69.15% sparsity, they run ultra-fast. 🔗arxiv.org/pdf/2509.05276

In a new technical report, researchers introduce two brain-inspired LLMs via continued pre-training. Thanks to a spiking scheme with 69.15% sparsity, they run ultra-fast.

🔗arxiv.org/pdf/2509.05276
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

In this paper the authors show that vision transformers activate semantic coherent features even in the presence of noise. When these features get activated, often the model hallucinates. Is this the root cause of hallucinations? 🔗arxiv.org/abs/2509.06938

In this paper the authors show that vision transformers activate  semantic coherent features even in the presence of noise. When these features get activated, often the model hallucinates. 

Is this the root cause of hallucinations?

🔗arxiv.org/abs/2509.06938
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

From today's arXiv. The authors propose a loss aggregation for GRPO that generalizes the one of Dr. GRPO. Aggregation coefficients are obtained by solving a convex problem with constraints. 🔗arxiv.org/abs/2509.07558

From today's arXiv. The authors propose a loss aggregation for GRPO that generalizes the one of Dr. GRPO. Aggregation coefficients are obtained by solving a convex problem with constraints.

🔗arxiv.org/abs/2509.07558
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

This is a new 100-page RL for LLM literature review. It appears fairly complete. It also covers static/dynamic data and frameworks. And it has some nice figures! 🔗arxiv.org/abs/2509.08827

This is a new 100-page RL for LLM literature review. It appears fairly complete. It also covers static/dynamic data and frameworks. And it has some nice figures!

🔗arxiv.org/abs/2509.08827
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

This is an application of GFlowNets to LLM RL training. Instead of directly maximizing the reward as in GRPO or PPO, the authors use the GFlow objective. They also had to deal with a few issues, but the end result seems pretty good. 🔗arxiv.org/abs/2509.15207

This is an application of GFlowNets to LLM RL training. Instead of directly maximizing the reward as in GRPO or PPO, the authors use the GFlow objective. They also had to deal with a few issues, but the end result seems pretty good.

🔗arxiv.org/abs/2509.15207
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

From today's arXiv, LATTS. A decoding strategy where the next token is sampled from the distribution product of a reward model (correctness) and the model itself (language coherence). They measure the quality as the curve of the accuracy over the tokens 🔗arxiv.org/abs/2509.20368

From today's arXiv, LATTS. A decoding strategy where the next token is sampled from the distribution product of a reward model (correctness) and the model itself (language coherence). They measure the quality as the curve of the accuracy over the tokens

🔗arxiv.org/abs/2509.20368
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

This paper is an iteration on the quite controversial HRM paper. The authors point out an additional parallelism between HRMs and Diffusion models. They also point out how adaptive computation time can be useful also at evaluation. 🔗arxiv.org/pdf/2510.00355

This paper is an iteration on the quite controversial HRM paper. The authors point out an additional parallelism between HRMs and Diffusion models. They also point out how adaptive computation time can be useful also at evaluation.

🔗arxiv.org/pdf/2510.00355
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

A few drawbacks for diffusion language models. This paper provides both practical and theoretical perspective of the problematics relative non-autoregressive models. 🔗arxiv.org/abs/2510.03289

A few drawbacks for diffusion language models. This paper provides both practical and theoretical perspective of the problematics relative non-autoregressive models.

🔗arxiv.org/abs/2510.03289
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

Verifiable sparse attention approach for inference. The probability that vAttention strays more than ϵ from SDPA is less than δ. And you can control ϵ and δ to customize the tradeoff between performance and accuracy. 🔗arxiv.org/pdf/2510.05688

Verifiable sparse attention approach for inference. The probability that vAttention strays more than ϵ from SDPA is less than δ. And you can control ϵ and δ to customize the tradeoff between performance and accuracy.

🔗arxiv.org/pdf/2510.05688
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

Interesting chunk-based approach to LLM-RL which goes like this: - get prompt - The model generates a chunk. - carryover of the chunk concatenated to the prompt. - Repeat generation This keeps the context small while allowing for RL. Cool work!! 🔗arxiv.org/abs/2510.06557

Interesting chunk-based approach to LLM-RL which goes like this:
- get prompt
- The model generates a chunk. 
- carryover of the chunk concatenated to the prompt.
- Repeat generation

This keeps the context small while allowing for RL. Cool work!! 

🔗arxiv.org/abs/2510.06557
Francesco Bertolotti (@f14bertolotti) 's Twitter Profile Photo

In this paper the authors show that base models already have in them thinking capabilities, and they can elicit them using steering vectors obtained from their fine-tuned counterparts. 🔗arxiv.org/abs/2510.07364

In this paper the authors show that base models already have in them thinking capabilities, and they can elicit them using steering vectors obtained from their fine-tuned counterparts.

🔗arxiv.org/abs/2510.07364