wilam yang (@wilambatch) 's Twitter Profile
wilam yang

@wilambatch

ID: 808553300811362305

calendar_today13-12-2016 06:05:03

126 Tweet

64 Followers

2,2K Following

马东锡 NLP 🇸🇪 (@dongxi_nlp) 's Twitter Profile Photo

「 Data Contamination,Qwen2.5 」 Qwen2.5 系列的 Data Contamination 问题被证实,模型在预训练阶段已经见过评测题目。 前几个月,数篇 LLM Reasoning + RL 的论文发现,用极弱或随机奖励即可显著提升 Qwen 系列数学推理能力。 这引发出 Qwen 模型在 pretraining 阶段已经见过评测题目的疑问。

「 Data Contamination,Qwen2.5 」

Qwen2.5 系列的 Data Contamination 问题被证实,模型在预训练阶段已经见过评测题目。

前几个月,数篇 LLM Reasoning + RL 的论文发现,用极弱或随机奖励即可显著提升 Qwen 系列数学推理能力。

这引发出 Qwen 模型在 pretraining 阶段已经见过评测题目的疑问。
ginobefun (@hongming731) 's Twitter Profile Photo

先理解复杂性,再追求简洁 人们常说「保持简单」,但大多数人的做法本末倒置。他们从简单入手,在缺乏全局视野的情况下不断叠加功能。最终催生出东拼西凑的“缝合怪”产品:看似干净的组件生硬拼凑在一起,靠临时方案和侥幸心理勉强维持。

Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

Why is in-context learning more efficient than SFT? Because conditioning a distribution is easier than learning a conditional distribution.

wilam yang (@wilambatch) 's Twitter Profile Photo

或者说是一种条件概率,一件事情可能性无限,但是把历史相关发生的人事物放在一起后,事情发生的可能性就集中在几种情况了。

wilam yang (@wilambatch) 's Twitter Profile Photo

模型激活值波动太大会导致模型训练崩溃,为了稳定加入norm,直觉是来说,波动大才能带来更多有效信息,有没有保持足够方差又能稳定训练呢?

wilam yang (@wilambatch) 's Twitter Profile Photo

扩散模型的泛化,全靠插值组合替换。但是关键点是神经网络记住了视觉基础概念,根据提示检索到这些概念,并做了组合。检索能力目前看还不太能精确控制。

wilam yang (@wilambatch) 's Twitter Profile Photo

追求简单的路径应该是了解现实复杂,然后用恰当的逻辑和方法做好拆分和抽象,最后在抽象之上构建简单的交互。从表面上看是简单的,真正复杂的时候被抽象隐藏在底层。 现在需求复杂多变,需要能做到良好的抽象。

Sergio Calvo Ordoñez (@s_calvoordonez) 's Twitter Profile Photo

We'd love our flow-based generative models to learn the optimal transport from noise to data... but they rarely do ❌. Mini-batch Optimal Transport methods aim to fix this — but they're costly and require large batch sizes to work well... Can we approximate this behaviour

Dimitri von Rütte (@dvruette) 's Twitter Profile Photo

gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting: - Uses attention sinks (a.k.a. registers) - Sliding window attention in every second layer - YaRN context window extension - RMSNorm without biases - No QK norm, no attn. softcap

gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting:
- Uses attention sinks (a.k.a. registers)
- Sliding window attention in every second layer
- YaRN context window extension
- RMSNorm without biases
- No QK norm, no attn. softcap
wilam yang (@wilambatch) 's Twitter Profile Photo

训练模型,包括扩散模型,泛化大致来自于插值组合,只使用原始信号做引导,泛化出来的可能会有高质量结果,同时也会有低质量的,如何控制多生成高质量,就需要再使用一些更抽象得信号做引导。要不就是数值上加约束。

TuringPost (@theturingpost) 's Twitter Profile Photo

GRPO vs GSPO, or DeepSeek vs Qwen - a workflow breakdown of the main Chinese reinforcement learning algorithms ➡️ Group Relative Policy Optimization (GRPO): Learning by comparison GRPO is tailored for reasoning-heavy tasks where relative quality matters more than absolute

Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

RLVR/RLHF libraries: • verl - ByteDance • TRL - HuggingFace • slime - Zhipu AI • prime-rl - Prime Intellect • ROLL - Alibaba • Nemo-RL - NVIDIA • AReaL - Ant Research • SkyRL - UC Berkeley • open-instruct - Allen AI • torchtune - PyTorch Any I am missing? Which do you

ℏεsam (@hesamation) 's Twitter Profile Photo

I think I found a based Substack on low-level GPU programming by accident. He has some extensive articles on CUDA programming, building LLM inference engines, looking inside GPUs and much more. even the name is cool: "From Scratch". bro.

I think I found a based Substack on low-level GPU programming by accident. 

He has some extensive articles on CUDA programming, building LLM inference engines, looking inside GPUs and much more. 

even the name is cool: "From Scratch". bro.
Aman Chadha (@i_amanchadha) 's Twitter Profile Photo

🧠 [Primer] Model Compression • compress.aman.ai - Model compression techniques make it possible to run powerful AI models efficiently on edge devices by reducing memory, compute, and energy/power demands without severely sacrificing accuracy. - This primer explores

🧠 [Primer] Model Compression • compress.aman.ai

- Model compression techniques make it possible to run powerful AI models efficiently on edge devices by reducing memory, compute, and energy/power demands without severely sacrificing accuracy.
- This primer explores