wilam yang (@wilambatch) Twitter Tweets • TwiCopy

马东锡 NLP 🇸🇪

2 months ago

「 Data Contamination，Qwen2.5 」 Qwen2.5 系列的 Data Contamination 问题被证实，模型在预训练阶段已经见过评测题目。前几个月，数篇 LLM Reasoning + RL 的论文发现，用极弱或随机奖励即可显著提升 Qwen 系列数学推理能力。这引发出 Qwen 模型在 pretraining 阶段已经见过评测题目的疑问。

thumb_up_off_alt158

chat_bubble_outline11

repeat26

shareShare

wilam yang

@wilambatch

2 months ago

无论做项目还是技术，首先第一步要搭建反馈链路。

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

ginobefun

@hongming731

2 months ago

先理解复杂性，再追求简洁人们常说「保持简单」，但大多数人的做法本末倒置。他们从简单入手，在缺乏全局视野的情况下不断叠加功能。最终催生出东拼西凑的“缝合怪”产品：看似干净的组件生硬拼凑在一起，靠临时方案和侥幸心理勉强维持。

thumb_up_off_alt84

chat_bubble_outline3

repeat14

shareShare

wilam yang

@wilambatch

2 months ago

电子化 > 信息化 > 参数化 > 智能化

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Dimitris Papailiopoulos

@dimitrispapail

2 months ago

Why is in-context learning more efficient than SFT? Because conditioning a distribution is easier than learning a conditional distribution.

thumb_up_off_alt446

chat_bubble_outline12

repeat36

shareShare

wilam yang

@wilambatch

2 months ago

条件分布更好学习，但是需要足够丰富的条件。

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

wilam yang

@wilambatch

2 months ago

或者说是一种条件概率，一件事情可能性无限，但是把历史相关发生的人事物放在一起后，事情发生的可能性就集中在几种情况了。

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

wilam yang

@wilambatch

2 months ago

如果更细分的话，做推理比训练更靠近产品一些！

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

宝玉

@dotey

2 months ago

👍 越来越多的人对 AI 编程有了清醒的认识

thumb_up_off_alt68

chat_bubble_outline8

repeat9

shareShare

wilam yang

@wilambatch

a month ago

模型激活值波动太大会导致模型训练崩溃，为了稳定加入norm，直觉是来说，波动大才能带来更多有效信息，有没有保持足够方差又能稳定训练呢？

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

wilam yang

@wilambatch

a month ago

扩散模型的泛化，全靠插值组合替换。但是关键点是神经网络记住了视觉基础概念，根据提示检索到这些概念，并做了组合。检索能力目前看还不太能精确控制。

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

wilam yang

@wilambatch

a month ago

追求简单的路径应该是了解现实复杂，然后用恰当的逻辑和方法做好拆分和抽象，最后在抽象之上构建简单的交互。从表面上看是简单的，真正复杂的时候被抽象隐藏在底层。现在需求复杂多变，需要能做到良好的抽象。

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

We'd love our flow-based generative models to learn the optimal transport from noise to data... but they rarely do ❌. Mini-batch Optimal Transport methods aim to fix this — but they're costly and require large batch sizes to work well... Can we approximate this behaviour

thumb_up_off_alt171

chat_bubble_outline7

repeat22

shareShare

wilam yang

@wilambatch

a month ago

不忘妄想几个月时间就能完全掌握别人花几年才解决的问题。

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Dimitri von Rütte

@dvruette

a month ago

gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting: - Uses attention sinks (a.k.a. registers) - Sliding window attention in every second layer - YaRN context window extension - RMSNorm without biases - No QK norm, no attn. softcap

thumb_up_off_alt916

chat_bubble_outline5

repeat77

shareShare

wilam yang

@wilambatch

a month ago

训练模型，包括扩散模型，泛化大致来自于插值组合，只使用原始信号做引导，泛化出来的可能会有高质量结果，同时也会有低质量的，如何控制多生成高质量，就需要再使用一些更抽象得信号做引导。要不就是数值上加约束。

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

TuringPost

@theturingpost

a month ago

GRPO vs GSPO, or DeepSeek vs Qwen - a workflow breakdown of the main Chinese reinforcement learning algorithms ➡️ Group Relative Policy Optimization (GRPO): Learning by comparison GRPO is tailored for reasoning-heavy tasks where relative quality matters more than absolute

thumb_up_off_alt501

chat_bubble_outline5

repeat84

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

a month ago

RLVR/RLHF libraries: • verl - ByteDance • TRL - HuggingFace • slime - Zhipu AI • prime-rl - Prime Intellect • ROLL - Alibaba • Nemo-RL - NVIDIA • AReaL - Ant Research • SkyRL - UC Berkeley • open-instruct - Allen AI • torchtune - PyTorch Any I am missing? Which do you

thumb_up_off_alt993

chat_bubble_outline38

repeat112

shareShare

ℏεsam

@hesamation

a month ago

I think I found a based Substack on low-level GPU programming by accident. He has some extensive articles on CUDA programming, building LLM inference engines, looking inside GPUs and much more. even the name is cool: "From Scratch". bro.

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat158

shareShare

Aman Chadha

@i_amanchadha

25 days ago

🧠 [Primer] Model Compression • compress.aman.ai - Model compression techniques make it possible to run powerful AI models efficiently on edge devices by reducing memory, compute, and energy/power demands without severely sacrificing accuracy. - This primer explores

thumb_up_off_alt11

chat_bubble_outline0

repeat1

shareShare

wilam yang

马东锡 NLP 🇸🇪

wilam yang

ginobefun

wilam yang

Dimitris Papailiopoulos

wilam yang

wilam yang

wilam yang

宝玉

wilam yang

wilam yang

wilam yang

Sergio Calvo Ordoñez

wilam yang

Dimitri von Rütte

wilam yang

TuringPost

Tanishq Mathew Abraham, Ph.D.

ℏεsam

Aman Chadha