Notes:
- Two models, R1-Zero (V3-Base + RL, no SFT), R1 (SFT [CoT from R1-Zero] -> RL [reasoning] -> SFT [general] -> RL [alignment, reasoning])
- Six distillation models, i.e., SFT from R1 on Qwen, Llama. Outperforms RL-only on those models, RL on distilled models would improve