-

@ ynniv
2025-05-12 20:30:20
Yesterday it was 48% with 57k RLHF pairs. Today it's 50% with zero. This is notable because the previous paper tried using less data and found it was necessary:
"Training data from math train 7.5k to Open-Reasoner-
Zero 57k, we observe a consistent increase in both training reward and response length for training and evaluation set, indicating that data scale plays a crucial role in training performance."
Leading to my conclusion that for zero pairs, the previous record was close to 0%. Maybe this isn't strictly true, but I expect it to be more predictive than seeing a 2% change