Scaled post‑training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not “average out” by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1→0 transitions (correct before post‑training, incorrect after) to quantify forgetting and 0→1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multiple‑choice benchmarks, we add chance‑adjusted variants that subtract the expected contribution of random guessing from pre‑ and post‑training accuracies. We apply this framework across post‑training stages, model sizes, and data scales. Our large‑scale analysis shows that: (1) Domain-continual pretraining induces moderate forgetting with low-to-moderate backward backward transfer; (2) RL/SFT post-training applied to base models and Instruction tuning yields moderate-to-large backward transfer on math and logic with overall low-to-moderate forgetting; (3) Applying RL/SFT to instruction‑tuned models is sensitive on data scale: at small scales, both forgetting and backward transfer are small; at larger scales, effects are mixed and warrant further study with better controls; (4) Model merging does not reliably mitigate forgetting. Overall, our framework offers a practical yardstick for mapping how post‑training alters pretrained knowledge at scale -- enabling progress towards generally capable AI systems.
We define forgetting as items that are answered correctly before a post-training stage but incorrectly afterward (the (1→0) transitions), and backward transfer as items that are answered incorrectly before but correctly after post-training (the (0→1) transitions). A further complication is that most knowledge-intensive LLM evaluation benchmarks are multiple-choice. Random guessing inflates accuracy and can create illusory transitions: an apparent (1→0) may simply be a lucky guess that later becomes an incorrect answer, even when the underlying knowledge did not change; likewise for (0→1) transitions. When the answer is only among few options (e.g., 4), performance by random guessing can account for a substantial share of observed transitions, distorting both level and trend estimates of forgetting. Thus a principled metric should:
Recent work shows that offline model merging can combine capabilities from multiple models. Unlike classical continual learning, it requires neither the original training data nor the ability to resume training, which is practical in resource-constrained settings.
Setup We evaluate Exponential Moving Average (EMA) merging; in the two‑checkpoint case this is linear interpolation. Prior large‑scale studies find these simple schemes effective for continual learning with foundation models. Our experiments compare LERP and SLERP across OpenThinker‑7B, OpenThinker3‑7B, and Qwen2.5‑Coder‑7B, together with their base checkpoints.
@misc{harmon2025postforgetting,
title={Mapping Post-Training Forgetting in Language Models at Scale},
author={Jackson Harmon and Andreas Hochlehnert and Matthias Bethge and Ameya Prabhu},
year={2025},
eprint={2510.17776},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.17776},
}