Mapping Post-Training Forgetting in Language Models at Scale

Tübingen AI Center, University of Tübingen
Equal supervision
Main figure

Abstract

Scaled post‑training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not “average out” by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1→0 transitions (correct before post‑training, incorrect after) to quantify forgetting and 0→1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multiple‑choice benchmarks, we add chance‑adjusted variants that subtract the expected contribution of random guessing from pre‑ and post‑training accuracies. We apply this framework across post‑training stages, model sizes, and data scales. Our large‑scale analysis shows that: (1) Domain-continual pretraining induces moderate forgetting with low-to-moderate backward backward transfer; (2) RL/SFT post-training applied to base models and Instruction tuning yields moderate-to-large backward transfer on math and logic with overall low-to-moderate forgetting; (3) Applying RL/SFT to instruction‑tuned models is sensitive on data scale: at small scales, both forgetting and backward transfer are small; at larger scales, effects are mixed and warrant further study with better controls; (4) Model merging does not reliably mitigate forgetting. Overall, our framework offers a practical yardstick for mapping how post‑training alters pretrained knowledge at scale -- enabling progress towards generally capable AI systems.

High-Level Findings

Main figure

Domain-Continual Pretraining: induces low-to-moderate forgetting across most categories; backward transfer is limited. Scaling model size marginally decreases forgetting.

Instruction-Tuning and SFT/RL from Base model: yield low-to-moderate forgetting but moderate-to-large backward‑transfer, particularly in the Math and Logic categories, across model families; forgetting tends to decrease with increasing model scale. Reasoning training yields similar forgetting and larger backward transfer than instruction tuning.

SFT/RL Reasoning Post-Training from Instruct model: has data-scale dependent behaviour: For low‑data regime, it yields low forgetting and backward transfer. For high-data regime, no dominant factor robustly described the forgetting and backward transfer dynamics.

Model Merging: does not reliably mitigate forgetting across post-training pipelines (yet).

Metrics

We define forgetting as items that are answered correctly before a post-training stage but incorrectly afterward (the (1→0) transitions), and backward transfer as items that are answered incorrectly before but correctly after post-training (the (0→1) transitions). A further complication is that most knowledge-intensive LLM evaluation benchmarks are multiple-choice. Random guessing inflates accuracy and can create illusory transitions: an apparent (1→0) may simply be a lucky guess that later becomes an incorrect answer, even when the underlying knowledge did not change; likewise for (0→1) transitions. When the answer is only among few options (e.g., 4), performance by random guessing can account for a substantial share of observed transitions, distorting both level and trend estimates of forgetting. Thus a principled metric should:

  • (i) Resolve outcomes at the item level
  • (ii) Explicitly correct for chance.
We do this by estimating the probability of a chance correct answer x from the accuracy of the model ā, under an indepedence assumption, and then subtracting this from the sample-level answers, thereby yielding the estimated true accuracy ā_true.

ā true x ā 1−ā
Figure 1: Decomposition of observed accuracy showing true knowledge and guessing components.

Experiments

Interactive Plots: Categories can be hidden or displayed by clicking on their labels. Zooming into the plot is possible by clicking and dragging the mouse while under the zoom tool. Rotating the plot is also possible by clicking and dragging the axis.

Domain Continual-Pretraining

Forgetting (left) and back-transfer (right) incurred by by domain continual-pretraining. Low-to-moderate forgetting across categories and model families with limited backtransfer; Scaling model size marginally reduces forgetting.

Instruction-Tuning

Forgetting (left) and back-transfer (right) incurred by reasoning training from instruction-tuning. There is low-to-moderate forgetting across most categories with moderate backward transfer in Math and Logic. Forgetting effects marginally improve with increasing model scale, whereas back-transfer tends to decrease.

Reasoning Models from Base Models

Forgetting (left) and back-transfer (right) incurred by reasoning training from a base model. Generally yields overall generally low-to-moderate forgetting but large backward‑transfer gains across the Math and Logic categories and model families; both effects improve with increasing model scale. Reasoning training yields lower forgetting and larger backward transfer than instruction tuning.

Reasoning Models from Instruction-Tuned Models (Low Data)

Forgetting (left) and back-transfer (right) incurred by reasoning training from an instruction-tuned model on small amounts of data. Yields low forgetting and backward transfer. We also evaluate on behavioral benchmarks for this category (e.g. safety).

Reasoning Models from Instruction-Tuned Models (High Data)

Forgetting (left) and back-transfer (right) incurred by reasoning training from an instruction-tuned model on large amounts of data. No dominant factor robustly described the forgetting and backward transfer dynamics. We also evaluate on behavioral benchmarks for this category (e.g. safety).

Model Merging for Mitigation?

Recent work shows that offline model merging can combine capabilities from multiple models. Unlike classical continual learning, it requires neither the original training data nor the ability to resume training, which is practical in resource-constrained settings.

Setup We evaluate Exponential Moving Average (EMA) merging; in the two‑checkpoint case this is linear interpolation. Prior large‑scale studies find these simple schemes effective for continual learning with foundation models. Our experiments compare LERP and SLERP across OpenThinker‑7B, OpenThinker3‑7B, and Qwen2.5‑Coder‑7B, together with their base checkpoints.

Experiments

Interactive Plots: Categories can be hidden or displayed by clicking on their labels. Zooming into the plot is possible by clicking and dragging the mouse while under the zoom tool. Rotating the plot is also possible by clicking and dragging the axis.

Failure Case: Qwen2.5 and Qwen2.5 Coder Merge

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Base model.

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Coder model. In both cases large forgetting is observed.

Failure Case: Qwen2.5 Instruct and OpenThinker3 Merge

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Instruct model.

Forgetting (left) and back-transfer (right) relative to the OpenThinker3 model. Sample-level inspection shows words and phrases are often repeated without the model producing an answer.

Moderate Success Case: Qwen2.5 Instruct and OpenThinker Merge

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Instruct model. We see marginal overall improvments for Linear (0.8).

Forgetting (left) and back-transfer (right) relative to the OpenThinker3 model. We see marginal overall improvments for Linear (0.2) and Linear (0.8).

BibTeX

@misc{harmon2025postforgetting,
            title={Mapping Post-Training Forgetting in Language Models at Scale}, 
            author={Jackson Harmon and Andreas Hochlehnert and Matthias Bethge and Ameya Prabhu},
            year={2025},
            eprint={2510.17776},
            archivePrefix={arXiv},
            primaryClass={cs.LG},
            url={https://arxiv.org/abs/2510.17776}, 
          }