Mapping Post-Training Forgetting in Language Models at Scale

Jackson Harmon, Andreas Hochlehnert, Matthias Bethge^†, Ameya Prabhu^†

Tübingen AI Center, University of Tübingen

^†Equal supervision

Paper

Code

Data

High-Level Findings

Domain-Continual Pretraining: induces low-to-moderate forgetting across most categories; backward transfer is limited. Scaling model size marginally decreases forgetting.

Instruction-Tuning and SFT/RL from Base model: yield low-to-moderate forgetting but moderate-to-large backward‑transfer, particularly in the Math and Logic categories, across model families; forgetting tends to decrease with increasing model scale. Reasoning training yields similar forgetting and larger backward transfer than instruction tuning.

SFT/RL Reasoning Post-Training from Instruct model: has data-scale dependent behaviour: For low‑data regime, it yields low forgetting and backward transfer. For high-data regime, no dominant factor robustly described the forgetting and backward transfer dynamics.

Model Merging: does not reliably mitigate forgetting across post-training pipelines (yet).

Metrics

We define forgetting as items that are answered correctly before a post-training stage but incorrectly afterward (the (1→0) transitions), and backward transfer as items that are answered incorrectly before but correctly after post-training (the (0→1) transitions). A further complication is that most knowledge-intensive LLM evaluation benchmarks are multiple-choice. Random guessing inflates accuracy and can create illusory transitions: an apparent (1→0) may simply be a lucky guess that later becomes an incorrect answer, even when the underlying knowledge did not change; likewise for (0→1) transitions. When the answer is only among few options (e.g., 4), performance by random guessing can account for a substantial share of observed transitions, distorting both level and trend estimates of forgetting. Thus a principled metric should:

(i) Resolve outcomes at the item level
(ii) Explicitly correct for chance.

We do this by estimating the probability of a chance correct answer x from the accuracy of the model ā, under an indepedence assumption, and then subtracting this from the sample-level answers, thereby yielding the estimated true accuracy ā_true.

Figure 1: Decomposition of observed accuracy showing true knowledge and guessing components.

Experiments

Domain Continual-Pretraining

Forgetting (left) and back-transfer (right) incurred by by domain continual-pretraining. Low-to-moderate forgetting across categories and model families with limited backtransfer; Scaling model size marginally reduces forgetting.

Instruction-Tuning

Forgetting (left) and back-transfer (right) incurred by reasoning training from instruction-tuning. There is low-to-moderate forgetting across most categories with moderate backward transfer in Math and Logic. Forgetting effects marginally improve with increasing model scale, whereas back-transfer tends to decrease.

Reasoning Models from Base Models

Forgetting (left) and back-transfer (right) incurred by reasoning training from a base model. Generally yields overall generally low-to-moderate forgetting but large backward‑transfer gains across the Math and Logic categories and model families; both effects improve with increasing model scale. Reasoning training yields lower forgetting and larger backward transfer than instruction tuning.

Reasoning Models from Instruction-Tuned Models (Low Data)

Forgetting (left) and back-transfer (right) incurred by reasoning training from an instruction-tuned model on small amounts of data. Yields low forgetting and backward transfer. We also evaluate on behavioral benchmarks for this category (e.g. safety).

Reasoning Models from Instruction-Tuned Models (High Data)

Forgetting (left) and back-transfer (right) incurred by reasoning training from an instruction-tuned model on large amounts of data. No dominant factor robustly described the forgetting and backward transfer dynamics. We also evaluate on behavioral benchmarks for this category (e.g. safety).

Model Merging for Mitigation?

Recent work shows that offline model merging can combine capabilities from multiple models. Unlike classical continual learning, it requires neither the original training data nor the ability to resume training, which is practical in resource-constrained settings.

Setup We evaluate Exponential Moving Average (EMA) merging; in the two‑checkpoint case this is linear interpolation. Prior large‑scale studies find these simple schemes effective for continual learning with foundation models. Our experiments compare LERP and SLERP across OpenThinker‑7B, OpenThinker3‑7B, and Qwen2.5‑Coder‑7B, together with their base checkpoints.

Experiments

Failure Case: Qwen2.5 and Qwen2.5 Coder Merge

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Base model.

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Coder model. In both cases large forgetting is observed.

Failure Case: Qwen2.5 Instruct and OpenThinker3 Merge

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Instruct model.

Forgetting (left) and back-transfer (right) relative to the OpenThinker3 model. Sample-level inspection shows words and phrases are often repeated without the model producing an answer.

Moderate Success Case: Qwen2.5 Instruct and OpenThinker Merge

Mapping Post-Training Forgetting in Language Models at Scale

Abstract

High-Level Findings

Metrics

Experiments

Domain Continual-Pretraining

Forgetting (left) and back-transfer (right) incurred by by domain continual-pretraining. Low-to-moderate forgetting across categories and model families with limited backtransfer; Scaling model size marginally reduces forgetting.

Instruction-Tuning

Reasoning Models from Base Models

Reasoning Models from Instruction-Tuned Models (Low Data)

Forgetting (left) and back-transfer (right) incurred by reasoning training from an instruction-tuned model on small amounts of data. Yields low forgetting and backward transfer. We also evaluate on behavioral benchmarks for this category (e.g. safety).

Reasoning Models from Instruction-Tuned Models (High Data)

Forgetting (left) and back-transfer (right) incurred by reasoning training from an instruction-tuned model on large amounts of data. No dominant factor robustly described the forgetting and backward transfer dynamics. We also evaluate on behavioral benchmarks for this category (e.g. safety).

Model Merging for Mitigation?

Experiments

Failure Case: Qwen2.5 and Qwen2.5 Coder Merge

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Base model.

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Coder model. In both cases large forgetting is observed.

Failure Case: Qwen2.5 Instruct and OpenThinker3 Merge

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Instruct model.

Forgetting (left) and back-transfer (right) relative to the OpenThinker3 model. Sample-level inspection shows words and phrases are often repeated without the model producing an answer.

Moderate Success Case: Qwen2.5 Instruct and OpenThinker Merge

Forgetting (left) and back-transfer (right) relative to the Qwen2.5 Instruct model. We see marginal overall improvments for Linear (0.8).

Forgetting (left) and back-transfer (right) relative to the OpenThinker3 model. We see marginal overall improvments for Linear (0.2) and Linear (0.8).

BibTeX