大模型持续学习实战：防止灾难性遗忘的工程方案

Yaqin Hei·October 13, 2025·15分钟阅读

背景

在Apple工作期间，我负责为客服场景开发对话AI系统。系统需要支持retail和AppleCare等多条产品线，但每次针对新产品线微调时，模型在旧产品线上的性能就会显著下降——这就是典型的**灾难性遗忘（Catastrophic Forgetting）**问题。

这个问题在工业界非常普遍，但学术论文往往忽略了生产环境的约束：

内存限制：无法存储所有历史数据
计算预算：重新训练全部数据成本太高
实时性要求：需要快速适应新场景
性能保障：不能牺牲旧任务表现

问题定义

什么是持续学习？

持续学习（Continual Learning）是指模型能够从连续的数据流中学习，同时保持在之前任务上的性能。

# 传统训练方式
model.train(task_1_data)  # 学习任务1
model.train(task_2_data)  # 学习任务2，但忘记任务1

# 持续学习目标
model.continual_train(task_1_data)  # 学习任务1
model.continual_train(task_2_data)  # 学习任务2，同时保持任务1性能

灾难性遗忘的本质

神经网络的参数是共享的。当在新任务上更新参数时，会覆盖旧任务学到的知识。

数学表示：

L_total = L_new + λ * L_old

关键是如何在优化新任务的同时，保护旧任务的知识。

我们的解决方案

在Apple项目中，我实现了一个参数高效的双重回放策略（Dual-Replay Strategy），在NLU基准测试上提升了10%，同时保持了旧任务性能。

方案架构

┌─────────────────────────────────────┐
│     Base LLM (冻结主干参数)        │
└──────────────┬──────────────────────┘
               │
        ┌──────┴──────┐
        │   Adapter   │  ← 只训练少量参数
        └──────┬──────┘
               │
    ┌──────────┴──────────┐
    │  Experience Replay  │  ← 混合新旧数据
    └─────────────────────┘

核心技术

1. Parameter-Efficient Fine-Tuning (PEFT)

使用LoRA（Low-Rank Adaptation）大幅减少可训练参数：

from peft import LoraConfig, get_peft_model

# 配置LoRA
lora_config = LoraConfig(
    r=8,  # 秩，控制参数量
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # 只调整attention层
    lora_dropout=0.1,
)

model = get_peft_model(base_model, lora_config)

# 可训练参数仅0.5%
print(f"Trainable params: {model.num_parameters(only_trainable=True):,}")
# 输出：Trainable params: 4,194,304 (vs 7B+ base model)

为什么有效？

只更新低秩矩阵，减少参数冲突
旧任务和新任务可以使用不同的Adapter
内存占用降低90%+

2. Experience Replay

存储旧任务的代表性样本，训练时混合使用：

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = []
        self.capacity = capacity

    def add(self, examples):
        """添加新样本，使用reservoir sampling保证多样性"""
        for example in examples:
            if len(self.buffer) < self.capacity:
                self.buffer.append(example)
            else:
                # 随机替换，保证均匀采样
                idx = random.randint(0, len(self.buffer))
                if idx < self.capacity:
                    self.buffer[idx] = example

    def sample(self, batch_size):
        """采样混合批次"""
        return random.sample(self.buffer, min(batch_size, len(self.buffer)))

# 训练循环
replay_buffer = ReplayBuffer(capacity=10000)

for task_id, task_data in enumerate(tasks):
    # 新任务数据
    new_data = task_data

    # 混合新旧数据（50/50比例）
    for batch in DataLoader(new_data):
        old_batch = replay_buffer.sample(batch_size // 2)
        mixed_batch = concat(batch, old_batch)

        loss = model(mixed_batch)
        loss.backward()
        optimizer.step()

    # 保存当前任务的代表性样本
    replay_buffer.add(sample_representative(task_data))

3. 知识蒸馏（Knowledge Distillation）

保存旧模型的"软标签"，作为正则化：

def distillation_loss(student_logits, teacher_logits, temperature=2.0):
    """
    软标签蒸馏损失
    temperature控制分布平滑度
    """
    student_probs = F.softmax(student_logits / temperature, dim=-1)
    teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)

    return F.kl_div(
        student_probs.log(),
        teacher_probs,
        reduction='batchmean'
    ) * (temperature ** 2)

# 训练时结合三个损失
loss = (
    alpha * task_loss +           # 新任务的监督损失
    beta * distillation_loss +    # 知识蒸馏损失
    gamma * replay_loss           # 重放数据损失
)

实验结果

在Apple内部的NLU基准测试上：

方法	新任务准确率	旧任务准确率	参数量	训练时间
Fine-tune全参数	92.1%	67.3% ❌	7B	24h
Naive LoRA	89.5%	75.2%	4M	3h
我们的方法	91.8%	85.6% ✅	4M	4h

关键发现：

参数效率提升99.9%（4M vs 7B）
旧任务性能提升10.4个百分点
训练速度提升6倍

生产环境经验

1. 数据采样策略很关键

不要均匀采样所有旧数据，应该：

优先保存难样本（模型confidence低的）
保持类别平衡（避免头部效应）
定期更新buffer（淘汰过时样本）

def sample_hard_examples(model, data, k=1000):
    """保存模型最容易忘记的样本"""
    scores = []
    for example in data:
        with torch.no_grad():
            confidence = model(example).max()
        scores.append((1 - confidence, example))

    # 取confidence最低的k个样本
    hard_examples = sorted(scores, reverse=True)[:k]
    return [ex for _, ex in hard_examples]

2. 监控指标

除了准确率，还要监控：

遗忘率（Forgetting Rate）：旧任务性能下降幅度
前向迁移（Forward Transfer）：新任务学习速度
后向迁移（Backward Transfer）：对旧任务的影响

def compute_forgetting(acc_matrix):
    """
    acc_matrix[i][j]: 在任务j训练后，任务i的准确率
    """
    n_tasks = len(acc_matrix)
    forgetting = 0

    for i in range(n_tasks - 1):
        max_acc = max(acc_matrix[i][:i+1])  # 任务i历史最佳
        final_acc = acc_matrix[i][-1]       # 最终准确率
        forgetting += max(0, max_acc - final_acc)

    return forgetting / (n_tasks - 1)