FoSSIL: A Unified Framework for Continual Semantic Segmentation in 2D and 3D Domains

Description
GitHub

Abstract

Continual semantic segmentation remains an underexplored area in both 2D and 3D domains. The problem becomes particularly challenging when classes and domains evolve over time, with incremental classes having only a few labeled samples. In this setting, the model must simultaneously address catastrophic forgetting of old classes, overfitting due to the limited labeled data of new classes, and domain shifts arising from changes in data distribution. Existing methods fail to simultaneously address these real-world constraints. We introduce the FoSSIL framework, which integrates guided noise injection and prototype-guided pseudo-label refinement (PLR) to enhance continual learning across class-incremental (CIL), domain-incremental (DIL), and few-shot scenarios. Guided noise injection perturbs parameters with overfitted or saturated gradients more strongly, while perturbing parameters with highly changing or large gradients less, preserving critical weights, allowing less critical parameters to explore alternative solutions in the parameter space, mitigating forgetting, reducing overfitting, and improving robustness to domain shifts. For incremental classes with unlabeled data, PLR enables semi-supervised learning by refining pseudo-labels and filtering out incorrect high-confidence predictions, ensuring reliable supervision for incremental classes. Together, these components work synergistically to enhance stability, generalization, and continual learning across all learning regimes.


Overview

Class-incremental (CI), few-shot (FS), and domain-incremental (DI) constraints all lead to significantly reduced Dice scores compared to the unconstrained baseline (“no constraint”) on a 3D U-Net model. Performance of various backbones on the FoSSIL benchmark with partially frozen or fully unfrozen weights. Session 1 denotes the first incremental session following the common base session. In Session 1, performance drops regardless of whether the model weights are partially frozen or fully unfrozen.
Image 2 Image 3

FoSSIL Framework

Problem Definition

We formalize the multi-constrained continual learning problem for semantic segmentation as a sequence of sessions \( \mathcal{S} = \{\mathcal{S}_0, \mathcal{S}_1, \dots, \mathcal{S}_T\} \) where each session \( \mathcal{S}_t \) has a semantic class space \( \mathcal{C}_t \) and domain distribution \( \mathcal{D}_t \). Each session has a dataset \( \mathbb{D}_t = \{(x_i, y_i)\}_{i=1}^{N_t} \) with input images \( x_i \in \mathcal{X} \) and pixel-wise labels \( y_i \in \mathcal{Y}_t \subseteq \mathcal{C}_t \).

Classes and Domains across Sessions

Few-shot Learning

Incremental sessions (\( t \neq 0 \)) have few-shot labeled data \( N_t \ll N_0 \), with \( N_t = K \cdot |\mathcal{C}_t| \), where \( K \in \{5,10,20,30\} \). Each session may also access unlabeled data \( \mathcal{U}_t = \{x_j^{(u)}\}_{j=1}^{M_t} \), where \( M_t \gg N_t \).

Exemplar-Free Prototype Replay

For each class \( c \in \mathcal{C}_t \), a compact prototype is extracted from the feature space using a model feature extractor \( \phi \). For each sample \( (x_i, y_i) \), the embedding is \( E_i = \phi(x_i) \). The class prototype is computed as:

\( p_c^{(i)} = \frac{1}{|\mathcal{F}_c^{(i)}|} \sum_{\mathbf{f} \in \mathcal{F}_c^{(i)}} \mathbf{f} \)

\( P_c = \frac{1}{NS_c} \sum_{j \in NS_c} \frac{p_c^{(j)}}{\|p_c^{(j)}\|_2} \)

The prototype replay loss is:

\( \mathcal{L}_{\text{proto}} = \sum_{P_c \in \mathcal{P}_t} \mathcal{L}_{\text{CE}}(F(P_c), c) \)

Guided Noise Injection

Weights are perturbed according to gradient-based guidance:

\( G_{ij}^{-1} = \frac{1}{G_{ij} + \epsilon} \)

\( \tilde{G}_{ij}^{-1} = \frac{1 + G_{ij}^{-1} - \min(G^{-1})}{1 + \max(G^{-1}) - \min(G^{-1})} \)

\( \tilde{\mathbb{W}} = \mathbb{W} + \tilde{G}^{-1} \odot \mathcal{N}(0, I) \)

Prototype-Guided Pseudo-Label Refinement (PLR)

PLR uses class prototypes to refine pseudo-labels for unlabeled data \( x_j^{(u)} \), ensuring reliable supervision for incremental classes. Student and teacher networks generate pseudo-labels:

\( \hat{y}_s, \mathcal{F}_s = M_s(x_j^{(u)}), \quad \hat{y}_t, \mathcal{F}_t = M_t(x_j^{(u)}) \)

Pseudo-labels are retained only if confidence and feature similarity with prototypes meet thresholds:

\( \text{valid}(p,q) = (\text{conf}(p,q) > \tau_{\text{conf}}) \land (\text{sim}(p,q) > \tau_{\text{sim}}) \)

Consistency loss on validated pixels:

\( \mathcal{L}_{\text{consistency}} = \frac{1}{|\mathcal{V}|} \sum_{(p,q) \in \mathcal{V}} \|\hat{y}_s(p,q) - \hat{y}_t(p,q)\|_2^2 \)


Dataset Details

FoSSIL Datasets

Dataset Description Download
Med FoSSIL-Disjoint Disjoint classes & medical domains across sessions Zenodo
Med FoSSIL-Mixed Classes & domains may reappear; multi-domain per session Zenodo
Med Semi-Supervised-FoSSIL Incremental classes with unlabeled data Zenodo
Natural-FoSSIL Multiple driving domains; class-incremental & FSCIL Zenodo
Semi-Supervised Natural-FoSSIL Incremental classes with unlabeled data Zenodo
Detection Multi-domain COCO-O dataset for detection Google Drive

Incremental Sessions

Setting Session 0 (Base) Session 1 Session 2 Session 3 Session 4 Session 5
Med FoSSIL-Disjoint 15 (TS) 5 (AMOS) 6 (BCV) 4 (MOTS) 3 (BraTS) 4 (VerSe)
Med FoSSIL-Mixed 10 (AMOS) 8 (BCV, MOTS) 6 (TS, AMOS) 4 (MOTS, TS) 7 (BraTS, VerSe) --
Med SS-FoSSIL 15 (TS) 5 (AMOS) 6 (BCV) 4 (MOTS) 3 (BraTS) 4 (VerSe)
Natural-FoSSIL 10 (BDD) 5 (IDD) 5 (BDD, IDD) -- -- --
SS Natural-FoSSIL 10 (BDD) 2 (Cityscapes) 2 (IDD) 3 (IDD) -- --

Results

Performance of baselines on Med FoSSIL-Disjoint benchmark (3 sessions)

Results are reported as Dice coefficients (0–1).
PD (Performance Drop rate) = ((Session 0 − Session 2) / Session 0) × 100.

Method Session 0 Session 1 Session 2 PD (↓)
PIFS0.7000.1290.07888.9
NC-FSCIL0.3940.0770.08179.4
CLIP-CT0.4750.1860.14170.3
MiB0.7000.2710.09686.3
MDIL0.7790.1150.09787.6
C-FSCIL0.7870.3340.29762.3
SoftNet0.8200.3050.14682.2
GAPS0.7000.3340.25363.9
FSCIL-SS0.7000.1150.08987.3
Subspace0.2570.0540.04084.4
Gen-Replay0.7000.0760.10285.4
FeCAM0.7000.0480.04294.0
FACT0.3570.0710.02892.2
MAML0.7000.0010.05991.6
MAML + Reg.0.7000.0010.06291.1
MTL0.7000.0790.08887.4
UnSupCL0.7000.0390.08887.4
SupCL0.7000.0580.04294.0
UnSupCL-HNM0.7000.0350.06890.3
FoSSIL (U-Net)0.7360.4600.39845.9

Performance on Natural-FoSSIL benchmark

All values are reported as mIoU (0–100).
PD (Performance Drop rate) = ((Session 0 − Session 2) / Session 0) × 100.

Method Session 0 Session 1 Session 2 PD (↓)
DeepLab Vanilla47.762.183.8691.9
GAPS47.7623.4216.6865.1
MiB47.762.502.3795.0
MDIL48.541.593.0293.8
SAM Vanilla66.032.630.8153.3
FoSSIL (SAM)66.033.231.2252.7

Performance of recent prototype replay-based and semi-supervised methods

Results reported as Dice coefficients (0–1).

Benchmark dataset / Method Session 0 Session 1
Med FoSSIL-Disjoint (U-Net)
Saving100x0.7000.072
Adaptive Prototype0.7000.044
FoSSIL0.7360.460
Med Semi-Supervised-FoSSIL (MedFormer)
CSL0.6590.040
FoSSIL0.6400.431

Performance of FoSSIL on the Med FoSSIL-Disjoint benchmark and its variant, Med Semi-Supervised-FoSSIL (which includes additional unlabeled data), evaluated across incremental sessions (TS (Base) → AMOS → BCV → MOTS → BraTS → VerSe).

Seen refers to the average performance on classes the model has encountered in previous sessions, while New refers to the average performance on classes introduced in the current session. Results are reported as Dice coefficients (0–1). The reported results confirm that unlabeled data helps to boost the performance.

SS refers to Semi-Supervised, and HM refers to Harmonic Mean. SS FoSSIL denotes the performance of FoSSIL on the Med Semi-Supervised-FoSSIL benchmark.

Method AMOS BCV MOTS BraTS VerSe
Seen New HM Seen New HM Seen New HM Seen New HM Seen New HM
FoSSIL 0.610 0.074 0.132 0.477 0.069 0.120 0.382 0.180 0.245 0.043 0.198 0.071 0.367 0.119 0.180
SS FoSSIL 0.706 0.099 0.174 0.561 0.065 0.116 0.444 0.218 0.292 0.042 0.216 0.070 0.391 0.182 0.248

Performance of FoSSIL on the Semi-Supervised Natural-FoSSIL benchmark (which includes additional unlabeled data), evaluated across incremental sessions (BDD100K (Base) → Cityscapes → IDD → IDD (with different classes from the previous session)).

Seen refers to the average performance on classes the model has encountered in previous sessions, while New refers to the average performance on classes introduced in the current session. Results are reported as mIoU (0–100). The results show that pseudo-label refinement (PRL) enhances the performance of FoSSIL compared to FoSSIL without pseudo-label refinement (w/o PRL).

HM refers to Harmonic Mean.

Method Cityscapes IDD IDD
Seen New HM Seen New HM Seen New HM
FoSSIL + GAPS 27.33 29.98 28.59 26.00 49.82 34.17 28.88 24.23 26.35
FoSSIL + GAPS w/o PLR 25.62 27.88 26.70 23.46 41.50 29.97 27.54 15.74 20.03

Performance of FoSSIL (MedFormer) on the Med FoSSIL-Mixed benchmark, evaluated across incremental sessions (Session 0 (Base) → Session 1 → Session 2 → Session 3 → Session 4).

Seen refers to the average performance on classes the model has encountered in previous sessions, while New refers to the average performance on classes introduced in the current session. Results are reported as Dice coefficients (0–1). The results show that guided noise injection (GNI) enhances the performance of FoSSIL compared to FoSSIL without guided noise injection (w/o GNI).

HM refers to Harmonic Mean.

Method Session 1 Session 2 Session 3 Session 4
Seen New HM Seen New HM Seen New HM Seen New HM
FoSSIL 0.534 0.159 0.245 0.398 0.076 0.128 0.329 0.046 0.081 0.289 0.038 0.067
FoSSIL w/o GNI 0.105 0.010 0.018 0.069 0.015 0.025 0.061 0.034 0.043 0.099 0.008 0.014
Image 2 Image 3

Baselines

Class-Incremental Semantic Segmentation

These methods cannot handle few-shot learning and domain shifts.

Domain-Incremental Learning

Separate decoder per domain limits scalability; cannot handle few-shot incremental classes.

Few-shot Class-Incremental Learning

Although effective for class-incremental and few-shot learning, these methods are not robust to domain shifts.

Semi-Supervised Based Methods

Cannot jointly address domain shifts, class-incremental, and few-shot learning.

Others (Representation Learning, Few-shot Learning, Meta-Learning, Active Learning, Domain-shift)