Continual semantic segmentation remains an underexplored area in both 2D and 3D domains. The problem becomes particularly challenging when classes and domains evolve over time, with incremental classes having only a few labeled samples. In this setting, the model must simultaneously address catastrophic forgetting of old classes, overfitting due to the limited labeled data of new classes, and domain shifts arising from changes in data distribution. Existing methods fail to simultaneously address these real-world constraints. We introduce the FoSSIL framework, which integrates guided noise injection and prototype-guided pseudo-label refinement (PLR) to enhance continual learning across class-incremental (CIL), domain-incremental (DIL), and few-shot scenarios. Guided noise injection perturbs parameters with overfitted or saturated gradients more strongly, while perturbing parameters with highly changing or large gradients less, preserving critical weights, allowing less critical parameters to explore alternative solutions in the parameter space, mitigating forgetting, reducing overfitting, and improving robustness to domain shifts. For incremental classes with unlabeled data, PLR enables semi-supervised learning by refining pseudo-labels and filtering out incorrect high-confidence predictions, ensuring reliable supervision for incremental classes. Together, these components work synergistically to enhance stability, generalization, and continual learning across all learning regimes.
We formalize the multi-constrained continual learning problem for semantic segmentation as a sequence of sessions \( \mathcal{S} = \{\mathcal{S}_0, \mathcal{S}_1, \dots, \mathcal{S}_T\} \) where each session \( \mathcal{S}_t \) has a semantic class space \( \mathcal{C}_t \) and domain distribution \( \mathcal{D}_t \). Each session has a dataset \( \mathbb{D}_t = \{(x_i, y_i)\}_{i=1}^{N_t} \) with input images \( x_i \in \mathcal{X} \) and pixel-wise labels \( y_i \in \mathcal{Y}_t \subseteq \mathcal{C}_t \).
Incremental sessions (\( t \neq 0 \)) have few-shot labeled data \( N_t \ll N_0 \), with \( N_t = K \cdot |\mathcal{C}_t| \), where \( K \in \{5,10,20,30\} \). Each session may also access unlabeled data \( \mathcal{U}_t = \{x_j^{(u)}\}_{j=1}^{M_t} \), where \( M_t \gg N_t \).
For each class \( c \in \mathcal{C}_t \), a compact prototype is extracted from the feature space using a model feature extractor \( \phi \). For each sample \( (x_i, y_i) \), the embedding is \( E_i = \phi(x_i) \). The class prototype is computed as:
\( p_c^{(i)} = \frac{1}{|\mathcal{F}_c^{(i)}|} \sum_{\mathbf{f} \in \mathcal{F}_c^{(i)}} \mathbf{f} \)
\( P_c = \frac{1}{NS_c} \sum_{j \in NS_c} \frac{p_c^{(j)}}{\|p_c^{(j)}\|_2} \)
The prototype replay loss is:
\( \mathcal{L}_{\text{proto}} = \sum_{P_c \in \mathcal{P}_t} \mathcal{L}_{\text{CE}}(F(P_c), c) \)
Weights are perturbed according to gradient-based guidance:
\( G_{ij}^{-1} = \frac{1}{G_{ij} + \epsilon} \)
\( \tilde{G}_{ij}^{-1} = \frac{1 + G_{ij}^{-1} - \min(G^{-1})}{1 + \max(G^{-1}) - \min(G^{-1})} \)
\( \tilde{\mathbb{W}} = \mathbb{W} + \tilde{G}^{-1} \odot \mathcal{N}(0, I) \)
PLR uses class prototypes to refine pseudo-labels for unlabeled data \( x_j^{(u)} \), ensuring reliable supervision for incremental classes. Student and teacher networks generate pseudo-labels:
\( \hat{y}_s, \mathcal{F}_s = M_s(x_j^{(u)}), \quad \hat{y}_t, \mathcal{F}_t = M_t(x_j^{(u)}) \)
Pseudo-labels are retained only if confidence and feature similarity with prototypes meet thresholds:
\( \text{valid}(p,q) = (\text{conf}(p,q) > \tau_{\text{conf}}) \land (\text{sim}(p,q) > \tau_{\text{sim}}) \)
Consistency loss on validated pixels:
\( \mathcal{L}_{\text{consistency}} = \frac{1}{|\mathcal{V}|} \sum_{(p,q) \in \mathcal{V}} \|\hat{y}_s(p,q) - \hat{y}_t(p,q)\|_2^2 \)
Dataset | Description | Download |
---|---|---|
Med FoSSIL-Disjoint | Disjoint classes & medical domains across sessions | Zenodo |
Med FoSSIL-Mixed | Classes & domains may reappear; multi-domain per session | Zenodo |
Med Semi-Supervised-FoSSIL | Incremental classes with unlabeled data | Zenodo |
Natural-FoSSIL | Multiple driving domains; class-incremental & FSCIL | Zenodo |
Semi-Supervised Natural-FoSSIL | Incremental classes with unlabeled data | Zenodo |
Detection | Multi-domain COCO-O dataset for detection | Google Drive |
Setting | Session 0 (Base) | Session 1 | Session 2 | Session 3 | Session 4 | Session 5 |
---|---|---|---|---|---|---|
Med FoSSIL-Disjoint | 15 (TS) | 5 (AMOS) | 6 (BCV) | 4 (MOTS) | 3 (BraTS) | 4 (VerSe) |
Med FoSSIL-Mixed | 10 (AMOS) | 8 (BCV, MOTS) | 6 (TS, AMOS) | 4 (MOTS, TS) | 7 (BraTS, VerSe) | -- |
Med SS-FoSSIL | 15 (TS) | 5 (AMOS) | 6 (BCV) | 4 (MOTS) | 3 (BraTS) | 4 (VerSe) |
Natural-FoSSIL | 10 (BDD) | 5 (IDD) | 5 (BDD, IDD) | -- | -- | -- |
SS Natural-FoSSIL | 10 (BDD) | 2 (Cityscapes) | 2 (IDD) | 3 (IDD) | -- | -- |
Results are reported as Dice coefficients (0–1).
PD (Performance Drop rate) = ((Session 0 − Session 2) / Session 0) × 100.
Method | Session 0 | Session 1 | Session 2 | PD (↓) |
---|---|---|---|---|
PIFS | 0.700 | 0.129 | 0.078 | 88.9 |
NC-FSCIL | 0.394 | 0.077 | 0.081 | 79.4 |
CLIP-CT | 0.475 | 0.186 | 0.141 | 70.3 |
MiB | 0.700 | 0.271 | 0.096 | 86.3 |
MDIL | 0.779 | 0.115 | 0.097 | 87.6 |
C-FSCIL | 0.787 | 0.334 | 0.297 | 62.3 |
SoftNet | 0.820 | 0.305 | 0.146 | 82.2 |
GAPS | 0.700 | 0.334 | 0.253 | 63.9 |
FSCIL-SS | 0.700 | 0.115 | 0.089 | 87.3 |
Subspace | 0.257 | 0.054 | 0.040 | 84.4 |
Gen-Replay | 0.700 | 0.076 | 0.102 | 85.4 |
FeCAM | 0.700 | 0.048 | 0.042 | 94.0 |
FACT | 0.357 | 0.071 | 0.028 | 92.2 |
MAML | 0.700 | 0.001 | 0.059 | 91.6 |
MAML + Reg. | 0.700 | 0.001 | 0.062 | 91.1 |
MTL | 0.700 | 0.079 | 0.088 | 87.4 |
UnSupCL | 0.700 | 0.039 | 0.088 | 87.4 |
SupCL | 0.700 | 0.058 | 0.042 | 94.0 |
UnSupCL-HNM | 0.700 | 0.035 | 0.068 | 90.3 |
FoSSIL (U-Net) | 0.736 | 0.460 | 0.398 | 45.9 |
All values are reported as mIoU (0–100).
PD (Performance Drop rate) = ((Session 0 − Session 2) / Session 0) × 100.
Method | Session 0 | Session 1 | Session 2 | PD (↓) |
---|---|---|---|---|
DeepLab Vanilla | 47.76 | 2.18 | 3.86 | 91.9 |
GAPS | 47.76 | 23.42 | 16.68 | 65.1 |
MiB | 47.76 | 2.50 | 2.37 | 95.0 |
MDIL | 48.54 | 1.59 | 3.02 | 93.8 |
SAM Vanilla | 66.0 | 32.6 | 30.81 | 53.3 |
FoSSIL (SAM) | 66.0 | 33.2 | 31.22 | 52.7 |
Results reported as Dice coefficients (0–1).
Benchmark dataset / Method | Session 0 | Session 1 |
---|---|---|
Med FoSSIL-Disjoint (U-Net) | ||
Saving100x | 0.700 | 0.072 |
Adaptive Prototype | 0.700 | 0.044 |
FoSSIL | 0.736 | 0.460 |
Med Semi-Supervised-FoSSIL (MedFormer) | ||
CSL | 0.659 | 0.040 |
FoSSIL | 0.640 | 0.431 |
Seen refers to the average performance on classes the model has encountered in previous sessions, while New refers to the average performance on classes introduced in the current session. Results are reported as Dice coefficients (0–1). The reported results confirm that unlabeled data helps to boost the performance.
SS refers to Semi-Supervised, and HM refers to Harmonic Mean. SS FoSSIL denotes the performance of FoSSIL on the Med Semi-Supervised-FoSSIL benchmark.
Method | AMOS | BCV | MOTS | BraTS | VerSe | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Seen | New | HM | Seen | New | HM | Seen | New | HM | Seen | New | HM | Seen | New | HM | |
FoSSIL | 0.610 | 0.074 | 0.132 | 0.477 | 0.069 | 0.120 | 0.382 | 0.180 | 0.245 | 0.043 | 0.198 | 0.071 | 0.367 | 0.119 | 0.180 |
SS FoSSIL | 0.706 | 0.099 | 0.174 | 0.561 | 0.065 | 0.116 | 0.444 | 0.218 | 0.292 | 0.042 | 0.216 | 0.070 | 0.391 | 0.182 | 0.248 |
Seen refers to the average performance on classes the model has encountered in previous sessions, while New refers to the average performance on classes introduced in the current session. Results are reported as mIoU (0–100). The results show that pseudo-label refinement (PRL) enhances the performance of FoSSIL compared to FoSSIL without pseudo-label refinement (w/o PRL).
HM refers to Harmonic Mean.
Method | Cityscapes | IDD | IDD | ||||||
---|---|---|---|---|---|---|---|---|---|
Seen | New | HM | Seen | New | HM | Seen | New | HM | |
FoSSIL + GAPS | 27.33 | 29.98 | 28.59 | 26.00 | 49.82 | 34.17 | 28.88 | 24.23 | 26.35 |
FoSSIL + GAPS w/o PLR | 25.62 | 27.88 | 26.70 | 23.46 | 41.50 | 29.97 | 27.54 | 15.74 | 20.03 |
Seen refers to the average performance on classes the model has encountered in previous sessions, while New refers to the average performance on classes introduced in the current session. Results are reported as Dice coefficients (0–1). The results show that guided noise injection (GNI) enhances the performance of FoSSIL compared to FoSSIL without guided noise injection (w/o GNI).
HM refers to Harmonic Mean.
Method | Session 1 | Session 2 | Session 3 | Session 4 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Seen | New | HM | Seen | New | HM | Seen | New | HM | Seen | New | HM | |
FoSSIL | 0.534 | 0.159 | 0.245 | 0.398 | 0.076 | 0.128 | 0.329 | 0.046 | 0.081 | 0.289 | 0.038 | 0.067 |
FoSSIL w/o GNI | 0.105 | 0.010 | 0.018 | 0.069 | 0.015 | 0.025 | 0.061 | 0.034 | 0.043 | 0.099 | 0.008 | 0.014 |
These methods cannot handle few-shot learning and domain shifts.
Separate decoder per domain limits scalability; cannot handle few-shot incremental classes.
Although effective for class-incremental and few-shot learning, these methods are not robust to domain shifts.
Cannot jointly address domain shifts, class-incremental, and few-shot learning.