FoSSIL: A Unified Framework for Continual Semantic Segmentation in 2D and 3D Domains

Abstract

Continual semantic segmentation remains an underexplored area in both 2D and 3D domains. The problem becomes particularly challenging when classes and domains evolve over time, with incremental classes having only a few labeled samples. In this setting, the model must simultaneously address catastrophic forgetting of old classes, overfitting due to the limited labeled data of new classes, and domain shifts arising from changes in data distribution. Existing methods fail to simultaneously address these real-world constraints. We introduce the FoSSIL framework, which integrates guided noise injection and prototype-guided pseudo-label refinement (PLR) to enhance continual learning across class-incremental (CIL), domain-incremental (DIL), and few-shot scenarios. Guided noise injection perturbs parameters with overfitted or saturated gradients more strongly, while perturbing parameters with highly changing or large gradients less, preserving critical weights, allowing less critical parameters to explore alternative solutions in the parameter space, mitigating forgetting, reducing overfitting, and improving robustness to domain shifts. For incremental classes with unlabeled data, PLR enables semi-supervised learning by refining pseudo-labels and filtering out incorrect high-confidence predictions, ensuring reliable supervision for incremental classes. Together, these components work synergistically to enhance stability, generalization, and continual learning across all learning regimes.

Overview

Class-incremental (CI), few-shot (FS), and domain-incremental (DI) constraints all lead to significantly reduced Dice scores compared to the unconstrained baseline (“no constraint”) on a 3D U-Net model. Performance of various backbones on the FoSSIL benchmark with partially frozen or fully unfrozen weights. Session 1 denotes the first incremental session following the common base session. In Session 1, performance drops regardless of whether the model weights are partially frozen or fully unfrozen.

FoSSIL Framework

Problem Definition

We formalize the multi-constrained continual learning problem for semantic segmentation as a sequence of sessions \( \mathcal{S} = \{\mathcal{S}_0, \mathcal{S}_1, \dots, \mathcal{S}_T\} \) where each session \( \mathcal{S}_t \) has a semantic class space \( \mathcal{C}_t \) and domain distribution \( \mathcal{D}_t \). Each session has a dataset \( \mathbb{D}_t = \{(x_i, y_i)\}_{i=1}^{N_t} \) with input images \( x_i \in \mathcal{X} \) and pixel-wise labels \( y_i \in \mathcal{Y}_t \subseteq \mathcal{C}_t \).

Classes and Domains across Sessions

Same Classes, Different Domains: \( \mathcal{C}_t = \mathcal{C}_{t'} \), \( \mathcal{D}_t \neq \mathcal{D}_{t'} \)
Different Classes, Same Domain: \( \mathcal{C}_t \neq \mathcal{C}_{t'} \), \( \mathcal{D}_t = \mathcal{D}_{t'} \)
Different Classes, Different Domain: \( \mathcal{C}_t \neq \mathcal{C}_{t'} \), \( \mathcal{D}_t \neq \mathcal{D}_{t'} \)

Few-shot Learning

Incremental sessions (\( t \neq 0 \)) have few-shot labeled data \( N_t \ll N_0 \), with \( N_t = K \cdot |\mathcal{C}_t| \), where \( K \in \{5,10,20,30\} \). Each session may also access unlabeled data \( \mathcal{U}_t = \{x_j^{(u)}\}_{j=1}^{M_t} \), where \( M_t \gg N_t \).

Exemplar-Free Prototype Replay

For each class \( c \in \mathcal{C}_t \), a compact prototype is extracted from the feature space using a model feature extractor \( \phi \). For each sample \( (x_i, y_i) \), the embedding is \( E_i = \phi(x_i) \). The class prototype is computed as:

\( p_c^{(i)} = \frac{1}{|\mathcal{F}_c^{(i)}|} \sum_{\mathbf{f} \in \mathcal{F}_c^{(i)}} \mathbf{f} \)

\( P_c = \frac{1}{NS_c} \sum_{j \in NS_c} \frac{p_c^{(j)}}{\|p_c^{(j)}\|_2} \)

The prototype replay loss is:

\( \mathcal{L}_{\text{proto}} = \sum_{P_c \in \mathcal{P}_t} \mathcal{L}_{\text{CE}}(F(P_c), c) \)

Guided Noise Injection

Weights are perturbed according to gradient-based guidance:

\( G_{ij}^{-1} = \frac{1}{G_{ij} + \epsilon} \)

\( \tilde{G}_{ij}^{-1} = \frac{1 + G_{ij}^{-1} - \min(G^{-1})}{1 + \max(G^{-1}) - \min(G^{-1})} \)

\( \tilde{\mathbb{W}} = \mathbb{W} + \tilde{G}^{-1} \odot \mathcal{N}(0, I) \)

Prototype-Guided Pseudo-Label Refinement (PLR)

PLR uses class prototypes to refine pseudo-labels for unlabeled data \( x_j^{(u)} \), ensuring reliable supervision for incremental classes. Student and teacher networks generate pseudo-labels:

\( \hat{y}_s, \mathcal{F}_s = M_s(x_j^{(u)}), \quad \hat{y}_t, \mathcal{F}_t = M_t(x_j^{(u)}) \)

Pseudo-labels are retained only if confidence and feature similarity with prototypes meet thresholds:

\( \text{valid}(p,q) = (\text{conf}(p,q) > \tau_{\text{conf}}) \land (\text{sim}(p,q) > \tau_{\text{sim}}) \)

Consistency loss on validated pixels:

\( \mathcal{L}_{\text{consistency}} = \frac{1}{|\mathcal{V}|} \sum_{(p,q) \in \mathcal{V}} \|\hat{y}_s(p,q) - \hat{y}_t(p,q)\|_2^2 \)

Dataset Details

FoSSIL Datasets

Dataset	Description	Download
Med FoSSIL-Disjoint	Disjoint classes & medical domains across sessions	Zenodo
Med FoSSIL-Mixed	Classes & domains may reappear; multi-domain per session	Zenodo
Med Semi-Supervised-FoSSIL	Incremental classes with unlabeled data	Zenodo
Natural-FoSSIL	Multiple driving domains; class-incremental & FSCIL	Zenodo
Semi-Supervised Natural-FoSSIL	Incremental classes with unlabeled data	Zenodo
Detection	Multi-domain COCO-O dataset for detection	Google Drive

Incremental Sessions

Setting	Session 0 (Base)	Session 1	Session 2	Session 3	Session 4	Session 5
Med FoSSIL-Disjoint	15 (TS)	5 (AMOS)	6 (BCV)	4 (MOTS)	3 (BraTS)	4 (VerSe)
Med FoSSIL-Mixed	10 (AMOS)	8 (BCV, MOTS)	6 (TS, AMOS)	4 (MOTS, TS)	7 (BraTS, VerSe)	--
Med SS-FoSSIL	15 (TS)	5 (AMOS)	6 (BCV)	4 (MOTS)	3 (BraTS)	4 (VerSe)
Natural-FoSSIL	10 (BDD)	5 (IDD)	5 (BDD, IDD)	--	--	--
SS Natural-FoSSIL	10 (BDD)	2 (Cityscapes)	2 (IDD)	3 (IDD)	--	--

Results

Performance of baselines on Med FoSSIL-Disjoint benchmark (3 sessions)

Results are reported as Dice coefficients (0–1).
PD (Performance Drop rate) = ((Session 0 − Session 2) / Session 0) × 100.

Method	Session 0	Session 1	Session 2	PD (↓)
PIFS	0.700	0.129	0.078	88.9
NC-FSCIL	0.394	0.077	0.081	79.4
CLIP-CT	0.475	0.186	0.141	70.3
MiB	0.700	0.271	0.096	86.3
MDIL	0.779	0.115	0.097	87.6
C-FSCIL	0.787	0.334	0.297	62.3
SoftNet	0.820	0.305	0.146	82.2
GAPS	0.700	0.334	0.253	63.9
FSCIL-SS	0.700	0.115	0.089	87.3
Subspace	0.257	0.054	0.040	84.4
Gen-Replay	0.700	0.076	0.102	85.4
FeCAM	0.700	0.048	0.042	94.0
FACT	0.357	0.071	0.028	92.2
MAML	0.700	0.001	0.059	91.6
MAML + Reg.	0.700	0.001	0.062	91.1
MTL	0.700	0.079	0.088	87.4
UnSupCL	0.700	0.039	0.088	87.4
SupCL	0.700	0.058	0.042	94.0
UnSupCL-HNM	0.700	0.035	0.068	90.3
FoSSIL (U-Net)	0.736	0.460	0.398	45.9

Performance on Natural-FoSSIL benchmark

All values are reported as mIoU (0–100).
PD (Performance Drop rate) = ((Session 0 − Session 2) / Session 0) × 100.

Method	Session 0	Session 1	Session 2	PD (↓)
DeepLab Vanilla	47.76	2.18	3.86	91.9
GAPS	47.76	23.42	16.68	65.1
MiB	47.76	2.50	2.37	95.0
MDIL	48.54	1.59	3.02	93.8

SAM Vanilla	66.0	32.6	30.81	53.3
FoSSIL (SAM)	66.0	33.2	31.22	52.7

Performance of recent prototype replay-based and semi-supervised methods

Results reported as Dice coefficients (0–1).

Benchmark dataset / Method	Session 0	Session 1
Med FoSSIL-Disjoint (U-Net)
Saving100x	0.700	0.072
Adaptive Prototype	0.700	0.044
FoSSIL	0.736	0.460

Med Semi-Supervised-FoSSIL (MedFormer)
CSL	0.659	0.040
FoSSIL	0.640	0.431

Performance of FoSSIL on the Med FoSSIL-Disjoint benchmark and its variant, Med Semi-Supervised-FoSSIL (which includes additional unlabeled data), evaluated across incremental sessions (TS (Base) → AMOS → BCV → MOTS → BraTS → VerSe).

Seen refers to the average performance on classes the model has encountered in previous sessions, while New refers to the average performance on classes introduced in the current session. Results are reported as Dice coefficients (0–1). The reported results confirm that unlabeled data helps to boost the performance.

SS refers to Semi-Supervised, and HM refers to Harmonic Mean. SS FoSSIL denotes the performance of FoSSIL on the Med Semi-Supervised-FoSSIL benchmark.

Method	AMOS			BCV			MOTS			BraTS			VerSe
Method	Seen	New	HM	Seen	New	HM	Seen	New	HM	Seen	New	HM	Seen	New	HM
FoSSIL	0.610	0.074	0.132	0.477	0.069	0.120	0.382	0.180	0.245	0.043	0.198	0.071	0.367	0.119	0.180
SS FoSSIL	0.706	0.099	0.174	0.561	0.065	0.116	0.444	0.218	0.292	0.042	0.216	0.070	0.391	0.182	0.248

Performance of FoSSIL on the Semi-Supervised Natural-FoSSIL benchmark (which includes additional unlabeled data), evaluated across incremental sessions (BDD100K (Base) → Cityscapes → IDD → IDD (with different classes from the previous session)).

Seen refers to the average performance on classes the model has encountered in previous sessions, while New refers to the average performance on classes introduced in the current session. Results are reported as mIoU (0–100). The results show that pseudo-label refinement (PRL) enhances the performance of FoSSIL compared to FoSSIL without pseudo-label refinement (w/o PRL).

HM refers to Harmonic Mean.

Method	Cityscapes			IDD			IDD
Method	Seen	New	HM	Seen	New	HM	Seen	New	HM
FoSSIL + GAPS	27.33	29.98	28.59	26.00	49.82	34.17	28.88	24.23	26.35
FoSSIL + GAPS w/o PLR	25.62	27.88	26.70	23.46	41.50	29.97	27.54	15.74	20.03

Performance of FoSSIL (MedFormer) on the Med FoSSIL-Mixed benchmark, evaluated across incremental sessions (Session 0 (Base) → Session 1 → Session 2 → Session 3 → Session 4).

Seen refers to the average performance on classes the model has encountered in previous sessions, while New refers to the average performance on classes introduced in the current session. Results are reported as Dice coefficients (0–1). The results show that guided noise injection (GNI) enhances the performance of FoSSIL compared to FoSSIL without guided noise injection (w/o GNI).

HM refers to Harmonic Mean.

Method	Session 1			Session 2			Session 3			Session 4
Method	Seen	New	HM	Seen	New	HM	Seen	New	HM	Seen	New	HM
FoSSIL	0.534	0.159	0.245	0.398	0.076	0.128	0.329	0.046	0.081	0.289	0.038	0.067
FoSSIL w/o GNI	0.105	0.010	0.018	0.069	0.015	0.025	0.061	0.034	0.043	0.099	0.008	0.014

Baselines

Class-Incremental Semantic Segmentation

MiB: A class-incremental learning method for semantic segmentation in natural images.
CLIP-CT: A continual learning method for abdominal multi-organ and tumour segmentation.
Saving100x: Leverages prototype replay for incremental semantic segmentation.
Adaptive Prototype: Updates prototypes adaptively for class-incremental learning.

These methods cannot handle few-shot learning and domain shifts.

Domain-Incremental Learning

MDIL: Uses multiple decoders for different domains with a shared encoder.

Separate decoder per domain limits scalability; cannot handle few-shot incremental classes.

Few-shot Class-Incremental Learning

PIFS: Prototype-based few-shot CIL.
Subspace: Regularization-based few-shot CIL.
C-FSCIL: Meta-learning for few-shot CIL.
FACT: Reserves embedding space for new classes.
NC-FSCIL: Fixes classifier as simplex ETF.
Gen-Replay: Data-free replay using diffusion model for 3D volumes.
GAPS: Guided copy-paste augmentation.
SoftNet: Learns model weights & adaptive soft masks.
FSCIL-SS: Pseudo-labeling & knowledge distillation.
FeCAM: Enhances class prototype representation; evaluates domain-incremental learning separately.

Although effective for class-incremental and few-shot learning, these methods are not robust to domain shifts.

Semi-Supervised Based Methods

RETRIEVE: Robust semi-supervised learning via coreset selection.
NNCSL: Nearest-neighbor based continual semi-supervised learning.
UaD-CE: Class Equilibrium + Uncertainty-aware Distillation.
CSL: Pseudo-label selection as convex optimization.

Cannot jointly address domain shifts, class-incremental, and few-shot learning.

Others (Representation Learning, Few-shot Learning, Meta-Learning, Active Learning, Domain-shift)

SupCL, UnSupCL, UnSupCL-HNM: Contrastive learning methods.
MTL: Fast adaptation to unseen tasks; few-shot learning.
MAML: Multi-task representation learning; few-shot learning.
CLIP-driven: Language-Vision model for organ segmentation & tumor detection.
HALO: Hyperbolic NN for pixel-level active learning under domain shift.