Theoretical Analysis of Guided Noise Injection (GNI)

1. Definitions, Notation, and Assumptions

Definition 1 (Weight vector)
Let \(\mathbb{W} \in \mathbb{R}^p\) denote the flattened vector containing all model parameters, where each component is indexed by \(i \in \{1, \dots, p\}\), and \(w_i\) denotes the \(i\)-th parameter.
Definition 2 (Gradient buffer)
At the current optimization step, \[ g_i = \frac{\partial \mathcal{L}}{\partial w_i}, \] denote the gradient of the loss \(\mathcal{L}\) w.r.t. \(w_i\), and define the elementwise squared gradient: \[ G_i = g_i^2. \]
Definition 3 (Inverse gradient modulator (normalized))
Introduce a stable inverse-gradient modulator: \[ G_{\mathrm{inv},i} = \frac{1}{G_i + \varepsilon}, \quad \varepsilon > 0, \] and normalize to a bounded range to control noise magnitude: \[ \tilde{G}_{i} = \frac{1 + G_{\mathrm{inv},i} - \min_j G_{\mathrm{inv},j}}{1 + \max_j G_{\mathrm{inv},j} - \min_j G_{\mathrm{inv},j}} \in (0,1]. \]
Definition 4 (Guided noise operator)
Let \(\xi = (\xi_1,\dots,\xi_p)\) be i.i.d. standard Gaussian noise, \(\xi_i \sim \mathcal{N}(0,1)\). Define, \[ \mathcal{U}(\mathbb{W}) = \mathbb{W} + \tilde{G} \odot \xi, \] i.e., componentwise: \[ \widetilde{w}_i = w_i + \tilde{G}_i \, \xi_i. \]
Assumption 1 (Smoothness)
\(\mathcal{L}: \mathbb{R}^p \to \mathbb{R}\) is twice continuously differentiable in a neighborhood of \(\mathbb{W}\). Let \(g=(g_1,\dots,g_p)\) be the gradient vector and \(H = \nabla^2 \mathcal{L}(\mathbb{W})\) the Hessian.

2. Expected Perturbed Loss

Consider the Taylor expansion of \(\mathcal{L}\) at \(\mathbb{W}\) with increment \(\Delta \mathbb{W} = \tilde{G} \odot \xi\):

\[ \mathcal{L}(\mathbb{W} + \Delta \mathbb{W}) = \mathcal{L}(\mathbb{W}) + g^\top \Delta \mathbb{W} + \frac{1}{2} \Delta \mathbb{W}^\top H \Delta \mathbb{W} + R_3(\Delta \mathbb{W}), \]

where \(R_3(\Delta \mathbb{W}) = O(\|\Delta \mathbb{W}\|^3)\).

Linear term expectation:

\[ \mathbb{E}_\xi[g^\top \Delta \mathbb{W}] = \sum_i g_i \tilde{G}_i \mathbb{E}[\xi_i] = 0, \]

since \(\mathbb{E}[\xi_i] = 0\).

Quadratic term expectation (diagonal approximation):

\[ \frac{1}{2} \mathbb{E}_\xi[\Delta \mathbb{W}^\top H \Delta \mathbb{W}] = \frac{1}{2} \sum_{i,j} H_{ij} \tilde{G}_i \tilde{G}_j \mathbb{E}[\xi_i \xi_j] \approx \frac{1}{2} \sum_{i=1}^p H_{ii} \tilde{G}_i^2, \]

because \(\mathbb{E}[\xi_i \xi_j] = 0\) for \(i\neq j\) and \(\mathbb{E}[\xi_i^2] = 1\).

Higher order term:

\[ |R_3(\Delta \mathbb{W})| \le C \|\Delta \mathbb{W}\|^3, \quad \mathbb{E}[|R_3|] = O(\mathbb{E}\|\Delta \mathbb{W}\|^3) = O(\max_i \tilde{G}_i^3), \]

for some constant \(C\). Since \(\tilde{G}_i \in (0,1]\), the higher order term is bounded even for small \(g_i^2\).

Note: For additional control of noise magnitude during training, one can scale the modulators by a factor \(\alpha \in (0,1]\): \[ \tilde{G}_i \;\longrightarrow\; \alpha \, \tilde{G}_i. \] With \(\Delta\mathbb{W}=\alpha\tilde{G}\odot\xi\) and \(\xi\sim\mathcal{N}(0,I_p)\) this yields, \[ |R_3(\alpha\tilde{G}\odot\xi)| \le C(\alpha\max_i\tilde{G}_i)^3\|\xi\|^3, \] and thus, taking expectation, \[ \mathbb{E}\big[|R_3(\alpha\tilde{G}\odot\xi)|\big] \le C(\alpha\max_i\tilde{G}_i)^3\mathbb{E}\|\xi\|^3 = O\!\big((\alpha\max_i\tilde{G}_i)^3\big). \] The factor \(\mathbb{E}\|\xi\|^3\) is finite (depends only on model parameters \(p\)). This ensures that the higher-order contributions remain bounded and controllable, while providing finer control of noise injection.

Thus, expected perturbed loss:

\[ \mathbb{E}_\xi[\mathcal{L}(\mathcal{U}(\mathbb{W}))] = \mathcal{L}(\mathbb{W}) + \frac{1}{2} \sum_{i=1}^p H_{ii} \tilde{G}_i^2 + O(\max_i \tilde{G}_i^3) \]

2.1 Interpretation and Theoretical Motivation

3. Spectral Viewpoint: Hessian Eigen-structure

Let \(H = U \Lambda U^\top\) be the eigendecomposition of the Hessian, where \(U\) is orthonormal and \(\Lambda = \mathrm{diag}(\lambda_1,\dots,\lambda_p)\). Then each diagonal entry satisfies,

\[ H_{ii} = \sum_{r=1}^p \lambda_r u_{ir}^2. \]

The expected second-order contribution with \(\tilde{G}\) becomes,

\[ \sum_{i=1}^p H_{ii} \tilde{G}_i^2 = \sum_{i=1}^p \sum_{r=1}^p \lambda_r u_{ir}^2 \tilde{G}_i^2 = \sum_{r=1}^p \lambda_r \sum_{i=1}^p (u_{ir} \tilde{G}_i)^2. \]

Define \(v^{(r)} \in \mathbb{R}^p\) with components \(v_i^{(r)} = u_{ir} \tilde{G}_i\). Then,

\[ \sum_{i=1}^p H_{ii} \tilde{G}_i^2 = \sum_{r=1}^p \lambda_r \|v^{(r)}\|_2^2 = \sum_{r=1}^p \lambda_r \|U_{:,r} \odot \tilde{G}\|_2^2 \]

3.1 Interpretation

GNI implements anisotropic noise injection: each Hessian eigenvalue \(\lambda_r\) is weighted by \(\|u^{(r)} \odot \tilde{G}\|_2^2\), the squared \(\ell_2\)-norm of \(\tilde{G}\) along the \(r\)-th eigendirection. When \(\tilde{G}\) is large on coordinates aligned with low-curvature directions (small \(\lambda_r\)), the noise contribution \(\lambda_r \|u^{(r)} \odot \tilde{G}\|_2^2\) remains controlled by the small eigenvalue; conversely, high-curvature directions (large \(\lambda_r\)) are automatically protected because \(\tilde{G}\) is small there. This concentrates noise along flat directions with minimal loss impact while protecting sensitive directions, encouraging flatter minima that improve generalization, with higher-order terms remaining bounded even when gradients vanish.

PAC-Bayes Analysis of Guided Noise Injection

4. Setup and Assumptions

We interpret the guided noise injection (GNI),

\[ \widetilde{\mathbb{W}} = \mathbb{W} + \tilde{G} \odot \xi, \qquad \xi \sim \mathcal{N}(0, I_p), \]

as defining a stochastic posterior over parameters, where \(\mathbb{W} \in \mathbb{R}^p\) denotes the current model weights. Each coordinate \(w_i\) is perturbed by scaled Gaussian noise \(\tilde{G}_i \xi_i\), with normalized noise magnitude,

\[ \tilde{G}_i = \frac{1 + G_{\mathrm{inv},i} - \min_j G_{\mathrm{inv},j}}{1 + \max_j G_{\mathrm{inv},j} - \min_j G_{\mathrm{inv},j}} \in (0,1], \qquad G_{\mathrm{inv},i} = \frac{1}{(\nabla_{w_i} \mathcal{L})^2 + \varepsilon} \]

Assumptions:

5. KL Divergence Between Posterior and Prior

For general multivariate Gaussians \(q = \mathcal{N}(\mu_q, \Sigma_q)\) and \(p = \mathcal{N}(\mu_p, \Sigma_p)\), the KL divergence is:

\[ \mathrm{KL}(q\|p) = \frac{1}{2} \Big[ \mathrm{tr}(\Sigma_p^{-1} \Sigma_q) + (\mu_p - \mu_q)^\top \Sigma_p^{-1} (\mu_p - \mu_q) - p + \ln \frac{\det \Sigma_p}{\det \Sigma_q} \Big]. \]

For the GNI posterior \(q(\theta) = \mathcal{N}(\mathbb{W}, \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2))\) and isotropic prior \(p(\theta) = \mathcal{N}(0, \tau^2 I_p)\), we have:

Trace term: \(\mathrm{tr}(\Sigma_p^{-1} \Sigma_q)\)

Since \(\Sigma_p = \tau^2 I_p\) and \(\Sigma_q = \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2)\):

\[ \Sigma_p^{-1} \Sigma_q = (\tau^2 I_p)^{-1} \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2) = \frac{1}{\tau^2} \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2). \]

The trace of a diagonal matrix is the sum of its diagonal elements, so

\[ \mathrm{tr}(\Sigma_p^{-1} \Sigma_q) = \sum_{i=1}^p \frac{\tilde{G}_i^2}{\tau^2}. \]

Mean difference term: \((\mu_p - \mu_q)^\top \Sigma_p^{-1} (\mu_p - \mu_q)\)

\[ (\mu_p - \mu_q)^\top \Sigma_p^{-1} (\mu_p - \mu_q) = (0 - \mathbb{W})^\top (\tau^{-2} I_p) (0 - \mathbb{W}) = \frac{1}{\tau^2} \sum_{i=1}^p w_i^2. \]

This term measures how far the posterior mean \(\mathbb{W}\) is from the prior mean \(0\), scaled by the prior variance.

Log-determinant term: \(\ln \frac{\det \Sigma_p}{\det \Sigma_q}\)

The determinant of a matrix is the product of its eigenvalues. For a diagonal matrix, the eigenvalues are the diagonal entries:

\[ \det(\mathrm{diag}(d_1, \dots, d_p)) = \prod_{i=1}^p d_i. \]

Hence,

\[ \det \Sigma_p = \det(\tau^2 I_p) = (\tau^2)^p, \quad \det \Sigma_q = \det(\mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2)) = \prod_{i=1}^p \tilde{G}_i^2. \]

Then the log-determinant term becomes,

\[ \ln \frac{\det \Sigma_p}{\det \Sigma_q} = \ln \frac{\tau^{2p}}{\prod_{i=1}^p \tilde{G}_i^2} = \sum_{i=1}^p \ln \frac{\tau^2}{\tilde{G}_i^2}. \]

This term penalizes how much the posterior volume differs from the prior. If posterior variance is smaller than prior, the log term is positive.

Combining these, the KL divergence reduces to:

\[ \mathrm{KL}(q\|p) = \frac{1}{2} \sum_{i=1}^{p} \Big[ \frac{\tilde{G}_i^2}{\tau^2} + \frac{w_i^2}{\tau^2} - 1 - \ln \frac{\tilde{G}_i^2}{\tau^2} \Big]. \tag{1} \]

6. Interpretation of the KL Divergence Term

Each parameter \(w_i\) contributes to the KL divergence:

\[ \mathrm{KL}_i = \frac{1}{2} \Big[ \frac{\tilde{G}_i^2}{\tau^2} + \frac{w_i^2}{\tau^2} - 1 - \ln \frac{\tilde{G}_i^2}{\tau^2} \Big]. \]

Define,

\[ x_i = \frac{\tilde{G}_i^2}{\tau^2}, \qquad f(x) = \frac{1}{2}(x - \ln x - 1), \]

so that,

\[ \mathrm{KL}_i = f(x_i) + \frac{w_i^2}{2\tau^2}. \tag{2} \]

Low-gradient parameters:

Parameters with small gradients have large \(\tilde{G}_i\), injecting more noise. This increases posterior variance in low-sensitivity directions, promoting exploration and regularization, allowing these parameters to safely explore flat regions without significantly affecting expected loss. The KL contribution is influenced both by the noise scale \(\tilde{G}_i^2/\tau^2\) and the weight magnitude \(w_i^2/\tau^2\). Since \(\tilde{G}_i \in (0,1]\), choosing \(\tau \in [\tilde{G}_{\min}, \tilde{G}_{\max}]\) ensures \(x_i\) remains bounded, preventing the KL from growing excessively.

High-gradient parameters:

Parameters with large gradients have small \(\tilde{G}_i\), suppressing noise. The KL term is then dominated by \(w_i^2/\tau^2\), reflecting that the posterior is tightly concentrated around the current weight, which prevents drift of these important weight parameters. The normalization in GNI ensures \(\tilde{G}_i > 0\), preventing the log term from diverging. Intuitively, high-gradient parameters remain stable.

Thus, the KL contribution reflects both the posterior spread (\(\tilde{G}_i\)) and the weight magnitude (\(w_i\)), providing a principled trade-off between stochastic regularization and posterior concentration. This provides a rigorous, anisotropic regularization mechanism that adapts to the local geometry of the loss landscape.

Theorem 1 (PAC-Bayes Bound Improvement via GNI)
Let \(p(\theta)\) denote a prior over model parameters, and let \(q_{\mathrm{ani}}(\theta)\) and \(q_{\mathrm{iso}}(\theta)\) denote the anisotropic and isotropic posteriors, respectively, both having the same total variance. Then, the anisotropic posterior induced by Guided Noise Injection (GNI) achieves a tighter PAC-Bayes bound than the isotropic posterior: \[ \mathcal{L}_{\mathrm{PAC}}(q_{\mathrm{ani}}(\theta), p(\theta)) \le \mathcal{L}_{\mathrm{PAC}}(q_{\mathrm{iso}}(\theta), p(\theta)) \]
Proof:

We define two posterior distributions over parameters \(\theta\):

The prior is isotropic Gaussian:

\[ p(\theta) = \mathcal{N}(0, \tau^2 I_p), \quad \tau>0. \]

Using Equation (1), for the isotropic posterior with variance \(\sigma^2\):

\[ \mathrm{KL}(q_{\mathrm{iso}} \| p) = \frac{1}{2} \sum_{i=1}^p \left[ \frac{\sigma^2}{\tau^2} + \frac{w_i^2}{\tau^2} - 1 - \ln \frac{\sigma^2}{\tau^2} \right] = \frac{1}{2} \left[ \frac{p \sigma^2}{\tau^2} + \frac{\|\mathbb{W}\|_2^2}{\tau^2} - p - p \ln \frac{\sigma^2}{\tau^2} \right]. \tag{3} \]

Similar to Equation (2),

\[ \mathrm{KL}^{\mathrm{iso}}_i = f(y_i) + \frac{w_i^2}{2\tau^2}. \tag{4} \]

where,

\[ y_i = \frac{\sigma^2}{\tau^2} \quad \forall i. \]

Using a second-order Taylor expansion around \(\mathbb{W}\):

\[ \mathcal{L}(\theta) \approx \mathcal{L}(\mathbb{W}) + (\theta - \mathbb{W})^\top \nabla \mathcal{L}(\mathbb{W}) + \frac{1}{2} (\theta - \mathbb{W})^\top H (\theta - \mathbb{W}), \]

where,

\[ H = \nabla^2_{\mathbb{W}} \mathcal{L}(\mathbb{W}) = U \Lambda U^\top, \quad \Lambda = \mathrm{diag}(\lambda_1,\dots,\lambda_p), \quad U^\top U = I_p. \]

Taking expectation under \(q(\theta)\) and using \(\mathbb{E}_{q}[\theta - \mathbb{W}] = 0\):

\[ \mathbb{E}_{\theta \sim q}[\mathcal{L}(\theta)] \approx \mathcal{L}(\mathbb{W}) + \frac{1}{2} \mathrm{tr}(H \Sigma_q). \tag{5} \]

For anisotropic posterior:

\[ \mathbb{E}_{q_{\mathrm{ani}}}[\mathcal{L}(\theta)] - \mathcal{L}(\mathbb{W}) \approx \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{ani}}) = \frac{1}{2} \sum_{i=1}^p H_{ii} \tilde{G}_i^2, \] \[ \sum_{i=1}^p H_{ii} \tilde{G}_i^2 = \sum_{i=1}^p \sum_{r=1}^p \lambda_r u_{ir}^2 \tilde{G}_i^2 = \sum_{r=1}^p \lambda_r \sum_{i=1}^p (u_{ir} \tilde{G}_i)^2 = \sum_{r=1}^p \lambda_r \| U_{:,r} \odot \tilde{G} \|_2^2 \]

For isotropic posterior:

\[ \mathbb{E}_{q_{\mathrm{iso}}}[\mathcal{L}(\theta)] - \mathcal{L}(\mathbb{W}) \approx \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{iso}}) = \frac{1}{2} \sum_{i=1}^p H_{ii} \sigma^2 = \frac{1}{2} \sigma^2 \mathrm{tr}(H), \] \[ \sigma^2 \mathrm{tr}(H) = \sigma^2 \sum_{i=1}^p H_{ii} = \sigma^2 \sum_{r=1}^p \lambda_r. \]

Low-curvature directions (\(\lambda_r\) small) are allowed larger variance \(\tilde{G}_i \in (0,1]\), contributing minimally to \(\sum_{r=1}^p \lambda_r \| U_{:,r} \odot \tilde{G} \|_2^2\). High-curvature directions (\(\lambda_r\) large) have small \(\tilde{G}_i\), preventing large expected loss increase. Isotropic variance \(\sigma^2\) does not adapt to curvature; high-curvature directions may receive excessive noise.

Hence, we obtain the inequality

\[ \mathbb{E}_{\theta \sim q_{\mathrm{ani}}}[\mathcal{L}(\theta)] = \mathcal{L}(\mathbb{W}) + \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{ani}}) \le \mathcal{L}(\mathbb{W}) + \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{iso}}) = \mathbb{E}_{\theta \sim q_{\mathrm{iso}}}[\mathcal{L}(\theta)]. \tag{6} \]

The KL divergence term satisfies convexity property of \(f(x) = x - \ln x - 1\):

\[ \frac{1}{p} \sum_{i=1}^p f\Big(\frac{\tilde{G}_i^2}{\tau^2}\Big) \le f\Big(\frac{1}{p} \sum_{i=1}^p \frac{\tilde{G}_i^2}{\tau^2} \Big) = f\Big(\frac{\sigma^2}{\tau^2}\Big), \]

by Jensen's inequality. Hence,

\[ \sum_{i=1}^p f\Big(\frac{\tilde{G}_i^2}{\tau^2}\Big) \le p f\Big(\frac{\sigma^2}{\tau^2}\Big). \tag{7} \]

The mean term \(\sum_i w_i^2/(2\tau^2)\) is identical for both posteriors (weight normalization ensures boundedness). Therefore, from Equation (2) and (4):

\[ \mathrm{KL}(q_{\mathrm{ani}} \| p) \le \mathrm{KL}(q_{\mathrm{iso}} \| p). \tag{8} \]

For dataset size \(N\) and confidence level \(\delta \in (0,1)\), with probability at least \(1-\delta\), the PAC-Bayes bound states:

\[ \mathbb{E}_{\theta \sim q}[\mathcal{L}_{\mathrm{test}}(\theta)] \le \mathbb{E}_{\theta \sim q}[\widehat{\mathcal{L}}(\theta)] + \sqrt{\frac{\mathrm{KL}(q\|p) + \ln\frac{N}{\delta}}{2(N-1)}}. \tag{9} \]

Combining (6) and (8):

\[ \mathbb{E}_{q_{\mathrm{ani}}}[\mathcal{L}_{\mathrm{test}}] \le \mathbb{E}_{q_{\mathrm{ani}}}[\widehat{\mathcal{L}}(\theta)] + \sqrt{\frac{\mathrm{KL}(q_{\mathrm{ani}}\|p)+\ln(N/\delta)}{2(N-1)}} \le \mathbb{E}_{q_{\mathrm{iso}}}[\widehat{\mathcal{L}}(\theta)] + \sqrt{\frac{\mathrm{KL}(q_{\mathrm{iso}}\|p)+\ln(N/\delta)}{2(N-1)}}, \] \[ \implies \mathcal{L}_{\mathrm{PAC}}(q_{\mathrm{ani}}(\theta), p(\theta)) \le \mathcal{L}_{\mathrm{PAC}}(q_{\mathrm{iso}}(\theta), p(\theta)). \]

Thus, the anisotropic posterior induced by GNI achieves a tighter PAC-Bayes bound than the isotropic posterior with matched total variance.

Theoretical Analysis of Prototype-Guided Pseudo-Label Refinement (PRL)

7. Setup

Consider a student-teacher semi-supervised segmentation framework:

Let \(\epsilon_0\) denote the base error of the student on labeled data. For unlabeled data, let \(\epsilon_s(t)\) and \(\epsilon_t(t)\) denote the fraction of incorrect pseudo-labels for the student and teacher at iteration \(t\).

8. Pseudo-Label Error Dynamics without Pseudo-Label Refinement

We begin by formulating the coupled dynamics between the student and teacher networks in the absence of pseudo-label refinement (PRL). Let \(\epsilon_s(t)\) and \(\epsilon_t(t)\) denote the expected pseudo-label errors (i.e., the fractions of incorrect pseudo-labels) produced by the student and teacher, respectively, at iteration \(t\). The student updates its parameters using a mixture of ground-truth labeled data and teacher-generated pseudo-labels:

\[ \epsilon_s(t+1) = (1-\gamma)\,\epsilon_0 + \gamma\,\epsilon_t(t), \tag{10} \]

where \(\epsilon_0\) represents the intrinsic baseline error arising from labeled supervision, and \(\gamma \in [0,1]\) controls the relative contribution of pseudo-labeled data in training. Intuitively, \(\gamma=0\) corresponds to purely supervised learning based solely on labeled data, whereas \(\gamma \to 1\) denotes a regime where training is dominated by pseudo-labels generated by the teacher.

The teacher, on the other hand, is an exponentially moving average (EMA) of the student:

\[ \epsilon_t(t+1) = \alpha \epsilon_t(t) + (1-\alpha) \epsilon_s(t+1), \tag{11} \]

where \(\alpha \in [0,1)\) is the EMA decay constant. This coupling forms a delayed feedback loop: the teacher smooths over the temporal trajectory of student errors.

Substituting Equation (10) into Equation (11) gives:

\[ \epsilon_t(t+1) = \alpha \epsilon_t(t) + (1-\alpha)\big[(1-\gamma)\epsilon_0 + \gamma \epsilon_t(t)\big] = [\alpha + (1-\alpha)\gamma] \epsilon_t(t) + (1-\alpha)(1-\gamma)\epsilon_0. \]

This establishes a non-homogeneous linear recurrence relation describing how teacher error evolves in conjunction with student learning dynamics.

Let us define:

\[ \lambda = \alpha + (1-\alpha)\gamma, \]

The update then simplifies to:

\[ \epsilon_t(t+1) = \lambda \epsilon_t(t) + (1-\alpha)(1-\gamma)\epsilon_0 \]

The general solution is:

\[ \epsilon_t(t) = \lambda^t \epsilon_t(0) + (1-\alpha)(1-\gamma)\epsilon_0 \sum_{k=0}^{t-1}\lambda^k, \]

where \(\epsilon_t(0)\) denotes the initial teacher pseudo-label error at iteration \(t=0\).

The term \(\lambda^t \epsilon_t(0)\) captures the contribution of past errors, exponentially weighted over time, while the summation term captures the accumulation of newly introduced errors from labeled supervision.

We analyze three characteristic regimes of \(\lambda\):

(a) \(\lambda < 1\):
The geometric series converges: \[ \sum_{k=0}^{t-1}\lambda^k = \frac{1-\lambda^t}{1-\lambda} \xrightarrow[t\to\infty]{} \frac{1}{1-\lambda}. \] Therefore, the asymptotic value of the teacher error is: \[ \epsilon_\infty = \frac{(1-\alpha)(1-\gamma)\epsilon_0}{1-\lambda}. \tag{12} \] Substituting \(\lambda = \alpha + (1-\alpha)\gamma\), we find: \[ \epsilon_\infty = \frac{(1-\alpha)(1-\gamma)\epsilon_0}{(1-\alpha)(1-\gamma)} = \epsilon_0. \] The teacher asymptotically retains the same error as if trained purely on ground-truth data.

(b) \(\lambda = 1\):
The summation becomes divergent and grows linearly with time: \[ \sum_{k=0}^{t-1}\lambda^k = t. \] The error therefore exhibits linear drift: \[ \epsilon_t(t) = \epsilon_t(0) + (1-\alpha)(1-\gamma)\epsilon_0 \cdot t. \] Errors are neither damped nor amplified exponentially, but drift slowly.

(c) \(\lambda > 1\):
The geometric series diverges exponentially: \[ \epsilon_t(t) = \lambda^t \epsilon_t(0) + (1-\alpha)(1-\gamma)\epsilon_0 \frac{\lambda^t - 1}{\lambda - 1}. \] Here, \(\epsilon_t\) grows exponentially leading to confirmation bias. The erroneous pseudo-labels reinforce the teacher, which in turn produces even more erroneous labels. Such behavior is typical when \(\gamma\) is high (strong reliance on unlabeled data).

For bounded error propagation, we require \(\lambda < 1\), which simplifies to \(\gamma < 1\). However, in real semi-supervised settings, \(|\mathcal{D}_u| \gg |\mathcal{D}_l|\), leading to an effective \(\gamma\) close to 1. This makes pseudo-label drift and error amplification inevitable without refinement mechanisms. This theoretical observation provides a direct justification for pseudo-label filtering and refinement strategies to maintain \(\lambda < 1\) in practice.

9. Prototype-Guided Pseudo-Label Refinement

Let \(z = f_\theta(x) \in \mathbb{R}^d\) denote the feature embedding of an unlabeled sample \(x\), where \(f_\theta\) is the student model with parameters \(\theta\). Let \(\mu_c \in \mathbb{R}^d\) be the prototype (mean feature) for class \(c\), computed from the labeled set \(\mathcal{D}_l^{(c)}\) of \(N_c\) examples of class \(c\):

\[ \mu_c = \frac{1}{N_c} \sum_{x_i \in \mathcal{D}_l^{(c)}} f_\theta(x_i), \]

where \(\mathcal{D}_l^{(c)}\) is the set of labeled samples belonging to class \(c\), \(N_c = |\mathcal{D}_l^{(c)}|\) is the number of labeled samples in class \(c\). A pseudo-label \(\hat{y}\) for input \(x\) is accepted if both the model confidence and similarity satisfy thresholds:

\[ p_T(\hat{y}\mid x) \ge \tau_{\mathrm{conf}}, \qquad s(x,\hat{y}) = \frac{\langle f_\theta(x), \mu_{\hat{y}} \rangle}{\|f_\theta(x)\|\|\mu_{\hat{y}}\|} \ge \tau_{\mathrm{sim}}, \]

where \(p_T(\hat{y}\mid x)\) is the predicted probability of class \(\hat{y}\) from the teacher model \(T\), \(\mu_{\hat{y}} \in \mathbb{R}^d\) is the prototype (mean feature) for class \(\hat{y}\), \(\tau_{\mathrm{conf}}\) and \(\tau_{\mathrm{sim}}\) are empirically determined thresholds for confidence and similarity, respectively.

This induces two measurable statistics over \(\mathcal{D}_u\):

Note: Since the true labels \(y\) are not available for unlabeled data, \(\rho\) serves as a conceptual measure of the expected correctness of accepted pseudo-labels.

Accounting for filtering, the effective contribution of pseudo-labeled samples is:

\[ \gamma_{\mathrm{eff}} = f \gamma, \]

where \(f\) is the fraction of pseudo-labels that pass the acceptance criteria.

9.1 Pseudo-Label Error Dynamics under Prototype Refinement

The student update, accounting for filtering with precision \(\rho\), can be expressed as:

\[ \epsilon_s(t+1) = (1-f\gamma)\epsilon_0 + f\gamma(1-\rho)\epsilon_t(t). \]

Here \((1-\rho)\) captures the proportion of incorrect pseudo-labels that survive prototype filtering. When \(\rho=1\), all pseudo-labels are correct and the student fully benefits from unlabeled supervision; when \(\rho=0\), the model ignores pseudo-labels, reducing to a purely supervised learner.

The teacher parameters are updated through an exponential moving average:

\[ \epsilon_t(t+1) = \alpha\epsilon_t(t) + (1-\alpha)\epsilon_s(t+1) = \big[\alpha + (1-\alpha)f\gamma(1-\rho)\big]\epsilon_t(t) + (1-\alpha)(1-f\gamma)\epsilon_0, \]

where \(\alpha\) is the decay rate of the EMA. Defining the effective recurrence coefficient:

\[ \lambda_{\mathrm{eff}} = \alpha + (1-\alpha)f\gamma(1-\rho), \]

For bounded error propagation (stability), \(\lambda_{\mathrm{eff}} < 1\), which yields:

\[ f\gamma(1-\rho) < 1, \] \[ \Rightarrow \quad \rho > 1 - \frac{1}{f\gamma}. \tag{13} \]

Equation (13) highlights that when pseudo-labels are highly reliable (high precision \(\rho\)), the student can safely use a larger fraction of pseudo-labeled data (\(f\gamma\)) without destabilizing training. Conversely, if pseudo-labels are less reliable (low \(\rho\)), only a smaller fraction of pseudo-labeled data should be used to prevent error amplification.

Provided that \(\lambda_{\mathrm{eff}} < 1\), Equation (12) shows that the teacher error asymptotically converges to:

\[ \epsilon_\infty = \frac{(1-\alpha)(1-f\gamma)\epsilon_0}{1-\lambda_{\mathrm{eff}}} = \frac{(1-f\gamma)\epsilon_0}{1-f\gamma(1-\rho)}. \tag{14} \]

Equation (14) quantifies how PLR influences long-term model performance.

Theorem 2 (Prototype-Guided Pseudo-Label Refinement)
Let \(\epsilon_0>0\) denote the baseline teacher error on labeled data. In a student-teacher semi-supervised framework with pseudo-label weight \(\gamma \in (0,1)\), EMA decay \(\alpha \in [0,1)\), and prototype-guided refinement with coverage \(f \in (0,1]\) and precision \(\rho \in [0,1]\), the asymptotic teacher error with PRL is \[ \epsilon_\infty^{\mathrm{PRL}} = \frac{(1-f\gamma)\,\epsilon_0}{1 - f\gamma(1-\rho)}. \] Under the stability condition \(\lambda_{\mathrm{eff}} := \alpha + (1-\alpha) f \gamma (1-\rho) < 1\), the following hold:
  1. (Reduced asymptotic error) \(\epsilon_\infty^{\mathrm{PRL}} \le \epsilon_0\), with strict inequality if \(f\gamma>0\) and \(\rho>0\).
  2. (Monotonicity in precision) \(\frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} < 0\), so higher precision strictly reduces asymptotic error.
  3. (Safe pseudo-label budget) Stable error propagation is maintained for all \(f\gamma < \frac{1}{1-\rho}\), allowing larger pseudo-label usage at higher precision.
Proof:

From the analysis in Section 9.1, the teacher error evolves as a first-order linear recurrence:

\[ \epsilon_t(t+1) = \lambda_{\mathrm{eff}} \epsilon_t(t) + (1-\alpha)(1-f\gamma)\epsilon_0, \]

where,

\[ \lambda_{\mathrm{eff}} = \alpha + (1-\alpha) f \gamma (1-\rho). \]

For \(|\lambda_{\mathrm{eff}}|<1\), solving the recurrence yields the asymptotic error:

\[ \epsilon_\infty^{\mathrm{PRL}} = \frac{(1-\alpha)(1-f\gamma)\epsilon_0}{1-\lambda_{\mathrm{eff}}}. \]

Substituting \(\lambda_{\mathrm{eff}}\) and simplifying the denominator:

\[ 1 - \lambda_{\mathrm{eff}} = 1 - \alpha - (1-\alpha)f\gamma(1-\rho) = (1-\alpha)[1 - f\gamma(1-\rho)], \]

we obtain:

\[ \epsilon_\infty^{\mathrm{PRL}} = \frac{(1-\alpha)(1-f\gamma)\epsilon_0}{(1-\alpha)[1 - f\gamma(1-\rho)]} = \frac{(1-f\gamma)\epsilon_0}{1 - f\gamma(1-\rho)}. \]

Part 1: Reduced Asymptotic Error

Define the error ratio:

\[ R := \frac{\epsilon_\infty^{\mathrm{PRL}}}{\epsilon_0} = \frac{1-f\gamma}{1 - f\gamma(1-\rho)}. \]

Rewriting the denominator:

\[ 1 - f\gamma(1-\rho) = 1 - f\gamma + f\gamma\rho. \]

Case 1: If \(f\gamma = 0\) (no pseudo-labels used) or \(\rho = 0\) (all accepted pseudo-labels incorrect):

\[ 1 - f\gamma(1-\rho) = 1 - f\gamma \implies R = 1 \implies \epsilon_\infty^{\mathrm{PRL}} = \epsilon_0. \]

Case 2: If \(f\gamma > 0\) and \(\rho > 0\):

\[ f\gamma\rho > 0 \implies 1 - f\gamma + f\gamma\rho > 1 - f\gamma. \]

Under the stability condition, both numerator and denominator are positive, so:

\[ R = \frac{1-f\gamma}{1-f\gamma+f\gamma\rho} < 1 \implies \epsilon_\infty^{\mathrm{PRL}} < \epsilon_0 = \epsilon_\infty^{\mathrm{noPRL}}. \]

This establishes Part (1). \(\square\)

Part 2: Monotonicity in Precision

Taking the derivative of \(\epsilon_\infty^{\mathrm{PRL}}\) with respect to \(\rho\):

\[ \frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} = \frac{\partial}{\partial \rho}\left[\frac{(1-f\gamma)\epsilon_0}{1-f\gamma(1-\rho)}\right]. \]

Let \(u = 1 - f\gamma(1-\rho)\), so \(\frac{\partial u}{\partial \rho} = f\gamma\). Applying the chain rule:

\[ \frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} = (1-f\gamma)\epsilon_0 \cdot \frac{\partial}{\partial \rho}\left[\frac{1}{u}\right] = (1-f\gamma)\epsilon_0 \cdot \left(-\frac{1}{u^2}\right) \cdot f\gamma = -\frac{(1-f\gamma)f\gamma\epsilon_0}{[1-f\gamma(1-\rho)]^2}. \]

Since all factors are positive under stability (\(f\gamma \in (0,1)\), \(\epsilon_0 > 0\), and \(1-f\gamma(1-\rho) > 0\) from \(\lambda_{\mathrm{eff}} < 1\)), we have:

\[ \frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} < 0. \]

Higher precision strictly reduces asymptotic error, establishing Part (2). \(\square\)

Part 3: Safe Pseudo-Label Budget

The stability condition requires \(\lambda_{\mathrm{eff}} < 1\):

\[ \alpha + (1-\alpha)f\gamma(1-\rho) < 1 \] \[ (1-\alpha)f\gamma(1-\rho) < 1 - \alpha \] \[ f\gamma(1-\rho) < 1 \] \[ f\gamma < \frac{1}{1-\rho}. \]

This bound quantifies the maximum effective pseudo-label usage that maintains bounded error propagation. Higher precision \(\rho\) allows proportionally larger pseudo-label budgets \(f\gamma\) while maintaining identical stability guarantees. This establishes Part (3). \(\square\)

Prototype-guided pseudo-label refinement guarantees that the asymptotic teacher error satisfies \(\epsilon_\infty^{\mathrm{PRL}} \le \epsilon_0\), ensuring that filtered pseudo-labels do not degrade performance. The asymptotic error decreases monotonically with pseudo-label precision, and the effective pseudo-label budget \(f\gamma\) can be safely increased under the stability constraint: \[ f\gamma < \frac{1}{1-\rho}. \] This provides a quantitative criterion for safe pseudo-label utilization, showing that investing in pseudo-label quality via prototype-based filtering simultaneously reduces long-term error and enlarges the permissible pseudo-label budget, thereby expanding robust training regimes for semi-supervised learning.