Consider the Taylor expansion of \(\mathcal{L}\) at \(\mathbb{W}\) with increment \(\Delta \mathbb{W} = \tilde{G} \odot \xi\):
\[ \mathcal{L}(\mathbb{W} + \Delta \mathbb{W}) = \mathcal{L}(\mathbb{W}) + g^\top \Delta \mathbb{W} + \frac{1}{2} \Delta \mathbb{W}^\top H \Delta \mathbb{W} + R_3(\Delta \mathbb{W}), \]where \(R_3(\Delta \mathbb{W}) = O(\|\Delta \mathbb{W}\|^3)\).
since \(\mathbb{E}[\xi_i] = 0\).
because \(\mathbb{E}[\xi_i \xi_j] = 0\) for \(i\neq j\) and \(\mathbb{E}[\xi_i^2] = 1\).
for some constant \(C\). Since \(\tilde{G}_i \in (0,1]\), the higher order term is bounded even for small \(g_i^2\).
Thus, expected perturbed loss:
\[ \mathbb{E}_\xi[\mathcal{L}(\mathcal{U}(\mathbb{W}))] = \mathcal{L}(\mathbb{W}) + \frac{1}{2} \sum_{i=1}^p H_{ii} \tilde{G}_i^2 + O(\max_i \tilde{G}_i^3) \]Let \(H = U \Lambda U^\top\) be the eigendecomposition of the Hessian, where \(U\) is orthonormal and \(\Lambda = \mathrm{diag}(\lambda_1,\dots,\lambda_p)\). Then each diagonal entry satisfies,
\[ H_{ii} = \sum_{r=1}^p \lambda_r u_{ir}^2. \]The expected second-order contribution with \(\tilde{G}\) becomes,
\[ \sum_{i=1}^p H_{ii} \tilde{G}_i^2 = \sum_{i=1}^p \sum_{r=1}^p \lambda_r u_{ir}^2 \tilde{G}_i^2 = \sum_{r=1}^p \lambda_r \sum_{i=1}^p (u_{ir} \tilde{G}_i)^2. \]Define \(v^{(r)} \in \mathbb{R}^p\) with components \(v_i^{(r)} = u_{ir} \tilde{G}_i\). Then,
\[ \sum_{i=1}^p H_{ii} \tilde{G}_i^2 = \sum_{r=1}^p \lambda_r \|v^{(r)}\|_2^2 = \sum_{r=1}^p \lambda_r \|U_{:,r} \odot \tilde{G}\|_2^2 \]GNI implements anisotropic noise injection: each Hessian eigenvalue \(\lambda_r\) is weighted by \(\|u^{(r)} \odot \tilde{G}\|_2^2\), the squared \(\ell_2\)-norm of \(\tilde{G}\) along the \(r\)-th eigendirection. When \(\tilde{G}\) is large on coordinates aligned with low-curvature directions (small \(\lambda_r\)), the noise contribution \(\lambda_r \|u^{(r)} \odot \tilde{G}\|_2^2\) remains controlled by the small eigenvalue; conversely, high-curvature directions (large \(\lambda_r\)) are automatically protected because \(\tilde{G}\) is small there. This concentrates noise along flat directions with minimal loss impact while protecting sensitive directions, encouraging flatter minima that improve generalization, with higher-order terms remaining bounded even when gradients vanish.
We interpret the guided noise injection (GNI),
\[ \widetilde{\mathbb{W}} = \mathbb{W} + \tilde{G} \odot \xi, \qquad \xi \sim \mathcal{N}(0, I_p), \]as defining a stochastic posterior over parameters, where \(\mathbb{W} \in \mathbb{R}^p\) denotes the current model weights. Each coordinate \(w_i\) is perturbed by scaled Gaussian noise \(\tilde{G}_i \xi_i\), with normalized noise magnitude,
\[ \tilde{G}_i = \frac{1 + G_{\mathrm{inv},i} - \min_j G_{\mathrm{inv},j}}{1 + \max_j G_{\mathrm{inv},j} - \min_j G_{\mathrm{inv},j}} \in (0,1], \qquad G_{\mathrm{inv},i} = \frac{1}{(\nabla_{w_i} \mathcal{L})^2 + \varepsilon} \]Assumptions:
For general multivariate Gaussians \(q = \mathcal{N}(\mu_q, \Sigma_q)\) and \(p = \mathcal{N}(\mu_p, \Sigma_p)\), the KL divergence is:
\[ \mathrm{KL}(q\|p) = \frac{1}{2} \Big[ \mathrm{tr}(\Sigma_p^{-1} \Sigma_q) + (\mu_p - \mu_q)^\top \Sigma_p^{-1} (\mu_p - \mu_q) - p + \ln \frac{\det \Sigma_p}{\det \Sigma_q} \Big]. \]For the GNI posterior \(q(\theta) = \mathcal{N}(\mathbb{W}, \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2))\) and isotropic prior \(p(\theta) = \mathcal{N}(0, \tau^2 I_p)\), we have:
Since \(\Sigma_p = \tau^2 I_p\) and \(\Sigma_q = \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2)\):
\[ \Sigma_p^{-1} \Sigma_q = (\tau^2 I_p)^{-1} \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2) = \frac{1}{\tau^2} \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2). \]The trace of a diagonal matrix is the sum of its diagonal elements, so
\[ \mathrm{tr}(\Sigma_p^{-1} \Sigma_q) = \sum_{i=1}^p \frac{\tilde{G}_i^2}{\tau^2}. \]This term measures how far the posterior mean \(\mathbb{W}\) is from the prior mean \(0\), scaled by the prior variance.
The determinant of a matrix is the product of its eigenvalues. For a diagonal matrix, the eigenvalues are the diagonal entries:
\[ \det(\mathrm{diag}(d_1, \dots, d_p)) = \prod_{i=1}^p d_i. \]Hence,
\[ \det \Sigma_p = \det(\tau^2 I_p) = (\tau^2)^p, \quad \det \Sigma_q = \det(\mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2)) = \prod_{i=1}^p \tilde{G}_i^2. \]Then the log-determinant term becomes,
\[ \ln \frac{\det \Sigma_p}{\det \Sigma_q} = \ln \frac{\tau^{2p}}{\prod_{i=1}^p \tilde{G}_i^2} = \sum_{i=1}^p \ln \frac{\tau^2}{\tilde{G}_i^2}. \]This term penalizes how much the posterior volume differs from the prior. If posterior variance is smaller than prior, the log term is positive.
Combining these, the KL divergence reduces to:
\[ \mathrm{KL}(q\|p) = \frac{1}{2} \sum_{i=1}^{p} \Big[ \frac{\tilde{G}_i^2}{\tau^2} + \frac{w_i^2}{\tau^2} - 1 - \ln \frac{\tilde{G}_i^2}{\tau^2} \Big]. \tag{1} \]Each parameter \(w_i\) contributes to the KL divergence:
\[ \mathrm{KL}_i = \frac{1}{2} \Big[ \frac{\tilde{G}_i^2}{\tau^2} + \frac{w_i^2}{\tau^2} - 1 - \ln \frac{\tilde{G}_i^2}{\tau^2} \Big]. \]Define,
\[ x_i = \frac{\tilde{G}_i^2}{\tau^2}, \qquad f(x) = \frac{1}{2}(x - \ln x - 1), \]so that,
\[ \mathrm{KL}_i = f(x_i) + \frac{w_i^2}{2\tau^2}. \tag{2} \]Parameters with small gradients have large \(\tilde{G}_i\), injecting more noise. This increases posterior variance in low-sensitivity directions, promoting exploration and regularization, allowing these parameters to safely explore flat regions without significantly affecting expected loss. The KL contribution is influenced both by the noise scale \(\tilde{G}_i^2/\tau^2\) and the weight magnitude \(w_i^2/\tau^2\). Since \(\tilde{G}_i \in (0,1]\), choosing \(\tau \in [\tilde{G}_{\min}, \tilde{G}_{\max}]\) ensures \(x_i\) remains bounded, preventing the KL from growing excessively.
Parameters with large gradients have small \(\tilde{G}_i\), suppressing noise. The KL term is then dominated by \(w_i^2/\tau^2\), reflecting that the posterior is tightly concentrated around the current weight, which prevents drift of these important weight parameters. The normalization in GNI ensures \(\tilde{G}_i > 0\), preventing the log term from diverging. Intuitively, high-gradient parameters remain stable.
Thus, the KL contribution reflects both the posterior spread (\(\tilde{G}_i\)) and the weight magnitude (\(w_i\)), providing a principled trade-off between stochastic regularization and posterior concentration. This provides a rigorous, anisotropic regularization mechanism that adapts to the local geometry of the loss landscape.
We define two posterior distributions over parameters \(\theta\):
The prior is isotropic Gaussian:
\[ p(\theta) = \mathcal{N}(0, \tau^2 I_p), \quad \tau>0. \]Using Equation (1), for the isotropic posterior with variance \(\sigma^2\):
\[ \mathrm{KL}(q_{\mathrm{iso}} \| p) = \frac{1}{2} \sum_{i=1}^p \left[ \frac{\sigma^2}{\tau^2} + \frac{w_i^2}{\tau^2} - 1 - \ln \frac{\sigma^2}{\tau^2} \right] = \frac{1}{2} \left[ \frac{p \sigma^2}{\tau^2} + \frac{\|\mathbb{W}\|_2^2}{\tau^2} - p - p \ln \frac{\sigma^2}{\tau^2} \right]. \tag{3} \]Similar to Equation (2),
\[ \mathrm{KL}^{\mathrm{iso}}_i = f(y_i) + \frac{w_i^2}{2\tau^2}. \tag{4} \]where,
\[ y_i = \frac{\sigma^2}{\tau^2} \quad \forall i. \]Using a second-order Taylor expansion around \(\mathbb{W}\):
\[ \mathcal{L}(\theta) \approx \mathcal{L}(\mathbb{W}) + (\theta - \mathbb{W})^\top \nabla \mathcal{L}(\mathbb{W}) + \frac{1}{2} (\theta - \mathbb{W})^\top H (\theta - \mathbb{W}), \]where,
\[ H = \nabla^2_{\mathbb{W}} \mathcal{L}(\mathbb{W}) = U \Lambda U^\top, \quad \Lambda = \mathrm{diag}(\lambda_1,\dots,\lambda_p), \quad U^\top U = I_p. \]Taking expectation under \(q(\theta)\) and using \(\mathbb{E}_{q}[\theta - \mathbb{W}] = 0\):
\[ \mathbb{E}_{\theta \sim q}[\mathcal{L}(\theta)] \approx \mathcal{L}(\mathbb{W}) + \frac{1}{2} \mathrm{tr}(H \Sigma_q). \tag{5} \]For anisotropic posterior:
\[ \mathbb{E}_{q_{\mathrm{ani}}}[\mathcal{L}(\theta)] - \mathcal{L}(\mathbb{W}) \approx \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{ani}}) = \frac{1}{2} \sum_{i=1}^p H_{ii} \tilde{G}_i^2, \] \[ \sum_{i=1}^p H_{ii} \tilde{G}_i^2 = \sum_{i=1}^p \sum_{r=1}^p \lambda_r u_{ir}^2 \tilde{G}_i^2 = \sum_{r=1}^p \lambda_r \sum_{i=1}^p (u_{ir} \tilde{G}_i)^2 = \sum_{r=1}^p \lambda_r \| U_{:,r} \odot \tilde{G} \|_2^2 \]For isotropic posterior:
\[ \mathbb{E}_{q_{\mathrm{iso}}}[\mathcal{L}(\theta)] - \mathcal{L}(\mathbb{W}) \approx \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{iso}}) = \frac{1}{2} \sum_{i=1}^p H_{ii} \sigma^2 = \frac{1}{2} \sigma^2 \mathrm{tr}(H), \] \[ \sigma^2 \mathrm{tr}(H) = \sigma^2 \sum_{i=1}^p H_{ii} = \sigma^2 \sum_{r=1}^p \lambda_r. \]Low-curvature directions (\(\lambda_r\) small) are allowed larger variance \(\tilde{G}_i \in (0,1]\), contributing minimally to \(\sum_{r=1}^p \lambda_r \| U_{:,r} \odot \tilde{G} \|_2^2\). High-curvature directions (\(\lambda_r\) large) have small \(\tilde{G}_i\), preventing large expected loss increase. Isotropic variance \(\sigma^2\) does not adapt to curvature; high-curvature directions may receive excessive noise.
Hence, we obtain the inequality
\[ \mathbb{E}_{\theta \sim q_{\mathrm{ani}}}[\mathcal{L}(\theta)] = \mathcal{L}(\mathbb{W}) + \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{ani}}) \le \mathcal{L}(\mathbb{W}) + \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{iso}}) = \mathbb{E}_{\theta \sim q_{\mathrm{iso}}}[\mathcal{L}(\theta)]. \tag{6} \]The KL divergence term satisfies convexity property of \(f(x) = x - \ln x - 1\):
\[ \frac{1}{p} \sum_{i=1}^p f\Big(\frac{\tilde{G}_i^2}{\tau^2}\Big) \le f\Big(\frac{1}{p} \sum_{i=1}^p \frac{\tilde{G}_i^2}{\tau^2} \Big) = f\Big(\frac{\sigma^2}{\tau^2}\Big), \]by Jensen's inequality. Hence,
\[ \sum_{i=1}^p f\Big(\frac{\tilde{G}_i^2}{\tau^2}\Big) \le p f\Big(\frac{\sigma^2}{\tau^2}\Big). \tag{7} \]The mean term \(\sum_i w_i^2/(2\tau^2)\) is identical for both posteriors (weight normalization ensures boundedness). Therefore, from Equation (2) and (4):
\[ \mathrm{KL}(q_{\mathrm{ani}} \| p) \le \mathrm{KL}(q_{\mathrm{iso}} \| p). \tag{8} \]For dataset size \(N\) and confidence level \(\delta \in (0,1)\), with probability at least \(1-\delta\), the PAC-Bayes bound states:
\[ \mathbb{E}_{\theta \sim q}[\mathcal{L}_{\mathrm{test}}(\theta)] \le \mathbb{E}_{\theta \sim q}[\widehat{\mathcal{L}}(\theta)] + \sqrt{\frac{\mathrm{KL}(q\|p) + \ln\frac{N}{\delta}}{2(N-1)}}. \tag{9} \]Combining (6) and (8):
\[ \mathbb{E}_{q_{\mathrm{ani}}}[\mathcal{L}_{\mathrm{test}}] \le \mathbb{E}_{q_{\mathrm{ani}}}[\widehat{\mathcal{L}}(\theta)] + \sqrt{\frac{\mathrm{KL}(q_{\mathrm{ani}}\|p)+\ln(N/\delta)}{2(N-1)}} \le \mathbb{E}_{q_{\mathrm{iso}}}[\widehat{\mathcal{L}}(\theta)] + \sqrt{\frac{\mathrm{KL}(q_{\mathrm{iso}}\|p)+\ln(N/\delta)}{2(N-1)}}, \] \[ \implies \mathcal{L}_{\mathrm{PAC}}(q_{\mathrm{ani}}(\theta), p(\theta)) \le \mathcal{L}_{\mathrm{PAC}}(q_{\mathrm{iso}}(\theta), p(\theta)). \]Thus, the anisotropic posterior induced by GNI achieves a tighter PAC-Bayes bound than the isotropic posterior with matched total variance.
Consider a student-teacher semi-supervised segmentation framework:
Let \(\epsilon_0\) denote the base error of the student on labeled data. For unlabeled data, let \(\epsilon_s(t)\) and \(\epsilon_t(t)\) denote the fraction of incorrect pseudo-labels for the student and teacher at iteration \(t\).
We begin by formulating the coupled dynamics between the student and teacher networks in the absence of pseudo-label refinement (PRL). Let \(\epsilon_s(t)\) and \(\epsilon_t(t)\) denote the expected pseudo-label errors (i.e., the fractions of incorrect pseudo-labels) produced by the student and teacher, respectively, at iteration \(t\). The student updates its parameters using a mixture of ground-truth labeled data and teacher-generated pseudo-labels:
\[ \epsilon_s(t+1) = (1-\gamma)\,\epsilon_0 + \gamma\,\epsilon_t(t), \tag{10} \]where \(\epsilon_0\) represents the intrinsic baseline error arising from labeled supervision, and \(\gamma \in [0,1]\) controls the relative contribution of pseudo-labeled data in training. Intuitively, \(\gamma=0\) corresponds to purely supervised learning based solely on labeled data, whereas \(\gamma \to 1\) denotes a regime where training is dominated by pseudo-labels generated by the teacher.
The teacher, on the other hand, is an exponentially moving average (EMA) of the student:
\[ \epsilon_t(t+1) = \alpha \epsilon_t(t) + (1-\alpha) \epsilon_s(t+1), \tag{11} \]where \(\alpha \in [0,1)\) is the EMA decay constant. This coupling forms a delayed feedback loop: the teacher smooths over the temporal trajectory of student errors.
Substituting Equation (10) into Equation (11) gives:
\[ \epsilon_t(t+1) = \alpha \epsilon_t(t) + (1-\alpha)\big[(1-\gamma)\epsilon_0 + \gamma \epsilon_t(t)\big] = [\alpha + (1-\alpha)\gamma] \epsilon_t(t) + (1-\alpha)(1-\gamma)\epsilon_0. \]This establishes a non-homogeneous linear recurrence relation describing how teacher error evolves in conjunction with student learning dynamics.
Let us define:
\[ \lambda = \alpha + (1-\alpha)\gamma, \]The update then simplifies to:
\[ \epsilon_t(t+1) = \lambda \epsilon_t(t) + (1-\alpha)(1-\gamma)\epsilon_0 \]The general solution is:
\[ \epsilon_t(t) = \lambda^t \epsilon_t(0) + (1-\alpha)(1-\gamma)\epsilon_0 \sum_{k=0}^{t-1}\lambda^k, \]where \(\epsilon_t(0)\) denotes the initial teacher pseudo-label error at iteration \(t=0\).
The term \(\lambda^t \epsilon_t(0)\) captures the contribution of past errors, exponentially weighted over time, while the summation term captures the accumulation of newly introduced errors from labeled supervision.
We analyze three characteristic regimes of \(\lambda\):
(a) \(\lambda < 1\):
The geometric series converges:
\[
\sum_{k=0}^{t-1}\lambda^k = \frac{1-\lambda^t}{1-\lambda} \xrightarrow[t\to\infty]{} \frac{1}{1-\lambda}.
\]
Therefore, the asymptotic value of the teacher error is:
\[
\epsilon_\infty = \frac{(1-\alpha)(1-\gamma)\epsilon_0}{1-\lambda}.
\tag{12}
\]
Substituting \(\lambda = \alpha + (1-\alpha)\gamma\), we find:
\[
\epsilon_\infty = \frac{(1-\alpha)(1-\gamma)\epsilon_0}{(1-\alpha)(1-\gamma)} = \epsilon_0.
\]
The teacher asymptotically retains the same error as if trained purely on ground-truth data.
(b) \(\lambda = 1\):
The summation becomes divergent and grows linearly with time:
\[
\sum_{k=0}^{t-1}\lambda^k = t.
\]
The error therefore exhibits linear drift:
\[
\epsilon_t(t) = \epsilon_t(0) + (1-\alpha)(1-\gamma)\epsilon_0 \cdot t.
\]
Errors are neither damped nor amplified exponentially, but drift slowly.
(c) \(\lambda > 1\):
The geometric series diverges exponentially:
\[
\epsilon_t(t) = \lambda^t \epsilon_t(0) + (1-\alpha)(1-\gamma)\epsilon_0 \frac{\lambda^t - 1}{\lambda - 1}.
\]
Here, \(\epsilon_t\) grows exponentially leading to confirmation bias. The erroneous pseudo-labels reinforce the teacher, which in turn produces even more erroneous labels. Such behavior is typical when \(\gamma\) is high (strong reliance on unlabeled data).
For bounded error propagation, we require \(\lambda < 1\), which simplifies to \(\gamma < 1\). However, in real semi-supervised settings, \(|\mathcal{D}_u| \gg |\mathcal{D}_l|\), leading to an effective \(\gamma\) close to 1. This makes pseudo-label drift and error amplification inevitable without refinement mechanisms. This theoretical observation provides a direct justification for pseudo-label filtering and refinement strategies to maintain \(\lambda < 1\) in practice.
Let \(z = f_\theta(x) \in \mathbb{R}^d\) denote the feature embedding of an unlabeled sample \(x\), where \(f_\theta\) is the student model with parameters \(\theta\). Let \(\mu_c \in \mathbb{R}^d\) be the prototype (mean feature) for class \(c\), computed from the labeled set \(\mathcal{D}_l^{(c)}\) of \(N_c\) examples of class \(c\):
\[ \mu_c = \frac{1}{N_c} \sum_{x_i \in \mathcal{D}_l^{(c)}} f_\theta(x_i), \]where \(\mathcal{D}_l^{(c)}\) is the set of labeled samples belonging to class \(c\), \(N_c = |\mathcal{D}_l^{(c)}|\) is the number of labeled samples in class \(c\). A pseudo-label \(\hat{y}\) for input \(x\) is accepted if both the model confidence and similarity satisfy thresholds:
\[ p_T(\hat{y}\mid x) \ge \tau_{\mathrm{conf}}, \qquad s(x,\hat{y}) = \frac{\langle f_\theta(x), \mu_{\hat{y}} \rangle}{\|f_\theta(x)\|\|\mu_{\hat{y}}\|} \ge \tau_{\mathrm{sim}}, \]where \(p_T(\hat{y}\mid x)\) is the predicted probability of class \(\hat{y}\) from the teacher model \(T\), \(\mu_{\hat{y}} \in \mathbb{R}^d\) is the prototype (mean feature) for class \(\hat{y}\), \(\tau_{\mathrm{conf}}\) and \(\tau_{\mathrm{sim}}\) are empirically determined thresholds for confidence and similarity, respectively.
This induces two measurable statistics over \(\mathcal{D}_u\):
Note: Since the true labels \(y\) are not available for unlabeled data, \(\rho\) serves as a conceptual measure of the expected correctness of accepted pseudo-labels.
Accounting for filtering, the effective contribution of pseudo-labeled samples is:
\[ \gamma_{\mathrm{eff}} = f \gamma, \]where \(f\) is the fraction of pseudo-labels that pass the acceptance criteria.
The student update, accounting for filtering with precision \(\rho\), can be expressed as:
\[ \epsilon_s(t+1) = (1-f\gamma)\epsilon_0 + f\gamma(1-\rho)\epsilon_t(t). \]Here \((1-\rho)\) captures the proportion of incorrect pseudo-labels that survive prototype filtering. When \(\rho=1\), all pseudo-labels are correct and the student fully benefits from unlabeled supervision; when \(\rho=0\), the model ignores pseudo-labels, reducing to a purely supervised learner.
The teacher parameters are updated through an exponential moving average:
\[ \epsilon_t(t+1) = \alpha\epsilon_t(t) + (1-\alpha)\epsilon_s(t+1) = \big[\alpha + (1-\alpha)f\gamma(1-\rho)\big]\epsilon_t(t) + (1-\alpha)(1-f\gamma)\epsilon_0, \]where \(\alpha\) is the decay rate of the EMA. Defining the effective recurrence coefficient:
\[ \lambda_{\mathrm{eff}} = \alpha + (1-\alpha)f\gamma(1-\rho), \]For bounded error propagation (stability), \(\lambda_{\mathrm{eff}} < 1\), which yields:
\[ f\gamma(1-\rho) < 1, \] \[ \Rightarrow \quad \rho > 1 - \frac{1}{f\gamma}. \tag{13} \]Equation (13) highlights that when pseudo-labels are highly reliable (high precision \(\rho\)), the student can safely use a larger fraction of pseudo-labeled data (\(f\gamma\)) without destabilizing training. Conversely, if pseudo-labels are less reliable (low \(\rho\)), only a smaller fraction of pseudo-labeled data should be used to prevent error amplification.
Provided that \(\lambda_{\mathrm{eff}} < 1\), Equation (12) shows that the teacher error asymptotically converges to:
\[ \epsilon_\infty = \frac{(1-\alpha)(1-f\gamma)\epsilon_0}{1-\lambda_{\mathrm{eff}}} = \frac{(1-f\gamma)\epsilon_0}{1-f\gamma(1-\rho)}. \tag{14} \]Equation (14) quantifies how PLR influences long-term model performance.
From the analysis in Section 9.1, the teacher error evolves as a first-order linear recurrence:
\[ \epsilon_t(t+1) = \lambda_{\mathrm{eff}} \epsilon_t(t) + (1-\alpha)(1-f\gamma)\epsilon_0, \]where,
\[ \lambda_{\mathrm{eff}} = \alpha + (1-\alpha) f \gamma (1-\rho). \]For \(|\lambda_{\mathrm{eff}}|<1\), solving the recurrence yields the asymptotic error:
\[ \epsilon_\infty^{\mathrm{PRL}} = \frac{(1-\alpha)(1-f\gamma)\epsilon_0}{1-\lambda_{\mathrm{eff}}}. \]Substituting \(\lambda_{\mathrm{eff}}\) and simplifying the denominator:
\[ 1 - \lambda_{\mathrm{eff}} = 1 - \alpha - (1-\alpha)f\gamma(1-\rho) = (1-\alpha)[1 - f\gamma(1-\rho)], \]we obtain:
\[ \epsilon_\infty^{\mathrm{PRL}} = \frac{(1-\alpha)(1-f\gamma)\epsilon_0}{(1-\alpha)[1 - f\gamma(1-\rho)]} = \frac{(1-f\gamma)\epsilon_0}{1 - f\gamma(1-\rho)}. \]Define the error ratio:
\[ R := \frac{\epsilon_\infty^{\mathrm{PRL}}}{\epsilon_0} = \frac{1-f\gamma}{1 - f\gamma(1-\rho)}. \]Rewriting the denominator:
\[ 1 - f\gamma(1-\rho) = 1 - f\gamma + f\gamma\rho. \]Case 1: If \(f\gamma = 0\) (no pseudo-labels used) or \(\rho = 0\) (all accepted pseudo-labels incorrect):
\[ 1 - f\gamma(1-\rho) = 1 - f\gamma \implies R = 1 \implies \epsilon_\infty^{\mathrm{PRL}} = \epsilon_0. \]Case 2: If \(f\gamma > 0\) and \(\rho > 0\):
\[ f\gamma\rho > 0 \implies 1 - f\gamma + f\gamma\rho > 1 - f\gamma. \]Under the stability condition, both numerator and denominator are positive, so:
\[ R = \frac{1-f\gamma}{1-f\gamma+f\gamma\rho} < 1 \implies \epsilon_\infty^{\mathrm{PRL}} < \epsilon_0 = \epsilon_\infty^{\mathrm{noPRL}}. \]This establishes Part (1). \(\square\)
Taking the derivative of \(\epsilon_\infty^{\mathrm{PRL}}\) with respect to \(\rho\):
\[ \frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} = \frac{\partial}{\partial \rho}\left[\frac{(1-f\gamma)\epsilon_0}{1-f\gamma(1-\rho)}\right]. \]Let \(u = 1 - f\gamma(1-\rho)\), so \(\frac{\partial u}{\partial \rho} = f\gamma\). Applying the chain rule:
\[ \frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} = (1-f\gamma)\epsilon_0 \cdot \frac{\partial}{\partial \rho}\left[\frac{1}{u}\right] = (1-f\gamma)\epsilon_0 \cdot \left(-\frac{1}{u^2}\right) \cdot f\gamma = -\frac{(1-f\gamma)f\gamma\epsilon_0}{[1-f\gamma(1-\rho)]^2}. \]Since all factors are positive under stability (\(f\gamma \in (0,1)\), \(\epsilon_0 > 0\), and \(1-f\gamma(1-\rho) > 0\) from \(\lambda_{\mathrm{eff}} < 1\)), we have:
\[ \frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} < 0. \]Higher precision strictly reduces asymptotic error, establishing Part (2). \(\square\)
The stability condition requires \(\lambda_{\mathrm{eff}} < 1\):
\[ \alpha + (1-\alpha)f\gamma(1-\rho) < 1 \] \[ (1-\alpha)f\gamma(1-\rho) < 1 - \alpha \] \[ f\gamma(1-\rho) < 1 \] \[ f\gamma < \frac{1}{1-\rho}. \]This bound quantifies the maximum effective pseudo-label usage that maintains bounded error propagation. Higher precision \(\rho\) allows proportionally larger pseudo-label budgets \(f\gamma\) while maintaining identical stability guarantees. This establishes Part (3). \(\square\)
Prototype-guided pseudo-label refinement guarantees that the asymptotic teacher error satisfies \(\epsilon_\infty^{\mathrm{PRL}} \le \epsilon_0\), ensuring that filtered pseudo-labels do not degrade performance. The asymptotic error decreases monotonically with pseudo-label precision, and the effective pseudo-label budget \(f\gamma\) can be safely increased under the stability constraint: \[ f\gamma < \frac{1}{1-\rho}. \] This provides a quantitative criterion for safe pseudo-label utilization, showing that investing in pseudo-label quality via prototype-based filtering simultaneously reduces long-term error and enlarges the permissible pseudo-label budget, thereby expanding robust training regimes for semi-supervised learning.