Theoretical Analysis of Guided Noise Injection (GNI)

Consider the Taylor expansion of \(\mathcal{L}\) at \(\mathbb{W}\) with increment \(\Delta \mathbb{W} = \tilde{G} \odot \xi\):

Linear term expectation:

Quadratic term expectation (diagonal approximation):

because \(\mathbb{E}[\xi_i \xi_j] = 0\) for \(i\neq j\) and \(\mathbb{E}[\xi_i^2] = 1\).

Higher order term:

for some constant \(C\). Since \(\tilde{G}_i \in (0,1]\), the higher order term is bounded even for small \(g_i^2\).

2.1 Interpretation and Theoretical Motivation

3. Spectral Viewpoint: Hessian Eigen-structure

Let \(H = U \Lambda U^\top\) be the eigendecomposition of the Hessian, where \(U\) is orthonormal and \(\Lambda = \mathrm{diag}(\lambda_1,\dots,\lambda_p)\). Then each diagonal entry satisfies,

Define \(v^{(r)} \in \mathbb{R}^p\) with components \(v_i^{(r)} = u_{ir} \tilde{G}_i\). Then,

3.1 Interpretation

GNI implements anisotropic noise injection: each Hessian eigenvalue \(\lambda_r\) is weighted by \(\|u^{(r)} \odot \tilde{G}\|_2^2\), the squared \(\ell_2\)-norm of \(\tilde{G}\) along the \(r\)-th eigendirection. When \(\tilde{G}\) is large on coordinates aligned with low-curvature directions (small \(\lambda_r\)), the noise contribution \(\lambda_r \|u^{(r)} \odot \tilde{G}\|_2^2\) remains controlled by the small eigenvalue; conversely, high-curvature directions (large \(\lambda_r\)) are automatically protected because \(\tilde{G}\) is small there. This concentrates noise along flat directions with minimal loss impact while protecting sensitive directions, encouraging flatter minima that improve generalization, with higher-order terms remaining bounded even when gradients vanish.

PAC-Bayes Analysis of Guided Noise Injection

4. Setup and Assumptions

as defining a stochastic posterior over parameters, where \(\mathbb{W} \in \mathbb{R}^p\) denotes the current model weights. Each coordinate \(w_i\) is perturbed by scaled Gaussian noise \(\tilde{G}_i \xi_i\), with normalized noise magnitude,

5. KL Divergence Between Posterior and Prior

For general multivariate Gaussians \(q = \mathcal{N}(\mu_q, \Sigma_q)\) and \(p = \mathcal{N}(\mu_p, \Sigma_p)\), the KL divergence is:

For the GNI posterior \(q(\theta) = \mathcal{N}(\mathbb{W}, \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2))\) and isotropic prior \(p(\theta) = \mathcal{N}(0, \tau^2 I_p)\), we have:

Trace term: \(\mathrm{tr}(\Sigma_p^{-1} \Sigma_q)\)

Since \(\Sigma_p = \tau^2 I_p\) and \(\Sigma_q = \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2)\):

Mean difference term: \((\mu_p - \mu_q)^\top \Sigma_p^{-1} (\mu_p - \mu_q)\)

This term measures how far the posterior mean \(\mathbb{W}\) is from the prior mean \(0\), scaled by the prior variance.

Log-determinant term: \(\ln \frac{\det \Sigma_p}{\det \Sigma_q}\)

The determinant of a matrix is the product of its eigenvalues. For a diagonal matrix, the eigenvalues are the diagonal entries:

This term penalizes how much the posterior volume differs from the prior. If posterior variance is smaller than prior, the log term is positive.

6. Interpretation of the KL Divergence Term

Low-gradient parameters:

Parameters with small gradients have large \(\tilde{G}_i\), injecting more noise. This increases posterior variance in low-sensitivity directions, promoting exploration and regularization, allowing these parameters to safely explore flat regions without significantly affecting expected loss. The KL contribution is influenced both by the noise scale \(\tilde{G}_i^2/\tau^2\) and the weight magnitude \(w_i^2/\tau^2\). Since \(\tilde{G}_i \in (0,1]\), choosing \(\tau \in [\tilde{G}_{\min}, \tilde{G}_{\max}]\) ensures \(x_i\) remains bounded, preventing the KL from growing excessively.

High-gradient parameters:

Parameters with large gradients have small \(\tilde{G}_i\), suppressing noise. The KL term is then dominated by \(w_i^2/\tau^2\), reflecting that the posterior is tightly concentrated around the current weight, which prevents drift of these important weight parameters. The normalization in GNI ensures \(\tilde{G}_i > 0\), preventing the log term from diverging. Intuitively, high-gradient parameters remain stable.

Thus, the KL contribution reflects both the posterior spread (\(\tilde{G}_i\)) and the weight magnitude (\(w_i\)), providing a principled trade-off between stochastic regularization and posterior concentration. This provides a rigorous, anisotropic regularization mechanism that adapts to the local geometry of the loss landscape.

Proof:

We define two posterior distributions over parameters \(\theta\):

Anisotropic posterior (GNI): \[ q_{\mathrm{ani}}(\theta) = \mathcal{N}(\mathbb{W}, \Sigma_{\mathrm{ani}}), \quad \Sigma_{\mathrm{ani}} = \mathrm{diag}(\tilde{G}_1^2, \dots, \tilde{G}_p^2), \] where \(\tilde{G}_i \in (0,1]\) is an adaptive variance, decreasing with gradient magnitude \(|\nabla_{w_i}\mathcal{L}|\): \[ |\nabla_{w_i}\mathcal{L}| \text{ small} \Rightarrow \tilde{G}_i \text{ large},\quad |\nabla_{w_i}\mathcal{L}| \text{ large} \Rightarrow \tilde{G}_i \text{ small}. \]
Isotropic posterior: \[ q_{\mathrm{iso}}(\theta) = \mathcal{N}(\mathbb{W}, \sigma^2 I_p), \] with \(\sigma^2 = \frac{1}{p} \sum_{i=1}^p \tilde{G}_i^2\) chosen so that the total variance matches that of the anisotropic posterior: \[ \mathrm{tr}(\Sigma_{\mathrm{ani}}) = \mathrm{tr}(\sigma^2 I_p) = p\sigma^2. \]

The prior is isotropic Gaussian:

\[ p(\theta) = \mathcal{N}(0, \tau^2 I_p), \quad \tau>0. \]

Using Equation (1), for the isotropic posterior with variance \(\sigma^2\):

\[ \mathrm{KL}(q_{\mathrm{iso}} \| p) = \frac{1}{2} \sum_{i=1}^p \left[ \frac{\sigma^2}{\tau^2} + \frac{w_i^2}{\tau^2} - 1 - \ln \frac{\sigma^2}{\tau^2} \right] = \frac{1}{2} \left[ \frac{p \sigma^2}{\tau^2} + \frac{\|\mathbb{W}\|_2^2}{\tau^2} - p - p \ln \frac{\sigma^2}{\tau^2} \right]. \tag{3} \]

Similar to Equation (2),

\[ \mathrm{KL}^{\mathrm{iso}}_i = f(y_i) + \frac{w_i^2}{2\tau^2}. \tag{4} \]

where,

\[ y_i = \frac{\sigma^2}{\tau^2} \quad \forall i. \]

Using a second-order Taylor expansion around \(\mathbb{W}\):

\[ \mathcal{L}(\theta) \approx \mathcal{L}(\mathbb{W}) + (\theta - \mathbb{W})^\top \nabla \mathcal{L}(\mathbb{W}) + \frac{1}{2} (\theta - \mathbb{W})^\top H (\theta - \mathbb{W}), \]

where,

\[ H = \nabla^2_{\mathbb{W}} \mathcal{L}(\mathbb{W}) = U \Lambda U^\top, \quad \Lambda = \mathrm{diag}(\lambda_1,\dots,\lambda_p), \quad U^\top U = I_p. \]

Taking expectation under \(q(\theta)\) and using \(\mathbb{E}_{q}[\theta - \mathbb{W}] = 0\):

\[ \mathbb{E}_{\theta \sim q}[\mathcal{L}(\theta)] \approx \mathcal{L}(\mathbb{W}) + \frac{1}{2} \mathrm{tr}(H \Sigma_q). \tag{5} \]

For anisotropic posterior:

\[ \mathbb{E}_{q_{\mathrm{ani}}}[\mathcal{L}(\theta)] - \mathcal{L}(\mathbb{W}) \approx \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{ani}}) = \frac{1}{2} \sum_{i=1}^p H_{ii} \tilde{G}_i^2, \] \[ \sum_{i=1}^p H_{ii} \tilde{G}_i^2 = \sum_{i=1}^p \sum_{r=1}^p \lambda_r u_{ir}^2 \tilde{G}_i^2 = \sum_{r=1}^p \lambda_r \sum_{i=1}^p (u_{ir} \tilde{G}_i)^2 = \sum_{r=1}^p \lambda_r \| U_{:,r} \odot \tilde{G} \|_2^2 \]

For isotropic posterior:

\[ \mathbb{E}_{q_{\mathrm{iso}}}[\mathcal{L}(\theta)] - \mathcal{L}(\mathbb{W}) \approx \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{iso}}) = \frac{1}{2} \sum_{i=1}^p H_{ii} \sigma^2 = \frac{1}{2} \sigma^2 \mathrm{tr}(H), \] \[ \sigma^2 \mathrm{tr}(H) = \sigma^2 \sum_{i=1}^p H_{ii} = \sigma^2 \sum_{r=1}^p \lambda_r. \]

Low-curvature directions (\(\lambda_r\) small) are allowed larger variance \(\tilde{G}_i \in (0,1]\), contributing minimally to \(\sum_{r=1}^p \lambda_r \| U_{:,r} \odot \tilde{G} \|_2^2\). High-curvature directions (\(\lambda_r\) large) have small \(\tilde{G}_i\), preventing large expected loss increase. Isotropic variance \(\sigma^2\) does not adapt to curvature; high-curvature directions may receive excessive noise.

Hence, we obtain the inequality

\[ \mathbb{E}_{\theta \sim q_{\mathrm{ani}}}[\mathcal{L}(\theta)] = \mathcal{L}(\mathbb{W}) + \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{ani}}) \le \mathcal{L}(\mathbb{W}) + \frac{1}{2} \mathrm{tr}(H \Sigma_{\mathrm{iso}}) = \mathbb{E}_{\theta \sim q_{\mathrm{iso}}}[\mathcal{L}(\theta)]. \tag{6} \]

The KL divergence term satisfies convexity property of \(f(x) = x - \ln x - 1\):

\[ \frac{1}{p} \sum_{i=1}^p f\Big(\frac{\tilde{G}_i^2}{\tau^2}\Big) \le f\Big(\frac{1}{p} \sum_{i=1}^p \frac{\tilde{G}_i^2}{\tau^2} \Big) = f\Big(\frac{\sigma^2}{\tau^2}\Big), \]

by Jensen's inequality. Hence,

\[ \sum_{i=1}^p f\Big(\frac{\tilde{G}_i^2}{\tau^2}\Big) \le p f\Big(\frac{\sigma^2}{\tau^2}\Big). \tag{7} \]

The mean term \(\sum_i w_i^2/(2\tau^2)\) is identical for both posteriors (weight normalization ensures boundedness). Therefore, from Equation (2) and (4):

\[ \mathrm{KL}(q_{\mathrm{ani}} \| p) \le \mathrm{KL}(q_{\mathrm{iso}} \| p). \tag{8} \]

For dataset size \(N\) and confidence level \(\delta \in (0,1)\), with probability at least \(1-\delta\), the PAC-Bayes bound states:

\[ \mathbb{E}_{\theta \sim q}[\mathcal{L}_{\mathrm{test}}(\theta)] \le \mathbb{E}_{\theta \sim q}[\widehat{\mathcal{L}}(\theta)] + \sqrt{\frac{\mathrm{KL}(q\|p) + \ln\frac{N}{\delta}}{2(N-1)}}. \tag{9} \]

Combining (6) and (8):

\[ \mathbb{E}_{q_{\mathrm{ani}}}[\mathcal{L}_{\mathrm{test}}] \le \mathbb{E}_{q_{\mathrm{ani}}}[\widehat{\mathcal{L}}(\theta)] + \sqrt{\frac{\mathrm{KL}(q_{\mathrm{ani}}\|p)+\ln(N/\delta)}{2(N-1)}} \le \mathbb{E}_{q_{\mathrm{iso}}}[\widehat{\mathcal{L}}(\theta)] + \sqrt{\frac{\mathrm{KL}(q_{\mathrm{iso}}\|p)+\ln(N/\delta)}{2(N-1)}}, \] \[ \implies \mathcal{L}_{\mathrm{PAC}}(q_{\mathrm{ani}}(\theta), p(\theta)) \le \mathcal{L}_{\mathrm{PAC}}(q_{\mathrm{iso}}(\theta), p(\theta)). \]

Thus, the anisotropic posterior induced by GNI achieves a tighter PAC-Bayes bound than the isotropic posterior with matched total variance.

Theoretical Analysis of Prototype-Guided Pseudo-Label Refinement (PRL)

7. Setup

Let \(\epsilon_0\) denote the base error of the student on labeled data. For unlabeled data, let \(\epsilon_s(t)\) and \(\epsilon_t(t)\) denote the fraction of incorrect pseudo-labels for the student and teacher at iteration \(t\).

8. Pseudo-Label Error Dynamics without Pseudo-Label Refinement

We begin by formulating the coupled dynamics between the student and teacher networks in the absence of pseudo-label refinement (PRL). Let \(\epsilon_s(t)\) and \(\epsilon_t(t)\) denote the expected pseudo-label errors (i.e., the fractions of incorrect pseudo-labels) produced by the student and teacher, respectively, at iteration \(t\). The student updates its parameters using a mixture of ground-truth labeled data and teacher-generated pseudo-labels:

where \(\epsilon_0\) represents the intrinsic baseline error arising from labeled supervision, and \(\gamma \in [0,1]\) controls the relative contribution of pseudo-labeled data in training. Intuitively, \(\gamma=0\) corresponds to purely supervised learning based solely on labeled data, whereas \(\gamma \to 1\) denotes a regime where training is dominated by pseudo-labels generated by the teacher.

The teacher, on the other hand, is an exponentially moving average (EMA) of the student:

where \(\alpha \in [0,1)\) is the EMA decay constant. This coupling forms a delayed feedback loop: the teacher smooths over the temporal trajectory of student errors.

This establishes a non-homogeneous linear recurrence relation describing how teacher error evolves in conjunction with student learning dynamics.

where \(\epsilon_t(0)\) denotes the initial teacher pseudo-label error at iteration \(t=0\).

The term \(\lambda^t \epsilon_t(0)\) captures the contribution of past errors, exponentially weighted over time, while the summation term captures the accumulation of newly introduced errors from labeled supervision.

(a) \(\lambda < 1\):
The geometric series converges: \[ \sum_{k=0}^{t-1}\lambda^k = \frac{1-\lambda^t}{1-\lambda} \xrightarrow[t\to\infty]{} \frac{1}{1-\lambda}. \] Therefore, the asymptotic value of the teacher error is: \[ \epsilon_\infty = \frac{(1-\alpha)(1-\gamma)\epsilon_0}{1-\lambda}. \tag{12} \] Substituting \(\lambda = \alpha + (1-\alpha)\gamma\), we find: \[ \epsilon_\infty = \frac{(1-\alpha)(1-\gamma)\epsilon_0}{(1-\alpha)(1-\gamma)} = \epsilon_0. \] The teacher asymptotically retains the same error as if trained purely on ground-truth data.

(b) \(\lambda = 1\):
The summation becomes divergent and grows linearly with time: \[ \sum_{k=0}^{t-1}\lambda^k = t. \] The error therefore exhibits linear drift: \[ \epsilon_t(t) = \epsilon_t(0) + (1-\alpha)(1-\gamma)\epsilon_0 \cdot t. \] Errors are neither damped nor amplified exponentially, but drift slowly.

(c) \(\lambda > 1\):
The geometric series diverges exponentially: \[ \epsilon_t(t) = \lambda^t \epsilon_t(0) + (1-\alpha)(1-\gamma)\epsilon_0 \frac{\lambda^t - 1}{\lambda - 1}. \] Here, \(\epsilon_t\) grows exponentially leading to confirmation bias. The erroneous pseudo-labels reinforce the teacher, which in turn produces even more erroneous labels. Such behavior is typical when \(\gamma\) is high (strong reliance on unlabeled data).

For bounded error propagation, we require \(\lambda < 1\), which simplifies to \(\gamma < 1\). However, in real semi-supervised settings, \(|\mathcal{D}_u| \gg |\mathcal{D}_l|\), leading to an effective \(\gamma\) close to 1. This makes pseudo-label drift and error amplification inevitable without refinement mechanisms. This theoretical observation provides a direct justification for pseudo-label filtering and refinement strategies to maintain \(\lambda < 1\) in practice.

9. Prototype-Guided Pseudo-Label Refinement

Let \(z = f_\theta(x) \in \mathbb{R}^d\) denote the feature embedding of an unlabeled sample \(x\), where \(f_\theta\) is the student model with parameters \(\theta\). Let \(\mu_c \in \mathbb{R}^d\) be the prototype (mean feature) for class \(c\), computed from the labeled set \(\mathcal{D}_l^{(c)}\) of \(N_c\) examples of class \(c\):

where \(\mathcal{D}_l^{(c)}\) is the set of labeled samples belonging to class \(c\), \(N_c = |\mathcal{D}_l^{(c)}|\) is the number of labeled samples in class \(c\). A pseudo-label \(\hat{y}\) for input \(x\) is accepted if both the model confidence and similarity satisfy thresholds:

where \(p_T(\hat{y}\mid x)\) is the predicted probability of class \(\hat{y}\) from the teacher model \(T\), \(\mu_{\hat{y}} \in \mathbb{R}^d\) is the prototype (mean feature) for class \(\hat{y}\), \(\tau_{\mathrm{conf}}\) and \(\tau_{\mathrm{sim}}\) are empirically determined thresholds for confidence and similarity, respectively.

Note: Since the true labels \(y\) are not available for unlabeled data, \(\rho\) serves as a conceptual measure of the expected correctness of accepted pseudo-labels.

Accounting for filtering, the effective contribution of pseudo-labeled samples is:

9.1 Pseudo-Label Error Dynamics under Prototype Refinement

The student update, accounting for filtering with precision \(\rho\), can be expressed as:

Here \((1-\rho)\) captures the proportion of incorrect pseudo-labels that survive prototype filtering. When \(\rho=1\), all pseudo-labels are correct and the student fully benefits from unlabeled supervision; when \(\rho=0\), the model ignores pseudo-labels, reducing to a purely supervised learner.

where \(\alpha\) is the decay rate of the EMA. Defining the effective recurrence coefficient:

For bounded error propagation (stability), \(\lambda_{\mathrm{eff}} < 1\), which yields:

Equation (13) highlights that when pseudo-labels are highly reliable (high precision \(\rho\)), the student can safely use a larger fraction of pseudo-labeled data (\(f\gamma\)) without destabilizing training. Conversely, if pseudo-labels are less reliable (low \(\rho\)), only a smaller fraction of pseudo-labeled data should be used to prevent error amplification.

Provided that \(\lambda_{\mathrm{eff}} < 1\), Equation (12) shows that the teacher error asymptotically converges to:

Proof:

From the analysis in Section 9.1, the teacher error evolves as a first-order linear recurrence:

\[ \epsilon_t(t+1) = \lambda_{\mathrm{eff}} \epsilon_t(t) + (1-\alpha)(1-f\gamma)\epsilon_0, \]

where,

\[ \lambda_{\mathrm{eff}} = \alpha + (1-\alpha) f \gamma (1-\rho). \]

For \(|\lambda_{\mathrm{eff}}|<1\), solving the recurrence yields the asymptotic error:

\[ \epsilon_\infty^{\mathrm{PRL}} = \frac{(1-\alpha)(1-f\gamma)\epsilon_0}{1-\lambda_{\mathrm{eff}}}. \]

Substituting \(\lambda_{\mathrm{eff}}\) and simplifying the denominator:

\[ 1 - \lambda_{\mathrm{eff}} = 1 - \alpha - (1-\alpha)f\gamma(1-\rho) = (1-\alpha)[1 - f\gamma(1-\rho)], \]

we obtain:

\[ \epsilon_\infty^{\mathrm{PRL}} = \frac{(1-\alpha)(1-f\gamma)\epsilon_0}{(1-\alpha)[1 - f\gamma(1-\rho)]} = \frac{(1-f\gamma)\epsilon_0}{1 - f\gamma(1-\rho)}. \]

Part 1: Reduced Asymptotic Error

Define the error ratio:

\[ R := \frac{\epsilon_\infty^{\mathrm{PRL}}}{\epsilon_0} = \frac{1-f\gamma}{1 - f\gamma(1-\rho)}. \]

Rewriting the denominator:

\[ 1 - f\gamma(1-\rho) = 1 - f\gamma + f\gamma\rho. \]

Case 1: If \(f\gamma = 0\) (no pseudo-labels used) or \(\rho = 0\) (all accepted pseudo-labels incorrect):

\[ 1 - f\gamma(1-\rho) = 1 - f\gamma \implies R = 1 \implies \epsilon_\infty^{\mathrm{PRL}} = \epsilon_0. \]

Case 2: If \(f\gamma > 0\) and \(\rho > 0\):

\[ f\gamma\rho > 0 \implies 1 - f\gamma + f\gamma\rho > 1 - f\gamma. \]

Under the stability condition, both numerator and denominator are positive, so:

\[ R = \frac{1-f\gamma}{1-f\gamma+f\gamma\rho} < 1 \implies \epsilon_\infty^{\mathrm{PRL}} < \epsilon_0 = \epsilon_\infty^{\mathrm{noPRL}}. \]

This establishes Part (1). \(\square\)

Part 2: Monotonicity in Precision

Taking the derivative of \(\epsilon_\infty^{\mathrm{PRL}}\) with respect to \(\rho\):

\[ \frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} = \frac{\partial}{\partial \rho}\left[\frac{(1-f\gamma)\epsilon_0}{1-f\gamma(1-\rho)}\right]. \]

Let \(u = 1 - f\gamma(1-\rho)\), so \(\frac{\partial u}{\partial \rho} = f\gamma\). Applying the chain rule:

\[ \frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} = (1-f\gamma)\epsilon_0 \cdot \frac{\partial}{\partial \rho}\left[\frac{1}{u}\right] = (1-f\gamma)\epsilon_0 \cdot \left(-\frac{1}{u^2}\right) \cdot f\gamma = -\frac{(1-f\gamma)f\gamma\epsilon_0}{[1-f\gamma(1-\rho)]^2}. \]

Since all factors are positive under stability (\(f\gamma \in (0,1)\), \(\epsilon_0 > 0\), and \(1-f\gamma(1-\rho) > 0\) from \(\lambda_{\mathrm{eff}} < 1\)), we have:

\[ \frac{\partial \epsilon_\infty^{\mathrm{PRL}}}{\partial \rho} < 0. \]

Higher precision strictly reduces asymptotic error, establishing Part (2). \(\square\)

Part 3: Safe Pseudo-Label Budget

The stability condition requires \(\lambda_{\mathrm{eff}} < 1\):

\[ \alpha + (1-\alpha)f\gamma(1-\rho) < 1 \] \[ (1-\alpha)f\gamma(1-\rho) < 1 - \alpha \] \[ f\gamma(1-\rho) < 1 \] \[ f\gamma < \frac{1}{1-\rho}. \]

This bound quantifies the maximum effective pseudo-label usage that maintains bounded error propagation. Higher precision \(\rho\) allows proportionally larger pseudo-label budgets \(f\gamma\) while maintaining identical stability guarantees. This establishes Part (3). \(\square\)

Prototype-guided pseudo-label refinement guarantees that the asymptotic teacher error satisfies \(\epsilon_\infty^{\mathrm{PRL}} \le \epsilon_0\), ensuring that filtered pseudo-labels do not degrade performance. The asymptotic error decreases monotonically with pseudo-label precision, and the effective pseudo-label budget \(f\gamma\) can be safely increased under the stability constraint: \[ f\gamma < \frac{1}{1-\rho}. \] This provides a quantitative criterion for safe pseudo-label utilization, showing that investing in pseudo-label quality via prototype-based filtering simultaneously reduces long-term error and enlarges the permissible pseudo-label budget, thereby expanding robust training regimes for semi-supervised learning.

Theoretical Analysis of Guided Noise Injection (GNI)

1. Definitions, Notation, and Assumptions

2. Expected Perturbed Loss

Linear term expectation:

Quadratic term expectation (diagonal approximation):

Higher order term:

2.1 Interpretation and Theoretical Motivation

3. Spectral Viewpoint: Hessian Eigen-structure

3.1 Interpretation

PAC-Bayes Analysis of Guided Noise Injection

4. Setup and Assumptions

5. KL Divergence Between Posterior and Prior

Trace term: \(\mathrm{tr}(\Sigma_p^{-1} \Sigma_q)\)

Mean difference term: \((\mu_p - \mu_q)^\top \Sigma_p^{-1} (\mu_p - \mu_q)\)

Log-determinant term: \(\ln \frac{\det \Sigma_p}{\det \Sigma_q}\)

6. Interpretation of the KL Divergence Term

Low-gradient parameters:

High-gradient parameters:

Theoretical Analysis of Prototype-Guided Pseudo-Label Refinement (PRL)

7. Setup

8. Pseudo-Label Error Dynamics without Pseudo-Label Refinement

9. Prototype-Guided Pseudo-Label Refinement

9.1 Pseudo-Label Error Dynamics under Prototype Refinement

Part 1: Reduced Asymptotic Error

Part 2: Monotonicity in Precision

Part 3: Safe Pseudo-Label Budget