EigenScore: OOD Detection using Posterior Covariance in Diffusion Models

First OOD detection framework that leverages denoising posterior covariance in diffusion models


Shirin Shoushtari, Yi Wang, Xiao Shi, M. Salman Asif, Ulugbek S. Kamilov

WashU & UC Riverside

Paper (arxiv) Code

Histogram

Figure 1. We compare negative log-likelihood (NLL), score norm \(\sqrt{\sum_t \|\epsilon_\theta(\mathbf{x}_t, t)\|_2^2}\), score derivative norm \(\sqrt{\sum_t \|\partial_t \epsilon_\theta(\mathbf{x}_t, t)\|_2^2}\), and the eigenvalue sum (ours) \(\sum_{t,k}\lambda_k^t(\mathbf{x}_t)\) as OOD detection statistics. Top row: near OOD task for C10 (InD) vs. C100, NLL and score-based metrics fail to separate distributions, showing substantial overlap. Bottom row: for C10 (InD) vs. SVHN (OOD), the ordering of metrics inverts—score and derivative norms assign lower values to OOD than InD, making thresholds unreliable. In both settings, our eigenvalue-based metric achieves clear separation and consistently assigns higher scores to OOD samples.


Abstract


Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems in safety-sensitive domains. Diffusion models have emerged as powerful generative models, capable of capturing complex data distributions through iterative denoising. Building on this progress, recent work has explored their potential for OOD detection. We propose EigenScore, a new OOD detection method that leverages the eigenvalue spectrum of the posterior covariance induced by a diffusion model. We argue that posterior covariance provides a consistent signal of distribution shift, leading to larger trace and leading eigenvalues on OOD inputs, yielding a clear spectral signature. We further provide analysis explicitly linking posterior covariance to distribution mismatch, establishing it as a reliable signal for OOD detection. To ensure tractability, we adopt a Jacobian-free subspace iteration method to estimate the leading eigenvalues using only forward evaluations of the denoiser. Empirically, EigenScore achieves state-of-the-art performance, with up to 5% AUROC improvement over the best baseline. Notably, it remains robust in near-OOD settings such as CIFAR-10 vs CIFAR-100, where existing diffusion-based methods often fail.

InD CelebA vs. OOD Cifar100

Banner

Figure 2. Denoised outputs (left), corresponding uncertainty maps (first principle component) (middle), and violin plots of the three largest eigenvalues for CelebA dataset (right). Top: clean CelebA image and its noisy variants for varying t. Middle: InD model (trained on CelebA) applied to CelebA inputs. Bottom: OOD model (trained on C100) applied to the same inputs. InD models yield sharp reconstructions and localized uncertainty with smaller leading eigenvalues, whereas OOD models produce blurrier outputs, diffuse uncertainty, and inflated eigenvalues—highlighting the eigenvalue spectrum as an indicator of distribution shift.

OOD Detection Algorithm with EigenScore


Banner

Algorithms 1. Given the EigenScore feature matrix M(x), we first estimate the mean and standard deviation of each feature column using the training data. The validation set is then employed to determine the optimal selection of timesteps and the aggregation method. In the testing phase, EigenScore features are extracted according to the chosen configuration and standardized using the parameters obtained from the training phase. The final detection score for each sample is computed as the sum of the standardized feature values across all columns.

Main OOD Detection Results

Banner

Table 1. Main OOD detection results (AUROC). Comparison of EigenScore with likelihood-based, reconstruction-based, and diffusion-based baselines across multiple InD–OOD dataset pairings (CelebA, C10, C100, SVHN). best and second best are highlighted. Note that EigenScore achieves the best average performance and is either best or second best in most settings.

Near-OOD Detection Results

Banner

Table 2. Near-OOD detection results (AUROC). We evaluate on semantically related datasets, including C10 vs. C100 and TinyImageNet, which are particularly challenging due to shared low-level statistics between InD and OOD samples. The best and second best methods are highlighted. EigenScore achieves the best average performance across both tasks, with a clear margin over prior diffusion-based approaches.

Bibtex


@article{shoushtari2025eigenscoreooddetectionusing, title={EigenScore: OOD Detection using Covariance in Diffusion Models}, author={Shirin Shoushtari and Yi Wang and Xiao Shi and M. Salman Asif and Ulugbek S. Kamilov}, year={2025}, eprint={2510.07206}, archivePrefix={arXiv},}