Segment Anything with Robust Uncertainty-Accuracy Correlation

Zhou, Hongyou; Toussaint, Marc; Shao, Ling; Ye, Zihan

Segment Anything with Robust
Uncertainty-Accuracy Correlation

Hongyou Zhou¹, Marc Toussaint¹, Ling Shao², Zihan Ye²^✉

¹Learning and Intelligent Systems, TU Berlin
²UCAS-Terminus AI Lab, University of Chinese Academy of Sciences

ICML 2026

Code arXiv BibTeX

RUAC overview: SAM2 vs RUAC mask quality and uncertainty across out-of-domain test images.

Vanilla SAM2 returns one confidence number per mask and produces confused uncertainty under domain shift. RUAC adds a Bayesian decoder and adversarial style + deformation training so uncertainty stays meaningfully aligned with errors.

Abstract

Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC): a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation.

Method

RUAC framework: Bayesian mask decoder, style and deformation adversarial networks, GRL-based min-max training.

RUAC formulates training as a min-max game between two attackers and the segmentation model. The Style Adversarial Network builds an object graph from ground-truth masks and visual features, then refines per-object style statistics via GCN layers to generate semantically coherent stylized images. The Deformation Network predicts a dense offset field from SAM2 features to produce geometric perturbations. Both attackers train via Gradient Reversal Layers, enabling end-to-end optimization without a PGD-style inner loop. The Bayesian Mask Decoder uses dual-granularity Weibull distributions over image tokens (local, boundary-aware) and mask tokens (global, semantic) to model pixel-wise uncertainty, optimizing for uncertainty-accuracy alignment under these bio-inspired perturbations. Training also includes a clean branch (not shown) that maintains in-domain performance.

Results

Segmentation quality on out-of-domain inputs

Qualitative segmentation comparison on Cityscapes, IBD aerial buildings, mixed industrial objects, and hand-object interaction.

From top: scene (Cityscapes), scientific (IBD aerial imagery), object (mixed industrial), and egocentric (hand-object). RUAC produces more complete masks on fine boundaries, densely-packed structures, and partially occluded objects.

Confidence and uncertainty maps

SAM2 produces confidence only (no uncertainty estimate). Bayes-SAM2 produces uncertainty but it collapses under domain shift. RUAC's uncertainty concentrates along ambiguous boundaries while leaving high-confidence interiors clean, indicating better calibration.

BibTeX

@inproceedings{ruac2026,
  title     = {Segment Anything with Robust Uncertainty-Accuracy Correlation},
  author    = {Zhou, Hongyou and Toussaint, Marc and Shao, Ling and Ye, Zihan},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}

Segment Anything with RobustUncertainty-Accuracy Correlation