Robust Onion: Peeling Open Vocab Object Detectors Under Noise

CoSPlan logo
Interpretable AI Explainable AI Robustness Noise Opening Black-box VLMs Analysis Benchmark
1University of Central Florida
*Equal Contribution
ECCV'26

Model Robustness Ranking 📋

Model HQ Pixelation Motion Blur Turbulence Average
Robustness
Average
WAR
AP RR WAR AP RR WAR AP RR WAR
GLEE-Plus-pretrain-Stage 144.0036.170.820.8236.890.840.8428.280.640.640.770.77
GLEE-Pro-pretrain-Stage 150.8344.730.880.8839.890.780.7827.300.540.540.730.73
GLEE-Pro-joint-Stage 261.9652.820.850.8546.340.750.7532.500.520.520.710.71
GLEE-Pro-scaleup-Stage 361.7151.800.840.8445.280.730.7331.500.510.510.690.69
MM-GDINO-L53.0037.600.710.7138.450.730.7328.450.540.540.660.66
GLEE-Plus-joint-Stage 260.4441.880.690.6944.100.730.7332.890.540.540.660.66
GLEE-Plus-scaleup-Stage 360.3442.200.700.7042.840.710.7131.730.530.530.650.65
FIBER-B*-RefCOCOg22.7015.600.690.6617.220.760.7213.060.580.550.670.64
MM-GDINO-L* - ALL60.3041.700.690.6942.750.710.7131.800.530.530.640.64
MM-GDINO-B (O_G_V)52.5035.700.680.6838.150.730.7327.250.520.520.640.64
GLIP-L [7]51.2334.200.670.6736.060.700.7026.360.510.510.630.63
MM-GDINO-B* - ALL 59.5040.100.670.6742.050.710.7129.900.500.500.630.63
FIBER-B*-LEVIS-FT50.7031.700.630.6335.600.700.7025.190.500.500.610.61
FIBER-B*-COCO-FT58.4036.600.630.6339.660.680.6828.400.490.490.600.60
FIBER-B49.4730.800.620.6233.530.680.6823.090.470.470.590.59
RCx4 Fully80-COCO-FT88.7756.860.640.6462.050.700.7036.980.420.420.590.59
RCx4-COCO-FT80.0049.800.620.6253.910.670.6734.330.430.430.580.58
FIBER-B*-RefCOCO+18.0012.300.680.5913.380.740.649.740.540.460.660.56
MM-GDINO-T (O_G_GR)50.5030.600.610.6133.000.650.6519.200.380.380.550.55
MM-GDINO-T (O_G_GR_V)50.4030.100.600.6033.130.660.6619.200.380.380.550.55
FIBER-B*-RefCOCO15.5010.900.700.5412.370.800.619.860.640.490.710.54
RCx4-LVIS-FT82.7047.800.580.5853.390.640.6433.490.400.400.540.54
GDINO-T Swin-T (O_G_CAP4)48.5029.300.600.6030.900.640.6418.000.370.370.540.54
RCx4 Fully123-LVIS-FT82.3847.550.580.5853.430.640.6432.200.390.390.540.54
MM-GDINO-T (O_G)50.4029.200.580.5832.700.650.6518.900.380.370.530.53
MM-GDINO-T (O_G_V)50.6029.300.580.5832.550.640.6419.000.380.380.530.53
GLIP-T [5]46.6026.200.560.5629.550.630.6318.050.390.390.530.53
GLIP-T (C)46.7025.700.550.5529.920.640.6418.140.390.390.530.53
RC-COCO-FT75.3041.600.550.5546.470.620.6227.840.370.370.510.51
GLIP-T (B)44.9022.300.500.5027.570.610.6115.810.350.350.490.49
RC-LVIS-FT80.0043.000.540.5447.580.590.5925.900.320.320.490.49
GLEE-Lite-pretrain-Stage 142.5923.000.540.5426.390.620.6212.330.290.290.480.48
GLEE-Lite-joint-Stage 254.9628.390.520.5232.480.590.5914.720.270.270.460.46
GLIP-T (A)42.9020.200.470.4725.400.590.5912.790.300.300.450.45
RegionCLIP R50x4 (RCx4)62.4028.900.460.4639.670.620.6217.320.270.270.450.45
GLEE-Lite-scaleup-Stage 353.7026.900.500.5031.200.580.5814.400.270.270.450.45
YOLO-Worldv2-XL-64047.5012.600.270.2729.900.630.6314.300.300.300.400.40
YOLO-Worldv2-L (CLIP-L)🔥 -64046.0011.700.250.2528.200.610.6112.900.280.280.380.38
YOLO-Worldv2-X-64046.709.600.210.2129.500.630.6314.500.310.310.380.38
YOLO-Worldv2-L-64045.4010.100.220.2227.400.600.6014.500.320.320.380.38
YOLO-Worldv2-L-640-LITE45.108.500.190.1928.200.630.6312.500.280.280.360.36
YOLO-Worldv2-M-64042.809.200.210.2124.400.570.5711.300.260.260.350.35
RegionCLIP R50 (RC)58.1820.100.350.3528.820.480.488.220.140.140.320.32
YOLO-Worldv2-S-64037.505.300.140.1420.300.540.546.800.180.180.290.29
Assuming random accuracy to be 0, we apply WAR metrics (alpha=60, Pathak et al., 2025) to nullify abnormally high robustness score from models like FIBER-B, which have a clean accuracy ≈ random. Random predictions remain random after noise, resulting in minimal drop, giving abnormally high robustness score. Models ranked by average WAR robustness metrics across Pixelation, Motion Blur, and Turbulence perturbation. Higher the Intensity of color, greater is the robustness [0-1] against the respective noises.


Teaser Figure
Effect of Noise Degradation: Performance of GLIP (above) & MM-GDINO (bottom) on COCO for noises like turbulence, pixelation, and motion blur. This figure shows the impact of real-world noise on object detection performance of these models.

Abstract 📖

The impact of real-world noise on Open Vocabulary Object Detectors (OV-ODs) remains poorly understood due to their architectural complexity. We present our comprehensive analysis Robust Onion, an empirical study that uses controlled synthetic visual degradations to peel OV-ODs layer-by-layer, revealing how, why, and where robustness degrades, systematically analyzing feature collapse. Our findings reveal that models with similar vision backbones exhibit comparable robustness, driven by similar feature collapse at similar layers, while factors such as pretraining strategy, architectural nuances, and caption supervision contribute little. Robustness is primarily governed by the image domain rather than annotations, explaining the similar robustness impact on COCO and LVIS, and why datasets like ODinW-13 can give an inflated impression of robustness due to large, isolated objects. Finally, we validate our insights by improving robustness on real-world BDD-100K and other autonomous driving datasets via our NN & TK0 approach, using 96× fewer trainable parameters than end-to-end training, while also explaining reported robustness observations from prior works.

Setup 🔨

Architecture
Analysis of VLM architecture towards robustness: The detailed analysis of VLM architecture layer by layer (peeling layers) to understand the impact of real-world noise on object detection performance. Sec. 4.1 (WHERE: Model-Based Analysis) analyses overall detector size, backbone, pretraining, and fine-tuning. Sec. 4.2 (WHY: Insight into Transformers) peels the backbone, enhancer, and fusion to localize feature collapse layer-by-layer. Sec. 4.3 (WHAT: Robustness as a function of Dataset) studies image-domain factors — object size, count, occlusion, and class. Sec. 4.4 (WHERE: Captions & Prompt Engineering) evaluates caption expressiveness, prompt rewording, and superclass grouping. Image modified from GLIP.

Shuffle E Reconstruction
No Collapse
Shuffle E Reconstruction
Complete Collapse
Synthetic COCO & Real-World BDDK100 feature collapse: Synthetic noisy features collapse (orange) is compared against clean image features (blue). (Left) Shows minimal feature collapse of synthetic noisy features from HQ image features, such synthetic noises include motion blur, fog etc. (Right) Shows complete feature collapse of synthetic noisy features from HQ image features, these synthetic noises include turbulence, rain, iso etc. As we increase the severity of the synthetic noise, the feature collapse increases from minimal to complete.

Sec. 4.1 WHERE: Model-Based Analysis 🥸

🔬 Click on the figures to enlarge


Shuffle E Reconstruction
Shuffle E Reconstruction Enlarged
(a) Size and Robustness correlation Strong +ve correlation between robustness & overall detector size. Fine-tuned models shown as .
Shuffle E Reconstruction
Shuffle E Reconstruction Enlarged
(b) Similar Robustness for similar backbones Depth is the deciding factor with overall size playing minimal role.
Shuffle E Reconstruction
Shuffle E Reconstruction Enlarged
(c) Robustness weak correlation with pretraining Robustness remains consistent regardless of pretraining dataset size.
Shuffle E Reconstruction
Shuffle E Reconstruction Enlarged
(d) No clear correlation of Finetuning with Robustness Region CLIP improves, FIBER-B suffers.
  • Model size positively correlates with robustness (a), with backbone being the deciding factor.
  • Similar backbones have similar robustness (b), other bells and whistles like different architectural design, neck, vision-text fusion network, different losses and pretraining strategies / data hardly playing any major role.
  • General overall trend is ResNet < Swin-T (12 blocks, 27M) < Swin-B (12 blocks, 87M) ≤ Swin-L (24 blocks 195M) ≃ EVA-02 (24 blocks 303M).
  • Robustness is not simply learned from more data (c). Models trained on very different amounts of data can have similar robustness. Hence robustness techniques should emphasize noise-robust training, without relying on pretraining and fine-tuning.
  • Finetuning shows no clear correlation with robustness while it helps RegionCLIP, it hurts FIBER-B.
  • Swin-B (87 M) can be an amazing alternative to EVA-02 (303 M) because of their similar robustness.
  • COCO and LVIS (same images, different annotations) imparts similar robustness, i.e. domain of images is more important than annotation.

Sec. 4.2 WHY: Insight into Transformers (GLIP, MM-GDINO & GLEE) 🤯

🔬 Click on the figures to enlarge


Pixelation
Pixelation
Pixelation
Motion Blur
Motion Blur
Motion Blur
Turbulence
Turbulence
Turbulence
Layerwise Robustness Analysis: Per Layer Feature Collapse for Swin-T (GLIP, 12 blocks), Swin-L (MM-GDINO-L, 24 blocks) and EVA-02 (GLEE-Pro-joint, 24 blocks) are anaylsed across 3 different types of perturbations. Here robustness is measured by the overlap between features of sev 5 (dark patches) and sev 0 (lighter patches), this occurs at the Fusion layer. Also, all models shows similar trend of feature collapse as the depth increases.
  • Early layers are more vulnerable to noise, as indicated by distinct cluster or lumps of features at shallow layers.
  • Same depth (2 and 4 blocks), i.e. layer #1 and layer #2, have similar feature collapse (lumps) across all models, despite architectural differences.
  • Model depth, not necessarily size, determines feature collapse, indicated by MM-GDINO-L (24 blocks) layer #3 (22nd block) and layer #4 (24th block) collapse are similar to GLEE-Pro-joint (24 blocks). While MM-GDINO-L is considerably different from Swin-T (12 blocks) at layer #3 (10th block) and layer #4 (12th block).
  • Feature enhancer serves no utility in robustness.
  • Fusion layer cross exchange information between spatial tokens of all layers 192 × 192, 96 × 96, 48 × 48, 24 × 24, inducing robustness in the last layer (24 × 24), as evidenced by the significant overlap between features of sev 5 and sev 0 for all models.

Sec. 4.3 WHAT: Robustness as a function of Dataset 😨

🔬 Click on the figures to enlarge


Shuffle E Reconstruction
Shuffle E Reconstruction Enlarged
(a) No of objects: Models are very robust when the # of objects in an image is < 3, with unpredictable jumps after 25+ objects, likely due to very few samples in that range.
Shuffle E Reconstruction
Shuffle E Reconstruction Enlarged
(b) Dataset Dependent Robustness The COCO and LVIS similar robustness because of similar image domain, whereas ODinW-13 is more robustness (majorly large singular objects) than COCO & LVIS on pixelation.
Shuffle E Reconstruction
Shuffle E Reconstruction Enlarged
(c) Classwise robustness: Some COCO classes are more robust (shade of blue) than others, with moderate correlation with average object size (dot size). Classes grouped in bins of frequency for GLIP-T.
  • Larger objects (≥ 96 × 96 size) are significantly more robust to noise.
  • All detectors are highly robust when there are only a few objects to be detected (≤ 3) (a). As the number of objects/images goes above 3, robustness starts to see a drop.
  • Datasets like ODinW-13, majorly dominated by singular and large objects, are significantly more robust than LVIS & COCO, which have a higher proportion of smaller objects and more objects on average.
  • COCO and LVIS exhibit nearly identical robustness despite vastly different label spaces, because they share the same underlying images (b)
  • Class-wise robustness correlates moderately with average object size, but shows almost no correlation with class frequency (c). This in turn partially explains why LVIS (long-tail distribution) has similar robustness to COCO.

Sec. 4.4 WHERE: Expressiveness of Captions & Prompt Engineering 🤔

🔬 Click on the figures to enlarge


(a) Training Captions descriptiveness limited impact on Robustness Despite training on captions with different degrees of expressiveness (RefCOCOg is most descriptive), robustness varies slightly (FIBER-B).
Shuffle E Reconstruction
Shuffle E Reconstruction Enlarged
(b) Evaluation with Pixelation aware Captions : Evaluation on Flickr30k w/ test captions modified with textual context of pixelation (using LLM, 'LR captions') has minimal impact on robustness of GLIP variants.
Shuffle E Reconstruction
Shuffle E Reconstruction Enlarged
(c) Robustness for Finegrained vs Superclass Captions Similar robustness for ODinW13 finegrained vs superclass captions for pixelation (GLIP-T). Language is not the primary driver for robustness.
  • Caption expressiveness during training has minimal impact on robustness, as models fine-tuned on simple (RefCOCO), appearance-based (RefCOCO+), or highly descriptive captions (RefCOCOg) show nearly identical robustness under noise. Indicating that robustness depends fundamentally on visual features, not semantic (a).
  • Injecting perturbation context into test prompts (captions) does not meaningfully impact robustness, showing that language cannot compensate for corrupted visual representations caused by noise (b).
  • Rewording class labels or grouping classes into supercategories does not impact robustness as such (c).
  • Efforts to improve OV-OD robustness should prioritize the vision backbone and feature learning, rather than relying on caption richness, prompt tuning, or linguistic augmentation

Sec. 5.1 Validating Model Design on Real-World Datasets
TK0 & NN 🔧

Our analysis in Sec. 4.1 & 4.2 highlights three key design principles: (1) the vision backbone is the primary determinant of robustness, (2) shallow layers are most adversely affected by noise, and (3) cross-exchanging information across backbone layers improves feature overlap across severities. We validate these insights with two lightweight modules added on top of a frozen GLIP-T trained on BDD-100K.

  • TK0 (Trainable Spatial Tokens): We extend LR-TK0 (Pathak et al., 2025), originally designed for low-resolution image classification, to hierarchical vision transformers (Swin) for object detection. We drop the expensive teacher-student distillation and retain only the cost-efficient trainable tokens, applied solely to the visual backbone while keeping the entire OV-OD frozen. At every layer (especially shallow), we insert a fixed 32×32 set of trainable spatial tokens, interpolated to match the layer's spatial resolution. This enables (i) flexible hierarchy, adapting to varying H×W across layers, and (ii) lower overhead (+5.7%) compared to the fixed-token design (+22.5%) for a 600×600 Swin-T input.
  • NN (Cross-Layer Non-Local Block): A simple trainable non-local block, implemented as a single-head self-attention module, where Query/Key/Value are formed by concatenating spatial tokens from all backbone layers. This enables tokens to share spatial information across layers — mirroring the role of the fusion transformer with far fewer parameters.
  • NN & TK0 (Combined): Combining both insights yields robustness comparable to end-to-end fine-tuning, while using only 2.41M trainable parameters — 96× fewer than E2E. It even surpasses E2E and Fuse baselines on DAWN and Foggy Cityscapes.

Setup: GLIP-T is trained on BDD-100K and zero-shot evaluated on DAWN, Foggy Cityscapes, Virtual KITTI 2, and COCO (to monitor preservation of the original VLM zero-shot ability). Baselines: E2E (full end-to-end fine-tuning) and Fuse (only the fusion transformer is trained).

Real-World Validation on Driving Datasets 🚗

BDD-100K trained GLIP-T zero-shot evaluation on DAWN, Foggy Cityscapes (Fog City), Virtual KITTI 2, and COCO. E2E and Fuse are training baselines (training pretrained weights). PC=Partly Cloudy; O=Overcast; S=Snow; R=Rain; F=Fog; Sa=Sand; All=Overall across all categories. The second row per model shows / change relative to vanilla zero-shot.

Model BDD-100K (Training) DAWN Fog City Virtual KITTI 2 COCO
PCSRFOAll FRSSaAll FORAll
Zero-shot 43.443.741.746.647.431.4 32.640.538.549.833.8 21.4 15.611.211.520.9 46.6
E2E 71.372.771.069.577.466.4 31.636.438.545.236.0 26.9 16.712.012.631.6 3.0
27.929.029.322.930.035.0 -1.0-4.10.0-4.62.2 5.5 1.10.81.110.7 -43.6
Fuse 69.472.170.367.776.665.1 30.035.936.342.535.6 26.6 17.212.312.729.8 3.8
26.028.428.621.129.233.7 -2.6-4.6-2.2-7.31.8 5.2 1.61.11.28.9 -42.8
TK0 61.563.061.059.268.445.1 35.342.340.650.132.5 25.5 16.912.212.626.1 43.2
18.119.319.312.621.013.7 2.71.82.10.3-1.3 4.1 1.31.01.15.2 -3.4
NN 61.864.061.663.068.545.0 33.341.638.349.733.4 25.8 16.912.012.428.7 22.1
18.420.319.916.421.113.6 0.71.1-0.2-0.1-0.4 4.4 1.30.80.97.8 -24.5
NN & TK0 66.868.666.165.573.155.2 34.041.339.849.536.1 28.2 17.312.413.129.9 33.4
23.424.924.418.925.723.8 1.40.81.3-0.32.3 6.8 1.71.21.69.0 -13.2

BibTeX 🙏

@misc{robust_onion,
      title           ={Robust Onion: Peeling Open Vocab Object Detectors Under Noise},
      author          ={Priyank Pathak and Mukilan Karuppasamy and Aaditya Baranwal and Shruti Vyas, and Yogesh S Rawat},
      booktitle       ={The 19th European Conference on Computer Vision (ECCV)},
      year            ={2026},
      month           ={September},
      url             ={},
      }
extend sec 5.126