
Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on largescale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored.
We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets.
Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR.
Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model’s initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches.
Problem A) Misleading high robustness: : If the model performs poorly on a challenging dataset i.e. performance close to random predictions, then downsampling will likely maintain this random prediction with minimal drop in accuracy, giving abnormally high robustness score.
▷ Solution: Improved Relative Robustness: We propose zeroing out robustness near random predictions. ε measures how far the accuracy of the model is from the random predictions. ε = AHQ - Arandom
Problem B) SAR overlooks datasets: When comparing models, their robustness scores are averaged across datasets (giving each dataset a score of 1, SAR). Ideally, the model rankings, after averaging, should stay consistent with individual dataset rankings. Model rankings on datasets like ImageNet overshadows the ranking of datasets like ImageNet-A and EuroSAT, which behave differently. This makes the final comparison exclude such datasests, as if these datasets aren’t present (left, below).
▷ Solution: Weighted Aggregated Robustness : We propose adjusting the dataset weights so that the model rankings after aggregation reflect each dataset fairly. Weights are optimized such that the correlation (Spearman) between the model rankings after the weighted average and individual dataset rankings are maxi- mized. (right, below)
Our technique freezes the existing model weights and trains the trainable tokens "LR Tokens" (left) via the self-supervised technique "LR-TK0" (right) on synthetic dataset with any annotations and labels.
▷ LR tokens are added to the frozen spatial patches (white) after patch generation, before each frozen transformer block, and class token as a final feature.
▷ LR-TK0: Multi-scale training (only 1 shown for simplicity). Teacher (w/o LR tokens) generates \( f^T_{HR} \) (HR), Student (w/ LR tokens) generates both \( f^S_{HR}, f^S_{LR}\).
We use the diffusion model PIXART-α to generate synthetic HR images, via 7,000 randomly sampled captions from Conceptual Captions.
CLAIM: If 7,000 (or fewer) concepts/captions can consistently enhance model performance across 15 datasets, it suggests that the model is likely learning the relationship between HR and LR features rather than exploiting shortcuts. This is supported by greater improvements at LR (16×16) compared to HR (128×128). If the model somehow cheats the zero-shot evaluation using diffusion-generated images, we would expect similar or better performance improvements at HRs.
Samples generated using the captions randomly sampled from Conceptual Captions.
Multiple images per caption generated via different seeds.
LR-TK0 improvement on Foundation models: EVA-B/16 is ‘EVA-02-CLIP-B/16, Meta-B/16: MetaCLIP-ViT-B/16 (2.5B), OC-B/16: OpenCLIP-ViT-B/16. Higher number implies better performance.
Baseline vs LR-TK0: Top-1 accuracy improvement for EVA-CLIP-B/16 on 16×16 for each dataset on zero-classification
Left: \#Images/Caption: Robustness vs. Size of diffusion generated dataset. Robustness can be imporved even with 2K captions instead of 7K captions as used in paper. Right: Baseline vs LR-TK0: Top-1 accuracy for EVA-CLIP-B/16 on 16×16. LR-TK0 not only improves the acuracy of all variants of the EVA but also boost the accuracy when there are more LR tokens.
@inproceedings{
pathak2025lrfm,
title={{LR}0.{FM}: {LOW}-{RESOLUTION} {ZERO}-{SHOT} {CLASSIFICATION} {BENCHMARK} {FOR} {FOUNDATION} {MODELS}},
author={Priyank Pathak and Shyam Marjit and Shruti Vyas and Yogesh S Rawat},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=AsFxRSLtqR}
}