
Traditional Self-Attention
shares information across all tokens, leaking information across ReID (biometrics) and Color tokens (appearance bias). This is an example of 100% overlap between biometrics and appearance bias.
Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing. Existing methods often rely on additional models or annotations to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color—specifically foreground and background colors—as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), a RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias ('Color See') while disentangling it from identity-relevant ReID features ('Color Ignore'). To achieve this, we introduce S2A self-attention, a novel self-attention to prevent information leak between color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as a cost-effective solution for addressing appearance bias in CC-ReID..
Traditional transformer-based ReID models use RGB spatial pages of input and pass them through layers of transformers. The class token is used as ReID Token for inference and is trained using triplet loss, an identity-based classifier. We introduce one additional class token, called Color Token, for which learns color embedding via MSE (regression) on color histograms. We then disentangle Color Token from ReID token using cosine loss.
Traditional Self-Attention
shares information across all tokens, leaking information across ReID (biometrics) and Color tokens (appearance bias). This is an example of 100% overlap between biometrics and appearance bias.
Masked Self-Attention
doesn't allow information sharing across ReID (biometrics) and Color tokens (appearance bias); however, ReID and Color tokens influence the weight of each other on spatial tokens. This is an example of 0% overlap between biometrics and appearance bias, but they influence each other's weight.
S2A Self-Attention (Ours ☺️)
By doing two-step self-attention, Color tokens (appearance bias) no longer influence the weight of the ReID tokens (biometrics) and vice-versa, "exactly like" Masked Self-Attention, and an example of 0% overlap between biometrics and appearance bias. By adjusting the weights of the averaging of the spatial tokens; one aware of biometrics, and the other aware of appearance bias, we can influence which signal gets more weightage (biometrics or appearance bias). Current hyperpater is set to equal weight for both (1/2 weight)
► Alternative to our S2A self-attention would be to use two transformers, one for ReID and the other for appearance bias, which is computationally impractical for deployment, currently deployed by many ReID works as 2 ResNets or 2 transformers: one for biometrics and other for appearence bias (diffusion models for clothes, LLMs for clothes description).
► Another alternative to S2A self-attention would be to just leak the information between biometrics and appearence bias by sharing the backbone, most famously done by CAL CC-ReID. In Transformers, that would be "Traditional Self-Attention".
► An alternative to using color would be to use "traditional" clothing integer annotations instead of colors. However, colors are more expressive than integer clothing labels.
► Another alternative would be that LLM-based fine-grained clothing description; computationally infeasible. Fine-grained description needs to be generated per frame on video, as clothing may change across video
Same clothes labels but different K means cluster of color embeddings indicates models take into account illumination, and other environmental factors. Depending on how you see it, it can be a limitation as "noisy clothing labels" or color embeddings indicate what's happening in the exact moment, aka "Adaptive".
Different clothes but similar K-means cluster of color embeddings indicate the true limitation of colors. Colors take into account overall frames, which may make models relate different images based on similar clothing foreground and background colors. Solution : horizontal / vertical splits of colors / fine-grained colors?
@InProceedings{Pathak_2025_ICCV,
author = {Pathak, Priyank and Rawat, Yogesh S},
title = {Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
}