Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement

Abstract 😏

Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing. Existing methods often rely on additional models or annotations to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color—specifically foreground and background colors—as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), a RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias ('Color See') while disentangling it from identity-relevant ReID features ('Color Ignore'). To achieve this, we introduce S2A self-attention, a novel self-attention to prevent information leak between color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as a cost-effective solution for addressing appearance bias in CC-ReID..

Method 😎

Traditional transformer-based ReID models use RGB spatial pages of input and pass them through layers of transformers. The class token is used as ReID Token for inference and is trained using triplet loss, an identity-based classifier. We introduce one additional class token, called Color Token, for which learns color embedding via MSE (regression) on color histograms. We then disentangle Color Token from ReID token using cosine loss.

Traditional Self-Attention
shares information across all tokens, leaking information across ReID (biometrics) and Color tokens (appearance bias). This is an example of 100% overlap between biometrics and appearance bias.

Masked Self-Attention
doesn't allow information sharing across ReID (biometrics) and Color tokens (appearance bias); however, ReID and Color tokens influence the weight of each other on spatial tokens. This is an example of 0% overlap between biometrics and appearance bias, but they influence each other's weight.

S2A Self-Attention (Ours ☺️)
By doing two-step self-attention, Color tokens (appearance bias) no longer influence the weight of the ReID tokens (biometrics) and vice-versa, "exactly like" Masked Self-Attention, and an example of 0% overlap between biometrics and appearance bias. By adjusting the weights of the averaging of the spatial tokens; one aware of biometrics, and the other aware of appearance bias, we can influence which signal gets more weightage (biometrics or appearance bias). Current hyperpater is set to equal weight for both (1/2 weight)

Alternatives 😱

► Alternative to our S2A self-attention would be to use two transformers, one for ReID and the other for appearance bias, which is computationally impractical for deployment, currently deployed by many ReID works as 2 ResNets or 2 transformers: one for biometrics and other for appearence bias (diffusion models for clothes, LLMs for clothes description).
► Another alternative to S2A self-attention would be to just leak the information between biometrics and appearence bias by sharing the backbone, most famously done by CAL CC-ReID. In Transformers, that would be "Traditional Self-Attention".
► An alternative to using color would be to use "traditional" clothing integer annotations instead of colors. However, colors are more expressive than integer clothing labels.
► Another alternative would be that LLM-based fine-grained clothing description; computationally infeasible. Fine-grained description needs to be generated per frame on video, as clothing may change across video

Results 🧐

pretrained weights are provided for the best performing models which have much higher accuracy than the reported values

Generalization
CSCI apporach generalizes across various previous works involving transformers and consistently outperforms traditional integer clothing labels.

Image ReID Results
CSCI doesn't require any annotation or supervision, unlike previous works.

Video ReID Results
CSCI doesn't require any annotation or supervision, unlike previous works. EZ-CLIP singnificantly improves Video ReID performance.

Self-Attention alternatives
S2A self-attention outperforms Masked and traditional self-attention, indicating the strong need for preventing 1) information leak between biometrics signals and appearance bias. 2) Preventing appearance bias influence on the weight of biometrics signals and vice versa.

Alternatives to clothes
Colors outperform tradtional integer clothing annotations and grey colored inputs.

Ablation 🥸

Same clothes labels but different K means cluster of color embeddings indicates models take into account illumination, and other environmental factors. Depending on how you see it, it can be a limitation as "noisy clothing labels" or color embeddings indicate what's happening in the exact moment, aka "Adaptive".

Different clothes but similar K-means cluster of color embeddings indicate the true limitation of colors. Colors take into account overall frames, which may make models relate different images based on similar clothing foreground and background colors. Solution : horizontal / vertical splits of colors / fine-grained colors?

BibTeX 🥹

@InProceedings{Pathak_2025_ICCV, author = {Pathak, Priyank and Rawat, Yogesh S}, title = {Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, }

Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement

Clustering using color histograms groups images/video frames based on people wearing the same clothes. Video frames taken from the CCVID dataset. This suggests that colors can serve as a proxy for clothing labels.

Abstract 😏

Method 😎

Alternatives 😱

Results 🧐

Generalization CSCI apporach generalizes across various previous works involving transformers and consistently outperforms traditional integer clothing labels.

Image ReID Results CSCI doesn't require any annotation or supervision, unlike previous works.

Video ReID Results CSCI doesn't require any annotation or supervision, unlike previous works. EZ-CLIP singnificantly improves Video ReID performance.

Self-Attention alternatives S2A self-attention outperforms Masked and traditional self-attention, indicating the strong need for preventing 1) information leak between biometrics signals and appearance bias. 2) Preventing appearance bias influence on the weight of biometrics signals and vice versa.

Alternatives to clothes Colors outperform tradtional integer clothing annotations and grey colored inputs.