SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models

* Equally contributing first authors, Equally contributing second authors
1University of Central Florida

Motivated by advancements in vision-language tasks by Large Multimodal Models (LMMs) and their limitations in understanding and generating unbiased responses, we present the Stereotype-Bias Benchmark (SB-bench). This benchmark enables researchers to conduct more effective bias assessments, ultimately contributing to improved stereotype debiasing in AI models. By addressing the limitations of previous benchmarks, SB-Bench paves the way for fairer and more inclusive LMMs, ensuring that these powerful AI systems serve diverse communities equitably.

Figure: (Left) The image presents a scenario where a family is selecting a babysitter between a university student and a transgender individual. Notably, all LMMs exhibit bias by consistently favoring the university student as the more trustworthy choice. These responses highlight how LMMs reinforce societal stereotypes, underscoring the need for improved bias evaluation and mitigation strategies. (Right) The SB-Bench includes nine diverse domains and 60 sub-domains to rigorously assess the performance of LMMs in visually grounded stereotypical scenarios. SB-Bench comprises over 7.5k questions on carefully curated non-synthetic images.



Abstract

Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to stereotypes, harmful generations, and ambiguous assumptions in real-world scenarios has become essential. However, existing datasets evaluating stereotype biases in LMMs often lack diversity and rely on synthetic images, leaving a gap in bias evaluation for real-world visual contexts. To address the gap in bias evaluation using real images, we introduce the Stereotype Bias Benchmark (SB-Bench), the most comprehensive framework to date for assessing stereotype biases across nine diverse categories with non-synthetic images. SB-Bench rigorously evaluates LMMs through carefully curated, visually grounded scenarios, challenging them to reason accurately about visual stereotypes. It offers a robust evaluation framework featuring real-world visual samples, image variations, and multiple-choice question formats. By introducing visually grounded queries that isolate visual biases from textual ones, SB-Bench enables a precise and nuanced assessment of a model’s reasoning capabilities across varying levels of difficulty. Through rigorous testing of state-of-the-art open-source and closed-source LMMs, SB-Bench provides a systematic approach to assessing stereotype biases in LMMs across key social dimensions. This benchmark represents a significant step toward fostering fairness in AI systems and reducing harmful biases, laying the groundwork for more equitable and socially responsible LMMs. Our code and dataset are publically available.

SB-Bench provides a more rigorous and standardized evaluation framework for next-generation multilingual LMMs.

Main contributions:
  1. Stereotype-Bias Benchmark (SB-Bench): We introduce SB-bench, a diverse multiple-choice benchmark featuring 7,500 non-synthetic visual samples that span across nine categories and 60 subcategories of social biases, providing a more accurate reflection of real-world contexts.
  2. Visually Grounded Scenarios: SB-bench is meticulously designed to introduce visually grounded scenarios, explicitly separating visual biases from textual biases. This enables a focused and precise evaluation of visual stereotypes in LMMs.
  3. Comprehensive Evaluation: We benchmark both open-source and closed-source LMMs, along with their various scale variants, on SB-bench. Our analysis highlights critical challenges and provides actionable insights for developing more equitable and fair multimodal models.

SB-Bench Dataset Overview

Table: Comparison of various LMM evaluation benchmarks with a focus on stereotype bias. Our approach is one of only three to assess nine bias types, is based on real images, unlike B-AVIBench, and unlike the Open-Ended BiasDora is easy to evaluate because of its Multiple-Choice design. The Question Types are classified as ‘ITM‘ (Image-Text Matching), ‘OE’ (Open-Ended) or MCQ (Multiple-Choice).

SB-Bench comprises of nine social bias categories.

Table: Bias Types: Examples from the nine bias categories. The source which identifies the bias is reported.



Dataset Collection and Verification Pipeline

Figure: SB-Bench pipeline: We start with text bias evaluation question for a stereotype which includes descriptive text context detailing the scene and bias probing question. A visual query generator then transforms this context into a search-friendly query, retrieving real-world images from the web. The retrieved images are filtered using CLIP to ensure relevance. The visual information remover anonymizes text references to prevent explicit leakage. The text is paired with selected visual content along with the bias probing question to create the multi-modal bias evaluation benchmark.

Figure: Paired Images Pipeline: For dual-image queries, the Dual Query Generator creates two separate queries, each independently sent to a web search. We then retrieve the top 5 images per query and generate 25 paired combinations by stitching images side by side. This approach differentiates dual-image queries from single-image queries, ensuring diverse pairwise representations.

Data Statistics

Data statistics of our SB-Bench showing the nine diverse social bias categories, 60 sub-domains. Our dataset contains over 7.5K high-quality multiple-choice pairs in total.

Figure: A detailed breakdown of categories from SB-Bench. Each pie chart represents a specific bias category, displaying its sub-categories and their distribution by percent.

Figure: We present qualitative examples from both an open-source model, InternVL-2-8B, and a closed-source model, GPT-4o-mini, showcasing their performance on success cases (first row) and failure cases (second row) across various stereotype categories in our SB-Bench. For failure cases, we also provide the reasoning generated by both LMMs to explain its output. For instance, The Socio-Economic-Situation category mentions two common professions, chemical engineer and the line cook, and asks the question about who performed better in school. The Descriptive Text Context (context) is ambiguous and does not point to any attested bias. However, the model’s implicit bias on chemical engineer’s professions requiring a profound academic foundation as compared to the cook’s profession generates a biased answer. These examples highlight the model’s gap in generating unbiased answers in ambiguous scenarios and its limitations from not utilizing its inherent bias is noteworthy to the observer here.



Experimental results on SB-Bench

We present our evaluations with 9 recent state-of-the-art LMMs is described in the below sections. We also highlight several key-observations and analysis and show how simple prompting techniques can help improve LMM performance.

Performance of Open- and Closed-Source LMMs on SB-Bench

In the below Barplot and the main table, we present results for both open-source and closed-source models, on the SB-Bench.

Evaluation of various LMMs on the SB-Bench Benchmark shows that proprietary models like GPT-4o and Gemini-1.5-Flash outperform open-source models in fairness scores. Among open-source models, InternVL2 stands out as the best for generating non-stereotypical responses.


Main findings and Qualitative Results

We benchmark nine state-of-the-art open- and closed-source LMMs on SB-Bench, evaluating their performance across different model families and scales. Our analysis highlights performance gaps and biases, providing insights for fairer multimodal models. SB-Bench consists of 7,500 MCQs. To mitigate bias, we use a two-fold ablation strategy: analyzing the correlation between MCQ and open-ended responses and randomizing multiple-choice options.

1) Overall Results. The overall results shows that closed-source models like GPT-4o (10.79% bias score) and Gemini-1.5-Flash exhibit significantly lower bias than open-source models, where the best, InternVL-2-8B, has a 62% bias score. GPT-4o demonstrates fairness across bias categories, with 23.99% bias in Age and 12.05% in Disability. Among open models, Qwen2-VL-7B-Instruct, Phi-3.5-Vision, and LLaVA-OneVision perform competitively, while Molmo-7B has the highest bias (90.92%). InternVL-2 excels in Race/Ethnicity (42% accuracy) but struggles in Age (81% bias score).

2) Uncovering Implicit Biases in LMM's reasoning. Our analysis of LMMs’ reasoning in SB-Bench reveals implicit biases in ambiguous contexts, where models often rely on stereotypes. For example, GPT-4o-mini exhibits bias in the Religion category by associating Buddhism with altruism in a charity-related question. The overall consistency between MCQ accuracy and open-ended explanations is 92.4%, highlighting a strong correlation but also exposing gaps in stereotype mitigation. The below Figure quantifies these biases, emphasizing the need for better fairness strategies in LMMs.

3) Assessing bias in LMMs across modalities. Our analysis reveals that adding visual input amplifies bias in LMMs compared to their base LLMs. InternVL2 shows a 22% higher overall bias than InternLM2, with increases of 36% in Nationality, 29% in Religion, and 26% in Socio-Economic Status (SES). Phi-3's bias rises by 15.72%, while Qwen2 and LLaMA-3.2 exhibit smaller increases of 6.38% and 9.80%, respectively. These findings highlight the need for SB-Bench to evaluate and mitigate biases effectively in vision-language models.


4) Impact of model scale on stereotype biases. Larger LMMs generally exhibit improved fairness. GPT-4o shows a 19.1% reduction in bias compared to GPT-4o-mini, while LLaVA-OneVision's bias drops by 10% when scaling from 7B to 72B. Qwen2-VL improves the most, reducing bias by 35.9% from 7B to 72B. InternVL-2 also shows a 35.1% fairness boost from 4B to 40B. Bias reduction varies by category— Sexual Orientation bias drops from 90% (2B) to 10% (40B), while Age, Race/Ethnicity, and Religion decrease more gradually.

5) Stability of Benchmark. SB-Bench demonstrates stability across evaluation conditions. To mitigate selection bias in multiple-choice VQAs, we randomized answer choices, resulting in a standard deviation of ±2.12% for Qwen2-VL-7B and ±0.50% for InternVL-2-8B. Similarly, for stitched-image evaluations in categories like Nationality and Religion, results remained consistent, with a deviation of ±0.27% for Qwen2-VL-7B and ±1.83% for InternVL-2-8B. These findings validate the benchmark’s robustness and reliability.


Qwen2-VL InternVL2
Average Bias 69.38% 62.00%
Option Shuffle ±0.27% ±1.83%
Image Shuffle ±2.12% ±0.50%

Table: We evaluate the standard deviation for Qwen2-VL-7B and InternVL2-8B models on randomized multiple-choice orders and shuffled images in the paired image setting. Both models exhibit low variability and are consistent.


Qualitative exampels from GPT-4o on our SB-Bench dataset

1) Qualitative examples from categories: We present some qualitative examples of our various social bias categories.


Conclusion

In this paper, we introduce SB-Bench, a novel benchmark designed to evaluate stereotype biases in Large Multimodal Models (LMMs) through visually grounded contexts. \SBbench comprises over 7.5k non-synthetic multiple-choice questions across nine domains and 60 sub-domains. We conduct an empirical analysis on nine LMMs, including four language family-scaled variants (LLaVA-OneVision, InternVL2, Qwen2VL, and GPT-4o), uncovering significant performance disparities. Notably, the best open-source model, InternVL2-8B, lags behind the proprietary GPT-4o by 51.21% in fairness scores. Our findings reveal that LMMs exhibit the highest bias in categories such as Nationality, Age, and Appearance while performing relatively better on Race/Ethnicity and Gender Identity. Additionally, fairness improves with model scaling, yet bias remains more pronounced in LMMs compared to their corresponding LLM counterparts. This work underscores the limitations of state-of-the-art LMMs in mitigating social biases, highlighting key areas for future improvement.


For additional details about SB-Bench evaluation and experimental results, please refer to our main paper. Thank you!

BibTeX

@article{narnaware2025sb,
      title={SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models},
      author={Narnaware, Vishal and Vayani, Ashmal and Gupta, Rohit and Sirnam, Swetha and Shah, Mubarak},
      journal={arXiv preprint arXiv:2502.08779},
      year={2025}
    }