Auditing Image-based NSFW Classifiers for Content Filtering

This paper examines NSFW (Not Safe For Work) image classifiers for content filtering. Through an audit of three prevalent NSFW classifiers, we analyze the relationship between NSFW predictions and three demographic factors: gender, skin-tone, and age. Our study reveals that women are disproportionately more frequently misclassified as NSFW compared to men, even when they appear conducting common daily-life activities. Additionally, we find that NSFW classifiers tend to mispredict images of people with lighter skin-tones and images depicting younger people. We explore the causes of such mispredictions by analyzing the explanatory pixel maps, which reveal some of the reasons behind the misclassifications. Overall, the implications of our findings become particularly salient when considering the application of filters based on NSFW classifiers, which we identified to have a direct impact on image datasets, computer vision models, generative AI, user experience, and artistic creativity. In summary, we hope our study brings attention to the inherent biases within NSFW classifiers and underscores the importance of addressing these issues to ensure fair and equitable outcomes in content filtering.


INTRODUCTION
Datasets are an integral part in the development and optimization of machine learning products.They serve various purposes, from model training and parameter selection to performance evaluation and benchmarking against other models.Specifically in the field of computer vision, the emergence of large open-source annotated datasets like ImageNet [15], MSCOCO [37], and OpenImages [35], facilitated the advancement of deep learning models that heavily rely on extensive data [25,34,44].In recent years, there has been a surge in the collection of multimodal datasets comprising image and text pairs to meet the escalating demand for vast amounts of data.Whereas some data collections have been made publicly available for anyone to use and scrutiny, such as Google Conceptual Captions [11,54], RedCaps [16], or LAION [51,52], others remain confidential and obfuscated.Examples of the latter include ALIGN [30], ALT200M [28], or datasets used for training large multimodal models such as CLIP [44], DALL-E [45,46], Parti [69], or Imagen [50].In any way, the collection method of large multimodal datasets consists of automated web crawling, which enables the aggregation of billions of samples.Nevertheless, as the scale increases, the number of challenges related to data grows, including issues about representation, consent, or the presence of toxic content [7][8][9].
With respect to toxic, offensive, or abusive content, to prevent undesirable samples from becoming part of a dataset, a common approach is using filters during the data collection process.Common filters applied to images include restrictions on their format (e.g., only jpg or png), size (e.g., more than 5 kilobytes), aspect ratio (e.g., maximum ratio of larger to smaller dimension of 2.5), license (e.g., only Creative Commons), provenance (e.g., only images hosted on Flickr1 ), or content filters (e.g., images not flagged by a NSFW classifier).In this work, we are interested in auditing the use of NSFW classifiers for content filtering and analyzing their implications on the datasets and models.
NSFW 2 (Not Safe for Work) is an Internet acronym used to flag content as inappropriate, usually due to its sexual, violent, or otherwise offensive nature.An NSFW classifier refers to a machine learning model specifically designed to identify if a sample, whether an image, a video, or a piece of text, falls into the NSFW category.By using NSFW classifiers for content filtering, sexual, violent, or otherwise offensive content can be ideally identified and removed.While NSFW or toxicity detectors have been extensively studied in the natural language domain for text inputs [3,13,20,42,43], the efficacy of image-only NSFW classifiers and their limitations are not well-studied.The opaqueness in their training process, stemming from the nature of NSFW images, makes benchmarking these types of classifiers challenging.As far as we know, no study yet delves into the deep aspects of image-based NSFW classifiers, particularly focusing on the correlations between the prediction of NSFW content and demographic factors such as gender, race, or age.We argue that the ramifications of such correlations can inadvertently perpetuate and amplify prejudices within the filtered content, raising questions about the ethical implications of automated content filtering.
In this paper, we conduct a comprehensive examination of three image-based NSFW classifiers used recently in multimodal datasets and computer vision models for filtering content [16,49,52].The three classifiers use different architectures, with one using standard convolutional neural networks (CNNs), and the other two relying on multimodal CLIP embeddings [44].As training is conducted by different individuals and institutions, we assume that the three classifiers are trained on different datasets, although not many specific training details are available.We analyze the False Positive Rates (FPR) of each classifier on two evaluation datasets [21,71] that contain people but are free of NSFW images, i.e., all the images are Safe for Work (SFW), and compare them against a controlled dataset without images of humans [2].Then, we investigate differences in the FPR across perceived gender, skin-tone, and age.Our findings reveal a concerning trend: women are disproportionately misclassified as NSFW images at a higher probability compared to men.This discrepancy not only underscores the limitations of existing NSFW classifiers but also can amplify the already pronounced representational gap between men and women in digital content [17].We also found discrepancies in the FPR according to skintone, with images of lighter skin-tone people exhibiting higher FPR, and age, with images of younger people exhibiting higher FPR.An in-depth analysis with explainable artificial intelligence (AI) techniques unveils that some of the pixels predominantly contributing to the misclassification of images as NSFW in all the three NSFW classifiers are those associated with female faces.
We conclude the paper by examining the repercussions of gender bias in NSFW classifiers for content filtering, raising discussions about its effects on image datasets, computer vision models, generative AI, user experience, and artistic creativity.We aspire that our work raises awareness and promotes discussions regarding the limitations of content filtering algorithms.By shedding light on the complex interplay between visual cues, biases, and the challenges associated with effectively mitigating explicit content in multimodal datasets, we aim to stimulate further exploration in this domain.

RELATED WORK 2.1 Toxicity in Image Datasets
In computer vision and machine learning research, the choice of training datasets plays a crucial role in shaping the performance and ethical considerations of models.Unfortunately, several widely used image datasets have been identified to contain toxic and problematic content, ranging from offensive imagery to explicit and non-consensual material.Take, for instance, ImageNet [15], the dataset in image classification that facilitated the emergence and popularity of convolutional neural networks [34].Despite efforts to curate its labels by removing 1, 593 out of 2, 832 inappropriate categories from the WordNet [19] person sub-tree [68], subsequent scrutiny by Birhane and Prabhu [8] revealed the persistence of nonconsensual and explicit content.In the same work, Birhane and Prabhu [8] uncovered that the Tiny Images dataset [62], containing 80 million low-resolution images sourced from Internet search engines for image classification tasks, included derogatory terms as labels and offensive visual content, leading to its official withdrawal [61].Similar trends have been observed in large multimodal datasets, such as LAION-400M [52], a 400 million text-image pairs dataset derived from web page alt-text and used to train generative AI models such as Stable Diffusion [49].LAION-400M, analyzed by Birhane et al. [9], retained problematic images and text pairs depicting rape, racism, and explicit content.Moreover, a recent study [7] shows that dataset scale exacerbates hateful content.In this way, LAION-5B [51], the latest and largest iteration of the LAION datasets with 5 billion text-image pairs, has been recently removed due to the identification of thousands of instances of suspected child sexual abuse material [60].Other popular multimodal datasets such as the widely-used MSCOCO [37] and Google Conceptual Captions [54], both envisioned for training image captioning models, have been flagged for unbalanced representations in terms of gender and skin-tone [21,71].Overall, scrutinizing these datasets highlights the challenges in ensuring ethical, safe, and non-toxic samples in training datasets.

NSFW Classifiers for Content Filtering
Manually removing toxicity from large image datasets requires a significant amount of resources.Additionally, visually inspecting millions of images to check whether the depicted content is potentially harmful has been found to have detrimental effects on the mental health of annotators [14,57,58].As a result, some authors choose to formally withdraw datasets upon discovering inappropriate content [51,62].An alternative approach involves implementing NSFW classifiers to detect and remove explicit or inappropriate content automatically.NSFW classifiers can take the form of various architectures, from CNNs [25,34,55] to multimodal approaches that combine text and image information [44].Moreover, the rise of image generation models [45,46] has amplified the risk of producing toxic images.In response, some image generation models [49] now incorporate NSFW classifiers to filter outputs that may be considered toxic or inappropriate.

METHODOLOGY
Our audit on image-based NSFW classifiers consists of evaluating three different models on two evaluation datasets and a control dataset.Specifically, we evaluate their performance according to the perceived gender, skin-tone, and age of the people in the images and study disparities in their misclassification rates produced by such demographic factors.

NSFW Classifiers under Evaluation
We analyze three NSFW classifiers that have recently been used for filtering inappropriate content from either datasets or AI-generated images.We specifically select these three NSFW classifiers due to their presence in state-of-the-art computer vision research, being an indication that they have transcended theoretical frameworks and are actively integrated into practical applications.
The selected NSFW classifiers, namely NSFW-CNN, CLIPclassifier, and, CLIP-distance, and their main characteristics are summarized in Table 1.NSFW-CNN extracts embeddings from images with a CNN [59], whereas CLIP-classifier and CLIP-distance use pre-trained CLIP embeddings [44].From the CLIP embeddings, CLIP-classifier predicts NSFW content with a three-layer fully connected (FC) network trained on LAION-5B dataset [51], while CLIPdistance does not require training and classifies images as NSFW if the resulting cosine similarity between their embeddings and a set of pre-defined concepts is above a threshold.The specific technical details for each method are provided below: NSFW-CNN An InceptionV3 [59] CNN model from [36] trained to classify images into five categories: sexy, neutral, porn, hentai, and drawings.Given an input image, the NSFW-CNN outputs a score from 0 to 1 for each category.If the score corresponding to porn is higher than 0.7, the image is classified as NSFW.The model is trained end-to-end with data collected from an NSFW data scraper [33], although the amount of training samples is not specified.This model has been used in Birhane and Prabhu [8] for abusive content detection and in the RedCaps dataset [16] for content filtering.CLIP-classifier A CLIP image encoder [44] followed by a three-layer FC classifier.The CLIP encoder is a frozen ViT L/14 network [12], and the FC classifier is trained on a subset 3 of the LAION-5B dataset [51].The CLIP-classifier outputs a single 0 to 1 score representing the confidence of the image being NSFW, with 1 being NSFW.If the score is higher than 0.7, the image is classified as NSFW.The LAION organization 4 supplied the model alongside the LAION-5B dataset.However, the model was not used for filtering content in LAION-5B.Instead, it was offered to assist users in filtering the data at their discretion.CLIP-distance A CLIP image encoder [44] followed by a distance computation.The CLIP encoder is also a frozen ViT L/14 network [12] that converts an input image to an image embedding.The distance between the image embedding and a set of 17 precalculated text embeddings, each representing an NSFW concept, is computed.If the cosine distance between the image embedding and any of the precomputed text embeddings is over a set threshold, the image is classified as NSFW.Note that the details of the specific 17 concepts have not been revealed.This model, which does not require training, can be found inside Stable Diffusion v1.5 [49] by HuggingFace [63] as a safety checker, to detect if a generated image is NSFW, and if so, returning a blacked out image instead of the generated one.A similar approach is used in LAION-400M dataset [52] for content filtering. 3Details on how the images in the subset were chosen are not specified by their authors. 4https://laion.ai

Evaluation Datasets
We evaluate the three NSFW classifiers on two annotated subsets of popular image datasets: the GCC dataset [54] with PHASE annotations [21] and the MSCOCO dataset [37] with Zhao et al.'s annotations [71].The details of each dataset are provided below:

Control Dataset
Additionally, we use a control dataset with images without people: the PASS dataset [2].
PASS The Pictures without humAns for Self-Supervision (PASS) dataset [2] is a collection with about 1.4 million unlabeled images that do not include any pictures of humans.It is designed to prevent issues with privacy, data protection, and ethics.We use PASS as a control benchmark to examine how each of the NSFW models performs when given images that do not contain people.To compare results fairly in terms of scale, we use a random subset of 11, 685 images.

Experimental Details
We run our experiments on Python 3.11.5 with PyTorch 2.1 [40] and TensorFlow 2.14.0 [1] on a single GeForce RTX 3070 GPU.We do not re-train any of the three NSFW classifiers, but use them off-the-shelf as provided by their authors.Input images are resized to 299 × 299 pixels for NSFW-CNN and 224 × 224 pixels for both CLIP-classifier and CLIP-distance.

IMAGE-BASED NSFW CLASSIFIERS AUDIT
We conduct our audit in four phases.In the initial phase (Section 4.1), we benchmark the three NSFW classifiers by comparing their performance across the evaluation and control datasets.In the second phase (Sections 4.2, 4.3, and 4.4), our focus shifts towards analyzing the relationship between demographic attributes and NSFW predictions: Section 4.2 focuses on gender, Section 4.3 on skin-tone, and Section 4.4 on age.Moving forward to the third phase (Section 4.5), we investigate the specific image regions triggering the NSFW classifiers by exploring pixel importance maps generated with explainable AI tools.In our final analysis (Section 4.6), we discuss the implications of the relationship between demographics and NSFW misclassification rates.

Comparative Evaluation
Firstly, we compare the performance of the three NSFW classifiers on the two evaluation datasets and the control dataset.Specifically, the performance of each NSFW classifier is measured as the False Positive Rate (FPR).Given an image  as input, an NSFW classifier, , which gives a confidence value, or how likely  is being NSFW, predicts whether the image is NSFW or not as with ŷ = 1 if the image is predicted as NSFW, and ŷ = 0 otherwise, where ℎ is a predefined threshold.As all the samples in the evaluation and the control datasets are safe for work (SFW), their ground truth label, , is always 0. The FPR is computed as the number of incorrectly predicted NSFW images over dataset D (either control, GCC, or MSCOCO) as where |D | gives the number of images in D.
Note that a low FPR is not always desirable, especially if achieved at the expense of a high False Negative Rate (FNR), which may lead to the classification of numerous inappropriate images as safe.Nevertheless, FNR is not computed due to a lack of properly annotated NSFW datasets.Thus, our analysis centers on comparing the performance of the NSFW classifiers on the evaluation datasets with their behavior on the control set.For completeness, we also report the average confidence score on the NSFW classifiers, given as Results are shown in Table 3.On the control dataset, CLIPclassifier achieves the lowest FPR, with only a single image misclassified as NSFW, while NSFW-CNN and CLIP-distance misclassify 10 and 131 images, respectively.The FPRs for NSFW-CNN and CLIP-distance on MSCOCO are similar to the control set, but they substantially increase on the GCC dataset.The GCC evaluation set seems to contain images that are generally more challenging for all the models to classify.Among the three datasets, GCC has the more lenient collection method, potentially resulting in a higher frequency of NSFW-like images.Finally, the CLIP-classifier performance is substantially different between the two evaluation sets featuring people, MSCOCO and GCC, and the control set without people.The FPR increasing from 0.009 in the Control dataset up to 7.509 in the GCC dataset indicates a strong correlation between images of people and NSFW content within the internal representations of this model.

Gender Examination
Next, we examine how the perceived gender in input images influences the predictions made by the NSFW classifiers.As both GCC and MSCOCO datasets are unbalanced and contain more images from man than woman, we compute the FPR per gender, FPR  with gender  ∈ {woman, man}, as where D  ⊂ D only contains images with gender .
Results are shown in Table 4. Images with perceived women are misclassified as NSFW at higher rates than images with perceived men.The difference is disproportionately high for the case of CLIPclassifier on the GCC dataset, reaching an alarming margin of 17.9%.The gender disparity, which appears in the three NSFW classifiers, is more pronounced in the GCC dataset than in the MSCOCO dataset.Some examples are shown in Figures 1 and 2 for MSCOCO and GCC datasets, respectively.Upon inspecting the images, we find that most pictures of women depict them engaging in innocuous and common activities like sports, eating, or posing for a camera.For men, a significant portion of the limited number of images classified as NSFW showcases characteristics associated with femininity or gender nonconformity.This suggests that NSFW classifiers tend to

Skin-Tone Examination
We analyze the relationship between skin-tone and NSFW predictions.We compute the FPR per skin-tone, FPR  with skin-tones  ∈ {darker, lighter}, as ( where D  ⊂ D is the subset of images annotated with skin-tone .
Results are shown in Table 5. Notably, all three classifiers exhibit a higher rate of false positives for images featuring individuals with perceived lighter skin-tones compared to those with darker skintones.In line with the analysis of gender bias, the CLIP-classifier on the GCC dataset shows the most substantial difference in the FPR, although the disparities are less pronounced than in the gender evaluation.Note that the number of images per class is more unbalanced than in gender, with about 8 times more individuals of lighter skin-tones than darker skin-tones.Regardless, these results suggest that skin-tone may not be as robust an indicator for NSFW classifiers as gender.

Age Examination
The last demographic attribute we analyze is age.Similarly to gender and skin-tone, we compute the FPR per age, FPR  , over D  with age  ∈ {child, young, adult, senior}, as where D  ⊂ D is the subset of images annotated with age .
Results are shown in Table 6, only for the GCC dataset, as MSCOCO dataset does not contain age annotations.FPR is well above the control dataset for all the age groups and classifiers.Of particular concern is the observation that the age groups most prone to misclassification are those associated with younger individuals.The Child category (0-14 years old) exhibits the highest rate of mispredicted NSFW in both the NSFW-CNN and CLIP-classifier, while the Young category (15-29 years old) has the highest mispredicted NSFW rate in the CLIP-distance classifier.For all models, many of the child images classified as NSFW are images of babies without clothes or just in diapers.This suggests that exposed skin may play a factor in classification, as will also be seen later in Section 4.5.Why children have a higher rate of false positives, however, is still unclear.

Regional Analysis
Our next evaluation involves looking into the NSFW classification mechanisms and understanding which particular regions of the image trigger the NSFW prediction.
NSFW-CNN regional analysis.We analyze the contribution of each region to the final prediction through Grad-CAM [53].Grad-CAM is an explainable AI algorithm that generates heatmaps in the original image, representing the regions that have the most influence on the final prediction -in our case, whether the image is classified as NSFW or not.Some examples, with confidence above 0.9, are shown in Figure 3.We note the following observations: (1) Images misclassified as NSFW often depict individuals, especially those annotated as women, engaged in eating.The reason for this classification is unclear; it remains uncertain whether the model associates eating or open mouths with sexual content or if it is influenced by the prevalence of close-up shots of faces in these images.(2) Another category of frequently misclassified NSFW images involves hands, with the specific reason behind this misclassification also remaining unclear.Common to both types of misclassifications is the belief that a significant amount     CLIP-classifier regional analysis.In this case, we use RISE [41] to obtain pixel-level explanations of the regions with the highest contribution to the NSFW prediction.RISE is a method for empirically estimating pixel importance by masking random regions of the image and observing the differences in the model prediction.Examples of RISE heatmaps for CLIP-classifier are shown in Figure 4.The most notable observations can be summarized as follows:

NSFW-CNN CLIP-classifier CLIP-distance
(1) Images misclassified as NSFW by the CLIP-classifier share many traits with the NSFW-CNN model, such as images annotated as women being overwhelmingly more likely to be classified as NSFW than men.Another similarity is the tendency to see more exposed skin as NSFW, though it seems to be a smaller factor here.(2) A common element seen in almost every picture is that the pixel-level explanations are focused on the area of the face.This is present even in images with a more sexually explicit tone, showing that this model seems to mainly use faces in images to classify NSFW or not.(3) In the last image in Figure 4, the pixel contribution is focused around the woman's facial region, despite the image having a much more exposed man right beside.This seems to suggest that not only does the model tend toward faces when classifying, but it tends specifically toward feminine faces, which is supported by the overwhelming majority of images classified as NSFW being annotated as women.
CLIP-distance regional analysis.For this model, we also use RISE to obtain the explanations for the predictions.In this case, we identify four themes within the misclassified images: faces, sausages, donuts, and eating.Examples are shown in Figure 5 and the most notable observations are summarized as: (1) We find heatmaps focusing on the facial regions of women.
However, this case is not as prominent as in NSFW-CNN and CLIP-classifier, so we believe that facial features have a present but less pronounced effect in the CLIP-distance model.(2) Another theme that is shared with the previous models is classifying images of people eating as NSFW.Unlike the previous two models, it does not seem that CLIP-distance associates eating with an image being NSFW.Instead, it may see the object itself as NSFW.This can be seen from food item the model predicts as NSFW is donuts, specifically the donut hole area.Considering that CLIP-distance relies on embedding distances between images and textual embeddings from unknown NSFW concepts, it looks like some of those concepts may represent similar to those items.In this sense, the model is not capable of discerning between common food objects and NSFW content.It is important to note that, for all of the images with only food, neither NSFW-CNN nor CLIP-classifier classify them as NSFW, being this behavior specific to CLIP-distance only.

Implications for Content Filtering
Finally, we analyze the implications of the above results, particularly when NSFW classifiers are used for filtering content in datasets, generative AI images, or social media platforms.Our examination revolves around five issues: the impact on image datasets, the impact on computer vision models, the impact on generative AI, the implications for user experience, and the implications for artistic creativity.
Impact on Image Datasets.A persistent issue in terms of social bias in image datasets is the disparities in the representational gap for different demographic groups [21,71].For example, for gender bias, the quantity of images depicting women tends to be significantly smaller compared to those of men.In addition to the already analyzed MSCOCO and GCC datasets, which exhibit 2.22 and 1.64 times more men than women, respectively, according to [26] the ratio of men to women in visual question answering (VQA) datasets ranges from 1.7 in GQA [29] to 2.1 in VQA 2.0 [23] and Visual7W [73].Using an NSFW classifier to filter content during the dataset creation phase, coupled with the higher likelihood of images featuring women being misclassified as NSFW, can exacerbate the representational gap and increase the already high ratio of men to women in computer vision datasets.
Impact on Computer Vision Models.Our findings hold a direct impact on the performance of computer vision models, which undergo training on large multimodal datasets filtered through NSFW classifiers [44-46, 50, 69].Models trained on unbalanced datasets not only mirror the biases present in the original data but also have the potential to amplify them [24,27,64,66,70,72], leading to the generation of skewed predictions at elevated rates.Recent research [65] underscores the importance of data in mitigating the adverse effects of bias, highlighting the need for careful considerations in training data selection to foster fair and accurate model outcomes.
Impact on Generative AI.With ongoing discussions about the ethical implications of image generation models [4,6,31,32], including bias [5,38,67], intellectual property [56], and privacy [10], some efforts to address the generation of toxic or inappropriate images have involved the inclusion of NSFW classifiers for posthoc deletion of generated images [49].However, as our study reveals, these classifiers exhibit a higher rate of misclassification for images containing women compared to those featuring men, which implies that the use of NSFW classifiers may reduce the final production of images depicting women, exacerbating the existing representational gap within these models.As the field advances, it becomes necessary to examine the consequences of such measures and advocate for a more equitable and inclusive trajectory in generative AI development.
Implications for User Experience.The use of automatic tools for content moderation in social media platforms has been largely discussed [22], especially for text data.When applied to visual content, NSFW models could disproportionately remove images and videos featuring women compared to men.As shown in Section 4.5, female faces undergo higher NSFW misclassification rates, which may lead social media users to encounter fewer images of women in comparison to their male counterparts.This skewed visibility may inadvertently foster a misleading impression that women are less prevalent in society.The unintended consequence of such content filtering mechanisms could contribute to distorted perceptions of gender representation within the online environment.
Implications for Artistic Creativity.Algorithmic content moderation on social media has a direct influence on the creativity of artists, directly impacting the visibility of their work and their income [39,47,48].The results presented in this paper, where NSFW algorithms flag content based on the presence of female faces, add further evidence to the growing concerns about automatic content moderation algorithms censoring artistic pieces featuring the female body [18], even when the intent is purely artistic and nonsexual.

CONCLUSION
In this we analyzed three prominent not safe for work (NSFW) classifiers and their impact when used for content filtering and automatic content moderation.We conducted an analysis on the GCC and MSCOCO datasets with demographic annotations and compared the false positive rate (FPR) against a control dataset without humans (PASS dataset).By inspecting the regions with the highest contribution to the NSFW mispredictions, we concluded: • NSFW classifiers mispredicted images of perceived women at higher rates than images of perceived men.The difference was as high as 17.9%.Upon inspection of the mispredicted images, women appeared doing standard activities like sports, eating, or posing for the camera.For men, we identified a number of images exhibiting gender-nonconforming attributes, indicating that the mere presence of attributes perceived as feminine or non-masculine can flag NSFW classifiers.• NSFW classifiers mispredicted images of people with lighter skin-tone at higher rates than images of people with darker skin-tone.These results, however, should be considered cautiously due to the big unbalance in the number of samples, with 8 times more light skin-tone annotations than dark skin-tone annotations.• NSFW classifiers tended to mispredict younger people at higher rates than older people.We found especially concerning the result for the child category (0-14 years old), with two out of three classifiers exhibiting the highest FPR and the third one exhibiting the second highest FPR.This indicates that NSFW classifiers find NSFW traits in innocuous images of children, inducing reflection about the training data and the reasons why images showing children and younger adults activated NSFW predictions.• When conducting a regional analysis, we found that all three NSFW classifiers tended to focus on the faces of women to make their prediction.We found that faces of women were a stronger signal for NSFW classifiers than images of men's nude torso.• The regional analysis also showed that hands and people eating were more likely to raise NSFW flags.Additionally, the NSFW classifier based on distance embeddings (CLIPdistance) had more difficulties distinguishing between safe and NSFW objects.• Finally, by analyzing the impact of demographic biases on NSFW classifiers, we found that different FPR across different demographic groups has the potential to widen the representational gap in image datasets, computer vision models, generative AI, and online content in general.This has a direct impact on users' experience as well as on artistic creativity.

LIMITATIONS
Our results about NSFW classifiers and their implications for content filtering, while insightful, have certain limitations that warrant consideration.The exclusive use of a single metric, namely False Positive Rate (FPR), may present an incomplete picture of the classifier's overall performance.A comprehensive evaluation should ideally incorporate multiple metrics to ensure a nuanced understanding of its effectiveness.Nevertheless, the decision to refrain from computing accuracy and False Negative Rate (FNR) on NSFW datasets was deliberated, driven by both a lack of reliable NSFW datasets and ethical considerations related to the download and possession of NSFW data.limitation of this work is the use of imbalanced demographic annotations, a factor that can introduce noise and skew the results.The unbalanced nature of demographic data in computer vision datasets underscores the need for a more balanced and representative dataset to draw robust conclusions.

ACKNOWLEDGMENTS
This work is partly supported by JST CREST Grant No. JP-MJCR20D3, JST FOREST Grant No. JPMJFR216O, JSPS KAKENHI Nos.JP22K12091 and JP23H00497.The JST and JSPS had no role in the design and conduct of the study; access and collection of data; analysis and interpretation of data; preparation, review, or approval of the manuscript; or the decision to submit the manuscript for publication.The authors declare no other financial interests.

RESEARCHERS POSITIONALITY
The authors acknowledge the importance of using automatic filters to eliminate toxic content from datasets and online platforms.Our intention is not to discourage the use of such filters, which we deem necessary.Rather, with this work, our objective is to highlight the disparities in the functionality of NSFW classifiers across diverse demographic groups.By bringing attention to this issue, we aim to contribute to the collective efforts to address and rectify these disparities, ensuring a more equitable and effective application of content filtering mechanisms.

ETHICAL CONSIDERATIONS
An inherent ethical consideration of this work lies in defining what constitutes an NSFW image.The ambiguity surrounding the NSFW criteria prompts a critical examination of the ethical dimensions involved.Questions arise regarding the threshold for explicit content -what specific body parts, if exposed, classify an image as NSFW?Moreover, the consideration of cultural and contextual variations adds another layer of complexity.The ethical discourse extends to instances where certain body parts may be depicted in art, statues, or classic works, challenging the universality of NSFW categorization.Within the scope of the paper, we find it important to acknowledge these ethical considerations and the diversity of perspectives to ensure a balanced and culturally sensitive approach.

Figure 1 :
Figure 1: Examples of images from the MSCOCO dataset misclassified as NSFW per classifier.On the top row (orange background), images annotated as man.On the bottom row (purple background), images annotated as woman.Note that annotations are based on perceived gender.

Figure 2 :
Figure 2: Examples of images from the GCC dataset misclassified as NSFW per classifier.On the top row (orange background), images annotated as man.On the bottom row (purple background), images annotated as woman.Note that annotations are based on perceived gender.

Figure 3 :
Figure 3: NSFW-CNN classifier regional analysis conducted with Grad-CAM [53].We find three main themes within the misclassified images: (a) people eating, (b) hands, and (c) women's faces.Red regions indicate a higher contribution to the model prediction, whereas blue regions indicate a low contribution.

Figure 4 :
Figure 4: CLIP-classifier regional analysis conducted with RISE [41] by estimating pixel importance in the input image.We find that the model focuses especially on women's faces when mispredicting safe images as NSFW.Red regions indicate a higher contribution to the model prediction, whereas blue regions indicate a low contribution.

Figure 5 :
Figure 5: CLIP-distance regional analysis conducted with RISE [41].In this case, we find four groups of images that often flag the NSFW misprediction: (a) women's faces, (b) sausages, (c) donuts, and (d) people eating.Red regions indicate a higher contribution to the model prediction, whereas blue regions indicate a low contribution.

Table 1 :
NSFW classifiers in our audit.Underline indicates the parts of the models that need training, Data source from where the training data was collected, and Num.samples the number of samples used for such training.

Table 2 :
Number of images used in our analysis per dataset and attribute.

Table 3 :
NSFW classifiers comparative evaluation.NSFW indicates the number of images misclassified as NSFW, FPR the false positive rate in %, and score the average confidence score for each classifier.Bold font highlights the classifier with the highest mispredictions and FPR for each dataset.

Table 4 :
False Positive Rate per gender in percentage (%).Diff. is the difference between Woman and Man columns.Bold font highlights the gender with the highest mispredictions for each classifier and dataset.
categorize an image as NSFW based on the presence of traditionally associated feminine traits.

Table 5 :
False Positive Rate per skin color in percentage (%).Diff. is the difference between the columns Light (skin-tone) and Dark (skin-tone).Bold font highlights the skin-tone with the highest mispredictions for each classifier and dataset.

Table 6 :
False Positive Rate per age in percentage (%).Bold font highlights the age with the highest mispredictions for each classifier and dataset.