Wiki Loves Monuments: Crowdsourcing the Collective Image of the Worldwide Built Heritage

The wide adoption of digital technologies in the cultural heritage sector has promoted the emergence of new, distributed ways of working, communicating, and investigating cultural products and services. In particular, collaborative online platforms and crowdsourcing mechanisms have been widely adopted in the effort to solicit input from the community and promote engagement. In this work, we provide an extensive analysis of the Wiki Loves Monuments initiative, an annual, international photography contest in which volunteers are invited to take pictures of the built cultural heritage and upload them to Wikimedia Commons. We explore the geographical, temporal, and topical dimensions across the 2010–2021 editions. We first adopt a set of CNN-based artificial systems that allow the learning of deep scene features for various scene recognition tasks, exploring cross-country (dis)similarities. To overcome the rigidity of the framework based on scene descriptors, we train a deep convolutional neural network model to label a photo with its country of origin. The resulting model captures the best representation of a heritage site uploaded in a country, and it allows the domain experts to explore the complexity of cross-national architectural styles. Finally, as a validation step, we explore the link between architectural heritage and intangible cultural values, operationalized using the framework developed within the World Value Survey research program. We observe that cross-country cultural similarities match to a fair extent the interrelations emerging in the architectural domain. We think this study contributes to highlighting the richness and the potential of the Wikimedia data and tools ecosystem to act as a scientific object for art historians, iconologists, and archaeologists.


INTRODUCTION
Cultural heritage is a complex, multifaceted mosaic that holds up a mirror to who we were, who we are, and who we aspire to be [45]. Broadly defined, cultural heritage encompasses the extraordinarily rich set of tangible objects and materials in the collections of cultural institutions (movable); along with the heritage represented in the built environment (immovable) and in landscapes (natural). At the same time, cultural heritage includes an intangible dimension that captures elements such as folklore, traditions, customs, rituals, and, in general, the knowledge base of a society [7]. As pointed out in various contexts, the promotion and preservation of cultural heritage have an enormous potential to shape vibrant, innovative, and prosperous societies; improve quality of life; drive economic growth; and open up employment opportunities [7]. Recently, the wide adoption of digital technologies and applications, such as online social networks, multimedia sharing platforms, and digital libraries, is profoundly influencing and shaping how culture is experienced in contemporary society. The cultural heritage sector, as are other sectors, is following this transformation with the emergence of new, distributed ways of working, communicating, and investigating cultural products and services. In particular, we are interested in the role of collaborative online platforms, like Wikipedia, as a digital gateway to cultural heritage and a platform to promote engagement. Crowdsourcing mechanisms have been widely adopted to enrich the information about cultural heritage by soliciting input from the community. Even though some ethical and practical concerns have been raised [29], crowdsourcing has been proven effective in encouraging audiences to curate and modify the content of museums, libraries, and archives, which needs indexing or digitization [53]. This paradigm shift was able to place the individual at the center of content production and enrichment, pushing engagement with cultural heritage outside the boundaries of traditional institutions. In this work, we focus on the Wiki Loves Monuments 1 (WLM) initiative, an annual, international photography contest in which volunteers are invited to take pictures of built cultural heritage and upload them to Wikimedia Commons, 2 a sister project and the shared media repository of Wikipedia. WLM represents a way of capturing a snapshot of a nation's architectural heritage for future generations and documenting a country's most important historical sites. While Wikimedia Commons hosts visual representations of a wide variety of heritage items, e.g., digitized art objects and natural heritage, it is worth mentioning that the contribution of WLM focuses on one aspect of a broader picture, that is, the built heritage component. The collection of images gathered over more than 10 years through the competition is growing into an incredibly useful historical resource. Since the first edition in 2010, the project helped to collect information on 1.6M monuments from 93 national competitions, with more than 2.8M pictures submitted by over 86K participants. To characterize the impact of the competition, we provide a quantitative analysis of the WLM photographic collection along the dimensions of space, time, and topical coverage. Different from previous work that focused mostly on specific countries or a particular edition [49], we explore the entire corpus across the time frame 2010-2021. We think this effort could shed light on the potential of a crowdsourced initiative and a collaborative platform like Wikipedia in the promotion and conservation of the worldwide cultural heritage alongside standard cultural institutions.
As part of human activity, cultural heritage has long been linked to the notion of culture as the manifestation of a shared history and a common collective identity [24]. As a validation step, our work explores this connection with quantitative instruments, focusing on the relation between architectural heritage and cultural values. In other words, as observed in a variety of empirical studies [3,42], we plan to test if the cross-country distances in the cultural sphere are correlated with the corresponding cross-country distances in the built heritage sphere. Several efforts have been devoted to the quantitative evaluation and characterization of culture [21,22,26,37,38,56,59], with researchers from different fields investigating a wide variety of different facets [42]. This heterogeneity leads often to the realization that cultural dimensions are hard to be interpreted as absolute scores, and that cross-country cultural models can only be meaningfully interpreted in a comparative framework instead [23]. To operationalize culture, we adopt a subset of the data from the World Value Survey [27] (WVS), which is a large-scale, cross-national, and cross-sectional longitudinal survey research program devoted to the scientific and academic study of social, political, economic, religious, and cultural values of people in the world. 3 In particular, we refer to the Inglehart-Welzel Index (IW Index) [27] as a multi-dimensional construct to characterize the emerging cultural traits of a country.
To summarize, the main contributions of this work can be listed as: • For the first time, we provide a comprehensive characterization of the WLM initiative across the geographical, temporal, and topical dimensions. • Leveraging the crowdsourced WLM photographic collection, we train a per-country visual descriptor to capture a digital representation of the architectural heritage of a nation. This visual descriptive model enables the researcher to explore the distinctive traits of a country heritage and, mostly, it provides a comparative framework to highlight (dis)similarities across countries. • As a validation step, we provide quantitative evidence of a link between the visual representation of a nation's architectural heritage and the set of social, political, religious, and cultural values operationalized by the Inglehart-Welzel Index. This is consistent with the hypothesis that cultural heritage and societal beliefs are intertwined.

RELATED WORK
Digital Cultural Heritage. Since the mid-2000s, the adoption of digital tools and information and communication technologies (ICTs) has radically changed the digital acquisition, storage, conservation, representation, and promotion of worldwide cultural heritage, enabling a variety of potential novel applications and services. In particular, online social networks, multimedia sharing platforms, open digital libraries, and collaborative knowledge bases have provided a heterogeneous ecosystem of data sources and tools that reshaped how we experience an exhibition, a museum, an archaeological site, or monuments in historical city centers. Semantic Web technologies have been extensively used to enrich cultural heritage data [11,12,25,58], to provide multilingual access [10], to implement semiautomatic pipelines for content annotation, and to facilitate the curation, collaborative management, interoperability, and engagement with digital collections [13]. A wide body of literature in the Cultural Heritage domain explored the use of IoT systems [47] to bridge the digital and real dimensions into new smart environments and innovative forms of browsing of cultural spaces. For instance, [2] Smart Context-awaRe Browsing assistant for cultural EnvironmentS (SCRABS) proposes a scalable prototype for the management and context-driven browsing of cultural environments. The system is characterized by the integration of distributed and heterogeneous pervasive data sources, context awareness, and advanced smart services such as retrieval, analytics, and recommendations based on users' preferences and the surrounding environment. The application of mobile phones in the framework of museums and exhibition spaces is steadily increasing due to their improved computational capabilities, which provide embedding interactive applications and rich graphics; their portability and widespread adoption; or their ability to shape new museum narratives [54]. Recently, the adoption of Augmented Reality (AR) and Virtual Reality (VR) technologies has acknowledged a great boost in terms of adoption and research perspectives. One common application of VR/AR involves the relocation of museum objects and cultural properties in the archaeological sites they originate from. Re-collocating allows the users to materialize their original location, scale, and function, creating a connection with the cultural identities to which they refer [15,39,48]. Cultural Heritage and Wikipedia. Wikipedia has been widely adopted as a digital gateway to tangible and intangible cultural heritage items and a platform to promote engagement. Starting from the observation that often digitized CH items have short descriptions and lack rich contextual information, Agirre et al. [1] explored the feasibility of matching metadata annotations and high-quality descriptors from Wikipedia articles to objects in the Europeana 4 initiative. Hall et al. [19] implemented a methodology to promote and make visible the unpopular resources of a CH collection by augmenting the navigational interface of matching Wikipedia articles with thumbnail images that are linked to external collections. The open source analysis architecture Contropedia [46], focusing on the notion of memory work, provided a platform to study the discourse around cultural heritage focusing on the identification and visualization of heritage-related disputes within an article and their comparison across language versions. The patterns of production, consumption, and geographical distribution of Wikipedia pages describing UNESCO World Heritage sites [40] have been recently investigated underlying the distinctive role of Wikipedia in characterizing how people experience and relate to the past. Controversies in articles were analyzed showing the prevalence of hyper-local and process issues over broader concerns, e.g., violence, destruction, and preservation. Even though the authors stressed how Wikipedia is a valuable and important source of social information on heritage, they underlined some concerns about spatial coverage, the skewness of editors' cultural and sociodemographic provenance, and the prevalence of edits from the Anglosphere. In [50], the authors investigated the gaps in Wikipedia's coverage of the visual arts by comparing the extent and quality of the coverage of 100 artists and 100 artworks from Western and non-Western canon. Recently, image analysis has been successfully applied to a wide variety of tasks and application domains, e.g., object detection [51,52,63] or image captioning [20], with a surge in contributions in the context of cultural heritage projects [6,34,41]. Examples of relevant applications cover the automatic enrichment of digital visual collections, improved search capabilities [62], automated dating [43], photogrammetric analysis of historical images for virtual reconstruction [35], and historical buildings classification [32]. To avoid the cost of training from scratch complex deep learning architectures, a popular method, adopted also in our work, is transfer learning, which refers to utilizing "what has been learned in one setting [. . .] to improve generalization in another setting" [17]. Wevers [62] showed the role of transfer learning to improve scene detection applied to a historical press photo collection. Kulkarni et al. [30] applied transfer learning to predict which cultural heritage site an image refers to, allowing improved searches through specific terms, and helping in the studying and understanding of heritage assets. Within the context of the INCEPTION European project, authors in [33] applied transfer learning to the classification of cultural heritage images, showing a substantial improvement in accuracy over SVM and random forest baselines. A generally accepted concept in photography underlines how a shot reverberates the photographer's inner sense, which translates to the observation that often the content of a photo can contribute to characterizing people's attitudes and values on a subject [28]. In this direction, exploiting a continuously re-training deep learning model, authors in [28] analyzed the visual content of a corpus of Flickr photos made by tourists to capture the tourists' urban image of a place as a reflection of the unique landscapes, cultural characteristics, and traditional elements of the visited region. Similarly, Pan et al. [44] analyzed 145 travel photos submitted to the The New York Times to explore the interplay between motivations, image dimensions, and affective qualities of places as a reflection of the inner feelings of the photographers. In this work, we embrace this vision within the cultural heritage domain; we aim at exploring the interplay between the choice of subjects of the WLM submissions and the inner representation of the importance of the cultural heritage assets of the photographers.

WIKI LOVES MONUMENTS PHOTOGRAPHY CONTEST
WLM is an international photography contest organized by volunteers around the world in which participants are invited to take pictures of local built cultural heritage sites, e.g., a building, ruin, or complex of historic significance, and upload them to the Wikimedia Commons platform. The competition is organized under a federated model, where national teams organize the competition in their specific country over 1 month (traditionally September), with its own monuments, organization, partners, jury process, and awards. After the national juries have concluded their work, each national jury can submit up to 10 photographs to an international finale. The national competitions typically establish some kind of list or definition of what buildings qualify as a "monument, " usually following official lists and definitions (e.g., "listed building" or "scheduled monument" in the United Kingdom, "monument historique" in France, and "tombamento" in Brazil) and may apply additional rules to the competition based on the national situation, such as copyright restrictions and antiquity laws. The photos are then uploaded to Wikimedia Commons under a free copyright license, so that the images can be used to illustrate relevant articles, and the aforementioned lists are generally accessible and reusable, contributing to expanding the quality of the knowledge base and helping in promoting the history and national heritage of all participating countries. WLM started as a pilot project in the Netherlands in 2010 and, due to its increasing success, organizers in many countries joined the competition; the competition has been organized in over 90 countries and was in 2021 running in more than 35 countries around the world. 5 It has not only managed to collect more than 2.8 million images of built cultural heritage worldwide but also inspired the creation of a database 6 of heritage sites so that photographers would know what to photograph and where. This database currently holds more than 1.6 million data points collected from various heritage organizations, likely the largest in existence. Arguably, WLM is a notable example of a successful crowdsourcing campaign and it is acknowledged by the Guinness Book of World Records as the largest photography competition in the world. 7 Temporal and Geographical Characterization. In this work, we focus on the WLM editions in the period 2010-2021 for which we collect all the submitted photos with the corresponding metadata. Photos are characterized by an identifier representing the depicted site, a reference to the WLM national organizer, and the filename that might contain a textual reference to the monument photographed. Figure 1 shows the spatial coverage of the resulting dataset; each photo has been assigned to a country according to the WLM contest of submission. We observe a highly heterogeneous spatial coverage with a strong presence of European countries; in fact, the top 6 countries for volume of submissions are in Europe, as shown in Figure 2. Within the top 30 most represented countries in this dataset, we find representatives of Asia and North/South America, while Africa is present only with Egypt. To characterize the temporal variability of contributors' participation, Figure 2 shows the volume of submissions (dashed bars) and the number of countries (solid bars) that are involved in each edition. After the first 2 years of growth, WLM reached an average engagement in the range of 200K to 250K photographs per year. The color-coded split by continent in Figure 2 confirms the predominance of Europe followed by Asia as the most prolific sources of information. Despite the European focus and the difficulty of reaching and engaging with a wider community, WLM has reached more than 90 different countries in 7 continents, and it arguably provides an extensive and diverse dataset to characterize the built cultural heritage across national boundaries. Characterization of the Contributor Base. To shed light on the characteristics of the typical WLM contestant and contextualize the nature of the dataset, we explore several activity metrics at the contributor level. Even though in the rest of the manuscript the unit of interest will be a country and the data will be analyzed only 20:6 • N. Azizifard et al.  in an aggregated form, we aim at characterizing to what extent the profile of a geographical region depends on the activity of a limited group of contributors. In this direction, Figure 3 shows the empirical cumulative distribution function of (a) the number of photos uploaded per contributor, (b) the distribution of the number of countries a contributor uploaded pictures for, and (c) the number of years of participation in the contest. We observe that 95% of the participants uploaded during the period of study up to 66 photos in a single country, entering a maximum of two editions of the contest. However, even though the vast majority of the contributors do not exhibit extreme behaviors, we observe the presence of a limited group of very active contributors. The top 10 most prolific contestants are responsible for about 265K submissions (around 9% of the collection). A handful of contributors (25) took part in all the editions, and even though usually participants upload photos within their country of origin, 25 contributors submitted photos across 10 or more countries.
To limit the impact of over-representation of a small set of active contributors, we apply the following filters to the original data: (1) Pre-processing: Images were downloaded at 300px format on the longest side and resized to a 224px or 150px square format, according to the input requirements of the vision pipelines in Sections 3.1 and 4. (2) Near-duplicates removal: To reduce the effect of contributors uploading nearly identical photos, we apply a similarity search routine at the country level. Each photo in the corpus is represented by a vector of visual features extracted from VGG16 [57], a state-of-the-art deep learning model for object detection initialized with pre-trained weights on ImageNet [55]. For each photo within a country, we compute the set of k-nearest neighbors and their similarity score, with k = 100. We define as duplicates the neighbors that show a similarity score greater than a threshold δ , computed as the inverse of the cosine distance of the visual vectors extracted in the previous step normalized in the range [0,1]. After exploring different configurations and evaluating them through visual inspection, we set the threshold δ = 0.98. We finally keep a randomly selected representative for each duplicate set, discarding the remaining photos. To enable a fast retrieval of the neighbors set, we use the Python library Faiss 8 that provides efficient similarity search and clustering of dense vectors. This filter removes 2% of the initial dataset. (3) Submissions per contributor: To limit the contribution of very active contributors, we randomly sample up to 1,000 photos as the representatives of the activity of a participant. This reduces the impact of individual preferences and styles across the geographical and temporal dimensions. (4) Support: To ensure sufficient support, only countries with at least 5K images remaining are considered.
This process results in a dataset of around 2.66M images by 82,716 contributors from 56 countries.

Toward a Topical Characterization of the Wiki Loves Monuments Corpus
After the geographical and temporal dimensions, in this section, we shift the focus to the content of a photo to capture which are the visual features that characterize the submissions in each national competition. Since the metadata do not always provide topical annotations nor an easy way to connect a submission to a monument in the corresponding national monuments registry, we refer to computer vision techniques to capture this aspect. In particular, we adopt the Places365 [65] scene recognition framework, a set of state-of-the-art convolutional neural network (CNN)-based artificial systems that allow the learning of deep scene features for various scene recognition tasks. Places are categorized by function making use of the spatial associations among the objects in an image. The framework is built upon a large-scale database 9 containing 10M images, which are labeled with aquarium, arcade, archive, art gallery, art school, art studio, artists loft, auditorium, beauty salon, booth, burial chamber, catacomb, church, classroom, conference center, jail cell, kindergarten classroom, lecture room, legislative chamber, library, movie theater indoor, museum, music studio, natural history museum, orchestra pit, science museum, stage indoor, television studio, throne room, ticket booth 365 scene semantic categories organized in a three-level taxonomy [64]. At the root of the hierarchy, there is the dichotomy indoor/outdoor places; the latter are grouped into natural and man-made outdoor scenes. Since we are interested in capturing the presence of elements related to the cultural heritage sphere, in this work, we focus on the indoor/cultural and outdoor/man-made/cultural scenes that count for 61 categories such as amphitheater, church, catacomb, museum, and others. The complete set of scene descriptors is listed in Table 1.
We adopt the pre-trained VGG16-places365 model provided in [64] to extract for each photo the scene categories with a prediction confidence score ρ > 0.15 that measures the reliability of class assignments. The threshold has been selected to filter out the annotations that have a confidence score less than the 30th percentile of the overall distribution. We explored different quality thresholds obtaining comparable results. Figure 4 (top-left) shows the top 5 indoor and outdoor scenes in our aggregated dataset, being, respectively, church, burial chamber, museum, catacomb, throne room, and church/outdoor, synagogue/outdoor, cemetery, tower and ruin. These associations are not surprising due to the geographical coverage dominated by European countries. However, shifting to a country-specific analysis, differences between cultural areas emerge: for instance, the most frequent outdoor cultural scenes in Thailand (Figure 4, top-center) are temple, pagoda, and palace, while in the Netherlands (Figure 4, top-right) oast house, synagogue, and church are predominant. To give a hint to the cross-country heterogeneity, we provide a representative for each geographical region. We list in the Additional Material the most frequent categories for the remaining national competitions (Figures A1-A4).
To explore in more detail the complexity of the category assignments, we refer to a comparative framework in which each country is represented by a vector v pl aces365 c = <v 1 , . . . ,v n >, where v i models the frequency of the category i within the corpus of the photos submitted in that country and n is the number of categories with a not null frequency (n = 60). This representation enables to explore the similarity among countries according to the most frequently detected scenes. To have a first descriptive image of cross-national similarities, we refer to the T-distributed Stochastic Neighbor Embedding (t-SNE) [61] approach to embed the high-dimensional v pl aces365 c vectors in a two-dimensional space. Figure 5(a) shows the map limited to the countries with at least 5K photos tagged with at least one cultural category (56 countries, 1.5M photos in total). To ease the navigation of the map, we color the points according to the corresponding country's geographical region. Interpreting closeness as a similarity measure, we observe the emergence of interesting patterns: in several cases, geographical proximity matches the proximity in the embedding space, e.g., the case of Bangladesh, India, and Pakistan, or the South American cluster of Colombia, Argentina, Chile, Uruguay, and Bolivia, or in Southeast Asia the cases of Thailand, China, and Nepal, suggesting the presence of homogeneous cultural areas with a similar representation of categories among the photos. However, there are some notable examples when the weak geographic association fails: e.g., Ireland exhibits close proximity to Georgia and Serbia in the embedding space while being geographically distant. A similar observation holds for the Philippines and Brazil, or South Africa and Ukraine. These patterns might be linked to limitations of the adopted scenes' detection framework or to specific characteristics of the national competition, e.g., local suggestions on what to photograph. Figure 5(b) provides an alternative view to explore these geographical relations: the dendrogram resulting from an agglomerative hierarchical clustering of the v pl aces365 c vectors (distance function = cosine, linking criterion = average) is shown in a circular layout. Countries are colored with the corresponding geographical area and they are placed close by according to the similarity in their scenes' frequency profiles. The hierarchy shows the existence of semantically and geographically related groups as well as counterexamples. As previously pointed out, despite the ability to detail high-level similarities, this approach presents a few drawbacks: First, it is based on a limited set of visual descriptors that are not able to capture the complexity and diversity of the worldwide cultural heritage. Second, certain categories are prone to misclassifications: e.g., jail cell is likely to capture anything that has a metallic grid pattern (Figure 6(a)), or throne room often refers to indoor rooms in historical palaces with a chair or even an altar in a church (Figure 6(b)). Similarly, the natural history museum category is often triggered by the presence of statues of animals in facades ( Figure 6(c)). Third, the nature and the guidelines of the photographic contest tend to bias the choice of the subjects of a photo since most competitions limit themselves to officially recognized heritage sites. This means that the government policy on what to acknowledge as heritage, but also which of those sites are easily accessible to photographers, will under perfect labeling determine the country's vector. For example, the fourth most frequently detected scene globally is cemetery, which often depicts headstones of famous individuals or commemorative plaques of historical events. Similarly, two of the three most frequent scenes in Italy appear to be cemetery and fountain, highlighting the choice of Italian contestants to focus on those architectural elements independently of their relevance to the national heritage. Finally, artistic styles are hardly captured by generic categories like palace (Figures 6(d) and 6(e)) or arch (Figures 6(f) and 6(g)), which hide the diversity of cultural movements and historical epochs; i.e., a palace in different geographical regions might show significant visual and artistic differences.

MODELING THE COLLECTIVE IMAGE OF A COUNTRY CULTURAL HERITAGE
To overcome the drawbacks and limitations of the previous approach based on scene detectors, we aim at training a model able to capture the best visual representation of a heritage site uploaded in a country. We frame the problem as a multi-class classification task, in which the model learns to label a photo with its country of origin.  To this extent, we adopt a deep CNN architecture, called Xception [9], in its implementation available in Keras. 10 Instead of training the model from scratch, which might take a conspicuous amount of time on large datasets, we apply the idea of transfer learning; starting from an Xception model pre-trained on ImageNet [55] for object detection, we freeze the model weights except for the top layer, which is customized to solve our multi-class problem. At last, we run a fine-tuning routine, which consists of unfreezing the entire model from the previous step and re-training it on the WLM data with a low learning rate (1e −4 ). In this framework, we limit our analysis to the countries with at least 5K photos to ensure a smooth learning phase, resulting in 56 countries. To keep a relative balance between classes, we consider a maximum number of 30K photos per country sampled randomly. We adopt an 80-20 split for training and test sets; 20% of the training data is held as a validation set for hyperparameter tuning. Precision, recall, and F1-score averages are weighted by the per-class support to account for label imbalance. The Matthews Correlation Coefficient (MCC) is a performance measure that combines the contribution of the true and false positives and negatives. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction, and −1 total disagreement between prediction and observation. Top k accuracy computes the number of times the correct country is among the top k predicted countries. In Table 2 we present the performance of the model at an aggregate level. We observe an F1-score of 0.45 computed as a weighted average of the per-class F1-score to account for label imbalance. Additionally, to address the case of a multi-class imbalanced scenario, we compute the Matthews Correlation Coefficient (MCC), which is a performance measure that takes into account all the dimensions of a confusion matrix combining true/false positives and negatives. In essence, MCC is a correlation coefficient value between −1 and +1; a coefficient of +1 represents a perfect prediction, 0 no better than random prediction, and −1 total disagreement between prediction and observation. We obtain an MCC = 0.44, which indicates a solid gain with respect to a random classifier. Using the class probability vectors, we compute the top k accuracy with k ∈ {1, 3, 5}, which indicates the number of times the correct country is among the top k predicted countries. We observe that a photo's true country is 63% in the top 3 and 71% in the top 5 predicted countries. Note that in a scenario with high visual heterogeneity and cross-country spillovers due to geographical or cultural proximity, we believe the model shows valid performance, especially in the information retrieval setting. To explore the diversity of performance across countries, in Table 3 we provide the per-class F1-score sorted in decreasing order. We observe a high variability across countries, with Nepal, Thailand, Armenia, the Netherlands, and Bolivia at the top of the ranking and Estonia, Uruguay, Algeria, Chile, and Hungary at the bottom. Qualitatively, countries with a low support tend to perform worse due to the lack of enough data to train a high-quality discriminator. However, this effect is not due only to underfitting in the case of a limited training set, but it is important to note how difficult the task might be to discriminate between countries with close artistic styles and monument types. Classification errors are not random: the model tends to misclassify photos between countries that have similar visual profiles. To quantify this We interpret the likelihood of misclassifications as a form of similarity measure, indicating the inability to easily discriminate between countries. For example, Russia is misclassified with high probability with Ukraine, Romania, or Finland, and Spain with Portugal, Italy, or France. For simplicity, in this table we present the seven cases where the concentrations of the errors are highest. Fig. 7. For three pairs of countries (Czechia/Austria, Belgium/the Netherlands, and Ukraine/Russia) we show three pairs of photos with a high visual similarity score. We compute the similarity between two photos as the cosine similarity between their feature vectors extracted by our model. A high similarity is a proxy of the difficulty to discriminate between classes: in fact, some of the examples are hardly distinguishable to the human eye, which underlines how class boundaries are sometimes fuzzy and, consequently, how difficult is the multi-class task we are trying to solve.
observation, for each label we compute the top 3 countries for number of misclassifications weighted by per-class support (Table 4). We interpret the likelihood of misclassifications as a form of similarity measure, indicating the inability to easily discriminate between labels. For example, Russia is misclassified with high probability with Ukraine, Finland, or Belarus, and Czechia with Slovakia, Austria, or Poland. For simplicity, in Table 4 we present the seven countries where the errors are more concentrated in a few classes. To explore this concept visually, in Figure 7 we show for a pair of countries and three sets of photos with the highest visual similarity score. We compute the similarity between two pictures as the cosine similarity between their feature vectors extracted with our model. It is immediate to note how some of these pictures are strikingly similar even if they belong to two different classes, indicating once again the inherent difficulty of the task and the fuzziness of the definition of the class boundaries.
In Figure 8 we visually explore the top 10 countries by F1-score showing a sample of 10 photos with the highest class assignment probability as a measure of confidence in the prediction. We refer to them as visual representatives for their high discriminative power. Not surprisingly, the diversity of subjects and artistic styles across countries emerges especially in the areas with different cultural backgrounds.
Finally, we refer to the same comparative framework in Section 3 where each country is represented by a vector v W LM c = <v 1 , . . .v m >, which is computed as the average of all the feature vectors extracted from each photo belonging to a specific country. In the case of our fine-tuned model, a photo is represented by an mdimensional vector, where m = 1,000 is the number of object classes the Xception model was originally trained for. Figure 9(a) shows the map using t-SNE for the 56 countries covered by our analysis. Consistently with the case of the Places365 characterization, the geographical proximity and the cultural similarity among countries positively correlate with the distance in the embedding space. However, differently than the previous case, we observe (1) a higher degree of separability in the embedding space, with denser and more homogeneous clusters, and (2) a more coherent positioning in terms of geographical regions that emerges from the consistent coloring within groups and the absence of some idiosyncrasies present in the scenes framework-e.g., Ireland is currently close to Great Britain instead of Georgia and Peru is no longer part of the Mediterranean cluster. We think the improved crosscountry placement highlights the advantages of the framework based on the multi-class classification pipeline. However, it is interesting to underline how geography does not explain entirely the observed relations: in fact, our approach captures features that go beyond the obvious geographic clustering; e.g., countries like Brazil and Portugal that are culturally strongly linked appear close even if geographically very distant.

ON THE INTERPLAY BETWEEN HERITAGE AND CULTURAL VALUES
Cultural heritage has long been linked to a notion of culture, which bundles together ethnicity, collective identity, territory, and the idea of a common origin and shared past [24]. In this framework, which maintains a dominant position in the mainstream heritage discourses since the 19th century, cultural heritage is the manifestation of a shared cultural history and a common collective identity. In this section, we aim at exploring with quantitative instruments this connection, focusing on the relation between built heritage and culture as a way to validate the cross-national differences we observed in Sections 3 and 4. Among the wide spectrum of models proposed in the literature, we adopt the mainstream etic approach in cross-cultural studies that assumes the existence of a set of universal cultural dimensions that are equally relevant to all cultures and that allow positioning a society relative to other societies along each of the dimension continuums. Even though several attempts to use large-scale online data, e.g., Facebook topical preferences [42] or Twitter news-oriented posts [3], have been made, social surveys are the de facto standard instrument within the cross-cultural studies community to quantitatively model cultural dimensions. The idea is to exploit the aggregation of individual opinions and the prevalence of some cultural traits to define national cultures, whereas cross-country comparisons are used to measure the cultural distance between countries or clusters of countries. In our work, we consider the WVS [27] that is a large-scale, crossnational, and cross-sectional longitudinal survey research program devoted to the scientific and academic study of social, political, economic, religious, and cultural values of people in the world. 11 The project grew out of the European Values Study 12 in 1981, and the corresponding social survey has been running every 5 years operating in more than 120 world societies. Extensive geographical and thematic scope, free availability of survey data, and project findings for the broad public turned the WVS into one of the most authoritative and widely used cross-national surveys in the social sciences. To maximize geographical coverage, we adopt as a reference dataset the Integrated Value Survey (IVS) [4,18] that represents an effort to merge the WVS and EVS trend files in a single resource covering the time frame 1981-2021. Following the same methodology used in the construction of their Cultural Map [27], Inglehart and Welzel propose the IW Index [27] as a multi-dimensional construct to characterize the cultural traits of a country. Operationally, they select 10 variables from the IVS that explore the dimensions of happiness, trust, respect for authority, voice (availability of people to express personal opinions by signing petitions), the importance of God, justification of homosexuality, abortion, national pride, individual rankings of social values, and obedience/independence [5].
In this scenario, each country is identified by a vector v IW I c = < v 1 , . . . ,v 10 >, where each component v i represents the average value of the i th question across IVS participants in country c. Since cultural dimensions are hard to interpret as absolute scores independently of the quality of the underlining model, we embrace the belief that cross-country cultural models can only be meaningfully interpreted and adopted in a comparative framework [23]. To enable comparisons among countries, we compute the cross-country distance matrix M IW I , where the generic element contains the cosine distance between the two countries i and j, in other words, how different they are according to the 10 selected survey dimensions. We aim at exploring two questions: (1) Are the cross-national cultural relations linked to the built heritage ecosystem emerging from the WLM dataset? (2) Which approach (Places365 scene detectors or fine-tuned visual model) describes this relation better? To this purpose, we compute the matrices M pl aces365 and M W LM that encode, respectively, relative distances in the embedding space built from the Places365 and WLM fine-tuned visual models. We adopt the Mantel [36] test, which is a statistical test widely used in ecology to quantify the extent to which two distance matrices are correlated. We use the implementation in the scikit-bio Python package 13 and the Spearman's rank correlation coefficient as correlation metric. Our analysis is limited to the set of countries that are shared between the IVS and the WLM datasets, resulting in 49 countries. We observe a correlation coefficient ρ = 0.34 and ρ = 0.45 for, respectively, the M pl aces365 and M W LM cases. Both correlations are statistically significant, with p-value = 0.001. This means that countries that share similar cultural values tend to have a similarly built heritage profile in this dataset. Moreover, the approach based on the multi-class classification task and the finetuned visual model seems to capture these relations to a finer extent. However, not all countries behave the same: with this approach, we are able to quantify which pairs of countries are the closest/farthest, or which pairs have significantly different similarity estimates in the cultural values and WLM spaces (refer to Table A1 for an overview of the cross-country distance matrix for WLM). For example, United States/Canada, Chile/Argentina, and Serbia/Romania appear to be the pairs with the closest relation. The countries that are the most consistent with the cross-country cultural map are China, Belarus, Austria, Hungary, and Uruguay, while the ones that differ the most are Iran, Pakistan, Bangladesh, Armenia, and Thailand. The consistency score for a country is calculated as the sum of the pairwise differences in the distances estimation across the two representations. A more in-depth analysis of these relations along with their theoretical investigation is out of the scope of this work, which mainly aims at providing an example of how the crowdsourced collective knowledge emerging from WLM can be adopted to tackle relevant dimensions of the cross-cultural studies domain.

DISCUSSION
Beyond an interest in the WLM communities, this study highlights the richness that emerges from the data of the Wikimedia ecosystem as a scientific object. Indeed, part of our results ultimately bear on the question of stylistics for an art historian/iconologist or chrono-typology for an archaeologist. This approach, which dates back to the 19th century, has always been governed by the amount of data that can be processed to be as reliable as possible, even if it is only relative dating. The WLM contest based on Wikimedia Commons tools offers the possibility of working on perfectly documented Big Data. What this study confirms is that today's data analysis and processing tools offer new avenues of innovation in research for art historians, iconologists, archaeologists, and more broadly digital humanities. The automation of our research raises the question of the diffusion of architectural styles throughout the world that should be worked with by art historians interested in a global approach. One of our results is therefore to explicitly account for what WLM in particular and the wider pool of media in Wikimedia Commons represent, namely a pool of data ready to be explored at different geographical, chronological, and sociological levels, allowing international as well as local approaches. When treated with care, the media database of Wikimedia Commons could prove to provide a treasure of insights. Moreover, from a platform management perspective, the longitudinal and geographical analysis summarized in Figure 2 gives a consolidated visualization of the countries that have participated in the contest. With some effort, this analysis could be repeated and deepened for an expanded dataset, to include other built heritage photos on Wikimedia Commons and other free and less free repositories. It would be particularly valuable to better understand which components of the world's cultural heritage are poorly described with free photos, especially when combined with an analysis of identified and unidentified heritage sites.
The classification of images by using CNNs is a methodology that could be applied more broadly by assisting in the maintenance and suggestion of labels in the Wikimedia universe. While this would require adapted models, it may be helpful to surface input labels for microtasks. The method could also be adapted in combination with the monuments database to recognize characteristics along different axes than country, such as style and building type, that could be used to detect possible identification errors and labels. Limitations. Our approach carries a few inherent limitations that are worth discussing, in particular: • Due to the original objective of the competition, which is limited to built cultural heritage, this study is not able to fully characterize the different dimensions within the cultural heritage concept, focusing only on a fraction of it. There are several initiatives and different perspectives on how to fill this gap. For example, Wiki Loves Heritage (WLH) 14 is a photo contest organized for the first time in 2018 by Wikimedia Belgium that aims at broadening the set of items that are allowed to enter the competition to movable, immaterial, maritime heritage, landscape, and museum collections. However, WLH was run only in Belgium, limiting participation and adoption; moreover, there is no agreement that a new combined competition merging WLM-WLH would be the most effective way to fill the current gaps. Another approach would be to combine available datasets with a different focus, such as natural heritage (Wiki Loves Earth 15 ) or intangible heritage (Wiki Loves Folklore 16 and Wiki Loves Africa 17 ), that have been modeled after WLM. However, creating a broader data repository merging images from heterogeneous sources would create some challenges that we were able to sidestep in this exercise. • Focusing on the community of WLM organizers, our methodological pipeline reinforces the importance of collecting and surfacing metadata on submissions. For example, roughly half of the photos uploaded to the contest have geographical coordinates embedded in the EXIF data, and therefore we chose to use the readily available and complete country data instead. It might be an interesting starting point for the study of regional and local trends. While the Wiki Loves Monuments dataset is relatively well documented with monument identifiers that connect a subset of the photos to a database, and Wikimedia Commons has a rich category structure, it is far from trivial to utilize this for the whole set. This may be the topic of further research, and the construction of a richer combined dataset may be worthwhile. • The data used in this study suffers from a variety of sample biases that are intrinsic to the nature of the contest and the platform used for their collection. There is a number of factors that one could argue to both be introducing bias and be an effective influence on what constitutes a country's cultural heritage: -Participating countries: Even though the competition spans 90 countries in 7 continents over its 11 years, the coverage is highly heterogeneous, and European countries are over-represented both in number and coverage. Likely, the presence of an active community, resources, and a culture of  The sites that are selected to be included in these lists are often selected by conservative mechanisms that may favor colonial-era heritage and limited viewpoints of what constitutes heritage. In addition, some countries are more inclusive. For example, Germany is estimated to have several hundred thousand sites on its lists, the Netherlands has 60K on its national list alone, and India only includes a few thousand sites. The fact that the competition uses primarily a pre-approved definition rather than the photographer's interpretation of what constitutes "built heritage, " and that this mechanism differs somewhat in each country, can introduce a bias of what gets photographed in which country. -Access: The accessibility of such sites to participating photographers and the public at large determines which sites in a country are more likely to be photographed. -Interest: The willingness and interest of photographers to go out and capture the representation of any site may be affected by many factors, including whether the photo is exciting enough to result in a potential prize-winning photo.

-Power users:
There is a small number of very active participants that photograph a lot of sites and upload them to the competition. While their drive is admirable, it may result in an over-representation of the country or the region that they happen to live in or visit. We attempted to limit the effect of this factor by applying filters. -Laws: Many countries effectively prohibit more recent (post-colonial) or ancient heritage sites' photos to be published through copyright (lack of Freedom of Panorama [14]) or antiquity laws [16,60]. This may either make it harder for organizers to collect and publish visual documentation of heritage or highly restrict and bias what can be documented. • The image processing pipeline adopted in our work is based on deep convolutional neural network architectures that do not easily provide human-readable explanations for their choices. The current analysis lacks an interpretation layer that would provide domain experts with an actionable definition of the characteristic traits of the built cultural heritage of a country as emerging from the WLM corpus. Future work in this direction will reduce the gap between quantitative and qualitative interpretations, potentially enriching the toolbox available to quantitative art historians, iconologists, and archaeologists.

CONCLUSIONS
In this work, we presented an extensive characterization of the Wiki Loves Monuments photographic contest along the geographical, temporal, and topical dimensions. We highlighted the potential of the Wikimedia ecosystem to act as a scientific object for art historians, iconologists, and archaeologists, as well as the limitations and the biases that such a dataset introduces. Several avenues for future work are open: First, the analysis of the Wikimedia Commons Monuments Database 18 and the systematic integration with the WLM dataset will help shed light on intra-national or regional patterns that in many countries show high heterogeneity and complexity (e.g., cultural differences and artistic styles between regions in North and South Italy). Moreover, the integration with the geographic coordinates, when available, would provide a refined understanding of which heritage sites are most attractive for a photographer and, possibly, their relation with the official visitation statistics. Second, while in this work we rely on a snapshot that aggregates the 11 years of the competition, it would be interesting to explore the within-country temporal evolution of photographed sites to spot trending venues or the emergence of particular styles in the portrayed subjects or in the photographic techniques. Third, a more in-depth analysis of the visual content of the uploaded photos could provide a more nuanced framework to disentangle the effects of photographic styles and the characteristics of different artistic movements, providing the researcher with an explainable, multi-dimensional characterization of a country built heritage. In this direction, several deep learning frameworks have been recently applied to the art domain proposing aesthetic measures, artistic style detectors, and salient object detection tools. In addition to the results presented, we hope to contribute to attracting new scientists to the richness and the potential of the projects within the Wikimedia ecosystem. This data mining exploration just scratched the surface of what is possible and it underlines the importance to work under precise hypotheses. This means the need for a closer dialogue with the digital humanities that hold those hypotheses and the adoption of interdisciplinary methodologies and teams.