An Ontological Approach for Unlocking the Colonial Archive

Cultural Heritage institutions have been exploring new ways of making available their catalogues in digital format. Recently, new approaches have emerged as methods to reuse and make available the contents for computational purposes. This work introduces a methodology to transform digital collections into Linked Open Data following best practices. The framework has been applied to Indigenous and Spanish colonial archives based on the collection Relaciones Geográficas of Mexico and Guatemala provided by the LLILAS Benson Latin American Studies and Collections. The results of this work are publicly available. This work aims at encouraging Cultural Heritage institutions to publish and reuse their digital collections using advanced methods and techniques.


INTRODUCTION
During the last decade, Cultural Heritage (CH) organisations have been publishing and providing access to their catalogues in digital format.Recently, new approaches have emerged as methods to reuse and make

RELATED WORK
Cultural Heritage institutions have explored the benefits of providing computational access to their collections.In this context, Labs have emerged as a crucial element in organisations to approach external researchers by means of the inclusion of open datasets, the use and publication of APIs, and the creation of collaboration research programs.Some examples are the Data Foundry at the National Library of Scotland, the LC Labs at the Library of Congress, and the British Library Labs.
The Semantic Web was introduced by Tim Berners-Lee as an extension of the traditional Web based on standards provided by the W3C such as Resource Description Framework (RDF) , Ontology Web Language (OWL) , and SPARQL Protocol and RDF Query Language (SPARQL) to provide machine-readable information [ 32 ].In addition, the author suggested a 5-star deployment schema to make available open data [ 56 ] that was extended with two additional stars [ 26 ].Linked Open Data refers to the publication of information under permissive and open licenses, providing URIs to identify resources and including links to external repositories [ 2 ].
CH institutions have adopted the Semantic Web principles by making available their contents as LOD using known standard vocabularies such as CIDOC-CRM 2 and linking to external repositories [ 1 , 11 , 13 ].Libraries have adopted a leading role in this sense providing their catalogues as LOD using different vocabularies.For instance, the Library of Congress has published a Linked Data Service to provide their contents using Bibliographic Framework (BIBFRAME) as main vocabulary. 3Other approaches such as the National Library of France and Spain have published its catalogues as LOD using a vocabulary based on the Functional Requirements for Bibliographic Records (FRBR) promoted by IFLA [ 15 , 53 ].FRBR has been used in other domains such as Functional Requirements for Information Resources for datasets [ 10 ].The British Library published its contents as LOD based on the Bibliographic Ontology (BIBO) . 4Other initiatives such as Europeana and the Digital Public Library of America (DPLA) have adopted the Europeana Data Model (EDM) as the main vocabulary to describe their contents. 5Furthermore, reference models such as the Open Archive Information System (OAIS) was developed to identify how these systems that hold the information will interact to manage the data objects.Within the OAIS protocol, it is paramount to implement the correct knowledge standardisation through the use of such controlled vocabularies to help maintain the long-term preservation of these digital collections [ 31 ].
The Smithsonian American Art Museum has made available information about artists and works as LOD using CIDOC-CRM to foster new opportunities for discovery, research, and collaboration [ 50 ].The Museo del Prado in Spain has adopted CIDOC-CRM and FRBR as the metadata model to describe its collection. 6The Amsterdam Museum Linked Open Data set is a five-star Linked Data representation and comprises the entire collection of the Amsterdam Museum [ 12 ].In addition, Linked Art is a community of museum and CH professionals collaborating to define a data model to describe Art. 7ational initiatives have explored the benefits to use LOD in Digital Humanities such as the Linked Open Data Infrastructure for Digital Humanities (LODI4DH) in Finland [ 24 ] that has made available the WarSampo knowledge graph, a LOD service based on CIDOC-CRM for publishing data about World War 2, with a focus on Finnish military history [ 29 ].Another approach is based on the Spanish Civil War including a photographic archive as LOD [ 48 ].
The Linked Open Data Cloud presents datasets with different domains (geography, media, linguistics, etc.) that have been published using the Linked Data format. 8In addition, Wikidata provides a section that includes SPARQL endpoints of a list of LOD projects that have been linked from the platform. 9Table 1 shows an overview of the projects and vocabularies used to make available digital collections and materials as LOD using the Semantic Web principles.
The enrichment of the datasets with external repositories has become a challenge for the research community.The use of advanced techniques such as Named Entity Recognition (NER) and Entity Linking enables the 74:4 • G. Candela et al.
recognition of entities in the text [ 5 , 30 ].In this sense, repositories such as DBpedia and Wikidata have been extensively used by CH organisations to enrich their catalogues and to create links to resources such as authors and works. 10Wikibase is a proven and maintained open software for generic knowledge base maintenance that has been previously tested in CH organisations [ 21 , 28 ].
Advances in technology have provided a wide variety of tools to transform the original sources into LOD, but also to enrich them with external repositories, reproduce the transformation process, and assess their quality.For example, OpenRefine11 is a tool for cleaning, transforming and enriching data with external services.Many modules have been developed as Python packages such as RDFLib 12 to work with RDF, and spaCy, 13 NLTK and Gensim [ 46 ] for natural language processing purposes.With regard to reproducibility, Jupyter 14 has emerged as a powerful tool to enable researchers to reproduce the results [ 7 ].In terms of quality, several approaches have proposed data quality criteria for LOD as well as methods based on the use of SPARQL and Shape Expressions (ShEx) to describe and assess LOD quality [ 6 , 16 ].Regarding the storage of RDF, many organisations have used Virtuoso providing a public SPARQL endpoint.Other initiatives are based on dump files and additional storage systems such as RDF4J 15 and Jena. 16Although the recommendation by the W3C is to reuse standard vocabularies instead of creating new ones, Protégé 17 enables researchers to create their own ontologies, as well as import previously generated ones and modify them.
Previous works have focused on the challenges for the next generation of Web technologies based on the Semantic Web and AI [ 22 , 25 , 40 ].The combination of Machine Learning and Semantic Web in order to facilitate the work of curators in CH organisations have been explored in the past [ 3 ].In addition, previous works have measured the impact of the Optical Character Recognition (OCR) errors produced by the digitisation systems on natural language processing (NLP) tasks [ 51 ].
Despite the current technological innovations, the vast majority of the technologies used to describe knowledge remain rooted in Western understandings of the world and dominated by a white, patriarchal perspective [ 14 , 47 ].These challenges have been flagged up by CIDOC-CRM through the creation of Issue 530: Bias in Information [ 9 ].This points to the inability of models to describe the information of different world views.The absence of diverse resources and non-western ways of thinking, describing, and classifying results in the effect of scarcity bias .While this issue is increasingly recognised, it is still assumed that technologies such as AI and Machine Learning will be able to fill knowledge gaps in some way, whilst the research is carried out in isolated (often Anglo-centric) sylos.This is known to result in workflows limited by the research objectives of creators that are usually embedded in western systems [ 14 ], and commonly limited to the cosmovision of the Global North.While the above-mentioned efforts provide an extensive demonstration of how to transform, publish, assess, and reuse LOD, to our best knowledge, none of the previous approaches have considered the use of Indigenous and Spanish colonial archives to apply innovative research methods.Working with indigenous and colonial datasets from the region once known as Mesoamerica expands efforts to a vast territory with shared cosmovision and biopolitical relationships, interconnecting 364 indigenous languages and variants and at least 68 different cultures.This covers a large geopolitical territory from North America, to parts of Central America and the Caribbean.This research uses as an example the Relaciones Geograficas de Nueva España (or the Geographic Reports of New Spain).This collection is the result of a survey with thousands of pages describing the situation of indigenous peoples and Europeans at the time.Spanish and Indigenous nobles, rulers, administrators, and scholars participated in this effort which included a rich view of different ethnic groups and diverse cosmovisions held in this territory during the sixteenth century.This work will be useful in the efforts to diversify views in technological development, and for the CH community to identify best practices and guidelines regarding the publication and reuse of LOD datasets.This research might also open the door to the use of other advanced computational methods, such as Computer Vision (CV) approaches to integrate a body of commonly non-accessible knowledge into the LOD and AI ecosystems, also helping in the creation of new understandings about this historical period.The project includes several datasets from relevant organisations such as the LLILAS Benson library (US) and the General Archive of the Nation (Mexico).In particular, this example is based on the collection of Relaciones Geográficas de Nueva España belonging to Mexico and Guatemala held by the LLILAS Benson.As said, this is a corpus of historical documents and paintings (maps) regarded as one of the most important datasets for the history of Early Colonial Mexico and Guatemala.Compiled from 1577 to 1585, the dataset is the result of a questionnaire ordered by Philip II, the king of Spain at the time.In 50 questions, the mandate ordered to collect all sorts of relevant information about people, ways of life, health, economic and cultural resources, languages, climate, social, economic, and military organisations, as well as available foods, plants, and animals, among many others.As such, the collection tends to be one of the major sources used by historians, archaeologists, and anthropologists interested in the history of Mexico and Latin America.Therefore, its importance cannot be understated.Furthermore, written in a combination of Spanish with around 69 different indigenous languages, and containing geographic and historical information for thousands of towns and villages, the Relaciones have always been of interest to indigenous communities, that until recently, have usually only been available to specialists.The LLILAS Benson dataset contains 81 items including the textual reports and its maps, and is available online using a CC0 license. 18The metadata of each record is provided as JSON, including all the information related to the item, such as title, source collection, publication date, description, and list of authors.

A Method to Transform a Cultural Heritage Digital Collection into RDF
Following previous approaches, the method proposed to transform a Cultural Heritage digital collection into RDF works in six steps: (i) identification of the original sources; (ii) extraction of metadata; (iii) data mapping to the main vocabulary; (iv) generation of RDF; (v) data enrichment; and (vi) quality assessment.
The identification of the original sources includes several aspects, such as how the data has been made available for the public and the license.In general, datasets are available as dump files (e.g., zipped files), but organisations are starting to adopt APIs to expose data enabling developers to access their contents and perform a wide range of scientific examination and analyses.Some examples of API include International Image Interoperability Framework (IIIF) , Open Archive Initiative-Protocol for Metadata Harvesting (OAI-PMH), and advanced initiatives such as SPARQL.With regard to the licenses, datasets may be available using permissive and open licenses such as CC0 and CC-BY, but in other cases, the license may not be clear or based on national regulations.For example, the Data Foundry at the National Library of Scotland provides datasets under a CC0 license.Additional aspects to consider to select a dataset include coverage to provide information about missing items or periods, data quality, trustworthiness (e.g., manual or automated metadata curation), and particular features such as the type of contents (e.g., text, image and map), and the metadata format used to describe the contents (ALTO, XML, Dublin Core, etc.).Most importantly, enabling access to the data through the APIs can facilitate the implementation of interactive tools for engagement, citizen science and crowdsourcing processes, that can help promote its findability and reusability.
The following step corresponds to the extraction of metadata from the original sources.In general, and depending on the content, CH organisations provide the datasets using different formats such as plain text, commaseparated values (CSV) , JavaScript Object Notation (JSON), and Extensible Markup Language (XML) .Some examples are A Medical History of British India provided by the Data Foundry, 19 Chronicling America, 20and the Museum of Modern Art (MoMA) [ 41 ].More advanced approaches are based on RDF, a standard model for data interchange on the Web promoted by the W3C.RDF enables the definition of triples subject-predicateobject to add information to resources using predicates (e.g., Shakespeare is_the_author_of Hamlet).The datasets may be published as dump files or using a SPARQL endpoint.For instance, the Library of Congress provides bulk download files for their LOD datasets. 21Some examples of datasets providing a SPARQL endpoint include the Smithsonian American Art Museum and the World War I as Linked Open Data from the Linked Data Finland platform. 22ach format can be read and analysed using a particular open-source Python or Java module.For instance, and regarding the different formats, many packages have been published for Python.In this sense, tabular data published as CSV can be analysed using Pandas.XML can be read and analysed with a wide range of libraries, such as Pymarc for MARCXML.RDFLib is a pure module for working with RDF including parsers, serializers, and a SPARQL implementation.
The data mapping step requires the knowledge of several technologies (e.g., RDF and OWL) to understand the selected vocabulary to define the resources correctly.Each vocabulary may include different classes and properties to define the final dataset.For example, Figure 1 shows the main classes and properties defined in the EDM vocabulary.Other initiatives present additional approaches that differentiate in terms of the level of granularity to describe the works providing a hierarchy based on different classes.For instance, FRBR and RDA provide four classes (Work, Expression, Manifestation, and Item) to define the resources, while BIBFRAME provides two classes (Work and Instance).In both cases, the top-level class Work reflects a conceptual entity, and the rest, the materialization of the individuals that can be physical or digital in nature.However, in some cases, the separation using different classes can be inconvenient or even impossible to achieve.In case a new vocabulary is required, tools such as Protégé provides an editor to define the classes and properties.
The generation of RDF is based on the original dataset and the data mapping.Tools such as OpenRefine enable the mapping between an original dataset provided as a CSV or JSON file and the selected vocabulary by using its interface.Each metadata column is mapped to a property defined by the vocabulary.For instance, a title provided as a column in a CSV file is mapped to a Dublin Core property dc:title .OpenRefine supports GREL 23to manipulate strings and enables the previsualisation of the transformation to preview the final result.Other approaches to create RDF are based on Python and Java modules such as RDFLib and Jena.
The data enrichment is performed using external repositories.In terms of how the links are created, OWL provides the property owl:sameAs to define the external links.Other approaches are based on the creation of particular properties such as the case of Wikidata to link the resources provided by a dataset (e.g., authors and works).Additional datasets can be used for different domains such as Virtual International Authority File (VIAF) for authors and GeoNames for geographic locations.While OpenRefine includes reconciliation services for knowledge bases such as Wikidata, there are additional tools based on Entity Linking methods to assign a unique identity to entities mentioned in the text.Previous works are based on DBpedia and Wikidata knowledge bases [ 30 , 35 ].
Previous approaches have proposed a criteria to assess the quality of LOD including several aspects.In that sense, SPARQL can be used to query the data and test the quality (e.g., number of resources typed as Person or the list of properties used by a particular class) [ 16 ].For instance, Listing 1 shows an example of SPARQL query to retrieve the number of resources typed as edm:Place .Other approaches are based on advanced assessment methods such as ShEx that provides a human-readable text to assess and describe RDF [ 6 ].A ShEx schema provides node and triples constraints defining the structure of a resource as RDF data.Listing 2 shows an example of ShEx schema to assess a resource using EDM as vocabulary and typed as edm:Place .
Listing 1.A SPARQL query to retrieve the number of resources using EDM as vocabulary and typed as edm:Place .Listing 2. A ShEx to validate a place described using EDM as vocabulary.Compared to other approaches, and as an example of application of the method described in Figure 2 , the transformation into RDF of the collection Relaciones Geográficas de Nueva España is based on the following features: • the metadata has been extracted and mapped to EDM as main vocabulary by means of OpenRefine.
EDM has been selected since it is an international initiative used by relevant institutions and the original sources are based on bibliographic information.Table 2 shows the URL patterns used to create the RDF resources according to the EDM vocabulary and Figure 3 shows a representation of a record as EDM vocabulary.
• the dataset has been enriched with external repositories such as Getty Vocabularies, Wikidata, and GeoNames.For instance, Getty has been used for the properties dc:coverage and edm:hasType to enrich the final dataset with entities related to centuries and type of content. 24In addition, geographic locations have been linked to GeoNames and Wikidata using the property owl:sameAs .• a data quality assessment has been performed using ShEx since it provides a human-readable text description of the datasets while is a recent method to assess RDF.This step includes two tasks: (i) automatic extraction of the ShEx schemas using SheXer [ 17 ]; and (ii) using the Python module PyShEx to test the ShEx schemas against a random sample of the RDF data.• the dataset has been described using the Vocabulary of Interlinked Datasets (VoID) concerned with metadata about RDF datasets [ 55 ].Table 3 shows an overview of the final dataset.• a code repository is available at GitHub including all the scripts developed and the final dataset. 25Following guidelines and best practices, a reproducible and runnable on the cloud collection of Jupyter Notebooks is provided as part of the project to show how the final dataset can be explored.

Exploring and Reusing the Dataset as LOD
This section introduces examples of reuse based on the LOD repository created in this work.In addition, a detailed description of how the data can be used to generate new knowledge is presented.

Geographic Locations: GeoNames and Wikidata
Previous works and projects have explored the use of a historical gazetteer for historical places such as the DECM Historical Gazetteer including the Relaciones Geográficas [ 38 ].Other approaches are based on the use of reproducible Jupyter Notebooks to provide a zommable an interactive map [ 7 ].This example reuses the data provided as RDF to query the dataset in order to retrieve the latitude and longitude about the resources typed as edm:Place .Listing 3 shows the SPARQL query to retrieve the geographic information.Then, the data is represented as a map using the Python module folium 26 as is shown in Figure 4 .Each point in the map provides a link to Wikidata.The representation has been included as a Jupyter Notebook.A refined version of the map enables the user to select the subjects and is shown in Figure 5 .A user-friendly interface has been developed that shows the works in the dataset according to their location ( edm:Place ) and subject ( dc:subject ) based on an interactive map.Depending on the selected topics, each point on the map presents the number of records.By zooming in we can see information about the work as well as the links to Wikidata and GeoNames.The following steps have been followed: (i) a CSV file has been generated with the data for visualization through an Extract Transform and Load (ETL) process; (ii) the different topics used for faceting the data have been retrieved; and (iii) the data and topics have been loaded into a map using LeafLet 27 -an open source JavaScript library-, OpenStreetMap 28 -a collaborative project to create free and editable mapsand Mapbox 29 to customize the map.

Generating New Knowledge from Old Sources -Reutilisation and Third Party Use
While the method presented here is capable of transforming available metadata into LOD, the research community is working to achieve the same results for information contained within these collections that, to a certain extent remain "hidden", only available to highly skilled specialists, or that are difficult to process through traditional means of research due to the sheer volume of information.Current approaches using AI techniques in the Digital Humanities have proven to be successful in this regard, automatically identifying, cross-referencing and analysing very large volumes of data of historical interest from these collections using Natural Language Processing techniques, and are achieving an acceleration in the generation of knowledge [ 39 ].However, although the nature of historical information is to be linked (for example, historical figures such as the conquistador Hernando Cortés tend to be mentioned in many different documents, or many documents might relate to one place), researchers can only make these connections through painstakingly reading thousands of pages.These connections might then be recorded in a book or article, and they remain again "locked" until someone else discovers them either in such publications or the historical documents again.In this sense, linking data from information contained in digital collections beyond metadata has the potential to unleash new discoveries and achieve the identification of previously unseen patterns.In the case of pictorial documents, further challenges might also emerge.Before the arrival of the Europeans to America, Mesoamerican cultures used an expression and communication system based on orality and paintings.On the one hand, stories or events, mythical and theological understandings, genealogies, and history were preserved through the tlamatinime or wise men and women that used orality, music, and dance as mnemonic devices, preser ving the community's memor y.On the other, this knowledge was also depicted through the creation of images where pictography, ideographic symbolism, and phonetic mediation were combined with the use of composition, colour, and the spatiality of the elements in the paintings to tell a story [ 27 ].This pictorial discourse existed in parallel with the oral one, making one think in images rather than words, although the codices could be also used as part of oral performances.Therefore, the Nahua (Aztec) codices, for instance, express very complex sets of information.For example, research by Garduño-Monroy [ 20 ], has showcased this by identifying the complexity in the depiction of natural processes and events in the Telleriano-Remensis Codex.The document recorded earthquakes through the amalgamation of the glyphs of movement ( ollin ) and earth ( tlalli ).Furthermore, the glyph tlalli is also used as a quantitative value measuring seismic intensity (see Figure 6 ).The fact that the Aztecs had a form of Richter Scale, is widely unknown, but even if technologies such as Geographic Information Systems can be used to record this kind of seismic data, as many other current technologies, they are not ready to ingest or make use of indigenous knowledge and cosmovisions (views and ways of understanding the world) such as the ones expressed by the Mesoamerican codices.
Another example is the maps of the Relaciones Geográficas .These paintings were the result of the Relaciones questionnaire asking for a map of each of the towns it reached.While the Spanish crown expected maps in the cartographic tradition of the time, what it got in response was a combination of the Mesoamerican spatial tradition of painting codices with the new European conceptions of space (see Figure 7 ).These maps contain important information that, beyond historical interest, can have legal implications and are still used today by indigenous communities for religious or social purposes, as well as to defend their rights and lands [ 49 ].The maps contain a wide range of features such as toponymic glyphs painted in the Mesoamerican style, place names and annotations already in Latin scripture, and another iconography that is shared by many of these.As such, the historical and social information contained in these can be of great importance.However, this knowledge remains to a certain extent, invisible, accessible just to a few specialists.
Regarding textual sources, one of the main benefits of automating entity recognition through machine learning processes is that it can help expand access to information in archives, whilst also diversifying it [ 33 ].In this realm, our work has previously contributed to the creation of novel approaches combining methods from Corpus Linguistics, Natural Language Processing, and Geographic Information Sciences, where the method called Geographical Text Analysis enables the automated identification and analysis of historical information at a large scale (i.e., the identification of entities within thousands of documents, that otherwise would take a lifetime to explore) [ 37 ].The consideration of information classification in this research has opened the door for an ontological change, both in a philosophical and computational form.In the same way, research in the Unlocking the Colonial Archive project is looking to extend this kind of work to Mesoamerican visual language and cosmovisions.For this, there is previous evidence of the benefits of the combination of image captioning metrics [ 23 , 45 ], where CV approaches can help generate descriptions of the visual sources.However, CV methodologies are primarily successful with natural images, which are pre-iconologic.Previous studies in iconology by Panofsky [ 44 ], have highlighted three core layers of analysis when approached through CV.First, there is a Pre-Iconographical Description, which deals with natural subject matters such as factual or expressional elements.On a second layer, an analysis takes place concerning the allegories and stories of the subject.This is the case with artistic paintings, which display representations of the world.Finally, on a third layer, there is an iconographical interpretation, that aims to flesh out the intrinsic meaning and the values of symbols.CV approaches that engage with Mesoamerican paintings and writings have to engage with this third layer of interpretation, whereas in the case of the historical texts, this knowledge is invisibilized and accessible to just a few specialists.
The application of multimodal CV in order to integrate vision and language is an ongoing research challenge [ 8 , 19 , 54 ].Following previous approaches, we have begun experimenting with CV techniques to detect key iconography that describes cultural elements and visual landmarks, as well as the glosas or small textual inscriptions that sometimes accompany these paintings and codices.In this case, we first carried out an experiment to create artistic representations using map paintings from Mapilu , a collection of maps from the National Archive of Mexico and the Relaciones collection from LLILAS Benson.This experiment aimed to detect and replicate key iconography elements such as water, roads, and a type of settlement called estancias .We used the Pix2Pix30 model to translate diverse locations from current Google Maps satellite images into the representation of the maps from    our collections (see .A second experiment used and the YOLOV431 model to automatically detect estancias , architectural features such as churches and houses, and calligraphic elements (see .By identifying these, we will have the ability to produce data objects that can be made reusable for machine and human use.For instance, key settlements and historical place names can be linked into historical gazetteers and become part of the scientific examination process, as well as data that can be used by indigenous communities.Furthermore, by being able to identify and compare similar iconography and representations from hundreds of maps, historians and archaeologists might not only be able to locate specific features and archaeological sites in the landscape, but also understand how this tradition of "writing by painting" changed when it intertwined with the European cosmogony.In this way, connecting these kinds of information in LOD opens a variety of possibilities.This approach will allow us to identify features of interest that would either take a lifetime, or that were impossible before due to the scale of the collections.This will not only have a substantial impact on research in fields such as historical and human geography, archaeology, history, and anthropology, but it will also help computer science diversify knowledge while expanding the accessibility of these datasets for indigenous communities.

Discussion
Computational technologies, including, the Web, the Semantic Web, and Artificial Intelligence, have benefited a wide range of communities by enabling access, accelerating knowledge production, and helping preserve the provenance and sustainability of information.However, a wide range of this world's knowledge remains invisible to current technologies.Decoloniality has aimed to make such knowledge visible to displace Western rationality as the only "framework and possibility of existence, analysis and thought" [ 36 , p. 17].Such invisibility of data is related not only to the technologies and representations of knowledge but also to the tools that make use of these sets of information.Therefore, it is important to begin including counter and alternative narratives to frameworks within AI and other widely spread computational tools.This includes technologies such as Geographic Information Systems, as well as Ontologies and controlled vocabularies to diversify cosmovisions in a way that becomes applicable to our current socio-technical systems.For example, EDM is a vocabulary used by relevant institutions that provides a data model to group resources using views, and to describe authors, places and subjects, among other classes.However, this approach has limitations, and the data mapping could be improved in several ways.For instance, the roles provided as text in the original sources (e.g., signer, artist, interpreter, collector, scribe, contributor, witness, etc.) can be mapped to external vocabularies such as the BnF Roles to enable a better representation and description of the data.In addition, subjects provided as a string in the original sources have been described using the property dc:subject and they could be improved by using the vocabulary Simple Knowledge Organization System (SKOS) .Similarly, locations are provided as strings describing the country, state and the city (e.g., Mexico (country)| Oaxaca (state)| Santa María Peñoles (city)), and they could be separated to provide individual resources.
With regard to the enrichment, and due to the limited number of resources, locations and authors have been manually identified in external datasets including Wikidata and GeoNames.In addition, centuries provided as text in the original sources have been linked to the appropriate resource at Getty vocabularies using the property dc:coverage .This step could be improved by using advanced techniques to apply methods such as Entity Linking.
With regard to the data quality assessment, the ShEx schemas automatically generated were slightly updated in terms of cardinalities.The tool SheXer enables the use of several parameters, such as the number of resources to analyse to generate the ShEx schemas.A low number of resources may result in an incomplete or incorrect ShEx schema.
Finally, this work is just the start, and these are only a few observations, but if scientists are to provide methods to challenge technical and cultural Western hegemony, there is a requirement to collect and analyse new forms of knowledge and provide new imaginative ways of how these data objects can be engaged [ 14 , p. 53].While this is our aim, larger conversations and work with Global South scholars and communities need to be carried out in order to imagine and create technologies and datasets that are more inclusive of different worldviews.As mentioned before, we have started to work in this area, particularly focusing on the inclusion of more diverse data in the AI and LOD ecosystems, including the creation of decolonial datasets and scarcity biases within ontologies.Our work presents novel means to 'unlocking' non-Western conceptual knowledge and ways of understanding space that have remained inaccessible to the current state-of-the-art computational methods and helps bridging them to further research pipelines such as crowd-sourcing and citizen science methods, geospatial analysis, and the generation and expansion of data modelling in LOD.

CONCLUSIONS
In the last decade, there has been a growing interest in making available digital collections published by CH organisations as LOD.Based on previous work, we defined a method described in Section 3 for making available these collections as LOD.The method was applied to an important historical collection published by LLILAS Benson.The discussion explains why this research is important and our evaluation showed several examples of reuse that can be useful to encourage other institutions, particularly those with colonial data, to publish their collections as LOD.Their reuse, enrichment, and assessment are becoming increasingly relevant to make the content visible and accessible to both humans and computers, whilst opening the door to the diversification of data and a wide range of knowledge from the Global South, as well as to the creation decolonial computational methods.
Future work to be explored includes the application of the method to additional datasets, the inclusion of the full text, and the use of additional vocabularies to describe the contents.In this sense, we are aiming to further develop the ontology for annotation based on previous work carried out by El Taller , a group of scholars and nahuatlatos 32 formed in 1978 [ 18 ].Their research engaged with multimodal understandings in the depiction of toponyms, people names, architectural features, artistic, and phonetic-pictorial elements, among many others in these kinds of documents.Finally, we believe that our research and workflow will help in the integration of knowledge from the Global South in computational research, as well as in the AI and LOD ecosystems.

3 A
LLILAS BENSON CASE STUDY: THE SIXTEENTH-CENTURY GEOGRAPHIC REPORTS OF NEW SPAIN This section presents the method used to transform a Cultural Heritage digital collection into RDF and the datasets provided by the LLILAS Benson as part of the project Unlocking The Colonial Archive: Harnessing Artificial Intelligence for Indigenous and Spanish American Collections funded by the AHRC, UK Research and Innovation (UKRI), and the US National Endowment for the Humanities (NEH) .

Fig. 1 .
Fig. 1.Representation of the main classes for resources in the EDM vocabulary.EDM made available by Europeana.

Fig. 2 .
Fig. 2. Methodology to transform the original sources into RDF using open source tools.

Fig. 3 .
Fig. 3. Representation of a record using EDM as the main vocabulary.Table 3. Overview of the Final Dataset Description N. of items N. of triples 3767 N. of properties 34 N. of classes 7 N. of external links 125

Fig. 4 .
Fig. 4. Map representing the geographic locations provided by the original sources.This map has been generated using the Python module folium.

Listing 3 .
SPARQL query to retrieve the geographic information in the dataset.

Fig. 5 .
Fig. 5. Interactive map representing the geographic locations in which the user is able to select the subjects.

Fig. 9 .
Fig. 9. Annotation and experiments of map data sample for the Pix2Pix model.

Fig. 10 .
Fig. 10.Image annotation for translation of map of Tezoacalco for Pix2Pix model.

Table 1 .
Overview of CH Projects Published as LOD