Universal Knowledge Graph Embeddings

A variety of knowledge graph embedding approaches have been developed. Most of them obtain embeddings by learning the structure of the knowledge graph within a link prediction setting. As a result, the embeddings reflect only the structure of a single knowledge graph, and embeddings for different knowledge graphs are not aligned, e.g., they cannot be used to find similar entities across knowledge graphs via nearest neighbor search. However, knowledge graph embedding applications such as entity disambiguation require a more global representation, i.e., a representation that is valid across multiple sources. We propose to learn universal knowledge graph embeddings from large-scale interlinked knowledge sources. To this end, we fuse large knowledge graphs based on the owl:sameAs relation such that every entity is represented by a unique identity. We instantiate our idea by computing universal embeddings based on DBpedia and Wikidata yielding embeddings for about 180 million entities, 15 thousand relations, and 1.2 billion triples. We believe our computed embeddings will support the emerging field of graph foundation models. Moreover, we develop a convenient API to provide embeddings as a service. Experiments on link prediction suggest that universal knowledge graph embeddings encode better semantics compared to embeddings computed on a single knowledge graph. For reproducibility purposes, we provide our source code and datasets open access.

While pretrained models for few knowledge graphs (KGs) are available, their embedding spaces are not aligned, i.e., same entities have different representations across different knowledge graphs.As a result, the usability of such embeddings is often limited to downstream tasks on the KG they were trained on [9].However, a growing number of real-world applications of KG embeddings (e.g., graph foundation models [8,10]) require entities to have a representation that integrates information from multiple sources.
The need for these unified representations for entities recently motivated several works [3, 14-16, 21, 24].Some of these approaches are tailored towards multi-lingual KG embeddings, i.e., the task of computing aligned embeddings between multiple language versions of the same KG by relying on the available owl:sameAs links [3,14,21].Other approaches employ a bootstrapping strategy based on the matching scores between entities during training [15] or use additional information on entities such as attribute embeddings [16,24].Although current alignment approaches for KGs have shown promising results on benchmark datasets, they inherently suffer from scalability issues.This is corroborated by the lack of pretrained KG embeddings for large datasets such as Wikidata [18] and DBpedia [1] using the aforementioned approaches.Moreover, most entity alignment approaches can only handle two KGs at a time, and do not assign the same embedding vector to matching entities 2 .
In this paper, we merge a given set of KGs into a single KG to compute embeddings that capture comprehensive information about each entity.We call these embeddings universal knowledge graph embeddings.By assigning a unique ID to all matching entities, we not only reduce memory consumption and computation costs but also tackle KG incompleteness-a well-known issue in the research community [19,25,26].For instance, Wikidata and DBpedia contain 360 and 138 triples about Iraq, respectively.Consequently, a traditional KGE model trained on either KG can only capture incomplete information about their shared entities.Our approach mitigates this limitation by integrating information from both KGs into the embeddings.
To quantify the quality of our embeddings, we apply our approach to 4 KGE models, i.e., DistMult [23], ComplEx [17], QMult [6], and ConEx [5], and evaluate their performance on link prediction.Overall, our results suggest that the benefits of using additional information derived from sameAs links become particularly noticeable in ConEx.We use the latter to compute and provide highquality unified embeddings for the most populous KGs of the Linked Open Data (LOD) cloud 3 , i.e., DBpedia and Wikidata.The merged graph encompasses about 180 million entities, 15 thousand relations, and 1.2 billion triples.Moreover, we develop an API 4 with convenient methods to make the computed embeddings easily accessible.

RELATED WORK
A knowledge graph embedding (KGE) model denoted by maps a knowledge graph (KG) into a continuous vector space-commonly by solving an optimization problem.This optimization aims to preserve the structural information of the input KG.For example, transitional distance models such as TransE [2] compute embeddings by modelling each triple as a translation between its head and tail entities.Specifically, TransE represents both entities and relations as vectors in the same semantic space and learns embeddings by minimizing the distance between ℎ + and for every triple (ℎ, , ) in the considered KG.DistMult [23] adopts the scoring technique of TransE but uses multiplications to model entiy and relation interactions.Other types of KG embedding models include ComplEx [17], ConEx [6] and RESCAL [11].ComplEx and ConEx model entities and relations as complex vectors (i.e. with real and imaginary parts) to handle both symmetric and antisymmetric relations.As ComplEx cannot handle transitive relations (see Sun et al. 2019), ConEx further improves on ComplEx by applying a 2D convolution operation on complex-valued embeddings of head entities and relations.By associating each relation with a end if end for 12: end for 13: return G * matrix, RESCAL captures pairwise interactions between entities and is regarded as one of the most expressive models [20].As KGs grow in size, computation-efficient algorithms are required to train KGs consisting of millions of entities and billions of triples.Zheng et al. 2020 developed DGL-KE, an open-source package that employs several optimization techniques to accelerate training on large KGs.For example, they partition a large KG to perform gradient updates on each partition and regularly fetch embeddings from other partitions which involves a significant communication overhead.
Although KG embeddings can benefit downstream tasks such as link prediction and KG completion, their successful application is often limited to the KGs they were trained on.As a result, applications to other tasks, such as entity resolution on two or multiple KGs, require aligned KG embeddings.

UNIVERSAL KNOWLEDGE GRAPH EMBEDDINGS 3.1 Preliminaries
A KG G can be regarded as a set of triples G = {( ( ) , where E G and R G represent its sets of entities and relations, respectively.When there is no ambiguity, we simply write E and R. Let G 1 , . . ., G denote KGs (in an arbitrary order), e.g., DBpedia, Wikidata, Freebase.Alignments between G and G are given by sameAs links.We use these links to fuse the given KGs as described in the next section.

Graph Fusion and Embedding Computation
In this work, we fuse all KGs G for = 1, . . ., into a single KG G * where all aligned entities are represented by a unique ID.Algorithm 1 describes how the fusion is carried out.First, the algorithm chooses a reference KG 5 (in this work, we select G 1 ) as the initial set of triples for G * (line 1).Then, it iterates over the rest of the KGs (line 2) and adds their triples (line 10).In this process, entities that are already present in G * via sameAs links are renamed

Knowledge Graphs
We downloaded and preprocessed the September 2022 version of DBpedia [1] and the March 2022 version of Wikidata [18].In this work, we only consider the English version of DBpedia, and its external links to Wikidata, i.e., sameAs links.The preprocessing step is concerned with the removal of non triplet-formatted files and literals.
DBpedia (G 1 ).DBpedia6 is the most popular and prominent KG in the LOD.It is automatically created based on Wikipedia information, such as infobox tables, categorizations, and links to external websites.Since DBpedia serves as the hub for LOD, it contains many links to other LOD datasets such as Freebase, Caligraph, and Wikidata.
Wikidata (G 2 ).Wikidata7 is a community-created knowledge base providing factual information to Wikipedia and other projects by the Wikimedia Foundation.As of April 2022, Wikidata contains over 97 million items and 1.37 billion statements.Each item page contains labels, short descriptions, aliases, statements, and site links.Each statement consists of a claim and an optional reference, and each claim consists of a property-value pair and optional qualifiers.
Statistics of Knowledge Graphs.Table 1 presents the statistics of DBpedia and Wikidata after our preprocessing step.MERGE is obtained by applying Algorithm 1 to {DBpedia, Wikidata}.In a normal scenario, the sum of the numbers of entities of DBpedia and Wikidata should be equal to that of MERGE and sameAs links.However, some entities in DBpedia were matched with multiple entities in Wikidata via sameAs links, and vice versa.This caused the equality not to hold as can be seen in the table.

EXPERIMENTS
We conduct our experiments to answer one fundamental question: "How do our universal knowledge embeddings compare to embeddings from traditional KGE approaches?"To this end, we set up a link prediction task where we compare two independently trained instances of the same embedding model (see the next sections for more details).

Evaluation Setup
Evaluation Datasets.We conduct our experiments on subsets of DBpedia and Wikidata, due to the computational complexity of our evaluation metrics.Specifically, we randomly select 1% of entities in DBpedia that share sameAs links to Wikidata, then we obtain their 1-hop neighborhood together with the corresponding relation types.We then analogously compute the corresponding subset of Wikidata by using entities identified by the 1% initially selected in DBpedia.The samples we obtain are then randomly split into training and test datasets.Overall, we obtain five datasets for our experiments: DBpedia and DBpedia (split of the DBpedia sample), Wikidata and Wikidata (split of the Wikidata sample), and MERGE (merge of DBpedia and Wikidata using Algorithm 1).For the sake of clarity, we use the notations DBpedia+ and Wikidata+ to refer to MERGE depending on whether we are evaluating embedding models on DBpedia or Wikidata.Note that the splits are performed in a way that all entities and relation types in training datasets also appear in the test datasets.This allows us to not encounter out-of-vocabulary entities and relations at inference time.The sizes of the splits are specified in Table 2.The average degree abbreviated as Deg.represents the average number of edges connected to an entity.
Metrics.We use two standard metrics to evaluate KGEs: hits@ (H@ ) and mean reciprocal rank (MRR).Formally, let G be a knowledge graph, i.e., a set of triples.We denote by rank[ 1 | , 2 ] the rank of the score of 1 given the relation and the tail entity 2 among the set of all scores {score( | , 2 ) s. t. ∈ E G }. Similarly, rank[ 2 | 1 , ] denotes the rank of the score of 2 given the head entity 1 and the relation among {score( | 1 , ) s. t. ∈ E G }.We define the metrics Hits@ and MRR as Hardware.The entire DBpedia and Wikidata datasets for which we provide embeddings as a service were processed on a virtual machine (VM) with 128 CPUs (AMD EPYC 7742 64-Core Processor) and 1TB RAM.The computation of universal knowledge graph embeddings was carried out using the DICE embedding framework [7] Table 3: Link prediction results in terms of mean reciprocal rank (MRR) and hits@k (H@k).We compare the performance of each embedding model on the two types of datasets: single KG (DBpedia, Wikidata) and enriched KG (DBpedia+, Wikidata+).Hence, the bold values correspond to the best performance achieved row-wise.All models use 32 embedding dimensions and have approximately the same number of parameters.

Results and Discussion
In Table 3, we present the results of our experiments comparing the performance of embedding models trained on a single KG against those trained on merged KGs (by leveraging sameAs links as described in Algorithm 1).Four embedding models are considered: ConEx [6] , ComplEx [17], QMult [5] and DistMult [23].From the table, we can observe that ConEx achieves the highest performance w.r.t.all metrics on all datasets.Moreover, its performance on the merged KGs (DBpedia+ and Wikidata+) is notably higher as compared to that on DBpedia and Wikidata.The ComplEx model also performs better on DBpedia+ and Wikidata+ than on DBpedia and Wikidata, respectively.On the other side, DistMult and QMult perform poorly on both the training and test datasets.
One would expect that additional information added about entities in DBpedia+ and Wikidata+ improves the performance of embedding models on downstream tasks such as link prediction.Although this is clearly the case for ConEx and ComplEx (with up to 2× improvement for ConEx), we observed the opposite on Dist-Mult and QMult.Interestingly, the poor-performing models correspond to the extreme cases of model complexity, i.e., DistMult is the simplest and QMult is the most expressive among the four models we considered.This suggests that with 32 embedding dimensions, DistMult cannot learn meaningful representations for entities and relations in our evaluation data due to its simplicity.Likewise, QMult fails to find optimal representations of entities and relations because it cannot encode its inherent high degree of freedom in 32 dimensions.The ConEx architecture appears to balance well between expressiveness and the chosen number of embedding dimensions.In fact, our preliminary experiments with 300 embedding dimensions ranked DistMult the top best-performing model ahead of ConEx and ComplEx at the cost of longer training times and memory consumption.In view of this observation, we use ConEx with 32 embedding dimensions to compute our universal embeddings for large KGs and provide them on a platform (see next section).The answer to the fundamental question behind our work is hence that we can learn rich embeddings on a KG that integrates information (about entities) from different external sources, in particular other KGs.A precondition to achieve this goal is intrinsic to common challenges in representation learning: find fitting hyper-parameters.

IMPLEMENTATION OF A SERVICE PLATFORM
We offer computed data as an open service, following the FAIR principles [22], so that the universal embeddings are available to a broad audience.The platform8 consists of a RESTful, TLS-secured API along with a website and an interactive documentation.The API contains a hidden webservice for developers to maintain data and a public webservice with eight methods providing RDF entity identifiers and related embeddings.
Regarding the FAIR principles, the data is findable as existing entity identifiers are used and can be accessed and explored with an autocomplete feature on the website.The big data amount is accessible as data subsets can be retrieved using API methods such as random or autocomplete, which explore the full data.The accessibility is additionally enhanced by meta queries like the size of the offered datasets.Interoperability is given by reusing existing RDF namespaces and identifiers.In addition, the API is versioned and uses the lightweight JSON format in Python and JavaScript.

CONCLUSION AND OUTLOOK
In this paper, we discuss the challenges related to computing embeddings for entities shared across multiple knowledge graphs.In particular, we note the lack of such embeddings for large knowledge graphs and propose a simple but effective approach to compute embeddings for shared entities.Given a countable set of knowledge graphs, our approach iterates over all triples and assigns a unique ID to all matching entities (i.e.shared entities).An embedding model is then applied to learn embeddings on the resulting graph-our embeddings are called universal knowledge graph embeddings.We use our approach to compute embeddings for recent versions of DBpedia and Wikidata, and provide them as an open service via a convenient API.Experiments on link prediction suggest that our universal embeddings are better than those computed on separate knowledge graphs.Regarding the API, we currently provide embeddings via autocomplete search and random entity selection.In future releases, we will integrate an approximation of embedding-level nearest neighbour search to support real-time queries of similar entities over the complete data.We will also collect more large-scale knowledge graphs from the Linked Open Data Cloud to update our universal embeddings.

Table 1 :
Statistics of the full datasets for universal knowledge graph embeddings.Deg.denotes the average degree of entities.

Table 2 :
Statistics of evaluation datasets.Deg. is the average degree of entities.