MMEAD: MS MARCO Entity Annotations and Disambiguations

MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for entity links for the MS MARCO datasets. We specify a format to store and share links for both document and passage collections of MS MARCO. Following this specification, we release entity links to Wikipedia for documents and passages in both MS MARCO collections (v1 and v2). Entity links have been produced by the REL and BLINK systems. MMEAD is an easy-to-install Python package, allowing users to load the link data and entity embeddings effortlessly. Using MMEAD takes only a few lines of code. Finally, we show how MMEAD can be used for IR research that uses entity information. We show how to improve recall@1000 and MRR@10 on more complex queries on the MS MARCO v1 passage dataset by using this resource. We also demonstrate how entity expansions can be used for interactive search applications.


INTRODUCTION
The MS MARCO datasets [3] have become the de facto benchmark for evaluating deep learning methods for Information Retrieval (IR).The TREC deep learning track [8], which has run since 2019, derives its datasets from the MS MARCO passage and document collections.The collections have been used in zero-and few-shot scenarios for diverse retrieval tasks and domains [23,24,28].They also serve as primary resources for training deep learning models for downstream IR tasks such as conversational search [10] and search over knowledge graphs [14] to achieve state-of-the-art results.
Purely text-based neural IR models, trained using MS MARCO collections, can generally not reason over complex concepts in the social and physical world [5,21].In response, recently proposed neuro-symbolic methods aim to combine neural models and symbolic AI approaches, e.g., by using knowledge graphs, which map concepts to symbols and relations.An essential step in developing neuro-symbolic models is connecting text to entities that represent the world's concepts formally.This step is mainly done using Entity linking, an intermediary between text and knowledge graphs, which detects entity mentions in the text and links them to the corresponding entries in a knowledge graph.
Despite the proven effectiveness of neuro-symbolic AI -and for IR models in particular [6,14,25] -the IR community has made limited efforts to develop such models.A primary hindrance is the annotation of large-scale collections with entities; entity linking methods are computationally expensive.Running them over a large text corpus (e.g., MS MARCO v2 with 12M documents and 140M passages) requires extensive resources.This paper aims to fill this gap by making entity annotations of the MS MARCO ranking collections readily available and easy to use.
With this work, we publish MMEAD, 1 a resource that provides entity links for the MS MARCO document and passage ranking collections.Two state-of-the-art entity linking tools, namely REL [18,26] and BLINK [27], are utilized for annotating the corpora.The annotations are stored in a DuckDB database, enabling efficient analytical operations and fast access to the entities.The resource is available as a Python package and can be installed from PyPI effortlessly.The resource also includes a sample demo, enabling queries with complex compositional structures about entities.
We envision that MMEAD will foster research in neuro-symbolic IR research and can be used to further improve neural retrieval models.In our experiments, we show significant improvements on recall for neural re-ranking IR models when using MMEAD annotations as bag-of-word expansions for queries and passages.Our experiments reveal that the difference in effectiveness is even greater (in terms of both recall and MRR) for complex queries that require further reasoning over entities.
To show the usefulness of our resource, we also present how to enrich interactive search applications.Specifically, we demonstrate how to obtain entities' geographical locations by relating the entities found in passages to their Wikidata entries.Plotting these entities on the world map shows that the MS MARCO passages can be geo-located all over the world.We can also move from location to web text by retrieving all passages associated with a geographical location that we present through an interactive demo.
In summary, this paper makes the following contributions: • We annotate the documents of the MS MARCO passage and document collections and share these annotations.By sharing these annotations, we ease future research in neuro-symbolic retrieval, which extensively uses entity information.We also provide useful metadata such as Wikipedia2Vec [29] entity embeddings.• We provide a Python library that makes our data easy to use.All data is stored in DuckDB tables, which can be loaded and queried quickly.The library is easy to install through PyPI, and the entity annotations are available with only a few lines of code.• We experimentally show that retrieval effectiveness measured by recall significantly increases when using MMEAD.The improvement is even greater for hard queries, where we observe low retrieval effectiveness using text-only IR models.• We demonstrate how the data can be used in geographical applications.For example, we can plot on a static map all entities found in the MS MARCO v2 passage collection for which geographical data is available.Additionally, through an interactive demo, we can retrieve all passages associated with a geographical location.

BACKGROUND
In this section, we describe systems that are used for creating entity annotations on the MS MARCO collections for MMEAD.

REL
REL (Radboud Entity Linker) [26] is a state-of-the-art open-source entity linking tool designed for high throughput and precision.REL links entities to a knowledge graph (Wikipedia) using a three-stage approach: (1) mention detection, (2) candidate selection, and (3) entity disambiguation.We briefly explain these three steps: (1) Mention Detection.REL starts the entity linking process by first identifying all text spans that might refer to an entity.In this stage, it is essential that all possible entities in the text are identified, as only the output of this stage can be considered an entity by REL.These spans are identified using a named entity recognition (NER) model based on contextual word embeddings.[15] and  Wiki ( |) is computed based on the summation of hyperlink counts in Wikipedia and the CrossWikis corpus [22].The remaining three candidate entities are determined according to the similarity of an entity and the context of a mention.For the top-ranked candidates based on  ( |) probabilities, the context similarity is calculated by e   ∈ w.Here e is the entity embedding for entity , and w are the word embeddings in context , with a maximum length of 100-word tokens.The entity and word embeddings are jointly learned using Wikipedia2Vec [29].
(3) Entity Disambiguation.The final stage tries to select the correct entity from the candidate entities and maps it to the corresponding entry in a knowledge graph (Wikipedia).For this, REL assumes a latent relation between entities in the text and utilizes the Ment-norm method proposed by Le and Titov [19].
REL is designed to be a modular system, making it easy to swap, for example, the NER system with another.All necessary scripts to train the REL system are available on GitHub,2 making it easy to update REL to a more recent Wikipedia dump.Recently, a batch extension of REL, REBL [18], was released, which improves the efficiency of REL for large-scale annotations, particularly in the candidate selection and entity disambiguation stages.

BLINK
BLINK [27] is a BERT-based [11] model for candidate selection and entity disambiguation, which assumes that entity mentions are already given.When utilized in an end-to-end entity linking setup, BLINK achieves similar effectiveness scores as REL.Below we describe the three steps of mention detection, candidate selection, and entity disambiguation for end-to-end entity linking using BLINK.
(1) Mention Detection.The mention detection stage can be done using an NER model.Like REL, we utilized Flair NER [1] for mention detection.(2) Candidate Selection.BLINK considers ten candidates for each mention.The candidates are selected through a bi-encoder (similar to Humeau et al. [16]) that embeds mention contexts and entity descriptions.The mention and the entity are encoded into separate vectors using the [CLS] token of BERT.The similarity score is then calculated using the dot-product of the two vectors representing the mention context and the entity.(3) Entity Disambiguation.For entity disambiguation, BLINK employs a cross-encoder to re-rank the top 10 candidates selected by the candidate selection stage.The cross-encoder usage is similar to the work by Humeau et al. [16], which employs a cross-attention mechanism between the mention context and entity descriptions.The input is the concatenation of the mention text and the candidate entity description.

DuckDB
DuckDB [20] is an in-process column-oriented database management system.It is designed with requirements that are beneficial for the MMEAD resource: (1) Efficient analytics.DuckDB is designed for analytical (OLAP) workloads, while many other database systems are optimized for transactional queries (OLTP).DuckDB is especially suitable for cases where analytics are more important than transactions.
As we release a resource, transactions (after loading the data) are unnecessary, making an analytics database more useful than a transactional-focused one.(2) In-process.DuckDB runs in-process, which means no database server is necessary, and all data processing happens in-process.This allows the database to be installed from PyPI without any additional steps.(3) Efficient data transfer.Because DuckDB runs in-process, it can transfer data from and to the database more easily, as the address space is shared.In particular, DuckDB uses an API built around NumPy and Pandas, which makes data (almost) immediately available for further data analysis within Python.DuckDB also supports the JSON and parquet file formats, making data loading especially fast when data is provided in such formats.

MMEAD
MMEAD provides links for MS MARCO collections v1 and v2 created by the REL entity linker, and links for the MS MARCO v1 passage collection by the BLINK entity linker.For REL, we use its batch entity linking extension, REBL [18].The knowledge graphs used for the REL and BLINK entity linkers are Wikipedia dumps from 2019-07 and 2019-08, respectively.Both dumps are publicly available from the linking systems' Github pages.

Goals
The design criteria for MMEAD are based on the following goals: • Easy-to-use.It should be easy to load and use the linked entities in experiments.With only a few lines of code, it should be possible to load entities and use them for analysis.Additional information should also be readily available, like where entities appear in the text and their latent representations.• High-quality entity links.We wish to release high-quality entity links for the MS MARCO collections, so that applying neurosymbolic models and reasoning over entities becomes feasible.• Extensibility.It should be easy to link the collections with a different entity linking system and publish them in the same format as MMEAD.This way, we can integrate links produced by other entity linking systems and make them automatically available through the MMEAD framework.• Useful metadata.Additional data that can help with experiments should be provided; this includes mapping entities to their respective identifiers and latent representations.

Design
Easy-to-use.To create an easy-to-use package, we make the MMEAD data publicly available as JSONL files, which is the same format as the MS MARCO v2 collections.Each line of JSON contains entity links for one of the documents or passages in the collections; see Figure 1.The corresponding document can be identified through the JSON field that represents the document/passage identifier: docid for documents and pid for passages.Then, for every section of a document, a separate JSON field is available to access the entities in that section.For passages, there is only one section containing the entity annotations of the passage, while for MS MARCO v2 documents, we link not only the body of the document but also the header and the title.
All essential information about the entity mentions and linked entities is stored in the JSON objects.Specifically, the following metadata is made available: entity_id, start_pos, end_pos, entity, and details.The field entity_id stores the identifier that refers to the entry in the knowledge graph (Wikipedia, in our case).The start_pos and end_pos fields store the start and end positions of the text span that refers to the linked entity (i.e., as a standoff annotation of the entity mention).The positions are UTF-8 indices into the text, ready to be used in Python to extract the relevant parts of the document.The field entity stores the text representation of the entity from the knowledge graph.We chose to store this field for convenience and human readability.The details field is a JSON object that stores linker-specific information; examples include the entity type available from the NER module and the confidence of the identified mention.
High-quality entity links.MMEAD provides entity links produced by state-of-the-art entity linking systems.For this paper, we provide links from REL for both MS MARCO v1 and v2 passages and docs, and links from BLINK for MS MARCO v1 passages.Both these systems have high precision, ensuring that identified mentions and their corresponding entities are likely correct.The knowledge graphs used by the entity linkers are the same as those used in the original studies; this way, extensive research has been done to confirm the precision of the linking systems.
Extensibility.We ensure extensibility by clearly describing the format in which the entity links are provided.If another system shares its links in the same format, the MMEAD Python library can work with the data directly.The details field per entity annotation enables inclusion of linker-specific information.REL provides specific instructions on updating the system to newer versions of Wikipedia in its documentation, making it possible to easily release links to newer versions of Wikipedia.
Useful metadata.Alongside the entity links, we also provide additional useful metadata.Specifically, we release Wikipedia2Vec [29] embeddings (300d and 500d feature vectors).REL uses the 300d Wikipedia2Vec feature vectors internally for candidate selection.These feature vectors consist of word embeddings and entity embeddings mapped into the same high-dimensional feature space.These embeddings can be used directly for information retrieval research [13,14].We also release a mapping of entities to their identifiers.The entity descriptions can change in different versions of Wikipedia, but their identifiers remain constant.The identifier can also be used to find the corresponding entity in other knowledge graphs such as Wikidata.

An Example
A passage from the MS MARCO v1 passage ranking collection is shown below. 3he Manhattan Project and its atomic bomb helped bring an end to World War II.Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.

HOW TO USE
MMEAD comes with easy-to-use Python code, allowing users to work with the resource effortlessly.To start, MMEAD can be installed from PyPI using pip: $ pip install mmead After installation, the entity links can be loaded into a DuckDB [20] database with only a couple of lines of code, as shown in Figure 2.
1 >>> from mmead import get_links 2 >>> links = get_links('v1', 'passage', linker='rel')  When running this code for the first time, initialization will take some time, as all the data need to be downloaded and ingested into the DuckDB database.After loading the data for the first time, it is automatically stored on disk.Loading the persisted data for later usage will only take seconds.
Once the data is loaded, it is ready to use.We provide a simple interface to access the data.The code shown in Figure 3 loads the entity links available for a document in the MS MARCO v1 passage ranking collection.When using this function, the data is provided in JSON format, making it easy to access the annotations.
We also provide word and entity embeddings generated by Wikipedia2Vec [29] based on the 2019-07 Wikipedia dump.These embeddings are stored in DuckDB tables and are available as Numpy arrays after loading.Figure 4 shows how embeddings are loaded using MMEAD.The example demonstrates that the entity embedding of Montreal and the word embedding of "Montreal" are closer to each other than the word embeddings of the two words "Montreal" and "green" based on dot-product as a similarity function.The dimensionality of the embedding vectors (300 or 500) can be specified in the code.
The mapping between the official Wikipedia identifiers and entity text representations is extracted from the 2019-07 Wikipedia dump.If entity annotations from another version of Wikipedia are available, the MMEAD mappings can be used to match entities between the dumps.Needless to say, emerging entities in newer versions of Wikipedia cannot be mapped to the version that is available in MMEAD.However, existing entities in MMEAD can be mapped to newer versions of Wikipedia in a straightforward manner.Figure 5 shows how entity identifiers can be matched to their text and the other way around.
As DuckDB is used as a database engine for MMEAD, it is possible to directly access the underlying tables and issue structured queries in an efficient manner.Figure 6 shows an example, where a connection to the database is created, and the identifiers of passages containing the entity Nijmegen are retrieved.
All data can be downloaded directly as well, and links to the data are provided on our Github page. 4 >>> from mmead import get_embeddings 2 >>> e = get_embeddings(300, verbose=False)

ENTITY EXPANSION WITH MMEAD
To demonstrate the usefulness of MMEAD for (neural) retrieval models, we have conducted experiments that extend existing models with MMEAD annotations.These experiments serve a demonstrative purpose only, and the full potential of this resource is to be further explored in (neuro-)symbolic IR models [14,25].

Methods
BM25 expansion.We experimented with three retrieval methods to show the benefits of entity annotation for passage ranking: one baseline method and two methods that use query entity expansion [9] using REL: a BM25 -No Expansion.As a baseline method, we used BM25 as implemented in Anserini [17] using hyper-parameters  1 = 0.82 and  = 0.68, shown to be optimal for the MS MARCO dataset.MS Figure 6: All data is stored in DuckDB tables, and thus it is possible to directly access the tables and issue queries.In this example, we extract the identifiers of passages that contain the city of Nijmegen.
MARCO was indexed normally, and no expansion was considered for the queries or the passages.b BM25 -Entity Text Expansion.In this method, passages and queries are expanded with the text representation of their annotated entities (from REL).Once the passages and queries have been expanded with entities, we run BM25 with the same hyperparameter settings as described in a. c BM25 -Entity Hash Expansion.Instead of using the text representation of entities as an expansion, we expanded the passages and queries by the MD5 hash of the entity text (from REL).The use of MD5 hashing is to provide a consistent representation of multi-word terms and to avoid partial or incorrect matching between queries and non-relevant passages; e.g., passages that contain the word "united", do not benefit if the query contains "United States" as an entity.Again, after expansion, we run BM25 with the same hyper-parameter settings described in a.In these experiments, the identified entities are deduplicated.As a demonstration of the proposed text expansion methods, Figure 8 shows how the query expansion is performed using explicit and hashed forms.The added entities provide more precise context and help eliminate ambiguous terms.Figure 9 shows the expansion methods on the relevant passage for this query.The relevant passage can be found through our expansion technique.The linking system recognizes that both the query and the passage contain a reference to the entity Sacagawea, even though they are spelled differently in the query and the passage.
Reciprocal Rank Fusion.As a second series of experiments, we applied Reciprocal Rank Fusion (RRF) [7] to the runs described above.RRF is a fusion technique that can combine rankings produced by different systems.RRF creates a new ranking by only considering the rank of a document in the input.Given a set of documents  and a set of rankings , RRF can be computed as: Here  is a hyperparameter that can be optimized, but we simply used a default value of  = 60 for all settings.
Table 2: Results on the MS MARCO v1 passage collection, using only the queries that have entity annotations.Bolded numbers are the highest achieved effectiveness.Scores with a dagger ( †) are significantly better compared to BM25 with no expansion (run a), following a paired t-test with Bonferroni correction.For MRR, we have not calculated significance scores due to its ordinal scale [12].This provides us with four new rankings; the RRF of the pairwise combinations of the three rankings described above and the RRF of all three of these runs:

Experimental Setup
In our experiments, we use MMEAD as a resource to expand queries and passages with entities.The experiments are performed using the MS MARCO v1 passage ranking collection, where only queries containing at least one entity annotation are used.We do not expect meaningful differences for queries without any linked entities, as the expanded query is identical to the original query in that case (due to the simplicity of the method applied here).
As we expect the linked entities to provide additional semantic information about the queries and passages, we conduct further testing on the obstinate query sets of the MS MARCO Chameleons [2], which consist of challenging queries from the original MS MARCO passage dataset.In general, ranking methods show poor effectiveness in finding relevant matches for these queries.Our testing focuses on the bottom 50% of the worst-performing queries from the subsets of Veiled Chameleon (Hard), Pygmy Chameleon (Harder), and Lesser Chameleon (Hardest), which represent increasing levels of difficulty.
This gives us four query sets on which we evaluate; (1) all queries that contain entity annotations (dev -1984 queries), (2) all queries in the hard subset that contain entity annotations (hard -680 queries), (3) all queries in the harder subset that contain entity annotations (harder -493 queries), and lastly, (4) all queries in the hardest subset that have entity annotations (hardest -322 queries).
The experiments are evaluated using Mean Reciprocal Rank (MRR) at rank ten and Recall (R) at rank one thousand.MRR@10 is the official metric for the MS MARCO passage ranking task, while R@1000 gives an upper limit on how well re-ranking systems could perform.The Anserini [30] toolkit is used to generate our experiments.

Results
Table 2 presents the results of our experiments.If we first look at lines a-c in the results table, we can examine the effects of our expansion methods compared to the baseline run.Looking at R@1000, we can see that more relevant passages are found using entity expansion for the dev collection and its harder subsets.We do not find additional relevant documents/passages on the dev set when we use the entity hashes, and entity text seems to be the better approach.There is, however, no increase in MRR@10 when using this expansion method.Entity expansions help when evaluating using R@1000, especially when the queries are more complex.The difference in recall effectiveness becomes larger the more complex the queries get.MRR@10 only improves when using entity text expansion.
The reciprocal rank fusion methods are presented in lines d-g.When using these methods, the R@1000 increases more.Again, the subsets that contain more complex queries tend to benefit more.Regarding R@1000 effectiveness, the best RRF method uses a ranking from the normal, not expanded index, with the index that has been expanded with the entity text.Again, entity text expansion helps recall more than using hash expansion.Although the RRF methods improve recall, MRR@10 does not benefit from RRF when compared to using only one of the expansion techniques.

BEYOND QUANTITATIVE RESULTS
In the previous section, we demonstrated the potential value of MMEAD using quantitative evaluations, where we leverage entities to improve retrieval effectiveness in standard benchmark datasets.Beyond these quantitative results, MMEAD can also help enrich interactive search applications in various ways.This section describes a few such examples.Entity links to Wikidata provide an entrée into the broader world of open-linked data, which enables integration with other existing resources.This allows us to build interesting "mashups" or support search beyond simple keyword queries.As a simple example, we can take the entities referenced in MS MARCO, look up the coordinates for geographic entities, and plot them on a map. Figure 7 shows a world map with all entities found in the MS MARCO v2 passage collection mapped onto it (each shown with a transparent blue dot).The results are as expected, where the blue dots' density largely mirrors worldwide population density, although (also as expected) we observe more representation from entities in North America, Europe, and other better-developed parts of the world.
Figure 7 is a static visualization, but we can take the same underlying data and principles to create interesting interactive demonstrations.Geo-based search is an obvious idea, where users can   While it is possible that pretrained transformers might implicitly contain this information, they can never offer the same degree of fine-grained control provided by explicit entity linking.As a simple demonstration, we have taken MMEAD, reformatted the entity links into RDF, and ingested the results into the QLever SPARQL engine [4]. 5By combining MMEAD with RDF data from Wikidata and OpenStreetMap, we can issue SPARQL queries such as "Show me all passages in MS MARCO about France".The query is shown in Figure 10, which gives us 122,316 entities found in the collection that have a connection with France (most of them are located in France).Then we can automatically show the entities on a map, as presented in Figure 11 (showing the first 1000 entities found).
Not all linked entities are located in France, however.For example, some entities are related to France (entities for which France is mentioned in their Wikidata), but are located elsewhere in the world.One of the blue dots in Germany is the source of the river Moselle.This river starts in Germany by splitting off from the Rhine, and then goes through France.Instead of querying for France, we can also query for different countries.Table 3 shows the number of entities found for a sample of countries.

CONCLUSION AND FUTURE WORK
This research presents the resource MMEAD, or MS MARCO Entity Annotations and Disambiguations.MMEAD contains entity annotations for the passages and documents in MS MARCO v1 and v2.These annotations simplify entity-oriented research on the MS MARCO collections.Links have been provided using the REL and BLINK entity linking systems.Using DuckDB, the data can quickly be queried, making the resource easy to use.We also demonstrated that our resource can enrich interactive search applications.In particular, we present an interactive demo where all entities related to geographical locations can be positioned on a map.We experimentally show that MMEAD improves recall effectiveness significantly when using entities for query and passage expansion.When using reciprocal rank fusion, the effectiveness difference becomes even more prominent and new relevant passages are found.The question remains whether these passages can be ranked higher by new retrieval models.With MMEAD, we support information retrieval research that combines deep learning and entity information.
In the future, we would like annotations from a more diverse group of linking systems.Using the MMEAD format, releasing entity links for collections beyond MS MARCO is also possible.We already showed that using entity links improves recall when using the linked entities for query expansion.What the effects are when training, e.g., DPR methods that include the entity links, is yet to be investigated -an exciting research opportunity that lies ahead.

Figure 1 :
Figure 1: Example of MMEAD annotations for a MS MARCO passage in JSON format.The field tag depicts the type of the entity and md_score shows the certainty of the mention detection component in identifying the text span as a mention.

Figure 2 :
Figure 2: Example of how to load MMEAD entity links for the MS MARCO v1 passage collection.

Figure 3 :
Figure 3: Example of how to load the entity links for a document.For formatting reasons, we do not show the full output.

Figure 4 :
Figure4: Example code for loading word and entity embeddings.It shows that the dot-product between "Montreal" word and entity embeddings is greater than the dot-product of embedding vectors for the word "Montreal" and a random word.The word embeddings of Montreal and Toronto, two cities in Canada, are more similar.

Figure 5 :
Figure 5: Entity names and identifiers are accessible in MMEAD.Given an entity text, we can directly find its corresponding identifier and vice versa.
d. RRF -No Expansion + Entity Text.RRF fusion of runs a and b.The run with no expansions and the run with entity text expansions are considered.e. RRF -No Expansion + Entity Hash.RRF fusion of runs a and c.The run with no expansions and the run with entity hash expansions are considered.f.RRF -Entity Text + Entity Hash.RRF fusion of runs b and c.The run with entity text expansions and the run with entity hash expansions are considered.g.RRF -No Expansion + Entity Text + Entity Hash.RRF fusion of runs a, b, and c.All three runs are considered.

Figure 7 :Figure 8 :
Figure 7: Locations of entities found in the MS MARCO v2 passage collection.

Figure 9 :
Figure 9: The relevant passage for the query presented in Figure 8; (a) the non-expanded passage, (b) the passage with entity text expansion, and (c) the passage with entity hash expansion.Text expansions are in italics.The MD5 hashes shown in (c) are shortened in this example for formatting.

Figure 10 :
Figure 10: SPARQL query that produces all entities in the passages of the MS MARCO v2 collection that are related to the country of France.

Figure 11 :
Figure 11: First 1000 entities found in that are connected to France.Entities are represented with a blue dot on the map.

Table 1 :
Number of entities linked by REL; we show the total number of entities found and how many entities there are per passage/document on average.

Table 3 :
Number of entities found per country for some example countries where the entity has an English label.