CompMix: A Benchmark for Heterogeneous Question Answering

Fact-centric question answering (QA) often requires access to multiple, heterogeneous, information sources. By jointly considering several sources like a knowledge base (KB), a text collection, and tables from the web, QA systems can enhance their answer coverage and confidence. However, existing QA benchmarks are mostly constructed with a single source of knowledge in mind. This limits capabilities of these benchmarks to fairly evaluate QA systems that can tap into more than one information repository. To bridge this gap, we release CompMix, a crowdsourced QA benchmark which naturally demands the integration of a mixture of input sources. CompMix has a total of 9,410 questions, and features several complex intents like joins and temporal conditions. Evaluation of a range of QA systems on CompMix highlights the need for further research on leveraging information from heterogeneous sources.

However, using only a single information source limits the answer coverage of QA systems: the individual sources are not complete, and may fail to cover the knowledge required for answering a user question.Consider, as an example, the question below: Who was fouled before the first penalty in the 2022 FIFA final?
This kind of detailed information on a sports event is rarely covered in a structured information source like a KB or table, but can be found in text discussing the content of the match.On the other hand, structured sources often include information that is not present in text.Tables often store match-specific details, and would contain, for instance, the answer to the following question: Argentina's ball possession in the 2022 WC final?
For some questions, answers appear in multiple sources.Such answer redundancy can also be helpful for QA systems, and boost their confidence in predicted answers.For instance, consider: In which stadium was the 2022 soccer world cup final played?
The answer to this question occurs in a Wikipedia infobox, text content, and Wikidata.It may even be necessary to join evidence from multiple sources for answering a more complex question: Which team was behind by two goals but still won a FIFA final?
The list of FIFA World Cup finals and their winners could be looked up in a KB, but the goal deficit information associated with the match timeline would either be discussed in text, or could be reasoned over statistics in tables.These observations have triggered work on heterogeneous QA Sun et al. (2018, 2019); Oguz et al. (2022); Savenkov and Agichtein (2016); Xu et al. (2016b,a); Xiong et al. (2019): jointly harnessing multiple sources for answering factual questions Roy and Anand (2022).
Limitations of state-of-the-art.There are currently three strategies of evaluating heterogeneous QA: (i) using benchmarks for single-source QA but showing that using more sources improves performance Xu et al. (2016a,b); Savenkov and Agichtein (2016); Oguz et al. ( 2022); (ii) using benchmarks for single-source QA, but artificially removing parts of the "main" source before augmenting the benchmark with new sources Sun et al. (2018Sun et al. ( , 2019)); and (iii) using dedicated benchmarks for heterogeneous QA Talmor and Berant (2018); Chen et al. (2020).The first approach usually leads to quick saturation on benchmarks: all answers are still available only in the primary source, which is what the methods primarily target, and auxiliary sources bring in incremental gains.The second approach is inherently flawed because considering heterogeneous sources obviously improves performance, as the main source is intentionally weakened.This creates an artificial situation and does not expose the true strengths and weaknesses of methods built for heterogeneous QA.
Our contribution belongs to the third approach.There are a few existing benchmarks for multi-source QA Talmor and Berant (2018); Miller et al. (2016); Zhang et al. (2018), but these either contain synthetic questions and do not reflect idiosyncrasies in formulation and intent concerning real users, or cover only a narrow spectrum of sources and domains Chen et al. (2021bChen et al. ( , 2020)); Zhu et al. (2021); Li et al. (2021); Chen et al. (2021a).
A new benchmark.We make the case for a benchmark that inherently requires the usage of a mixture of information sources, as a more natural testbed for evaluating heterogeneous QA systems.To this end, we release COMPMIX (Complete questions over a Mixture of sources), a crowdsourced QA benchmark with questions that require heterogeneous sources for answering (Wikidata KB, and Wikipedia text, tables and infoboxes).The dataset has 9,410 questions created by humans from five different domains: books, movies, music, TV series and soccer.The answers are grounded to the Wikidata KB, which allows use of consistent evaluation metrics for QA systems returning either entity IDs or simple strings.
Contributions.This paper presents our benchmark COMPMIX, accompanied by an in-depth analysis.We identify complex phenomena in the questions, like temporal conditions, multiple entities and relations, aggregations and comparisons.We investigate the effect of combining multiple sources on answer coverage and redundancy, and show that heterogeneous sources are truly required.
Finally, we evaluate multiple recent heterogeneous QA methods on COMPMIX, and identify questions for which none of these systems gives correct answers.Interestingly, the results for a recent GPT model show that even a large language model (LLM) can answer only half of the questions for this realistic and challenging benchmark.The COMPMIX benchmark is publicly available at https: //qa.mpi-inf.mpg.de/compmix.
Open Retrieval; HQ: Human Questions; OD: Open Domain.2017)).However, these benchmarks were created with the intention of having a specific underlying source for answering, which already contains almost all answers to the questions.This restricts their utility as a testbed for heterogeneous QA.Thus, existing work on heterogeneous QA, being forced to rely on these benchmarks, would often remove significant chunks of information from this "main" information source (≃ 50% of Freebase removed for evaluating on WebQuestions in Sun et al. ( 2019)), and add parts of other sources to simulate a setting with heterogeneous sources.
All existing benchmarks for heterogeneous QA suffer from one or more of the following issues: (i) their questions are not fully human-generated, and hence lack the diverse formulations of real users Talmor and Berant ( 2018 COMPMIX removes these shortcomings: (i) it is crowdsourced; (ii) it includes the full KB as one of the knowledge sources; (iii) it spans four sources; (iv) it covers five domains; and (v) it contains self-contained complete questions.A succinct comparison of salient properties across benchmarks is in Table 1.
Figure 1: Answer-type frequencies per domain in COMPMIX.

COMPMIX
We create COMPMIX by collating the completed (intent-explicit) versions of the potentially incomplete (intent-implicit) questions in the CONVMIX Christmann et al. (2022b) benchmark, which is a dataset for conversational QA over heterogeneous sources.These completed questions are provided directly by crowdworkers on Amazon Mechanical Turk (AMT), i.e. are created by humans.The answers to the questions were derived from four sources: either the full Wikidata KB, or the text, tables or infoboxes from all of Wikipedia.The questions span five domains: movies, tv series, music, books, and soccer (a distribution of expected answer types for each domain is in Fig. 1).Overall, the benchmark comprises 9,410 questions, split into train set (4,966), development set (1,680), and test set (2,764).Basic statistics for COMPMIX can be found in Table 2.A notable property of our dataset is the presence of a significant fraction of questions with long-tail entities (last row), a major vulnerability of LLM methods.
COMPMIX includes questions, their domains, and their corresponding answers.Answers are Wikidata entity identifiers (text labels are also provided), plaintext strings, or normalized dates.This enables consistent evaluation across extractive and generative answering models.In addition, entity markup in question formulations are provided by crowdworkers.Answer sources are given, too: "KB", "text", "table", or "infobox".
3 Benchmark analysis

Answer coverage
One key desideratum of the benchmark is that heterogeneous sources are actually required for answering the questions inside.To verify that this is the case, we analyzed the answer coverage of each information source, which is the number of questions that a source contains the answer for.In a good benchmark for heterogeneous QA, each source should have an answer coverage far less than 100%.
At the time of benchmark creation, Turkers were given a domain, and they picked up an entity of choice from the domain, followed by asking a natural question using this entity, and then provided an answer to the question.They also provided the source they consulted for locating their answer.
For computing coverage, we first consider these source annotations by the crowdworkers.However, this measurement only captures whether a specific information source has the desired information, without any implications concerning the other sources.
Therefore, we also conducted an automatic analysis of the answer coverage using a recall-oriented retriever that, given a question, tries to obtain as many relevant pieces of evidence as possible from all our sources.This retriever is implemented as in Christmann et al. (2022bChristmann et al. ( , 2023)), and would first disambiguate KB-entities from the question (using CLOCQ Christmann et al. (2022a), a recent system), and then retrieve KB-facts, text-sentences, table-records and infobox-entries with these disambiguated KB-entities.For each evidence, mentions of entities are linked to the KB.We measure this automated answer coverage as the number of questions for which the gold answer is among this set of mentioned entities in the pool of retrieved evidence.As with any large-scale automated analysis, this statistic is a noisy proxy, because the mere presence of an answer does not necessarily mean that the surrounding evidence is question-relevant.
The results of both analyses are in Table 3. First, we see that the AMT annotators used the KB, text and infoboxes almost equally often to answer their questions (tables also consulted ≥10% of times).This proves that COMPMIX is not biased towards any specific underlying source.Second, from the automated measurement, we learn that adding an information source always improves the answer coverage.Note that this is a natural expansion, as opposed to augmentation after artificial suppression of large parts of specific sources.By including all sources, the answer coverage goes up to about 87%.Note that our recall-oriented retriever only provides a loose upper bound: the performance of an actual retriever that balances recall and precision would currently reach a lower number (cf.Sec. 4).Thus, our benchmark leaves substantial room for the development of smart heterogeneous retrievers.
Overall, these measurements suggest that all four sources are naturally required for answering the questions in COMPMIX, and different sources complement each other nicely.

Answer redundancy
Answer redundancy creates scope to test a heterogeneous system's ability to boost confidence in its prediction when multiple matches happen across sources.For each question, we thus measured the number of sources touched by the retrieved pieces of evidence that actually contain the gold answer.
Results are in Table 4. What we can see from here is that for a substantial proportion of questions, the answer is located in two (≃ 17%) or three (≃ 34%), out of four, sources.A reasonable chunk even has redundancy across all sources (≃ 20%).This shows that COMPMIX has ample answer redundancy to be exploited by some appropriate heterogeneous QA model.

Anecdotal examples
For each of our five domains, Table 5 shows representative examples from the COMPMIX benchmark.

Evaluation with COMPMIX
Metrics.We use standard QA metrics for evaluating models on COMPMIX: (i) Precision at 1 (P@1), which is either 1 or 0 according as the top-ranked system answer is correct or not; (ii) Mean reciprocal rank (MRR), which is the reciprocal of the first rank at which a correct answer is located; and, (iii) Hit at 5 (Hit@5), which is either 1 or 0 according as the first five system responses contains a gold answer or not.A system answer is considered correct if it exactly (case-insensitive) matches a Wikidata ID (if QA system returns IDs) or the accompanying plaintext string/entity label (if QA system returns simple text).Metrics are averaged over all questions.
Models.To better understand the state-of-the-art in heterogeneous QA, we evaluate several recent QA models that incorporate heterogeneous sources on COMPMIX.We also include GPT in our   2022b) is method for conversational QA over heterogeneous sources, but can also be applied to complete questions.It derives an intent-explicit structured representation for a question, and feeds this into a retriever-reader pipeline.2023) is another method for heterogeneous QA that makes use of iterative graph neural networks for deriving the answer instead of a generative reader model like FiD. • GPT-3.For evaluating GPT-3 Brown et al. (2020) (model: text-davinci-003), we use the following prompt, which performed the best among different alternatives: "Please answer the following question by providing the crisp answer entity, date, year, or numeric number.Q: <question>".
The generated answer string is then compared with the label and KB-aliases of the gold answer(s), to allow for potential synonymy (all strings lowercased).P@1 = 1 for exact matches, and zero otherwise.GPT-3 generates only a single answer, and thus metrics for ranked lists are inapplicable.
Results.Findings in Table 6 reveal two key takeaways: (i) systems from the literature only reach about 45% P@1 on COMPMIX, showing substantial room for model improvement.Much higher numbers have been reported for the compared models in previous sub-optimal evaluation settings (UNIK-QA reaches 80% accuracy on WebQuestionsSP): this highlights challenges in COMPMIX; (ii) The task is far from solved for LLMs, with the P@1 reached by GPT-3 being merely 50%.We attribute this to a large number of rare and emerging entities in our benchmark (see Table 2).To put   2020) (text-davinci-003) 0.502 − − aggregate performance in perspective, we found that for 2,764 questions (81.9%), at least one of the methods failed to produce a correct answer.On the other hand, for 759 (27.5%) none of the methods (including GPT-3) could find the correct answer.Table 7 shows one such unanswered question per domain.The second and fifth question make a perfect case for merging multiple sources, as subtle cues like "adult Pi Patel" or "twin brothers" are likely to be mentioned in textual sources, while movie cast or club membership is more easily looked up via structured repositories.
Table 7: Anecdotal questions for which none of the tested methods could derive the correct answer.
What was the original title of the book Twilight?Who played as adult Pi Patel in Life of Pi movie?What album is the song Closing Time on?Who composed the theme music for the TV series Fury?Who were the twin brothers who played soccer for Manchester United?
5 Data Sharing and Ethics Licensing.The COMPMIX benchmark is licensed under a Creative Commons Attribution 4.0 International License1 .
Availability.The benchmark is released on our project website2 , with inclusion of a leaderboard to keep track of the state-of-the-art.COMPMIX is also offered at Hugging Face for a broader audience3 .The DOI of COMPMIX is https://doi.org/10.57967/hf/0707.
Ethical considerations.COMPMIX collates completed questions from the CONVMIX benchmark.
For collecting CONVMIX, human annotators from AMT asked factoid questions in a conversational setting.No personal or other critical data was collected or published.The COMPMIX benchmark does not contain any personal or other critical data.All questions are provided anonymously.
The annotators for collecting the CONVMIX dataset were paid a fair compensation for their work, consistent with the German minimum wage (irrespective of their residential country).

Conclusion
We release COMPMIX, a benchmark for heterogeneous QA that inherently requires the usage of multiple sources.Answering questions in COMPMIX requires systems to work consistently well for intents spread across five domains, and deal with a wide variety of challenging human formulations asking about rare entities.Thus, our hope is that this resource can help facilitate progress in developing more robust QA models that can appropriately exploit complementary and potentially redundant sources of information.A promising direction for improvement would be to include questions that need answers of a different flavor of heterogeneity: sentences, passages, or longer lists.

•
UNIK-QA Oguz et al. (2022)  follows a retriever-reader pipeline, and verbalizes evidence from each source into text.DPRKarpukhin et al. (2020) retrieves relevant evidences from the verbalized text, and a Fusion-in-decoder (FiD) modelIzacard and Grave (2021) generates the answer.Due to unavailability of end-to-end source code, we approximate UNIK-QA by replacing DPR withBM25 Robertson and Zaragoza (2009).FiD generates strings, that are mapped to a ranked list of KB items, by followingChristmann et al. (2023).•CONVINSEChristmann et al. (

Table 1 :
Comparing benchmarks for heterogeneous QA.

Table 2 :
Basic statistics for the COMPMIX benchmark.

Table 3 :
Answer coverage across information sources.Answer found in all sources 0.199 model suite, to verify if LLMs trained on colossal web corpora are already sufficient for this task.We compare the following models:

Table 5 :
Representative questions from COMPMIX.Sources that can be used for answering these questions are in brackets.