The Web Data Commons Schema.org Table Corpora

The research on table representation learning, data retrieval, and data integration in the context of data lakes requires large table corpora for the training and evaluation of the developed methods. Over the years, several large table corpora such as WikiTables, GitTables, or the Dresden Web Table Corpus have been published and are used by the research community. This paper complements the set of public table corpora with the Web Data Commons Schema.org table corpora, two table corpora consisting of 4.2 (Release 2020) and 5 million (Release 2023) relational tables describing products, events, local businesses, job postings, recipes, movies, books, as well as 37 further types of entities. The feature that distinguishes the corpora from all other publicly available large table corpora is that all tables that describe entities of a specific type use the same attributes to describe these entities, i.e. all tables use a shared schema, the schema.org vocabulary. The shared schema eases the integration of data from different sources and allows training processes to focus on specific types of entities or specific attributes. Altogether the tables contain ~653 million rows of data which have been extracted from the Common Crawl web corpus and have been grouped into separate tables for each class/host combination, i.e. all records of a specific class that originate from a specific website are put into a single table. This paper describes the creation of the WDC Schema.org Table Corpora, gives an overview of the content of the corpora, and discusses their use cases.


INTRODUCTION
The schema.org1community effort defines a shared vocabulary for describing entities such as products, local businesses, events, job offers, questions and answers, as well as many other types of entities [7,10].Schema.orgterms are used together with the Microdata and RDFa syntaxes to annotate structured data within the BODY of HTML pages.Alternatively, the terms are used in combination with the JSON-LD syntax to embed structured data in the HEAD section of HTML pages.Since 2011, the search engines Google2 , Bing 3 , and Yandex4 use schema.orgdata to display rich snippets in search results, display info boxes next to search results, and entities on maps.Other applications that use schema.orgdata include Google Shopping, Google for Jobs, and Google Dataset Search 5 .In order to have their content displayed in these applications, millions of websites have started to use the schema.orgvocabulary to annotate structured data within their pages and today approximately 50% of all web pages contain schema.organnotations [2].The Web Data Commons (WDC) project 6 regularly extracts schema.orgdata from the Common Crawl 7 , the largest public web corpus.The Common-Crawl is released monthly and typically contains around 3 billion HTML pages originating from over 30 million different websites (hosts).The WDC project uses the extracted data to calculate statistics about the adoption of schema.org on the Web [2] and publishes the extracted data in the form of N-Quads8 , a provenance-enabled graph data format.
In order to allow the extracted data to be directly used for applications that require tabular data, as well as to prevent users from needing to deal with duplicate data resulting from the structure of the original websites, we generate the WDC Schema.orgTable Corpora from the extracted data by (i) grouping the data by website, (ii) removing incomplete entities that were extracted from listing pages and (iii) deduplicating it.We represent the resulting tables in a JSON format that can directly be read by the pandas 9 library.We publish two releases of the WDC Schema.orgTable Corpus: The 2020 release consists of 4.2 million relational tables that together contain ∼292 million rows of data.The corpus was generated from the WDC 2020 JSON-LD and Microdata extraction.The second corpus contains 5 million tables with together ∼361 million rows of data and was generated from the WDC 2023 extraction.The tables in the corpora belong to 44 different schema.orgclasses with product, local business, and event being the most widely used classes.As the tables are generated from schema.organnotations, all tables that describe entities of a specific type use the same set of attributes to describe these entities, i.e. all tables use a shared schema.However, as the data originates from over 4.3 million different websites (hosts) from all across the Web, the actual data values are heterogeneous concerning value format, unit of measurement, and language.
This paper is structured as follows: Section 2 gives an overview of the creation process of the WDC Schema.orgTable Corpora

CREATION PROCESS
This section describes the process of creating the WDC Schema.orgTable Corpora.We use the 2023 release to illustrate the number of tables and rows and the amount of computation used.
1. Extracting Data from the Common Crawl.The WDC project has developed a parsing framework for extracting structured data from the Common Crawl [14].The framework runs in the AWS cloud and supports the parallel processing of multiple (W)ARC files.To extract JSON-LD, Microdata, RDFa, and Microformats data from the HTML pages contained in the (W)ARC files, the framework uses the Any23 parser library 10 .For the 2023 release, we used 250 AWS spot instances with 8 × 3.2 GHz CPUs and 16 GB RAM for the extraction which altogether required 4,602 machine hours.The extracted corpus consists of 97 billion RDF quads (N-Quads).Webmasters primarily use JSON-LD and Microdata syntaxes to annotate web pages with schema.orgterms.Therefore, we merge the extracted JSON-LD and Microdata data to form class-specific subsets for selected schema.orgclasses.The subsets consist of all entities of a specific class along with entities of other classes present on the same page and contain 39 billion RDF quads 11 .It took 5 days of compute time on a local shared server equipped with 96 × 3.6 GHz CPU cores and 1024 GB RAM to create the schema.orgsubsets.

Group by Host.
Next, all entities, corresponding attributes, and attribute values are converted from RDF quads into tabular form and grouped by host.If an attribute contains child entities instead of a literal value, all child entities and their attributes are extracted as a list.However, only literal values are considered for the attributes of child entities, dismissing any child entity attributes further down the hierarchy.For example, a web page about a movie might annotate the name of the movie and details of the actors who appear in the movie including their names and their spouses.In this case, the movie's name and a list of actors are extracted.For each actor, only the actor's name is extracted because it is a literal value.Child entities further down the hierarchy such as the actor's spouse are omitted.After this step, the 2023 version of the table corpus contains ∼7.5 million tables with overall ∼1.4 billion rows.
3. Removal of Listing Pages and Sparse Entities.Listing pages contain concise information about entities that are described in more detail on other pages.In order to have attribute-rich entity descriptions in the corpus, we want to exclude descriptions originating from listing pages.Other pages provide detailed descriptions of one entity and brief descriptions of other entities as part of navigation elements or advertisements.Our objective is to extract only the main entity from such pages.We apply the following heuristic to exclude these sparse entities: If a web page contains only one relevant entity with at least three attributes, the entity is extracted.For web pages that contain multiple entities, we concatenate the attribute values of each entity and calculate the mean absolute deviation (MAD) of each entity based on the length of the concatenated attribute values.Entities with at least three attributes and concatenated attribute value lengths greater than the median plus three times the MAD (positive outliers) are extracted.If a web page marks up multiple entities without outliers, those entities are dismissed as originating from a listing page.Applying the heuristics reduces the size of the corpus to ∼5 million tables and ∼429 million rows.

Content-based Deduplication.
Content-based deduplication removes exactly equal entity descriptions that originate from different web pages of the same host.This process is applied to all attributes except for schema.org/url,which is excluded for top and second-level attributes as it may differ and lead to false positives during deduplication.We keep only attributes with a density above 25% for the final table for each host and dismiss all sparser attributes.After content-based deduplication, the 2023 release of the WDC Schema.org

DATA FORMAT
The tables are encoded in JSON line data format and can be read by the pandas Python library.An example table of the class Movie is shown in Figure 1.Each record represents an entity annotated using the schema.orgvocabulary.The 'row_id' is an identifier for the extracted entity that is created during the extraction process.The 'name' column contains literal values of the schema.org/nameattribute of the extracted entity, while the 'actor' column shows an example of extracted child entities of type schema.org/actor.These are represented as lists containing all literal attribute names and values of the respective child entity.Due to space limitations, only three actors are shown in Figure 1.
release has a size of 51GB.For each class of entities, we provide three download files: (1) the top 100 tables containing the largest number of rows, ( 2) tables containing at least 3 rows and, (3) the tail tables containing the remaining smaller tables.The user may choose to use one or any combination of these files, depending on the intended application.

CONTENT OF THE 2023 TABLE CORPUS
This section presents profiling information for the 2023 release of the WDC Schema.orgTable Corpus.The corpus comprises around 5 million tables, containing over 361 million rows of data in total.
The tables cover 42 schema.orgclasses and originate from over 4.33 million websites (hosts).

Tables by Class.
The statistics on the number of tables per class, their rows, and the average number of attributes for a selection of schema.orgclasses are presented in Table 1.The schema.orgclasses selected demonstrate the breadth of the corpus, ranging from classes with extensive tables, such as Product, to those with fewer tables, such as Dataset.In addition to the statistics for the complete corpus (Overall), Table 1 provides separate statistics for the largest 100 tables and tables with at least three rows (Minimum 3).For example, the corpus contains over one million LocalBusiness tables describing altogether 8 million business entities with on average 4 attributes, such as name, address, telephone, or average rating.By distinguishing between the Top 100 and Minimum 3 tables, we can see that the Top 100 tables account for 1% to 11% of all rows for the three most popular classes: Product, LocalBusiness, and Event.
Attributes.The WDC Schema.org

RELATED WORK
Various table corpora have been created in recent years.Table 3 lists table corpora and shows statistics on their number of tables (Tabs.), the average number of rows (Avg.# Rows) and attributes (Avg.# Attr.), and whether all tables in the corpus use a single shared schema.The WDC Web Table Corpus [12] and the Dresden Web Table corpus [6] extract relational HTML tables from web pages in the Common Crawl.These web tables [17] have been used in related work on table search [4] and table augmentation [3].The WikiTables table corpus contains tables that were extracted from Wikipedia [1].VizNet [8] consists of tables that were chosen for benchmarking visualization methods.Open Data Portal Watch [15] contains tabular data that was collected from open data portals.Table 3 shows that the tables in the WDC Web

USE CASES
This section describes various use cases of the WDC Schema.orgTable Corpus.
Benchmarking.The table corpus is a useful resource for evaluating table annotation, schema matching, entity matching, and data retrieval methods due to its shared schema and the presence of entity identifiers such as GTINs, MPNs, or phone numbers in many tables.For example, the SOTAB Table Annotation Benchmark [11] was constructed by selecting a subset of the tables from the 2020 version of the corpus, removing the attribute labels from the tables, and having the annotation systems predict the removed labels.The SOTAB Benchmark was used in the 2023 edition of the SemTab challenge 14 .A second benchmark that uses tables from the 2020 version of the corpus is the WDC Schema Matching Benchmark 15 which requires instance-based schema matching methods to discover correspondences between table columns.A benchmark that uses schema.orgentity identifiers as ground truth is the WDC Products entity matching benchmark [16].
Source of Training Data.Its structuredness, the common schema, and entity identifiers also make the corpus an interesting source of (pre-)training data for table representation learning as well as data integration, e.g.entity matching [16] and table annotation [11].The corpus further contains 4 million question-answer pairs originating from 38,000 websites (see row Question in Table 2) which could be used for fine-tuning LLMs or as background knowledge for retrieval-augmented question answering.Besides such task-specific training, the corpus can also be used as a structured pre-training resource for table representation learning methods [5] or for fine-tuning LLMs for structured data tasks [13].
Analyzing the Adoption of Semantic Web Technologies.We publish detailed statistics about which host uses which schema.org terms together with the corpora.From a web science perspective, these statistics together with the data itself can be used to analyze the adoption of the schema.orgvocabulary within specific application domains as well as on the Web in general [2,7].
Source of Domain Data.Last but not least, the corpus can be used as a large source of domain data.For example, if a user wants 14 https://sem-tab-challenge.github.io/2023/ 15https://webdatacommons.org/structureddata/smb/ to assemble a list of shops or hotels in a city, the 1 million local business tables in the corpus with together 8 million rows could be a useful starting point.Or, if the user wants to analyze the skills that are currently in demand on the job market, they could use the 3 million job postings in the corpus for their analysis.
Table Corpus contains ∼5 million tables containing alltogether ∼361 million rows of data.It took 10 days of compute time on a local shared server equipped with 96 × 3.6 GHz CPU cores and 1024 GB RAM to create the 2023 release from the extracted RDF quads resulting from Step 1.

Table 1 :
TableCorpus is constructed using schema.organnotations.As a result, all tables share the same set of attributes, while the attributes present in a specific table depend on the annotations included by the corresponding host in its web pages.Table2shows a selection of attributes appearing in tables for the classes Product, LocalBusiness and Movie.Common attributes such as name and description are present in many tables across multiple classes.Other attributes, such as productID and genre, are more class-specific and less frequently used, indicating that a long-tail distribution for such attributes exists in the corpus.Nevertheless, for both head and long-tail attributes, the tables exhibit a high average value density of 95%.This shows that if hosts use a schema.orgterm, they do so consistently.Some attributes are entity identifiers that can be used to link entities across tables for example to derive training data for entity matching tasks.Examples of such attributes are SKU, productID, MPN and GTIN13 for the class Product as well as the telephone number for LocalBusiness.Number of tables and rows for selected schema.orgclasses in millions (M) and thousands (k).

Table 2 :
Fraction of tables containing specific attributes.
[9]le Corpus, the Dresden Web Table Corpus, WikiTables and VizNet have a relatively small number of rows.Compared to the web tables corpora, the WDC schema.orgtablecorporaandGitTables[9]contain on average more rows.GitTables[9]consists of tables that are extracted from CSV files shared on GitHub.The tables in all the referenced table corpora do not use a single shared schema but each table uses a different, proprietary schema.As a post-processing step, the tables in GitTables are annotated with semantic types from DBpedia and schema.orgusinganautomated annotation method which might misinterpret table semantics[9].The WDC Schema.orgTable Corpora are generated from schema.organnotations.As a result, the tables in our table corpora follow a single shared schema.