HKPoly: A Polystore Architecture to Support Data Linkage and Queries on Distributed and Heterogeneous Data

Context: Modern information systems commonly manipulate heterogeneous data and schemas fragmented in the data stores that best fit their storage and access requirements. Besides, different organizations’ business processes independently consume these fragments without explicit links between the employed data. Problem: Supporting heterogeneous and not explicitly connected data residing in distinct data repositories is a big challenge. Solution: This work proposes HKPoly: a federated architecture that encapsulates data heterogeneity, location, and linkage. IS Theory: We employed the Representation theory to create the models of the architecture and its components. Method: Architecture implementation, its application in an Oil & Gas scenario, and its comparison to a multi-database system. Results: The proposal allows query writing to be two times less complex than the one written for the relational multi-database system, adding an excess of about 30% in query processing time. Contributions: An architecture to query heterogeneous data, the requirements and components for its implementation, and an implementation example using the stated-of-the-art concepts.


INTRODUCTION
Modern information systems usually manipulate datasets with different models and usages, employing specific tools and techniques, e.g., applications in medical informatics, oceanography, metagenomics, and exploration and production phases in Oil & Gas [30].
In an oil reserves discovery, independent workflows process sizeable raw data files to generate training and validation datasets used by Deep Learning (DL) models.The workflows (Figure 1) consume and generate data from/to heterogeneous data stores where data is not directly linked.The first workflow (Wf1) processes geological raw data files residing on a Parallel File System to extract metadata.It assesses missing and displaced information to measure metadata quality and stores metadata in a relational DBMS (R-DBMS).Wf2 considers the high-quality data files and generates geospatial indexes to accelerate geospatial queries over the geological data.It stores the indexes in a document-oriented DBMS (Doc DBMS).Wf3 augments the high-quality raw geodata files with extra knowledge informed by geoscience experts and stores that knowledge in a Triplestore System (T-DBMS).Wf4 prepares the learning datasets used by DL algorithms.It queries the data from Doc DBMS and the related knowledge from the T-DBMS to generate datasets, which are stored as HDF5 files in the Parallel File System, ready to be used by DL algorithms.More details are presented by [30].
In this scenario, the user is an ML expert with deep knowledge in the domain.When reporting the results, the user must query the data residing in the heterogeneous data stores.
Several data management solutions handle heterogeneous data access, such as distributed file systems (e.g., GFS [10]

Wf3
Expert knowledge ingestion

Wf4
Training datasets

Data preparation
Generated data

Used data
Figure 1: Workflows and data stores of the scenario [30].
NoSQL databases (e.g., MongoDB2 , AllegroGraph 3 ), new data processing frameworks (e.g., Spark 4 ), and hybrid multimodal (e.g., Ori-entDB 5 ) or hybrid NewSQL (e.g., LeanXcale 6 ).Each solution generally manages a few kinds of data models or formats [2].On the other hand, migrating the heterogeneous data to a single database and schema does not work.Data conversion and loading is a very costly and time-consuming task, which is difficult to justify [32].
A Multidatabase System (MDBS) (or federated data system) allows users to access various databases in an integrated way.A polystore system is a kind of MDBS [24].There is still a lack of a more abstract query language to access heterogeneous data [32].This work's research question is "How to query data that are not explicitly connected residing on heterogeneous data stores?"To answer this, we propose a polystore system architecture that uses provenance for data linkage.We create links between data by collecting the provenance of the organization's business processes' activities (or workflows) execution [6].The architecture provides an abstract layer to formulate queries based on a global conceptual schema, and it supports the creation of remote data mappings and transparent record linkage.
We implemented the architecture as a RESTful Web service [28] with metadata stored in a Knowledge Graph.We evaluated it in a simulated real case in the Oil & Gas industry, considering query complexity from the user and system perspectives.The results indicate the architecture's applicability and utility; further improvements in query processing time are required.
We used the Representation Theory 7 .We elicited the requirements and designed the components and data structures (i.e., the deep structure).We identified how "users" interact with the architecture (i.e., the surface structure).We implemented it and illustrated how it carries out the proposed components and accesses the Knowledge Graph (i.e., physical structure).
The remainder of this work has the following content: Section 2: background and our choices' justifications; Section 3: proposed architecture; Section 4: architecture implementation and evaluation; Section 5: related work; Section 6: conclusion and future work.

BACKGROUND
This section presents the main concepts related to this work and the reasoning for their use in our proposal.

Multidatabase and Postgres FDW
A Multidatabase System (MDBS) provides a software layer that runs on top of individual Database Management Systems (DBMSs) and facilitates users' access to various databases.The MDBS creates a Global Conceptual Schema (GCS) to represent an integrated view of a database in which parts are allocated to different sites.Each site runs an individual DBMS, sharing parts of its local database, represented by a Local Conceptual Schema (LCS).The database consumer formulates queries based on the GCS, and the MDBS translates them into local queries sent to be executed by the individual DBMSs.The MDBS receives the responses, consolidates them, and returns an integrated result to the user [24].
The International Organization for Standardization (ISO) developed the SQL/MED (Management of External Data) 8 , an extension of SQL that allows applications to use standard SQL to access SQL and non-SQL data.SQL/MED defines an API that standardizes the communication between an SQL-based Server and wrappers (aka Foreign-Data Wrappers or FDW) to access external servers (aka Foreign Servers).SQL/MED defines how a Relational-DBMS works as an MDBS.PostgreSQL9 , MySQL, and Maria DB support SQL/MED.Other DBMSs do not implement the standard, although they provide similar implementations.Db2 provides a distributed database feature, Microsoft SQL-Server offers Linked Server and Oracle provides database link [23].The robustness of PostgreSQL, its open-source license, the wrappers already implemented, and its wide use motivated us to employ it in our proposal.

Provenance
Provenance is also known as the audit trail, lineage, and pedigree of a data product.It contains information about the process and data used to derive the data product [6].
We are employing ProvLake10 , a lineage data management system capable of capturing, integrating, and querying data across multiple workflows by leveraging provenance data.ProvLake's data model is built on W3C PROV [12] and PROV-Wf [5].It supports data representation for workflows on data lakes [30].ProvLake is efficient for High-Performance Computing workloads due to its system design principles, which aim to attain very low data capture overhead.While capturing the data, ProvLake logically integrates and ingests them into a provenance database ready for analyses at runtime.ProvLake was validated in a use case encompassing four workflows that process 5 TB datasets for a deep learning classifier.Compared with Komadu, the closest state-of-the-art solution, ProvLake enables runtime multiworkflow data analysis with much smaller overhead, such as 0.1% [30].
ProvLake provides a lightweight data tracking API to be added to workflow codes (ProvLakeLib11 ), such as scripts and a Web API for runtime analytical queries that integrate multistore data.

Knowledge Graph and Hyperknowledge (HK)
Knowledge Graphs (KG) have gained a broad use in research and business [16] since the term was coined by a Google blog post [29] in 2016 [8].There are several definitions for the term.We follow [8]: "A knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge".We chose this one because we create a knowledge base acquiring information of schemas and instances corresponding to the GCS, LCS, mappings, and provenance, and we use inference mechanisms, e.g., to navigate in the KG and compute queries to be executed on the remote data stores.We use the IBM Hyperlinked Knowledge Graph 12(or Hyperknowledge, HK for short) for knowledge presentation.
HK is a hybrid conceptual model able to handle the multitude of media content and the meaning behind this data.It supports describing associations among symbolic semantics and non-symbolic data fragments within the same knowledge base.This representational approach takes advantage of combining in a single rationale user interaction, data segments (e.g., sentences of a text document, fragments of images, segments of seismic data, fragments of executable code), and semantic representations (e.g., knowledge entities in an ontology that can be reasoned upon) [20].
The foundation of HK is a pair of typical hypermedia concepts: (i) Nodes: represent information units; (ii) Links: define relationships among fragments of information (anchors) and properties of nodes.Nodes can be of two classes: (i) Terminal nodes, which are composed of a collection of information units13 ; (ii) Composite nodes, whose content is a set of nodes of the two classes.Composite nodes may be specialized to define the semantics of node collections.E.g., a Context composite node contains a set of links and other attributes.

Hyperknowledge Platform -HK-Platform
HK-Platform is a set of tools developed for handling HK.From this set, we use two of them: HKBase and KES.
HKBase provides a RESTful API to manipulate structured data represented in HK and unstructured data (e.g., images, videos).Structured data may be stored in data stores of different types, e.g., Triplestores (Jena or AllegroGrap), Graph Databases (JanusGraph), or Document Databases (MongoDB).Unstructured data is stored in Object Stores or File Stores (e.g., MinIO, FileSystem).
In HKBase, data may be retrieved by using either HyQL (the Hyperknowledge Query Language) or SPARQL.HyQL provides access to HK data in a language closer to its representation [21].SPARQL is the query language for RDF (Resource Description Framework) defined by W3C [26].Triplestores, or RDF stores, are the matter of choice for storing and querying RDF data.When HKBase is storing HK in a triplestore database, one can query the database using SPARQL as if only RDF data was stored.
KES (Knowledge Explorer System) [22] is a Web application for collaborative management of HK databases.It enables creating, validating, and curating HK data in an interactive visual approach.
Our polystore solution is provisioned as an HKBase service and we used KES to inspect the ingested data and as an alternative to manually manage models and mappings.

HYPERKNOWLEDGE POLYSTORE
Our proposal of polystore architecture provides users with a single layer for data access, encapsulating data store heterogeneity, location, and data linkage using data mappings and provenance.We named it as Hyperknowledge Polystore (or HKPoly for short) because it uses the Hyperknowledge metamodel (Section 2.3) in its implementation to represent the HKPoly data schemas.

Requirements and Stakeholders
The HKPoly supports the following requirements: ( creates the mappings among GCS and LCSs (R4); (S4) Provenance Specialist: understands the domain processes, creates the provenance schema of these processes, and supports developers to instrument the processes' applications (R5); (S5) Developer: instruments the applications that support the domain processes (R5); (S6) Client User: understands the GCS, formulates queries, and calls HKPoly to process the queries (R6).

Architecture overview
HKPoly supports R1, R2, R3 and R4 via a service to manage domain and remote data stores metadata.The service is accessed using a client application (Figure 2.a) or a User Interface (Figure 2.b).Metadata is stored in a HKPoly Knowledge Graph (or HKPoly KG).
HKPoly KG metamodel 14 (Figure 3) was created based on W3C-Prov and ProvLake (Section 2.2).We proposed elements inheriting from W3C-Prov (like Collection and Entity), and we use ProvLake elements to represent workflow provenance schema and execution (e.g., Workflow, Workflow Execution, Data Transformation, Data Transformation Execution, Attribute, Attribute Value).
R1 Support: The HKPoly client user employs a simple language for the GCS creation, i.e., a language without many constructs.
R2 Support: The Knowledge Engineer (KE) formulates a data schema to represent GCS domain elements as instances of classes DatasetSchema and Attribute of HKPoly metamodel.
R3 Support: The Database Administrator (DBA) creates LCS schemas for each shared portion of the remote data stores' databases as instances of the metamodel presented in Figure 3, detailed as follows.The DBA can implement a script to automate this task.A Database resides in a DataStore (isInStore relationship), and it may have DatabaseSchemas that may have DatasetSchemas (isSchemaOf and isDataSchemaOf relationships, respectively).
DatasetSchema and Attributes are used to represent the GCS and LCS schemas, e.g., tables of a remote PostgreSQL database.
An Attribute may be simple or complex (like a list or a dictionary).Simple Attributes cannot be subdivided.ComplexAttributes' elements may be composed by other attributes, which composition is represented by isMemberOfComplexAttribute relationship.We use alias to represent equivalent semantics and mappings of Attributes, e.g., the alias linking Seismic.uri of GCS with Seismic.id of a relational table of a PostgreSQL LCS.
R4 Support: The Knowledge Engineer (KE) maps GCS and LCS by creating alias relationships between GCS and LCS attributes.
R5 Support: The Provenance Specialist (PS) understands the workflows and creates their provenance schema as instances of the classes presented in gray in Figure 4, such as Workflow, Data-Transformations and Attributes.The PS works with Domain Experts and Developers to create this schema.
Afterward, Developers include calls to the ProvLakeLib (Section 2.2) in the applications' codes that support the workflows to capture the provenance execution and send it to the Provenance Manager (Figure 5) to store workflow executions' provenance data as instances of the classes in white in Figure 4, such as: Workflow-Execution, DataTransformationExecution and the used and generated AttributeValues.The DataReferences are values that identify the data records residing in the DataStores, e.g., id or URI. Figure 4: Provenance model based on ProvLake [30] .

Instrumented client application
Provenance manager KG DBMS HKPoly KG The schemas metadata, mappings, workflows schemas, and executions stored in HKPoly KG allow the data linkage.The mappings between GCS and LCS elements allow identifying the linkage of instances of different data stores (i.e., distinct LCSs) representing the same global concept when distinct workflows consume them.
R6 Support: HKPoly architecture overview is depicted in Figure 6.A Client application sends to HKPoly a query written using an HKPoly's supported query language (e.g., HyQL) and based on the GCS concepts.HKPoly processes the query as follows: (i) It interprets and validates the input query concerning GCS elements and supported operators by querying the KG; (ii) It creates local queries for each LCS that maps to the GCS elements used in the query, getting the GCS and LCS mappings and provenance data required for query building; (iii) It creates an optimized query execution plan for the local queries; (iv) It sends the local queries to wrappers that send them to the heterogeneous local data stores; (v) The wrappers receive the query results represented in LCS, transform them according to the GCS, and return the restructured result to HKPoly; (vi) HKPoly consolidates the results in a single response; (vii) HKPoly returns the response to the Client.A user can use any Web API client tool to execute queries over the GCS since HKPoly is provisioned as a Web service.In our implementation, we use KES (Section 2.4) to inspect the ingested data and, as an alternative, to manually manage models and mappings.

HKPOLY IMPLEMENTATION IN A REAL SCENARIO
This section presents the architecture implementation and its evaluation in a simulation of a real scenario.

Scenario
We evaluate the architecture implementation in the oil reserves discovery, a critical scenario for the Oil & Gas (O&G) industry.We   [27], and deep learning (DL) is a promising ML technique [4].
Managing the data generated during the training of production DL models is hard [31].This is particularly true in geoscience problems [11].It requires preprocessing, cleaning, and performing complex integrated data analysis.This lifecycle is decomposed into parts addressed by collaborating teams of geoscientists, computational scientists, engineers, and statisticians, among others.Each team has a preferred way to automate tasks and store data while consuming data from other teams.
The case study includes activities that range from preprocessing large raw geological data files to the generation of training and validation datasets for DL models depicted in Figure 1 (Section 1).Workflows use data stores with heterogeneous data models.An example of data to be queried and integrated is shown in Table 1.

HKPoly Implementation
HKPoly service is implemented as a RESTful Web service [28] as a realization of the architecture depicted in Figure 6. Figure 7 presents a UML component diagram concerning the support of R1, R2,R3, and R4.HKPoly service interface is provisioned as part of HKBase API (Section 2.4), IHKPoly interface in the diagram.Client applications call HKPoly service to create the GCS, LCS, and mappings between them.HKPoly service uses HKBase's HKDataSource to store HKPoly Knowledge Graph HKBase's DBMS.Users may use the KES (Section 2.4) to visualize and manipulate the ingested data.
The domain model (GCS) may be created using basic elements of HK (like Concept node and Connectors for relationships), and the GCS schema, the LCS, and mappings are created using the models presented in Section 3.2 (Figure 3).As an example, considering the Seismic domain data presented in the first column of Table 1, a user would model the domain (R1) as a node Seismic and edges connecting it with name, inline, crossline, well, horizon, and epsg properties.The KE creates the GCS schema (R2) for the domain  R5 is supported by provenance manager component (aka HKProv-Manager), provisioned also as a RESTful Web service.It is a realization of the architecture presented in Figure 5.
Figure 8 presents a component diagram of HKProvManager.The developers of client applications create hooks in their code (illustrated by the Instrumented client application component), which make calls to HKProvManager service.HKProvManager uses HKBase services to store the provenance data, using the HKDataSource to access the DBMS.The service calls aim to store provenance execution data, such as: (i) Workflow start information; (ii) Workflow DataTransformation executions and their input parameters and returned data (used and generated Attributes and AttributeValues); (iii) Workflow end information.
The requirement for the instrumented code is that the Attributes' names must be the same as those defined in the LCS.It allows the identification of which domain data are consumed by each workflow and, consequently, the linkage among the object fragments residing in each data store.These data are used by HKPoly to process client queries.
Figure 9 presents a diagram to exemplify the execution of the workflows of Figure 1 to process the Seismic Netherlands.Workflows are presented as actions for simplification.The workflow Data quality assessment reads the netherlands segy file from the FileSystem and    HKPoly interprets and validates the query using the data stored in the HKPoly Knowledge Graph.Query 2 is an example of HKPoly's query in SPARQL used to retrieve KG's data.It retrieves workflow execution and related GCS data (lines 1-3) and remote data (lines 4-5), which is used by HKPoly to create a polystore query.It navigates in the KG executing, e.g., query path and inverse property navigation (lines 10-11) and traversing the workflow elements (lines 8-15) and GCS (lines [16][17][18][19].The att variable connects the workflow and GCS data (line 9 for the workflow and line 16 for the GCS).
To access the remote heterogeneous data stores, HKPoly uses PostgreSQL Foreign Data Wrapper (FDW) (Section 2.1) as the polystore connector.HKPoly does not implement a new connector since some are available, and our work focuses on validating the use of the GCS and provenance to enable a seamless access interface to users regarding abstraction and data linkage.We chose In Query 3, the from clause (lines 6-8) lists the PostgreSQL Foreign Tables 15 that maps to the remote data stored in a segy file, AllegroGraph knowledge base triples, Mongo Seismic collection, and Seismic_Header table.Lines 9-13 compute a "constant table" using the VALUES 16 SQL clause with the workflow executions data values retrieved using Query 2. In the where clause, the Foreign Tables and constant table are joined (lines 14-17), and, as Seismic.name is linked to kb_seismic and seismic_header Foreign Tables, lines 18-19 are included in the SQL during its generation.The returned data is present in lines 1-5, which references the columns of the Foreign Tables mapping to Seismic domain object. 15Foreign tables behave like regular tables, and they are used to create mappings in a PostgreSQL database to external objects [19]. 16https://www.postgresql.org/docs/current/sql-values.htmlThere are several challenges in polystore development [2].We focused on devising a solution to handle the semantic and record linkage mappings concerning the automatic translation of utterances to the local dialect of a storage system and integration of the results [32].Other challenges concern query language and query optimization.We use HyQL query language to support easy-to-use with efficient processing over diverse stores and compared with SQL.We left query optimization as future work.It requires handling data flow and multi-platform scheduling, as well as a cost model specific for heterogeneous distributed data stores.

Evaluation
Our evaluation measures query complexity considering the user perspective and the database perspective.

Query
Complexity -User's Perspective.We count the query components to estimate the user's cognitive load measurement.Ozsu and Valduriez [24] state that the number of relations and operators characterizes complex SQL queries.The complexity increases with the increasing number of equivalent operators.Some examples of works that use the query components are [33], [35], and [37].For instance, Vashistha and Jain [35] measure query complexity by: number of tables and columns in a query; query length (i.e., number of characters); number of operators (e.g., join, filter); and, query expressions (e.g., like, greater than, or, and).We do not choose a weighted sum because there is no convergence that a weighting technique would improve the evaluation.
Table 5 presents the number of query components.The user query (Query 1), written in HyQL language, has 5 elements in the projection, 1 element in the from clause, and 1 filter, resulting in 7 components.The equivalent polystore SQL query (Query 3) has 5 elements in the projection, 5 elements in from clause, 3 joins, and 1 filter, resulting in 14 components 17 .So, in this example, we can infer that the query written in HyQL is two times less complex than the SQL query generated by HKPoly to access the external data.We build the following experimental deployment.We use Docker18 container, allocating 5 CPUs, 10GB RAM, and a swap area of 2GB in a machine with i5 Intel Quad Core 2GHz CPU and 16 GB RAM.HKBase and its services (including HKPoly and HKDataSource) are deployed in a container.Considering that all servers were part of the same network, we mirrored the proposed architecture presented in Figure 10.We use Apache Jena as HKBase's main DBMS, which is deployed in a single container.Each remote data store is also deployed in a separate container using a proper schema to match the seismic domain model, corresponding to the scenario presented in Section 4.1.Data resides on AllegroGraph, MongoDB, and PostgreSQL, data attributes are illustrated in Table 2.The Post-greSQL FDW is deployed in a single container and loaded with data wrapper plugins to provide communication and data transformation between the remote databases.All these containers are started altogether with a composing script.
After setting up the test environment, we load domain data into the remote stores and provenance data into HKPoly Knowledge Graph.To control the volume of data, we generate domain data in what we call a batch.A batch has 16KB and represents three different files generated by a synthetic data generator whose schemas match the remote databases' schemas.They correspond to the data used by the three workflows presented in Figure 9.We simulate the workflows' execution and load the results in PostgreSQL, MongoDB, and AllegroGraph.Then, we execute the performance tests.
We run the test of query processing time performance using different numbers of batches varying the data volume.We start with one batch (B001) and then increase to 50, 100, 400, and 700.The number of query results increases linearly according to the number of batches.For each query performance test, we purge the provenance execution data from the HKPoly Knowledge Graph repository, keeping the schemas (GCS, LCS, provenance, and mappings schemas), and erase the remote databases' contents.Also, for each batch quantity, we execute fifty times the Query 1, so we can calculate the aggregated statistics for each one, waiting for one second between query requests.According to Hoefler and Belli [15], the academic literature suggests that sample sizes between thirty and forty are sufficient for normal distributions.Although we have not verified that the query processing time is normally distributed, we use the sample size of fifty based on this assumption.
Finally, we instrument HKPoly service to capture query execution times to understand the impact of building the query to be executed on the PostgreSQL FDW component (which we name as BuildingQuery step), i.e., transform the input query (e.g., Query 1 to Query 3 in the case of one batch).We collect the BuildingQuery step time and the time to run the query in PostgreSQL FDW , whose sum corresponds to the whole query execution time.We assume that the time of the Client Application to send the query to HKPoly Although it is straightforward to assume that the overall query response time would increase due to the rising loaded data, the time consumed to build the PostgreSQL FDW SQL query is not the ruling part of the process.In the worst case, our experiments show that the transformation process takes less than 30% of the overall query processing time.We expect that this behavior remains the same whenever more data is loaded into the HKPoly environment.

RELATED WORK
Several works have tackled the database federation problem [34] [2].
The Garlic system [3] is based on repository wrappers, an objectoriented data model, and query language to provide a uniform view of heterogeneous data types and data sources.The DiscoveryLink system [14] is based on the fusion of several DB2 [13] and Garlic [3] components to query data stored in heterogeneous and physically distributed data stores by using a virtual database.
While Garlic and DiscoveryLink use SQL, [17] propose a middleware that employs SPARQL, shares architectural principles with Garlic's architecture, and enables integration of CSV files and Relational Databases with RDF data sources.This integration is achieved by RDF wrappers that are very similar to the repository wrappers of Garlic and DiscoveryLink.[18] proposal uses SPARQL and integrates time-dependent data with RDF data by enriching sensor sources and sensor data streams with semantic descriptions.
The Information Integrator tool [1] uses ontologies and F-Logic to create an integrated view of data and mapping between data objects in distributed heterogeneous data sources, e.g., databases and Web services.The solution is divided into four layers: (i) Data sources; (ii) Ontologies that represent each data source; (iii) Business ontology that provides a conceptualization of business entities; (iv) views (i.e., queries) of the business ontology.Elements of the layers are connected by mappings using F-Logic rules.
Comparison: HKPoly has similarities with those proposals.However, it aims for simplification, expressiveness gain, and connecting heterogeneous data using provenance generated by independent workflows, which are not considered in those works.The BigDAWG polystore system [9] handles the competing notions of location transparency and semantic completeness when supporting diverse database systems working with different data models.It leverages the concepts of islands and shims to create a middleware that provides a uniform query interface while allowing users to exploit the full capabilities of each database connected to the system.BigDAWG still requires the implementation of so-called degenerate islands to support the full semantic power of a connected database, similar to previous solutions' repository wrapper-based architecture.As a novelty, this system does not require the user to commit to one specific data model, such as Relational or RDF, as its base query language mainly comprises two general operators (CAST and SCOPE) that can be used to combine queries following multiple data models.
Although BigDAWG represents a breakthrough and has been evaluated with success in multiple medical applications, many largescale applications continue to use federated database systems that extend the concept of repository wrappers (as in Garlic and Discov-eryLink) and use SQL as their primary query language.PostgreSQL is an example.It supports the implementation of Foreign Data Wrappers (FDWs) for external data source access.The FDWs convert data stored in the heterogeneous data sources into a relational representation, while the PostgreSQL engine is responsible for processing the multiple joins required to answer a query that integrates their corresponding tables.This system allows the implementation of condition pushdowns in which conditions clauses of SQL queries, described after the WHERE clause, are directly translated to operations in the native query language of each data source, enabling significant gains in query performance [36].
Comparison: BigDAWG requires the user to understand the query languages of the multiple federated databases integrated through the system and how to combine queried data using its basic operators.PostgreSQL requires the user to know the schema of the FDW Foreign tables to specify a SQL query to get remote data.In our view, these facts limit the usability of these solutions.We propose an architecture (and present its implementations) based on a hybrid conceptual model that enables users to write federated queries that are much richer and more comprehensive.
Giacomo et al. [7] present an overview of the concept "Ontology-Based Data Access" (OBDA), propose a general OBDA framework, and exemplify its instantiation theoretically.An OBDA system implements three components: (i) Ontology: provides a formal highlevel representation of a domain, e.g., used by clients to formulate queries; (ii) Data source layer: the remote data stores; (iii) Mappings: explicitly representations that map data sources and ontologies used to translate the client operations (e.g., query answering).
Comparison: The general approach of the OBDA framework is related to our proposal.However, it focuses on relational remote data stores.The authors point out that the encapsulation of NoSQL databases is still a challenge.Besides, they do not present the framework implementation and experimental evaluation nor handle data linkage of data stores' data.Our approach is broader, encompassing heterogeneous data models and tackling the data linkage problem.

CONCLUSION
Provide an integrated view of heterogeneous data residing in different types of data stores is a big challenge [24,32].The problem handled in this work is how to build data connections between such data and execute queries effectively on them.
The main contribution of our work is HKPoly, a polystore architecture that uses domain ontology, remote data stores' schema metadata, data mappings, and knowledge graphs to support users with a single abstract global conceptual schema to write their queries, encapsulating data heterogeneity, location, and linkage.As secondary contributions: (i) We analyzed the related components of federated database systems, SQL/MED specification, PostgreSQL FDW, Hyperknowledge, Polystore etc.; (ii) We presented an implementation of the architecture and evaluated it in a real case within the O&G industry; (iii) We demonstrated how a knowledge graphcentric approach could improve polystore queries by augmenting their semantics and facilitating the construction of queries over heterogeneous remote data stores linked by using provenance.
Considering the user perspective, our solution allows the user to write queries using a more abstract query language considering a single and global data model that represents the organization's data.The users do not have to understand the schemas of each part of organization data stored in distinct heterogeneous systems, and they do not have to learn each query languages of such data stores.
We evaluated the proposal in the O&G scenario aiming to presenting the feasibility of the solution considering query complexity from the user and database perspectives.The evaluation shows that the proposed architecture allows query writing that is two times less complex than the query one should write to use a multidatabase system (e.g., PostgreSQL FDW ) directly.Considering the processing time, HKPoly adds an excess about 30% to transform the high-level query (Query 1) in the PostgreSQL FDW query (Query 3).
One threat to the validity of our work is the experiments that considered one use case and not a variety of data volume.Our goal was to demonstrate the feasibility the proposal and the experiments ran in a real scenario with multiple workflows and heterogeneous data.Different data volumes must be considered in future experiments, e.g., increasing the number of batches using a geometric progression, and using queries that returns different result sizes.
Besides, we aim at evolving HKPoly implementation to process other query operators, and evaluate the proposal considering workflows of other scenarios with distinct structures and including more types of queries.We plan to improve the query execution processing time, mainly concerning the support of Requirement R6, and to handle multimedia data (e.g., video, audio, and text) [25].
Another future work would consider data updates requirement, e.g., to handle materialization reconstruction which replaces its last output, changing the data identifiers.In this case, the previous links should be discarded and new data linkage created.
More details about this work is available at arXiv 19 .

Figure 2 :Figure 3 :
Figure 2: HKPoly-client interaction: (a) Client application calls directly HKPoly service; (b) A user uses a UI which calls the HKPoly service.

Figure 10 presents
HKPoly component diagram concerning the support for R6.It is a realization of the architecture depicted in Figure 6.The client application creates a query based on the GCS model using the Hyperknowledge query language (HyQL).An example is Query 1 which retrieves Seismic attributes (lines 1-2) considering the workflow geological_data_ingestion_workflow (line 3) and the Seismic which name is Netherlands (line 4).

Figure 9 :Figure 10 :
Figure 9: Example of execution of the workflows of Figure 1 to investigate the Seismic Netherlands.Workflows are illustrated as actions in the diagram.

Query 3 :
SQL to get data of Netherlands seismic.

4. 3 . 2
Query Complexity -Database's Perspective.We evaluate the database's query complexity by measuring query processing time, i.e., the time HKPoly takes to process the user query.It corresponds to the implementation to support R6 (Figure10).We compared this time against the time required to process the PostgreSQL FDW query solely.The HKPoly processing time includes: (i) receive the query (through IHKPoly interface), e.g., Query 1; (ii) parse the received query; (iii) run queries over HKPoly Knowledge Graph (using HKDataSource) to get required metadata and mappings, running Query 2 and others; (iv) compute the PostgreSQL FDW query, e.g., Query 3; and, (v) run the computed query using PostgreSQL FDW .The compared PostgreSQL FDW time corresponds only to the time elapsed in Step (v).We vary the data volume of HKPoly Knowledge Graph and the data volume of remote data stores to analyze HKPoly overhead when remote data stores' data volume increases.

Figure 11 :
Figure 11: HyQL to SQL Query Building and FDW SQL Query Execution response time (ms) over Batch quantities.
and the time of HKPoly to send the response back is negligible.Figure 11 presents the result of the query response time experiments for each batch's quantities.It presents the medians of the overall query response time extracted from the fifty rounds executed for each batch experiment.

Table 1 :
Seismic data and the data stores where they reside.
are concerned with seismic image interpretation.Typically, these images cover large extensions of the earth, and, by inspecting the images, geoscientists try to identify geological features, e.g., salt bodies.Academia and O&G industry aim at automating this activity

Table 2 :
GCS of Seismic (line 1) and LCS of the Seismic data residing in the remote stores (lines 2-4).

Table 3 :
Seismic GCS and LCS mappings attribute mappings.Data preparation creates the training data by reading all the stored data and generating the netherlands.trainfilewhich is stored in the FileSystem.During this workflow execution, while storing data in the data stores, the instrumented client application also sends provenance data calling HKProvManager, which stores the DataReferences for the record, document, and triples (id, identifier, and URI, respectively) in HKPoly Knowledge Graph.Table4presents examples of the DataReferences captured.The provenance data, GCS, LCSs, and mappings are used to compute the polystore query (R6).
inserts Netherlands assessments in a PostgreSQL database table.The workflow Geospatial index generation reads the same file and stores Netherlands indexes in a MongoDB collection.The workflow Expert knowledge ingestion stores Netherlands knowledge information in AllegroGraph.An Expert provides this information when analyzing the netherlands segy file.Finally, the workflow

Table 4 :
Remote data store data captured during Seismic interpretation DL workflow execution.HKPoly computes the SQL presented in Query 3, which is sent to PostgreSQL FDW.It processes the results and returns to HKPoly that returns the response to the Client service consumer.

Table 5 :
Accounting query components as an estimate of user's cognitive load.