A Declarative Query Language Enabled Autonomous Deep Web Search Engine

Though deep web search research has a long history, no convenient search engine has materialized of late to support access to the deep web databases in ways similar to shallow web engines such as Google or Bing. Surfacing and custom access engines are the two most common techniques available to search a tiny fraction of the deep web today with the concomitant limitations. In this paper, we discuss the implementation of DeepIR as a possible deep web search engine. It is powered by a flexible declarative query language called DQL that is similar to SQL. We assemble and leverage existing technologies to realize DeepIR and identify possible pain points for further research. We believe investigating the identified technological hurdles will help develop a smarter deep web search engine. The advantage of DeepIR is that its system components can be replaced with more powerful ones to improve its performance without impacting its query semantics.


INTRODUCTION
More than a decade of research devoted to deep web research is a testimony to the difficulty in developing a web scale deep web search engine.Most of the deep web databases are structured or semi-structured, and thus have an invisible scheme that users cannot see.From this perspective, querying the deep web is akin to ACM ISBN 979-8-4007-0243-3/24/04. https://doi.org/10.1145/3605098.3635945querying a database schema-free [28].A universal relation [26], on the other hand, assumes a single relation view of the universe that has a scheme.Deep web databases can be viewed as having both of these features.To the users, it appears as just one relation, just like universal relations, but potentially have several interconnected tables behind the firewall.Additionally, the scheme is abstract and invisible to the outside world.Only window given is the search interfaces they support to grant access [21,24].
It is intriguing to imagine a search engine that is capable of answering a query  of the form Example 1.1.List all pre-owned Toyota cars for sale in Moscow, Idaho no older than 2010 at a price less than $5,000.This query when asked on Google, returns the pages in Fig 1, among many others, most of which do not adhere to the query conditions.Arguments can be made that the principle at play is flexible query answering [4] to approximate the query because, in this case, eCommerce queries are multi-faceted and often are a complex function of numerous features, and thus presenting a possible match may make a sale by presenting options.However, the bait and switch type of results are also possible that uses search engine's cached data based on data surfacing [12], or harvesting [22], from the deep web from previous queries, that are not current or are no longer available.Often a web crawling based active harvesting of deep web databases are also used to gather the contents and generate an open access mirror [50] notwithstanding significant limitations and caveats including questionable usability.
In this paper, we report the implementation of a new deep web search engine called DeepIR based on the theoretical foundations of the DQL structured query language [19].DQL is an algebraic system.It isolates the functionality of a deep web search engine into several matured basic components -recommender systems [5,38], schema matching [30], wrapper generation [2], and data integration [6].The model uses a clever integration of these systems to realize the search engine.In the next several sections, we discuss the DeepIR engine with a focus on DQL.
Our vision is to have a decomposable deep web querying system capable of accepting the most capable implementations for each of the DQL modules.For example, the interface identification task could be performed by a recommender system.To enable such a clean modular approach, we developed a high level abstraction and expressed it in terms of a declarative language and foundation.Through these means, various independently developed deep web technologies can synergistically come together at the implementation level.We will demonstrate that such a model and architecture is entirely feasible today by suitable adaptation of recent research in this area.DQL supports a declarative syntax for expressing structured queries over the internet to fashion arbitrary queries to initiate a deep web search.To illustrate the syntax and the functionality of DQL, consider Abebi, who is searching online for a car, using a slightly modified query in Ex 1.1 as follows: "Find 2010 or newer model red Toyota Camry cars on sale near San Francisco, CA at a price between $2,000 and $8,000".She has a rough idea of what she wants to buy -a fairly new Camry for a price between $2,000 and $8,000, preferring a red trim with Notice that in this query although she prefers a trim of red Camry with alloy wheels, she did not mention the wheel type as a condition to avoid being overly restrictive.Needless to mention that the text version of this query on Google does not produce any meaningful results as before, it does not even bring up the sites which actually have the results at various levels of "matching" accuracy.In the next few sections, we illustrate how DeepIR is designed and how it functions to serve as a deep web search engine leveraging DQL query language to be able to respond to such queries better.

IMPLEMENTATION OF DEEPIR
The Architecture of the DeepIR system is shown in Fig 3 .It faithfully follows the DQL query model.The system is implemented in Python on Ubuntu server with MySQL 8 as its back-end database.The application's backbone is created in Python Flask framework, which controls server-side tasks including routing and operational logic.The HTML5 user interface through which users interact with DeepIR is implemented using Bootstrap CSS framework.We used Selenium WebDriver as our initial instrument for web scraping to deal with JavaScript elements for handling dynamic web content.However, once the page is loaded and dynamic content is rendered, we use Python BeautifulSoup library for parsing HTML.Query mapping to individual deep web sites is achieved with the help of CUPID schema matching system [30].The wrapper to extract and collate site responses into a local table for secondary processing was written in Python.

DQL User Interface
The DeepIR interface is a web-form based user interaction system as shown in Fig 4 .Since the DQL queries expect only a list of attributes, and a set of conditions in the form of Boolean expressions, the interface has two text boxes -one for the LIST clause, and one for the WHERE clause.Once a query is framed, and the retrieve button is pushed, DeepIR springs into action and gathers the responses and displays the gathered responses as a table over the scheme in the LIST clause plus an additional special system column called Site.This column, as shown in Fig 5, includes a hot link to the site from where the row has been collected and listed as part of the data provenance.
The special link Site can be clicked to view the source, as shown in Fig 6 .Only caveat is that the attribute names in the site may not exactly match the table scheme shown in the result.This is because a schema matching has been performed, and the table scheme is the user view of the world.However, because the matches are semantic in nature, the terms or the attribute names should be pretty similar.

DeepIR Query Processor.
Processing DQL queries in DeepIR is via query translation to HTTP or API calls to remote deep web sites, and SQL for local secondary processing.Every DQL query is parsed to collect the set of attribute names used in the LIST (  ) and WHERE (  ) clauses to form the scheme of the assumed abstract universal relation  =   ∪  .This scheme is then forwarded to the recommender system to obtain a list of possible deep web sites where the query has a chance to be executed successfully.DeepIR uses the best effort approach [20,49] to compute deep web queries.This means that it does the best to approximate the schema matching, wrapping, retrieving and processing under resource constraints of the underlying systems.
Based on the site recommendations and the site scheme, the processor decides how to form an HTTP request, and prepares to receive a response.The request follows a schema mapping application using CUPID to determine the appropriate schema correspondence.The HTTP request is made based on the schema mapping, and a correspondence table is created for the site so that a reverse mapping can be applied when the response from the site is received.The reverse mapping will help adhere to and not disrupt the user view of the world.

3.2.1
The Recommender System.The DeepIR recommender system is implemented as a function over a materialized index of sites.The index is a list  of quadruples of the form ⟨, , {⟨, ⟩}, {⟨, ⟩}, where  is a site address or URL,  is the name or label, and {⟨, ⟩} is a set of input attribute name and domain type pairs.Similarly, {⟨, ⟩} is a set of output attribute names and domain type pairs.Essentially, an entry in  captures the fact that a deep web site interface at  accepts a set of values  or type , and returns a set of values  of type .
Recommender systems base their suggestions on collected data and smart algorithms [31,51].However, they usually do not archive the recommendations they make.Yet, some systems [1,3] leverage some form of archival information.We too leverage prior harvested [33] and materialized site information using crawlers [25] to improve best-effort performance adopted for DeepIR.The recommendation then is made based on the function  that selects a site only if the list   has all the required inputs at the site , i.e., {⟨, ⟩}.Often times, the required set of inputs are smaller than the listed interface inputs and still could accept a smaller set   .Finally, the recommendation is made if all the attributes in   are available in the list {⟨, ⟩} of  (more on this in Sec 3.3).

Schema Matcher.
As mentioned earlier, most of the time, the user view of the universal relation she has in mind will not match the site view of the database.Therefore, before a recommendation is made by the recommender system, a schema mapping is applied in an appropriate fashion, i.e.,  ((  ∪   ), ({⟨, ⟩} ∪ {⟨, ⟩})).In the current edition of DeepIR, we have successfully utilized time tested CUPID as a single system wide schema matcher.However, investigations are underway to leverage newer matchers and also to   understand if site specific schema matchers would improve matching performance and efficiency.Wrapper.Extraction of useful information from deep web sites has been the most difficult aspect of DeepIR project.Over the last several decades, a war [11] has been raging between the proponents of open access to information, and primarily the ecommerce vendors who want to protect their information [37], and the ethico-social defenders community [39].While there are legitimate concerns about supporting open information access paradigm, in most scientific and educational communities information harvesting and openness are prized [32], and web scraping [8,46,47] is the main tool used in wrappers for information harvesting.It is interesting to note that in a recent research, scraping has been directly used on deep web data [27] as we do in DeepIR.

Table
E-commerce sites often use JavaScript to prevent scraping to protect proprietary information or to retain competitive edge and to stop bots.The move to exploit JavaScript for information hiding and protection has been gradual [18], and somewhat effective.For all practical purposes, DeepIR is also a bot that springs into action on users' behalf on command.DeepIR bot also uses scraping for automation, and many sites attempt to block access completely, or make it difficult to scrape.Nonetheless, we have been successful in designing a general enough scrapper based wrapper using Selenium and BeautifulSoup which performs flawlessly on non e-Commerce sites, and effective on most e-Commerce sites which are not too serious about automated scraping.
As alluded to earlier, the mixture of static and dynamic web contents returned by the sites pose a significant hurdle for scrapping.
To combat this design hurdle, we designed a two stage wrapper based on scrapping.We used Selenium in stage one to scrape and parse both static and dynamic pages because Selenium is more effective on AJAX-based contents than BeautifulSoup.Using Beau-tifulSoup, we analyze the page content to extract regular HTML structures that semantically mirror table rows.

Query Processing
With the help of the recommender, schema matcher and wrapper described above, the DeepIR query processor was implemented successfully.However, the simplistic view of query processing introduced in Sec 3.2 is somewhat deceptive.In reality, many queries cannot be computed using only a set of similar sites, i.e., requires a set of car sales sites, and a set of insurance or bank sites.Consequently, DeepIR classifies DQL queries into simple queries, and non-simple queries.Simple queries can be computed using only one single type of site, whereas non-simple queries need multiple types.We elaborate on query processing involving these query types below.

Processing Simple
Queries.Ex 1.1 discussed in Sec 1 is a simple query for it can be processed most likely on only a set of car sales deep web sites, such as cars.com,or autotrader.com.This is because the attributes (  ∪   ) listed in the query  are all present in these sites.More importantly, these sites may not have one or more of these attributes, but none need to be found in other databases to complete the query computation.The algorithm for processing a simple query can be as straightforward as the one in Alg 1.
For the purpose of this algorithm, we define the following set of terms that are used in the algorithm.Recall that the DeepIR recommender is a list of the form ⟨, , {⟨, ⟩}, {⟨, ⟩}, and the site  is recommended if   is a superset of the required fields of the front-end interface of , and if the output fields of  is a superset of   .This means that DeepIR can gain access to the back-end of the site  and it can provide all the information the query  requires.
For a deep web site , we define a function  that given a site  in the recommender index list , it returns an interface template  () =   as follows.We use this template to access the deep web content.
Definition 3.1.A simple interface template   is a quintuple of the form ⟨,    ,    ,, ⟩ where  is the site address,    is the set of mandatory input conditions to access the database,    is the set of optional input conditions (both have the form ⟨, ⟩ where  is an attribute name, and  is a set of permissible values), and  is the URL access suffix syntax needed for the formation of the POST path, and  is the set of object properties.
Using this interface template, we are then able to construct a POST request for cars.comfor the query below: "Find 2010 or newer model red Toyota Camry cars on sale near Moscow, ID at a price between $2,000 and $8,000", and return price, dealer name, phone number.Assuming that the zip code for Moscow, ID is 83843, we can formulate the POST request as the string below where %5B%5D means [] in the internet language.
https://www.cars.com/shopping/results/?stock_type= used&makes%5B%5D= toyota&models%5B%5D=toyota-camry&list_price_max=8000& maximum_distance=10&zip=83843 There are a few caveats to this though.First, the mandatory input condition is that a zip code must be supplied.However, the query does not actually list one, and instead we manually supplied the zip code.Technically then, this query is not computable at cars.com because the query does not satisfy the site access condition.A smarter way of computing this query is possible, but that is not the focus or subject of this paper.A somewhat similar issue will be discussed in the next section.Finally, if the year as an attribute is not listed in interface template's  component, this site is also not suitable for this query and will be ignored.
The technical undertone of the discussion above is more involved.Given an interface template   , and a DQL query , we identify an admissible template using the definition below.Definition 3.2.Let   be an interface template, and  be a DQL query.Also let   ∪   be the query scheme, and   be the query condition.  is admissible only if ∀( ∈   ∪   → ∃ ( ∈  ∪    ∧  (,   ) = 1)), ∀( ∈    → ∃ ( ∈   ∧  (, ) = 1)), and   is satisfiable.
The definition above means that all the mandatory interface conditions can be met, and all the user information needed can also be met.However, it is possible that some of the conditions in   cannot be applied in site engine, but they can be tested and filtered by DQL as a secondary processing since all the needed information can be found at this site, if available.The algorithm in Alg 1 formalizes this process.As discussed earlier, a simple DQL query is one that can be processed at single deep web site at a time, i.e., for every   recommended by the recommender system is admissible.However, as pointed out in the previous section, the query  with a location Moscow, ID cannot be processed at the deep web site cars.com,even though it is executable with the corresponding zip code, i.e., 83843.We therefore need a second site that is able to map a location to zip code, and use the zip code to process the query at cars.com.Such queries require intelligent determination of site capabilities, smart linkage and enhanced processing techniques.A simpler version of these types of queries, we call, non-simple queries, involve specific attributes that are not available at a site, say elevation of a dealer's shop, but can be semantically linked to complete a query.
Non-Simple Queries.To understand the caveats presented by a non-simple query, consider a modified version of the query  below.In this query, Abebi poses the same query with an additional constraint that the dealer selling the car must be at an elevation no higher than 2,000 ft in the state of Montana where she is visiting, and because she suffers from a breathing disorder and have difficulty at higher elevations.This time, she forms the query   as follows.
list Year, Make, Model, Make, Price, Site where Make='toyota' and Model='camry' and (Price > 2000 and Price < 8000) and Year > 2010 and Exterior_color='red' and Location='Bozeman, MT' and elevation < 2000 In this query, most likely, the elevation information for the dealers will not be available in the car sales database.However, the zip code will be.So, the question is, is it possible to gather this information from another database to compute this query?More importantly, how do we decide to identify a JOIN query, and what is the stopping condition?Because, all queries potentially could be answered by fragmenting the scheme and reconstructing a universal relation from the parts.The challenge is to find a semantically meaningful way to do so.While record [16] and entity [42] linkage, and entity identification [36] technologies come handy in this regard, their implementation in query processors is not so simple.
Non-Simple Query Processing Algorithm.In DeepIR, it is the job of the recommender system to propose a join path of multi-site queries using a minimum distance of attributes function as a reconstruction cost model.The sites chosen for exploration follow the same selection criteria (subset inclusion), and must have the minimum number of sites in the join path.Based on this cost model, DeepIR chose the freemaptools.com 1 site to find the elevation given the ZipCode/address, and ZipCode as the JOIN column.Interestingly, FreeMapTools site is a non-standard site to find elevation of a location, and DeepIR recommender selected this site over the more familiar site such as the USGS based on the cost model of DeepIR for JOIN site selection.The algorithm that implements the non-simple query processing strategy is shown in Alg 2.
In Alg 2, we make use of the fact that a non-simple interface template is a sequential list of simple interface templates, each joined or linked by a set of attributes.In the case of the query above, the non-simple interface template sequence is given as follows.

Secondary Processing
Not all query conditions in the DQL query can be applied at the sites to filter unwanted tuples.For example, the elevation less than 2,000 feet cannot be checked at the site of FreeMapTools.It should be filtered at DQL local database after retrieval.Similarly, filters outside    and    can only be applied locally after retrieval.Therefore, all such conditions are applied once all the information is collected, and then the final results are displayed.

COMPARISON WITH CONTEMPORARY DEEP WEB SEARCH ENGINES
Among the many deep web search engine efforts such as SemLAV [14], VIQI [7], SDWS [13], Deep2Q [45], and SEEDEEP [43,44], only DEQUE [41] and MetaQuerier [10,17] probably stand on the same plane as DeepIR, with significant differences in approach and novelty.Both DEQUE and MetaQuerier are about eighteen years older than DeepIR, and use technologies of their time.They also do not have their systems online, and no source codes could be found.Under the circumstances, any performance analysis in terms of processing time, and quality are not options.Besides, such performance comparisons on the same deep web data sources are also moot because the time to retrieve data does not depend on these systems, and will be theoretically almost identical on the same set of queries.The real difference is, we believe, in their query expressiveness, user-friendliness or usability, and maintainability as we highlight below.

Capability Comparison
DeepIR, MetaQuerier and DEQUE, all are designed to accept a user query  and map it to site specific queries  1 ,  2 , . . .,   to compute a response.The main difference is in how they accomplish this task, and what type of queries they support.The ability to autonomously compute queries at a site is limited by the technology such as JavaScript used by the sites as well as an effort to prevent scraping.Both DEQUE and MetaQuerier will fail to compute anything today from most of the example sites they used, e.g., cars.com,without manual processing by hand.Both of these systems also need manual integration of the sites and pre-process them to construct site schemes over which queries can be submitted.DEQUE also requires extensive curation of the sites and cannot handle schema heterogeneity (MetaQuerier, however, was able to deal with schema mismatch using schema matcher).DeepIR, in comparison to both, has a recommender system that finds and automatically and intelligently integrates relevant but arbitrary sites without user involvement or any pre-processing.This intelligent site capability discovery on the fly allows DeepIR to split user query  into multiple sub-queries that are diverse in application domains.For example, both DEQUE and MetaQuerier are able to (manually of course) compute queries of the form   discussed in Sec 2 from one singular type of sites (all must belong to the same application domain) -cars, book stores, or movies.In particular, these systems will not be able to compute queries of the form   discussed in Sec 3.3.2, the so called non-simple queries, which forces the system to process a query in multiple sites from disjointed domains and join them back (e.g., cars and city elevation).The SEEDEEP system, however, allows queries across multiple domains, but does so fully manually.

Conceptual Novelty of DeepIR
The most striking difference with DEQUE and MetaQuerier is that DeepIR has a declarative language similar to SQL [19], a mathematical foundation, and a query processing model as shown in Fig 3 requiring no user involvement whatsoever, manual pre-processing of the sites, or curation of any site information before a query could be processed.DeepIR also has built-in technologies to deal with JavaScript to scrape information as shown in Figs 4 and 5, which both DEQUE and MetaQuerier would fail to compute.The separation of the recommender system, the schema matchers and wrappers in DeepIR also allows system flexibility which other systems cannot afford.DeepIR is able to improve its processing power and accuracy by replacing any of its functional components in Fig 3 as replaceable modules without changing the essence of the engine, which both DEQUE and MetaQuerier are unable to do.

CONCLUSION AND FUTURE RESEARCH
DeepIR is an attempt to declaratively search the deep web contents with a no code approach using a structured query language within a highly unstructured data space.Although DQL requires the LIST and WHERE clauses, at the interface level it basically appears as a two slot search box similar to Google or other popular search engines, only functionally better.The current edition though schema agnostic and somewhat forgiving of the spelling mistakes (taken care of by the schema matchers), it is not known if a query that used keywords such as "Student" and "Studnt" will map on the underlying databases.This is because the mapping is from the query variables to the deep web schemes, and not among the keywords within the query.Search engines such as Google do a better job at standardizing the query keywords first before making the search.We plan to improve the engine in this direction in the future.
It is likely that the mapping quality and wrapping performance will vary from site to site and as the technology the content creators employ.The problem is similar to adjusting these systems for frequent changes in the scheme [40] or in content [23].There are two possible improvements that we are contemplating in DeepIR.First is to statically identify site specific matchers and wrappers from a library using an offline analysis, and including their usage to the query processor as a suggestion by the recommender system to aid for improving processing performance.An alternative and more ambitious option is to do so dynamically at run time.Both of these options can also be considered for query optimization.However, more research is necessary to understand the feasibility and identify possible gain.
A more effective and dynamic recommender system could help improve retrieval quality.In the current edition, we have used a materialized index as a recommender, and thus it is limited in coverage which depends on the richness of the materialized index.It should be evident that the query engine DeepIR was the focus in this paper, and not the components such as recommender, matcher or wrapper.The FreeMapTools database we have used in Sec 3.3.2,was manually identified and included in the index.A more desirable approach should find the deep web sites at run time and determine their capabilities and appropriateness to the query at hand dynamically [29,35,48].An investigation in this direction is also in progress.
DeepIR is part of a computational biology pipeline design system.DeepIR is an effort to help Biologists find relevant online information from the deep web.Although FAIRness is being advocated for more than a decade, compliance is extremely low making it difficult to find data of interest [34].While serious efforts are being made to make biological data FAIR [9], our goal is to just find the online data [15] regardless of their FAIRness.Our plan is to leverage the capacities DeepIR offers to achieve this goal.However, that also means that our recommender system should support dynamic site recommendation, and so must our matchers and wrappers.These are some of the research we seek to continue as our future projects.

Figure 1 :
Figure 1: Result of Google search over the query in Ex 1.1

Figure 2 :
Figure 2: DQL abstract view of deep relational web.

Figure 6 :
Figure 6: Linked deep web pages as provenance.