Databases are now part of our everyday lives even if at times not explicitly. Organized collections of data provide information, so important in decision making, from the medical area to business. Web mining is important to improve human computer interaction in general and in particular to exploit information available on the Internet. This area benefits from knowledge, concepts and techniques from artificial intelligence, statistics, linguistics and graph theory, among other fields.
Everyday we stumble upon many different kind of data, arising from different sources. The combination of data from several sources, stored using different technologies, provides a unified view of the data and empowers data processing and analysis.
Making data meaningful and worthy in a particular context is an imperative task. The logical structure of data is essential for the correct and efficient storage, organization and processing of data. Current technological developments allow the collection of huge amounts of data that can take decision-making processes to new levels. However, this is only possible if data can be transformed into knowledge. Various kinds of data mining algorithms are used to extract data patterns. The development of data preparation techniques is both a challenging and critical task.
The amount of private and personal data contained in databases has grown radically with the current digitalization of our lives. Moreover, the access to databases is widespread and made easier by the interconnection of information systems. Database systems must be designed in a way that limits the disclosure of private information. Nowadays, business intelligence applications are widely used in organizations and their strategic importance is clearly recognized. The dissemination of data mining tools is increasing in the business intelligence field, as well as the acknowledgement of the relevance of its usage in companies. Also, cloud computing relies on sharing computing resources rather than having local servers or personal devices to handle applications. It enables collaborative work and gives cheaper and continuous access to computational resources.
Automatic data collection and retention of end user actions has become the norm. Typical approaches in mobile crowd sensing applications collect and process sensor data on devices and apply local analytic algorithms to produce consumable data for users. Web crowd-sensing can also contribute with detailed data where proprietary data are extremely costly.
DARM: a privacy-preserving approach for distributed association rules mining on horizontally-partitioned data
Extracting association rules helps data owners to unveil hidden patterns from their data for the purpose of analyzing and predicting the behavior of their clients. However, mining association rules in a distributed environment is not a trivial task due ...
A method for predicting citations to the scientific publications of individual researchers
Any researcher's publications at any time can be ordered from the highest cited to the lowest cited, yielding a citation curve. We describe a novel method for predicting citation curves of researchers in the future. The method depends on treating the ...
Visual data integration based on description logic reasoning
Despite many innovative systems supporting the data integration process, designers advocate more abstract metaphors to master the inherent complexity of this activity. In fact, the visual notations provided in many modern data integration systems might ...
Semantic mediator querying
We present the whole querying process of our ontology-based data integration proposal, that we call Semantic Mediator. The global schema (a TBox) is composed of the source schemas (also Tboxes) and a taxonomy, which links the sources to each other. The ...
Discovering domain-specific public SPARQL endpoints: a life-sciences use-case
A significant portion of the LOD cloud consists of Life Sciences data sets, which together contain billions of clinical facts that interlink to form a "Web of Clinical Data". However, tools for new publishers to find relevant datasets that could ...
Mining named entities from search engine query logs
We present a seed expansion based approach to classify named entities in web search queries. Previous approaches to this classification problem relied on contextual clues in the form of keywords surrounding a named entity in the query. Here we propose ...
Named entities as privileged information for hierarchical text clustering
Text clustering is a text mining task which is often used to aid the organization, knowledge extraction, and exploratory search of text collections. Nowadays, the automatic text clustering becomes essential as the volume and variety of digital text ...
Multilevel refinement based on neighborhood similarity
The multilevel graph partitioning strategy aims to reduce the computational cost of the partitioning algorithm by applying it on a coarsened version of the original graph. This strategy is very useful when large-scale networks are analyzed. To improve ...
The state of data
We are currently experiencing an extraordinary acceleration in the growth rate of digital data. One of the reasons for this increase is the digitization of virtually all communications and records. This exponential growth is evidenced by the fact that ...
A scheme for privacy-preserving ontology mapping
Due to the rapid proliferation of ontology-based information systems and networks, there are strong demands for ontology-mapping in a privacy-aware way. To this problem, in this paper, we propose Privacy-Preserving Quick Ontology Mapping (P2QOM), a ...
Specifying complex correspondences between relational schemas and RDF models for generating customized R2RML mappings
The W3C RDB2RDF Working Group proposed a standard language to map relational data into RDF triples, called R2RML. However, creating R2RML mappings may sometimes be a difficult task because it involves the creation of views (within the mappings or not) ...
Ontology-based multi-domain metadata for research data management using triple stores
Most current research data management solutions rely on a fixed set of descriptors (e.g. Dublin Core Terms) for the description of the resources that they manage. These are easy to understand and use, but their semantics are limited to general concepts, ...
Automatic creation of stock market lexicons for sentiment analysis using StockTwits data
Sentiment analysis has been increasingly applied to the stock market domain. In particular, investor sentiment indicators can be used to model and predict stock market variables. In this context, the quality of the sentiment analysis is highly dependent ...
Dealing with incompleteness and inconsistency in P2P deductive databases
This paper proposes a logic framework for modeling the interaction among incomplete and inconsistent deductive databases in a P2P environment. Each peer joining a P2P system provides or imports data from its neighbors by using a set of mapping rules, ...
Personalized classifiers: evolving a classifier from a large reference knowledge graph
Identifying the right choice of categories for organizing and representing a large digital library of documents is a challenging task. A completely automated approach to category creation from the underlying collection could be prone to noise. On the ...
MV-IDX: indexing in multi-version databases
An index in a Multi-Version DBMS (MV-DBMS) has to reflect different tuple versions of a single data item. Existing approaches follow the paradigm of logically separating the tuple version data from the data item, e.g. an index is only allowed to return ...
A study of machine learning methods for detecting user interest during web sessions
The ability to have an automated real time detection of user interest during a web session is very appealing and can be very useful for a number of web intelligence applications. Low level interaction events associated with user interest manifestations ...
Improving MMDB distributed transactional concurrency
Main Memory Database Systems (MMDBs) have been studied since the 80s [3,4], when memory was quite costly ($1500 per MByte in 1984). We can now buy memory for about $10 per GByte. An advantage of MMDBs is that serial execution of a non-distributed ...
Condensed representation of frequent itemsets
One of the major problems in pattern mining is still the problem of pattern explosion, i.e., the large amounts of patterns produced by the mining algorithms when analyzing a database with a predefined minimum support threshold. The approach we take to ...
Survey on open source platform-as-a-service solutions for education
While the cloud computing becomes popular in the industry and companies take advantages of Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) as well as Software-as-a-Service (SaaS) solutions, education is sometimes one step behind. SaaS ...
RSQL - a query language for dynamic data types
Database Management Systems (DBMS) are used by software applications, to store, manipulate, and retrieve large sets of data. However, the requirements of current software systems pose various challenges to established DBMS. First, most software systems ...
CloudETL: scalable dimensional ETL for hive
Extract-Transform-Load (ETL) programs process data into data warehouses (DWs). Rapidly growing data volumes demand systems that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud ...
A methodology for social BI
Social BI (SBI) is the emerging discipline that aims at combining corporate data with textual user-generated content (UGC) to let decision-makers analyze their business based on the trends perceived from the environment. Despite the increasing diffusion ...
Optimizing query execution for variable-aligned length compression of bitmap indices
Indexing is a fundamental mechanism for efficient data access. Recently, we proposed the Variable-Aligned Length (VAL) bitmap index encoding framework, which generalizes the commonly used word-aligned compression techniques. VAL presented a variable-...
A fragmented data-declustering strategy for high skew tolerance and efficient failure recovery
Data declustering is a common technique to improve data I/O performance by retrieving data in parallel from multiple storage nodes. Data-declustering methods with replicated data also increase system availability, reliability and skew tolerance. Current ...
Optimizing database index performance for solid state drives
As Solid State Disk (SSD) drive technology matures and costs continue to decrease, it is becoming a viable replacement for traditional, rotational hard disk drives. SSDs are based on NAND flash technology, which results in different wear and performance ...
Algebraic optimization of grouped preference queries
SQL queries containing Group-by are common in data warehouse environments and OLAP. From this the concept of grouped Skyline queries emerged, wherein a Skyline of each group of tuples is requested. Grouped preference queries generalize this kind of ...
Portable decision support system for heart failure detection and medical diagnosis
Heart disorders are one of the most problematic issues of human health. There are currently many efforts to reduce the time for first assistance based on electronic systems that continuously records the electric heart activity (ECG), for further ...
An experimental evaluation of similarity measures for uncertain time series
Uncertain time series analysis is important in applications such as wireless sensor networks and location-based services. This has been the subject of some recent studies, and a number of solution techniques have been proposed for similarity search ...
Integration of linguistic and web information to improve biomedical terminology extraction
Comprehensive terminology is essential for a community to describe, exchange, and retrieve data. In multiple domain, the explosion of text data produced has reached a level for which automatic terminology extraction and enrichment is mandatory. ...
Index Terms
Proceedings of the 18th International Database Engineering & Applications Symposium
Recommendations
Requirements-driven database systems benchmark method
Benchmarks are the vital tools in the performance measurement, evaluation, and comparison of relational database management systems (RDBMS). Standard benchmarks such as the TP1, TPC-A, TPC-B, TPC-C, TPC-D, TPC-H, TPC-R, TPC-W, Wisconsin, and AS3 Ap ...




