|
|
Leonardo's laptop: human needs and the new computing technologies |
| |
Ben Shneiderman
|
|
Pages: 1-1 |
|
doi>10.1145/1099554.1099555 |
|
Full text: PDF
|
|
The old computing was about what computers could do; the new computing is about what people can do.To accelerate the shift from the old to the new computing designers need to:reduce computer user frustration. Recent studies show 46% of time is ...
The old computing was about what computers could do; the new computing is about what people can do.To accelerate the shift from the old to the new computing designers need to: - reduce computer user frustration. Recent studies show 46% of time is lost to crashes, confusing instructions, navigation problems, etc. Public pressure for change could promote design improvements and increase reliability, thereby dramatically enhancing user experiences.
- promote universal usability. Interfaces must be tailorable to a wide range of hardware, software, and networks, and users. When broad services such as voting, healthcare, and education are envisioned, the challenge to designers is substantial.
- envision a future in which human needs more directly shape technology evolution. Four circles of human relationships and four human activities map out the human needs for mobility, ubiquity, creativity, and community. The World Wide Med and million-person communities will be accessible through desktop, palmtop and fingertip devices to support e-learning, e-business, e-healthcare, and e-government.
Leonardo da Vinci could help as an inspirational muse for the new computing. His example could push designers to improve quality through scientific study and more elegant visual design. Leonardo's example can guide us to the new computing, which emphasizes empowerment, creativity, and collaboration. Information visualization and personal photo interfaces will be shown: PhotoMesa (www.cs.umd.edu/hcil/photomesa) and PhotoFinder (www.cs.umd.edu/hcil/photolib).For more: http://mitpress.mit.edu/leonardoslaptop and http://www.cs.umd.edu/hcil/newcomputing. expand
|
|
|
Emerging data management systems: close-up and personal |
| |
Yannis Ioannidis
|
|
Pages: 2-2 |
|
doi>10.1145/1099554.1099556 |
|
Full text: PDF
|
|
Conventional data management occurs primarily in centralized servers or in well-interconnected distributed systems. These are removed from their end users, who interact with the systems mostly through static devices to obtain generic services around ...
Conventional data management occurs primarily in centralized servers or in well-interconnected distributed systems. These are removed from their end users, who interact with the systems mostly through static devices to obtain generic services around main-stream applications: banking, retail, business management, etc. Several recent advances in technologies, however, give rise to a new breed of applications, which change altogether the user experience and sense of data management. Very soon several such systems will be in our pockets, many more in our homes, the kitchen appliances, our clothes, etc. How would these systems operate? Many system and user aspects must be approached in novel ways, while several new issues come up and need to be addressed for the first time. Highlights include personalization, privacy, information trading, annotation, new interaction devices and corresponding interfaces, visualization, etc. In this talk, we take a close look at and give a very personal guided tour to this emerging world of data management, offering some thoughts on how the new technical challenges might be approached. expand
|
|
|
From bits and bytes to information and knowledge |
| |
Thomas Hofmann
|
|
Pages: 3-3 |
|
doi>10.1145/1099554.1099557 |
|
Full text: PDF
|
|
Unstructured data is a valuable source of information and implicit knowledge. Yet, the bits and bytes of, e.g., text, image, or click-stream data need to be interpreted in order to transform them into business intelligence and actionable information. ...
Unstructured data is a valuable source of information and implicit knowledge. Yet, the bits and bytes of, e.g., text, image, or click-stream data need to be interpreted in order to transform them into business intelligence and actionable information. Clearly, this process needs to be automated to the largest possible extend in order to be scalable to the typical volumes of data. One way to accomplish this is through the use of machine learning and statistical modelling techniques. This talk will provide an overview of recent progress and new trends in machine learning and discuss their relevance for developing intelligent tools for search, information filtering, categorization, and knowledge extraction. expand
|
|
|
SESSION: Paper session IR-1 (information retrieval): XML retrieval |
|
|
|
|
Structured queries in XML retrieval |
| |
Jaap Kamps,
Maarten Marx,
Maarten de Rijke,
Börkur Sigurbjörnsson
|
|
Pages: 4-11 |
|
doi>10.1145/1099554.1099559 |
|
Full text: PDF
|
|
Document-centric XML is a mixture of text and structure. With the increased availability of document-centric XML content comes a need for query facilities in which both structural constraints and constraints on the content of the documents can be expressed. ...
Document-centric XML is a mixture of text and structure. With the increased availability of document-centric XML content comes a need for query facilities in which both structural constraints and constraints on the content of the documents can be expressed. How does the expressiveness of languages for querying XML documents help users to express their information needs? We address this question from both an experimental and a theoretical point of view. Our experimental analysis compares a structure-ignorant with a structure-aware retrieval approach using the test-suite of the 2004 edition of the INEX XML retrieval evaluation initiative. Theoretically, we create mathematical models of users' knowledge of a set of documents and define query languages which exactly fit these models. One of these languages corresponds to an XML version of fielded search, the other to the INEX query language. Our main findings are: First, while structure is used in varying degrees of complexity, over half of the queries can be expressed in a fielded-search like format which does not use the hierarchical structure of the documents. Second, structure is used as a search hint, and not a strict requirement, when judged against the underlying information need. Third, the use of structure in queries functions as a precision enhancing device. expand
|
|
|
Score region algebra: building a transparent XML-R database |
| |
Vojkan Mihajlović,
Henk Ernst Blok,
Djoerd Hiemstra,
Peter M. G. Apers
|
|
Pages: 12-19 |
|
doi>10.1145/1099554.1099560 |
|
Full text: PDF
|
|
A unified database framework that will enable better comprehension of ranked XML retrieval is still a challenge in the XML database field. We propose a logical algebra, named score region algebra, that enables transparent specification of information ...
A unified database framework that will enable better comprehension of ranked XML retrieval is still a challenge in the XML database field. We propose a logical algebra, named score region algebra, that enables transparent specification of information retrieval (IR) models for XML databases. The transparency is achieved by a possibility to instantiate various retrieval models, using abstract score functions within algebra operators, while logical query plan and operator definitions remain unchanged. Our algebra operators model three important aspects of XML retrieval: element relevance score computation, element score propagation, and element score combination. To illustrate the usefulness of our algebra we instantiate four different, well known IR scoring models, and combine them with different score propagation and combination functions. We implemented the algebra operators in a prototype system on top of a low-level database kernel. The evaluation of the system is performed on a collection of IEEE articles in XML format provided by INEX. We argue that state of the art XML IR models can be transparently implemented using our score region algebra framework on top of any low-level physical database engine or existing RDBMS, allowing a more systematic investigation of retrieval model behavior. expand
|
|
|
Generalized contextualization method for XML information retrieval |
| |
Paavo Arvola,
Marko Junkkari,
Jaana Kekäläinen
|
|
Pages: 20-27 |
|
doi>10.1145/1099554.1099561 |
|
Full text: PDF
|
|
A general re-weighting method, called contextualization, for more efficient element ranking in XML retrieval is introduced. Re-weighting is based on the idea of using the ancestors of an element as a context: if the element appears in a good context ...
A general re-weighting method, called contextualization, for more efficient element ranking in XML retrieval is introduced. Re-weighting is based on the idea of using the ancestors of an element as a context: if the element appears in a good context -- good interpreted as probability of relevance -- its weight is increased in relevance scoring; if the element appears in a bad context, its weight is decreased. The formal presentation of contextualization is given in a general XML representation and manipulation frame, which is based on utilization of structural indices. This provides a general approach independent of weighting schemas or query languages.Contextualization is evaluated with the INEX test collection. We tested four runs: no contextualization, parent, root and tower contextualizations. The contextualization runs were significantly better than no contextualization. The root contextualization was the best among the re-weighted runs. expand
|
|
|
SESSION: Paper session DB-1 (databases): networks and peer-to-peer |
|
|
|
|
Decentralized coordination of transactional processes in peer-to-peer environments |
| |
Klaus Haller,
Heiko Schuldt,
Can Türker
|
|
Pages: 28-35 |
|
doi>10.1145/1099554.1099563 |
|
Full text: PDF
|
|
Business processes executing in peer-to-peer environments usually invoke Web services on different, independent peers. Although peer-to-peer environments inherently lack global control, some business processes nevertheless require global transactional ...
Business processes executing in peer-to-peer environments usually invoke Web services on different, independent peers. Although peer-to-peer environments inherently lack global control, some business processes nevertheless require global transactional guarantees, i.e., atomicity and isolation applied at the level of processes. This paper introduces a new decentralized serialization graph testing protocol to ensure concurrency control and recovery in peer-to-peer environments. The uniqueness of the proposed protocol is that it ensures global correctness without relying on a global serialization graph. Essentially, each transactional process is equipped with partial knowledge that allows the transactional processes to coordinate. Globally correct execution is achieved by communication among dependent transactional processes and the peers they have accessed. In case of failures, a combination of partial backward and forward recovery is applied. Experimental results exhibit a significant performance gain over traditional distributed locking-based protocols with respect to the execution of transactions encompassing Web service requests. expand
|
|
|
On the complexity of computing peer agreements for consistent query answering in peer-to-peer data integration systems |
| |
Gianluigi Greco,
Francesco Scarcello
|
|
Pages: 36-43 |
|
doi>10.1145/1099554.1099564 |
|
Full text: PDF
|
|
Peer-to-Peer (P2P) data integration systems have recently attracted significant attention for their ability to manage and share data dispersed over different peer sources. While integrating data for answering user queries, it often happens that ...
Peer-to-Peer (P2P) data integration systems have recently attracted significant attention for their ability to manage and share data dispersed over different peer sources. While integrating data for answering user queries, it often happens that inconsistencies arise, because some integrity constraints specified on peers' global schemas may be violated. In these cases, we may give semantics to the inconsistent system by suitably "repairing" the retrieved data, as typically done in the context of traditional data integration systems. However, some specific features of P2P systems, such as peer autonomy and peer preferences (e.g., different source trusting), should be properly addressed to make the whole approach effective. In this paper, we face these issues that were only marginally considered in the literature. We first present a formal framework for reasoning about autonomous peers that exploit individual preference criteria in repairing the data. The idea is that queries should be answered over the best possible database repairs with respect to the preferences of all peers, i.e., the states on which they are able to find an agreement. Then, we investigate the computational complexity of dealing with peer agreements and of answering queries in P2P data integration systems. It turns out that considering peer preferences makes these problems only mildly harder than in traditional data integration systems. expand
|
|
|
Internet scale string attribute publish/subscribe data networks |
| |
Ioannis Aekaterinidis,
Peter Triantafillou
|
|
Pages: 44-51 |
|
doi>10.1145/1099554.1099565 |
|
Full text: PDF
|
|
With this work we aim to make a three-fold contribution. We first address the issue of supporting efficiently queries over string-attributes involving prefix, suffix, containment, and equality operators in large-scale data networks. Our first design ...
With this work we aim to make a three-fold contribution. We first address the issue of supporting efficiently queries over string-attributes involving prefix, suffix, containment, and equality operators in large-scale data networks. Our first design decision is to employ distributed hash tables (DHTs) for the data network's topology, harnessing their desirable properties. Our next design decision is to derive DHT-independent solutions, treating DHT as a black box. Second, we exploit this infrastructure to develop efficient content based publish/subscribe systems. The main contribution here are algorithms for the efficient processing of queries (subscriptions) and events (publications). Specifically, we show that our subscription processing algorithms require O(logN) messages for a N-node network, and our event processing algorithms require O(l x logN) messages (with l being the average string length).Third, we develop algorithms for optimizing the processing of multi-dimensional events, involving several string attributes. Further to our analysis, we provide simulation-based experiments showing promising performance results in terms of number of messages, required bandwidth, load balancing, and response times. expand
|
|
|
SESSION: Paper session KM-1 (knowledge management): knowledge systems |
|
|
|
|
Intelligent creation of notification events in information systems: concept, implementation and evaluation |
| |
Michael Guppenberger,
Burkhard Freitag
|
|
Pages: 52-59 |
|
doi>10.1145/1099554.1099567 |
|
Full text: PDF
|
|
An important feature of information systems is the ability to inform users about changes of the stored information. Therefore, systems have to 'know' what changes a user wants to be informed about. This is well known from the field of publish-/subscribe ...
An important feature of information systems is the ability to inform users about changes of the stored information. Therefore, systems have to 'know' what changes a user wants to be informed about. This is well known from the field of publish-/subscribe architectures. In this paper, we propose a solution for information system designers of how to extend their information model in a way that the notification mechanism can consider semantic knowledge when determining which parties to inform. Two different kinds of implementations are introduced and evaluated: one based on aspect oriented programming (AOP), the other one based on traditional database triggers. The evaluation of both approaches leads to a combined approach preserving the advantages of both techniques, using Model Driven Architecture (MDA) to create the triggers from a UML model enhanced with stereotypes. expand
|
|
|
Opportunity map: a visualization framework for fast identification of actionable knowledge |
| |
Kaidi Zhao,
Bing Liu,
Thomas M. Tirpak,
Weimin Xiao
|
|
Pages: 60-67 |
|
doi>10.1145/1099554.1099568 |
|
Full text: PDF
|
|
Data mining techniques frequently find a large number of patterns or rules, which make it very difficult for a human analyst to interpret the results and to find the truly interesting and actionable rules. Due to the subjective nature of "interestingness", ...
Data mining techniques frequently find a large number of patterns or rules, which make it very difficult for a human analyst to interpret the results and to find the truly interesting and actionable rules. Due to the subjective nature of "interestingness", human involvement in the analysis process is crucial. In this paper, we propose a novel visual data mining framework for the purpose of identifying actionable knowledge quickly and easily from discovered rules and data. This framework is called the Opportunity Map. It is inspired by some interesting ideas from Quality Engineering, in particular Quality Function Deployment (QFD) and the House of Quality. It associates summarized data or discovered rules with the application objective using an interactive matrix, which enables the user to quickly identify where the opportunities are. The proposed system can be used to visually analyze discovered rules, and other statistical properties of the data. The user can also interactively group actionable attributes and values, and see how they affect the targets of interest. Combined with drill-down and comparative analysis, the user can analyze rules and data at different levels of detail. The proposed visualization framework thus represents a systematic and yet flexible method of rule analysis. Applications of the system to large-scale data sets from our industrial partner have yielded promising results. expand
|
|
|
Establishing value mappings using statistical models and user feedback |
| |
Jaewoo Kang,
Tae Sik Han,
Dongwon Lee,
Prasenjit Mitra
|
|
Pages: 68-75 |
|
doi>10.1145/1099554.1099569 |
|
Full text: PDF
|
|
In this paper, we present a "value mapping" algorithm that does not rely on syntactic similarity or semantic interpretation of the values. The algorithm first constructs a statistical model (e.g., co-occurrence frequency or entropy vector) that captures ...
In this paper, we present a "value mapping" algorithm that does not rely on syntactic similarity or semantic interpretation of the values. The algorithm first constructs a statistical model (e.g., co-occurrence frequency or entropy vector) that captures the unique characteristics of values and their co-occurrence. It then finds the matching values by computing the distances between the models while refining the models using user feedback through iterations. Our experimental results suggest that our approach successfully establishes value mappings even in the presence of opaque data values and thus can be a useful addition to the existing data integration techniques. expand
|
|
|
SESSION: Paper session IR-2 (information retrieval): question answering |
|
|
|
|
Retrieving answers from frequently asked questions pages on the web |
| |
Valentin Jijkoun,
Maarten de Rijke
|
|
Pages: 76-83 |
|
doi>10.1145/1099554.1099571 |
|
Full text: PDF
|
|
We address the task of answering natural language questions by using the large number of Frequently Asked Questions (FAQ) pages available on the web. The task involves three steps: (1) fetching FAQ pages from the web; (2) automatic extraction of question/answer ...
We address the task of answering natural language questions by using the large number of Frequently Asked Questions (FAQ) pages available on the web. The task involves three steps: (1) fetching FAQ pages from the web; (2) automatic extraction of question/answer (Q/A) pairs from the collected pages; and (3) answering users' questions by retrieving appropriate Q/A pairs. We discuss our solutions for each of the three tasks, and give detailed evaluation results on a collected corpus of about 3.6Gb of text data (293K pages, 2.8M Q/A pairs), with real users' questions sampled from a web search engine log. Specifically, we propose simple but effective methods for Q/A extraction and investigate task-specific retrieval models for answering questions. Our best model finds answers for 36% of the test questions in the top 20 results. Our overall conclusion is that FAQ pages on the web provide an excellent resource for addressing real users' information needs in a highly focused manner. expand
|
|
|
Finding similar questions in large question and answer archives |
| |
Jiwoon Jeon,
W. Bruce Croft,
Joon Ho Lee
|
|
Pages: 84-90 |
|
doi>10.1145/1099554.1099572 |
|
Full text: PDF
|
|
There has recently been a significant increase in the number of community-based question and answer services on the Web where people answer other peoples' questions. These services rapidly build up large archives of questions and answers, and these archives ...
There has recently been a significant increase in the number of community-based question and answer services on the Web where people answer other peoples' questions. These services rapidly build up large archives of questions and answers, and these archives are a valuable linguistic resource. One of the major tasks in a question and answer service is to find questions in the archive that a semantically similar to a user's question. This enables high quality answers from the archive to be retrieved and removes the time lag associated with a community-based system. In this paper, we discuss methods for question retrieval that are based on using the similarity between answers in the archive to estimate probabilities for a translation-based retrieval model. We show that with this model it is possible to find semantically similar questions with relatively little word overlap. expand
|
|
|
Connecting topics in document collections with stepping stones and pathways |
| |
Fernando Das-Neves,
Edward A. Fox,
Xiaoyan Yu
|
|
Pages: 91-98 |
|
doi>10.1145/1099554.1099573 |
|
Full text: PDF
|
|
In this paper, we present Stepping Stones and Pathways (SSP), an alternative model of building and presenting answers for the cases when queries on document collections cannot be answered just by a ranked list. Stepping Stones can handle questions like: ...
In this paper, we present Stepping Stones and Pathways (SSP), an alternative model of building and presenting answers for the cases when queries on document collections cannot be answered just by a ranked list. Stepping Stones can handle questions like: "What is the relation of topics X and Y?" SSP addresses when the contents of a small set of related documents is needed as an answer rather than a single document, or when "query splitting" is required to satisfactorily explore a document space. Query results are networks of document groups representing topics, each group relating to and connecting (by documents) to other groups in the network. Thus, a network answers the user's information need. We devise new and more effective representations and techniques to visualize such answers, and to involve users as part of the answer-finding process. In order to verify the validity of our approach, and since the questions we aim to answer involve multiple topics, we performed a study involving a custom built broad collection of operating systems research papers, and evaluated the results with interested computer science students, using multiple measures. expand
|
|
|
SESSION: Paper session DB-2 (databases): security and privacy |
|
|
|
|
Securing XML data in third-party distribution systems |
| |
Barbara Carminati,
Elena Ferrari,
Elisa Bertino
|
|
Pages: 99-106 |
|
doi>10.1145/1099554.1099575 |
|
Full text: PDF
|
|
Web-based third-party architectures for data publishing are today receiving growing attention, due to their scalability and the ability to efficiently manage large numbers of users and great amounts of data. A third-party architecture relies on a distinction ...
Web-based third-party architectures for data publishing are today receiving growing attention, due to their scalability and the ability to efficiently manage large numbers of users and great amounts of data. A third-party architecture relies on a distinction between the Owner and the Publisher of information. The Owner is the producer of information, whereas Publisher provides data management services and query processing functions for (a portion of) the Owner's information. In such architecture, there are important security concerns especially if we do not want to make any assumption on the trustworthy of the Publishers. Although approaches have been proposed [4, 5] providing partial solutions to this problem, no comprehensive framework has been so far developed able to support all the most important security properties in the presence of an untrusted Publisher. In this paper, we develop an XML-based solution to such problem, which makes use of non-conventional digital signature techniques and queries over encrypted data. expand
|
|
|
The case for access control on XML relationships |
| |
Béatrice Finance,
Saïda Medjdoub,
Philippe Pucheral
|
|
Pages: 107-114 |
|
doi>10.1145/1099554.1099576 |
|
Full text: PDF
|
|
With the emergence of XML as the de facto standard to exchange and disseminate information, the problem of regulating access to XML documents has attracted a considerable attention in recent years. Existing models attach authorizations to nodes of an ...
With the emergence of XML as the de facto standard to exchange and disseminate information, the problem of regulating access to XML documents has attracted a considerable attention in recent years. Existing models attach authorizations to nodes of an XML document but disregard relationships between them. However, ancestor and sibling relationships may reveal information as sensitive as the one carried out by the nodes themselves (e.g., classification). This paper advocates the integration of relationships as first class citizen in the access control models for XML and makes the following contributions. First, it characterizes important relationship authorizations and identifies the mechanisms required to translate them accurately in an authorized view of a source document. Second, it introduces a rule-based formulation for expressing these classes of relationship authorizations and defines an associated conflict resolution strategy. Rather than being yet-another XML access control model, the proposed approach allows a seamless integration of relationship authorizations in existing XML access control model. expand
|
|
|
A function-based access control model for XML databases |
| |
Naizhen Qi,
Michiharu Kudo,
Jussi Myllymaki,
Hamid Pirahesh
|
|
Pages: 115-122 |
|
doi>10.1145/1099554.1099577 |
|
Full text: PDF
|
|
XML documents are frequently used in applications such as business transactions and medical records involving sensitive information. Typically, parts of documents should be visible to users depending on their roles. For instance, an insurance agent may ...
XML documents are frequently used in applications such as business transactions and medical records involving sensitive information. Typically, parts of documents should be visible to users depending on their roles. For instance, an insurance agent may see the billing information part of a medical document but not the details of the patient's medical history. Access control on the basis of data location or value in an XML document is therefore essential. In practice, the number of access control rules is on the order of millions, which is a product of the number of document types (in 1000's) and the number of user roles (in 100's). Therefore, the solution requires high scalability and performance. Current approaches to access control over XML documents have suffered from scalability problems because they tend to work on individual documents. In this paper, we propose a novel approach to XML access control through rule functions that are managed separately from the documents. A rule function is an executable code fragment that encapsulates the access rules (paths and predicates), and is shared by all documents of the same document type. At runtime, the rule functions corresponding to the access request are executed to determine the accessibility of document fragments. Using synthetic and real data, we show the scalability of the scheme by comparing the accessibility evaluation cost of two rule function models. We show that the rule functions generated on user basis is more efficient for XML databases. expand
|
|
|
SESSION: Paper session KM-2 (knowledge management): index structures |
|
|
|
|
Exact match search in sequence data using suffix trees |
| |
Mihail Halachev,
Nematollaah Shiri,
Anand Thamildurai
|
|
Pages: 123-130 |
|
doi>10.1145/1099554.1099579 |
|
Full text: PDF
|
|
We study suitable indexing techniques to support efficient exact match search in large biological sequence databases. We propose a suffix tree (ST) representation, called STA-DF, as an alternative to the array representation of ST (STA) proposed in [7] ...
We study suitable indexing techniques to support efficient exact match search in large biological sequence databases. We propose a suffix tree (ST) representation, called STA-DF, as an alternative to the array representation of ST (STA) proposed in [7] and utilized in [18]. To study the performance of STA and STA-DF, we develop a memory efficient ST-based Exact Match (STEM) search algorithm. We implemented STEM and both representations of ST and conducted extensive experiments. Our results indicate that the STA and STA-DF representations are very similar in construction time, storage utilization, and search time using STEM. In terms of the access patterns by STEM, our results show that compared to STA, the STA-DF representation exhibits better spatial and sequential locality of reference. This suggests that STA-DF would require less number of disk I/Os, and hence is more amenable to efficient and scalable disk-based computation. expand
|
|
|
Rotation invariant indexing of shapes and line drawings |
| |
Michail Vlachos,
Zografoula Vagena,
Philip S. Yu,
Vassilis Athitsos
|
|
Pages: 131-138 |
|
doi>10.1145/1099554.1099580 |
|
Full text: PDF
|
|
We present data representations, distance measures and organizational structures for fast and efficient retrieval of similar shapes in image databases. Using the Hough Transform we extract shape signatures that correspond to important features of an ...
We present data representations, distance measures and organizational structures for fast and efficient retrieval of similar shapes in image databases. Using the Hough Transform we extract shape signatures that correspond to important features of an image. The new shape descriptor is robust against line discontinuities and takes into consideration not only the shape boundaries, but also the content inside the object perimeter. The object signatures are eventually projected into a space that renders them invariant to translation, scaling and rotation. In order to provide support for real-time query-by-content, we also introduce an index structure that hierarchically organizes compressed versions of the extracted object signatures. In this manner we can achieve a significant performance boost for multimedia retrieval. Our experiments suggest that by exploiting the proposed framework, similarity search in a database of 100,000 images would require under 1 sec, using an off-the-shelf personal computer. expand
|
|
|
DIST: a distributed spatio-temporal index structure for sensor networks |
| |
Anand Meka,
Ambuj Singh
|
|
Pages: 139-146 |
|
doi>10.1145/1099554.1099581 |
|
Full text: PDF
|
|
We consider the general problem of tracking moving objects in sensor networks. The specific application we consider is that of tracking a chemical plume moving over a large infrastructure network. We present a distributed index structure DIST ...
We consider the general problem of tracking moving objects in sensor networks. The specific application we consider is that of tracking a chemical plume moving over a large infrastructure network. We present a distributed index structure DIST that stores and updates distributed summaries as the plume moves. We present algorithms for range queries on the history of the plume. DIST localizes information with respect to time and space using a hierarchy that scales with the plume size. The highlight of our work is an analytical model to predict the cost of query algorithms based on the query location, query size, and plume's spatio-temporal distribution. Using this model, our adaptive scheme chooses the optimal scheme. Experimental results show that DIST outperforms alternative techniques in query, update, and storage costs, and scales well with the number of plumes. expand
|
|
|
SESSION: Paper session IR-3 (information retrieval): web retrieval |
|
|
|
|
Focused crawling for both topical relevance and quality of medical information |
| |
Thanh Tin Tang,
David Hawking,
Nick Craswell,
Kathy Griffiths
|
|
Pages: 147-154 |
|
doi>10.1145/1099554.1099583 |
|
Full text: PDF
|
|
Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained ...
Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained through whole-of-Web search may not be scientifically based and can be potentially harmful.To address problems of cost, coverage and quality, we built a focused crawler for the mental health topic of depression, which was able to selectively fetch higher quality relevant information. We found that the relevance of unfetched pages can be predicted based on link anchor context, but the quality cannot. We therefore estimated quality of the entire linking page, using a learned IR-style query of weighted single words and word pairs, and used this to predict the quality of its links. The overall crawler priority was determined by the product of link relevance and source quality.We evaluated our crawler against baseline crawls using both relevance judgments and objective site quality scores obtained using an evidence-based rating scale. Both a relevance focused crawler and the quality focused crawler retrieved twice as many relevant pages as a breadth-first control. The quality focused crawler was quite effective in reducing the amount of low quality material fetched while crawling more high quality content, relative to the relevance focused crawler.Analysis suggests that quality of content might be improved by post-filtering a very big breadth-first crawl, at the cost of substantially increased network traffic. expand
|
|
|
Hybrid index structures for location-based web search |
| |
Yinghua Zhou,
Xing Xie,
Chuang Wang,
Yuchang Gong,
Wei-Ying Ma
|
|
Pages: 155-162 |
|
doi>10.1145/1099554.1099584 |
|
Full text: PDF
|
|
There is more and more commercial and research interest in location-based web search, i.e. finding web content whose topic is related to a particular place or region. In this type of search, location information should be indexed as well as text information. ...
There is more and more commercial and research interest in location-based web search, i.e. finding web content whose topic is related to a particular place or region. In this type of search, location information should be indexed as well as text information. However, the index of conventional text search engine is set-oriented, while location information is two-dimensional and in Euclidean space. This brings new research problems on how to efficiently represent the location attributes of web pages and how to combine two types of indexes. In this paper, we propose to use a hybrid index structure, which integrates inverted files and R*-trees, to handle both textual and location aware queries. Three different combining schemes are studied: (1) inverted file and R*-tree double index, (2) first inverted file then R*-tree, (3) first R*-tree then inverted file. To validate the performance of proposed index structures, we design and implement a complete location-based web search engine which mainly consists of four parts: (1) an extractor which detects geographical scopes of web pages and represents geographical scopes as multiple MBRs based on geographical coordinates; (2) an indexer which builds hybrid index structures to integrate text and location information; (3) a ranker which ranks results by geographical relevance as well as non-geographical relevance; (4) an interface which is friendly for users to input location-based search queries and to obtain geographical and textual relevant results. Experiments on large real-world web dataset show that both the second and the third structures are superior in query time and the second is slightly better than the third. Additionally, indexes based on R*-trees are proven to be more efficient than indexes based on grid structures. expand
|
|
|
Person resolution in person search results: WebHawk |
| |
Xiaojun Wan,
Jianfeng Gao,
Mu Li,
Binggong Ding
|
|
Pages: 163-170 |
|
doi>10.1145/1099554.1099585 |
|
Full text: PDF
|
|
Finding information about people on the Web using a search engine is difficult because there is a many-to-many mapping between person names and specific persons (i.e. referents). This paper describes a person resolution system, called WebHawk. ...
Finding information about people on the Web using a search engine is difficult because there is a many-to-many mapping between person names and specific persons (i.e. referents). This paper describes a person resolution system, called WebHawk. Given a list of pages obtained by submitting a person query to a search engine, WebHawk facilitates person search in three steps: First of all, a filter removes those pages that contain no information about any person. Secondly, a cluster groups the remaining pages into different clusters, each for one specific person. To make the resulting clusters more meaningful, an extractor is used to induce query-oriented personal information from each page. Finally, a namer generates an informative description for each cluster so that users can find any specific person easily. The architecture of WebHawk is presented, and the four components are discussed in detail, with a separate evaluation of each component presented where appropriate. A user study shows that WebHawk complements most existing search engines and successfully improves users' experience of person search on the Web. expand
|
|
|
SESSION: Paper session DB-3 (databases): sensors and data streams |
|
|
|
|
Adaptive load shedding for windowed stream joins |
| |
Buğgra Gedik,
Kun-Lung Wu,
Philip S. Yu,
Ling Liu
|
|
Pages: 171-178 |
|
doi>10.1145/1099554.1099587 |
|
Full text: PDF
|
|
We present an adaptive load shedding approach for windowed stream joins. In contrast to the conventional approach of dropping tuples from the input streams, we explore the concept of selective processing for load shedding. We allow stream tuples ...
We present an adaptive load shedding approach for windowed stream joins. In contrast to the conventional approach of dropping tuples from the input streams, we explore the concept of selective processing for load shedding. We allow stream tuples to be stored in the windows and shed excessive CPU load by performing the join operations, not on the entire set of tuples within the windows, but on a dynamically changing subset of tuples that are learned to be highly beneficial. We support such dynamic selective processing through three forms of runtime adaptations: adaptation to input stream rates, adaptation to time correlation between the streams and adaptation to join directions. Indexes are used to further speed up the execution of stream joins. Experiments are conducted to evaluate our adaptive load shedding in terms of output rate. The results show that our selective processing approach to load shedding is very effective and significantly outperforms the approach that drops tuples from the input streams. expand
|
|
|
Integrating DCT and DWT for approximating cube streams |
| |
Ming-Jyh Hsieh,
Ming-Syan Chen,
Philip S. Yu
|
|
Pages: 179-186 |
|
doi>10.1145/1099554.1099588 |
|
Full text: PDF
|
|
For time-relevant multi-dimensional data sets (MDS), users usually pose a huge amount of data due to the large dimensionality, and approximating query processing has emerged as a viable solution. Specifically, the cube streams handle MDSs in a continuous ...
For time-relevant multi-dimensional data sets (MDS), users usually pose a huge amount of data due to the large dimensionality, and approximating query processing has emerged as a viable solution. Specifically, the cube streams handle MDSs in a continuous manner. Traditional cube approximation focuses on generating single snapshots rather than continuous ones. To address this issue, the application of generating snapshots for cube streams, called SCS, is investigated in this paper. Such an application collects data events for cube streams on-line and generates snapshots with limited resources in order to keep the approximated information in synopsis memory for further analysis. As compared to OLAP applications, the SCS ones are subject to much more resource constraints for both processing time and memory and cannot be dealt with by existing methods due to the limited resources. In this paper, the DAWA algorithm, standing for a hybrid algorithm of Dct for Data and discrete WAvelet transform, is proposed to approximate the cube streams. The DAWA algorithm combines the advantage of high compression rate from DWT and that of low memory cost from DCT. Consequently, DAWA costs much smaller working buffer and outperforms both DWT-based and DCT-based methods in execution efficiency. Also, it is shown that DAWA provides answers of good quality for SCS applications with a small working buffer and short execution time. The optimality of algorithm DAWA is theoretically proved and also empirically demonstrated by our experiments. expand
|
|
|
Exploiting redundancy in sensor networks for energy efficient processing of spatiotemporal region queries |
| |
Alexandru Coman,
Mario A. Nascimento,
Jörg Sander
|
|
Pages: 187-194 |
|
doi>10.1145/1099554.1099589 |
|
Full text: PDF
|
|
Sensor networks are made of autonomous devices that are able to collect, store, process and share data with other devices. Spatiotemporal region queries can be used for retrieving information of interest from such networks. Such queries require the answers ...
Sensor networks are made of autonomous devices that are able to collect, store, process and share data with other devices. Spatiotemporal region queries can be used for retrieving information of interest from such networks. Such queries require the answers only from the subset of the network nodes that fall into the query region. If the network is redundant in the sense that the measurements of some nodes can be substituted by those of other nodes with a certain degree of confidence, then a much smaller subset of nodes may be sufficient to answer the query at a lower energy cost. We investigate how to take advantage of such data redundancy and propose two techniques to process spatiotemporal region queries under these conditions. Our techniques reduce up to twenty times the energy cost of query processing compared to the typical network flooding, thus prolonging the lifetime of the sensor network. expand
|
|
|
SESSION: Paper session KM-3 (knowledge management): classification & clustering |
|
|
|
|
Collective multi-label classification |
| |
Nadia Ghamrawi,
Andrew McCallum
|
|
Pages: 195-200 |
|
doi>10.1145/1099554.1099591 |
|
Full text: PDF
|
|
Common approaches to multi-label classification learn independent classifiers for each category, and employ ranking or thresholding schemes for classification. Because they do not exploit dependencies between labels, such techniques are only well-suited ...
Common approaches to multi-label classification learn independent classifiers for each category, and employ ranking or thresholding schemes for classification. Because they do not exploit dependencies between labels, such techniques are only well-suited to problems in which categories are independent. However, in many domains labels are highly interdependent. This paper explores multi-label conditional random field (CRF)classification models that directly parameterize label co-occurrences in multi-label classification. Experiments show that the models outperform their single-label counterparts on standard text corpora. Even when multi-labels are sparse, the models improve subset classification error by as much as 40%. expand
|
|
|
Clustering high-dimensional data using an efficient and effective data space reduction |
| |
Ratko Orlandic,
Ying Lai,
Wai Gen Yee
|
|
Pages: 201-208 |
|
doi>10.1145/1099554.1099592 |
|
Full text: PDF
|
|
This paper introduces a new algorithm for clustering data in high-dimensional feature spaces, called GARDENHD. The algorithm is organized around the notion of data space reduction, i.e. the process of detecting dense areas (dense cells) ...
This paper introduces a new algorithm for clustering data in high-dimensional feature spaces, called GARDENHD. The algorithm is organized around the notion of data space reduction, i.e. the process of detecting dense areas (dense cells) in the space. It performs effective and efficient elimination of empty areas that characterize typical high-dimensional spaces and an efficient adjacency-connected agglomeration of dense cells into larger clusters. It produces a compact representation that can effectively capture the essence of data. GARDENHD is a hybrid of cell-based and density-based clustering. However, unlike typical clustering methods in its class, it applies a recursive partition of sparse regions in the space using a new space-partitioning strategy. The properties of this partitioning strategy greatly facilitate data space reduction. The experiments on synthetic and real data sets reveal that GARDENHD and its data space reduction are effective, efficient, and scalable. expand
|
|
|
Versatile structural disambiguation for semantic-aware applications |
| |
Federica Mandreoli,
Riccardo Martoglia,
Enrico Ronchetti
|
|
Pages: 209-216 |
|
doi>10.1145/1099554.1099593 |
|
Full text: PDF
|
|
In this paper, we propose a versatile disambiguation approach which can be used to make explicit the meaning of structure based information such as XML schemas, XML document structures, web directories, and ontologies. It can be of support to the semantic-awareness ...
In this paper, we propose a versatile disambiguation approach which can be used to make explicit the meaning of structure based information such as XML schemas, XML document structures, web directories, and ontologies. It can be of support to the semantic-awareness of a wide range of applications, from schema matching and query rewriting to peer data management systems, from XML data clustering to ontology-based automatic annotation of web pages and query expansion. The effectiveness of the achieved results has been experimentally proved and is founded both on a flexible exploitation of the structure context, whose extraction can be tailored on the specific application needs, and of the information provided by commonly available thesauri such as WordNet. expand
|
|
|
POSTER SESSION: Poster Session |
|
|
|
|
D-CAPE: distributed and self-tuned continuous query processing |
| |
Timothy M. Sutherland,
Bin Liu,
Mariana Jbantova,
Elke A. Rundensteiner
|
|
Pages: 217-218 |
|
doi>10.1145/1099554.1099595 |
|
Full text: PDF
|
|
|
|
|
Mining conserved XML query paths for dynamic-conscious caching |
| |
Qiankun Zhao,
Sourav S. Bhowmick,
Le Gruenwald
|
|
Pages: 219-220 |
|
doi>10.1145/1099554.1099596 |
|
Full text: PDF
|
|
Existing XML query pattern-based caching strategies focus on extracting the set of frequently issued query pattern trees based on the number of occurrences of the query pattern trees in the history. Each occurrence of the same query pattern tree is considered ...
Existing XML query pattern-based caching strategies focus on extracting the set of frequently issued query pattern trees based on the number of occurrences of the query pattern trees in the history. Each occurrence of the same query pattern tree is considered equally important for the caching strategy. However, the same query pattern tree may occur at different timepoints in the history of XML queries. This temporal feature can be used to improve the caching strategy. In this paper, we propose a novel type of query pattern called conserved query paths for efficient caching by integrating the support and temporal features together. Conserved query paths are paths in query pattern trees that never change or do not change significantly most of the time (if not always) in terms of their support values during a specific time period. We proposed an algorithm to extract those conserved query paths. By ranking those conserved query paths, a dynamic-conscious caching (DCC) strategy is proposed for efficient XML query processing. Experiments show that the DCC caching strategy outperforms the existing XML query pattern tree-based caching strategies. expand
|
|
|
Optimizing continuous multijoin queries over distributed streams |
| |
Yongluan Zhou,
Ying Yan,
Beng Chin Ooi,
Kian-Lee Tan,
Aoying Zhou
|
|
Pages: 221-222 |
|
doi>10.1145/1099554.1099597 |
|
Full text: PDF
|
|
|
|
|
Processing XPath queries with XML summaries |
| |
Takeharu Eda,
Makoto Onizuka,
Masashi Yamamuro
|
|
Pages: 223-224 |
|
doi>10.1145/1099554.1099598 |
|
Full text: PDF
|
|
Range labeling and structural joins are well-studied techniques for efficiently processing XPath queries. However, when XPath queries become long, many times of structural joins are required. To solve this problem, we developed a method to reduce the ...
Range labeling and structural joins are well-studied techniques for efficiently processing XPath queries. However, when XPath queries become long, many times of structural joins are required. To solve this problem, we developed a method to reduce the number of joins and nodes read from the disk using strong DataGuides. Our method can process single paths without any joins and twig patterns with joins amongst branching nodes and leaves in queries. Experimental results verified that our approach outperforms the best optimization technique for structural joins by factors of up to several hundreds of times. expand
|
|
|
On reducing redundancy and improving efficiency of XML labeling schemes |
| |
Changqing Li,
Tok Wang Ling,
Jiaheng Lu,
Tian Yu
|
|
Pages: 225-226 |
|
doi>10.1145/1099554.1099599 |
|
Full text: PDF
|
|
The basic relationships to be determined in XML query processing are ancestor-descendant (A-D), parent-child (P-C), sibling and ordering relationships. The containment labeling scheme can determine the A-D, P-C and ordering relationships fast, but it ...
The basic relationships to be determined in XML query processing are ancestor-descendant (A-D), parent-child (P-C), sibling and ordering relationships. The containment labeling scheme can determine the A-D, P-C and ordering relationships fast, but it is very expensive in determining the sibling relationship. The prefix labeling scheme can determine all the four basic relationships fast if the XML tree is shallow. However, if the XML tree is deep, the prefix scheme is inefficient since the prefix is long. Furthermore, the prefix label is repeated by all the siblings (only the self labels of these siblings are different). Thus in this paper, we propose the P-Containment and P-Prefix schemes which can determine all the four basic relationships faster no matter what the XML structure is; meanwhile P-Prefix can reduce the redundancies in the prefix labeling scheme. expand
|
|
|
Applying cosine series to join size estimation |
| |
Cheng Luo,
Zhewei Jiang,
Wen-Chi Hou
|
|
Pages: 227-228 |
|
doi>10.1145/1099554.1099600 |
|
Full text: PDF
|
|
This paper provides a general overview of two innovative applications of Cosine series in XML joins and data stream joins.
This paper provides a general overview of two innovative applications of Cosine series in XML joins and data stream joins. expand
|
|
|
Database selection in intranet mediators for natural language queries |
| |
Fang Liu,
Shuang Liu,
Clement Yu,
Weiyi Meng,
Ophir Frieder,
David Grossman
|
|
Pages: 229-230 |
|
doi>10.1145/1099554.1099601 |
|
Full text: PDF
|
|
|
|
|
Structure-based query-specific document summarization |
| |
Ramakrishna Varadarajan,
Vagelis Hristidis
|
|
Pages: 231-232 |
|
doi>10.1145/1099554.1099602 |
|
Full text: PDF
|
|
Summarization of text documents is increasingly important with the amount of data available on the Internet. The large majority of current approaches view documents as linear sequences of words and create query-independent summaries. However, ignoring ...
Summarization of text documents is increasingly important with the amount of data available on the Internet. The large majority of current approaches view documents as linear sequences of words and create query-independent summaries. However, ignoring the structure of the document degrades the quality of summaries. Furthermore, the popularity of web search engines requires query-specific summaries. We present a method to create query-specific summaries by adding structure to documents by extracting associations between their fragments. expand
|
|
|
Typed functional query languages with equational specifications |
| |
Ken Q. Pu,
Alberto O. Mendelzon
|
|
Pages: 233-234 |
|
doi>10.1145/1099554.1099603 |
|
Full text: PDF
|
|
We present a framework for functionally modeling query languages and data models. Data and queries are uniformly represented by first-order functions, and query-language constructs by polymorphic higher-order functions. The functions are typed by a database-oriented ...
We present a framework for functionally modeling query languages and data models. Data and queries are uniformly represented by first-order functions, and query-language constructs by polymorphic higher-order functions. The functions are typed by a database-oriented type system that supports polymorphism and nesting of types, thus one can perform static type-checking and type-inferencing of query-expressions. The query language can be freely extended by introducing new querying constructs as polymorphic higher-order functions.While type information gives the input-output description of the functions, the semantic information is captured by equational specifications. Knowledge about the functions is represented as equalities of functional expressions in the form of equations. By equational axiomatization of the query language, database problems of query equivalence and answering-query with views can be posed as equational word-problems and equational matching. expand
|
|
|
DSAC: integrity for outsourced databases with signature aggregation and chaining |
| |
Maithili Narasimha,
Gene Tsudik
|
|
Pages: 235-236 |
|
doi>10.1145/1099554.1099604 |
|
Full text: PDF
|
|
Database outsourcing is an important trend which involves data owners farming out their data management needs to an external service provider. One important requirement is to maintain the integrity and authenticity of outsourced data. Whenever an outsourced ...
Database outsourcing is an important trend which involves data owners farming out their data management needs to an external service provider. One important requirement is to maintain the integrity and authenticity of outsourced data. Whenever an outsourced database is queried, the corresponding query reply must be demonstrably authentic. Furthermore, a reply must include a proof of completeness to convince the querier that no data matching the query predicate(s) has been omitted. In this paper, we suggest new techniques in support of efficient authenticity and completeness guarantees of such query replies. expand
|
|
|
Answering aggregation queries on hierarchical web sites using adaptive sampling |
| |
Foto N. Afrati,
Paraskevas V. Lekeas,
Chen Li
|
|
Pages: 237-238 |
|
doi>10.1145/1099554.1099605 |
|
Full text: PDF
|
|
We study how to answer aggregation queries over hierarchical Web sites using adaptive sampling.
We study how to answer aggregation queries over hierarchical Web sites using adaptive sampling. expand
|
|
|
OSQR: overlapping clustering of query results |
| |
Bhuvan Bamba,
Prasan Roy,
Mukesh Mohania
|
|
Pages: 239-240 |
|
doi>10.1145/1099554.1099606 |
|
Full text: PDF
|
|
|
|
|
INFER: a relational query language without the complexity of SQL |
| |
Terrence Mason,
Ramon Lawrence
|
|
Pages: 241-242 |
|
doi>10.1145/1099554.1099607 |
|
Full text: PDF
|
|
The INFER query language allows users to express queries without referencing relations or specifying joins. Since the INFER syntax is similar to but less restrictive than SQL, users can easily write highly expressive queries that are automatically completed ...
The INFER query language allows users to express queries without referencing relations or specifying joins. Since the INFER syntax is similar to but less restrictive than SQL, users can easily write highly expressive queries that are automatically completed by INFER's inference engine. INFER's SQL-based syntax is familiar to current database users, and its improved ranking and query explanation system makes it easier to use. expand
|
|
|
Efficient data dissemination using locale covers |
| |
Sandeep Gupta,
Jinfeng Ni,
Chinya V. Ravishankar
|
|
Pages: 243-244 |
|
doi>10.1145/1099554.1099608 |
|
Full text: PDF
|
|
Location-dependent data are central to many emerging applications, ranging from traffic information services to sensor networks. The standard pull- and push-based data dissemination models become unworkable since the data volumes and number of clients ...
Location-dependent data are central to many emerging applications, ranging from traffic information services to sensor networks. The standard pull- and push-based data dissemination models become unworkable since the data volumes and number of clients are high.We address this problem using locale covers, a subset of the original set of locations of interest, chosen to include at least one location in a suitably defined neighborhood of any client. Since location-dependent values are highly correlated with location, a query can be answered using a location close to the query point.We show that location-dependent queries may be answered satisfactorily using locale covers, with small loss of accuracy. Our approach is independent of locations and speeds of clients, and is applicable to mobile clients. expand
|
|
|
Incremental evaluation of a monotone XPath fragment |
| |
Hidetaka Matsumura,
Keishi Tajima
|
|
Pages: 245-246 |
|
doi>10.1145/1099554.1099609 |
|
Full text: PDF
|
|
This paper shows a scheme for incremental evaluation of XPath queries. Here, we focus on a monotone fragment of XPath, i.e., when a data is deleted from (or inserted to) the database, only deletion (insertion, resp.) may occur to query answers. For efficiently ...
This paper shows a scheme for incremental evaluation of XPath queries. Here, we focus on a monotone fragment of XPath, i.e., when a data is deleted from (or inserted to) the database, only deletion (insertion, resp.) may occur to query answers. For efficiently processing deletions, we store information on partial matchings, i.e., which elements were participating in matchings for which query answers, and also store counters showing how many matchings each query answer had. We use the information on the partial matchings also for skipping a part of computation upon data insertion. We investigate properties of the XPath fragment in order to keep the amount of information we store as small as possible. expand
|
|
|
Discovering strong skyline points in high dimensional spaces |
| |
Zhenjie Zhang,
Xinyu Guo,
Hua Lu,
Anthony K. H. Tung,
Nan Wang
|
|
Pages: 247-248 |
|
doi>10.1145/1099554.1099610 |
|
Full text: PDF
|
|
Current interests in skyline computation arise due to their relation to preference queries. Since it is guaraneed that a skyline point will not lose out in all dimensions when compared to any other point in the data set, this means that for each skyline ...
Current interests in skyline computation arise due to their relation to preference queries. Since it is guaraneed that a skyline point will not lose out in all dimensions when compared to any other point in the data set, this means that for each skyline point, there exists a set of weight assignments to the dimensions such that the point will become the top user preference.We believe that the usefulness of skyline points is not limited to such application and can be extended to data analysis and knowledge discovery as well. However, since the skyline of high dimensional datasets (which are common in data analysis applications) can contain too many points, various means must be developed to filter off the less interesting skyline points in high dimensions. In this paper, we will propose algorithms to find a set of interesting skyline points called strong skyline points. Extensive experiments show that our proposal is both effective and efficient. expand
|
|
|
Mining undiscovered public knowledge from complementary and non-interactive biomedical literature through semantic pruning |
| |
Xiaohua Hu,
Illhoi Yoo,
Min Song,
Yanqing Zhang,
Il-Yeol Song
|
|
Pages: 249-250 |
|
doi>10.1145/1099554.1099611 |
|
Full text: PDF
|
|
Two complementary and non-interactive literature sets of articles, when they are considered together, can reveal useful information of scientific interest not apparent in either of the two document sets. Swanson called the existence of such knowledge, ...
Two complementary and non-interactive literature sets of articles, when they are considered together, can reveal useful information of scientific interest not apparent in either of the two document sets. Swanson called the existence of such knowledge, undiscovered public knowledge (UDPK). This paper proposes a semantic-based mining model for UDPK. Our method replaces manual ad-hoc pruning with using semantic knowledge from the biomedical ontologies. Using the semantic types and semantic relationships of the biomedical concepts, our prototype system can identify the relevant concepts collected from Medline and generate the novel hypothesis between these concepts. The system successfully replicates Swanson's two famous discoveries: Raynaud disease/fish oils and migraine/magnesium. Compared with previous approaches, our methods generate much fewer but more relevant novel hypotheses, and require much less human intervention in the discovery procedure. expand
|
|
|
Access control for XML: a dynamic query rewriting approach |
| |
Sriram Mohan,
Arijit Sengupta,
Yuqing Wu
|
|
Pages: 251-252 |
|
doi>10.1145/1099554.1099612 |
|
Full text: PDF
|
|
Being able to express and enforce role-based access control on XML data is a critical component of XML data management. However, given the semi-structured nature of XML, this is non-trivial, as access control can be applied on the values of nodes as ...
Being able to express and enforce role-based access control on XML data is a critical component of XML data management. However, given the semi-structured nature of XML, this is non-trivial, as access control can be applied on the values of nodes as well as on the structural relationship between nodes. In this context, we adopt and extend a graph editing language for specifying role-based access constraints in the form of security views. A Security Annotated Schema (SAS) is proposed as the internal representation for the security views and can be automatically constructed from the original schema and the security view specification. To enforce the access constraints on user queries, we propose Secure Query Rewrite (SQR) -- a set of rules that can be used to rewrite a user XPath query on the security view into an equivalent XQuery expression against the original data, with the guarantee that the users only see information in the view but not any data that was blocked. Experimental evaluation demonstrates the efficiency and the expressiveness of our approach. expand
|
|
|
Relational computation for mining association rules from XML data |
| |
Hong-Cheu Liu,
John Zeleznikow
|
|
Pages: 253-254 |
|
doi>10.1145/1099554.1099613 |
|
Full text: PDF
|
|
We develop a fixpoint operator for computing large item sets and demonstrate three query paradigm solutions for association rule mining that use the idea of least fixpoint computation and indicates some optimisation issues. The results of our research ...
We develop a fixpoint operator for computing large item sets and demonstrate three query paradigm solutions for association rule mining that use the idea of least fixpoint computation and indicates some optimisation issues. The results of our research provide theoretical foundation for relational computation of association rules and its application on XML mining. expand
|
|
|
Mining all maximal frequent word sequences in a set of sentences |
| |
Helena Ahonen-Myka
|
|
Pages: 255-256 |
|
doi>10.1145/1099554.1099614 |
|
Full text: PDF
|
|
We present an efficient algorithm for finding all maximal frequent word sequences in a set of sentences. A word sequence s is considered frequent, if all its words occur in at least σ sentences and the words occur in each of these ...
We present an efficient algorithm for finding all maximal frequent word sequences in a set of sentences. A word sequence s is considered frequent, if all its words occur in at least σ sentences and the words occur in each of these sentences in the same order as in s, given a frequency threshold σ. Hence, the words of a sequence s do not have to occur consecutively in the sentences. expand
|
|
|
Joint deduplication of multiple record types in relational data |
| |
Aron Culotta,
Andrew McCallum
|
|
Pages: 257-258 |
|
doi>10.1145/1099554.1099615 |
|
Full text: PDF
|
|
Record deduplication is the task of merging database records that refer to the same underlying entity. In relational data-bases, accurate deduplication for records of one type is often dependent on the decisions made for records of other types. ...
Record deduplication is the task of merging database records that refer to the same underlying entity. In relational data-bases, accurate deduplication for records of one type is often dependent on the decisions made for records of other types. Whereas nearly all previous approaches have merged records of different types independently, this work models these inter-dependencies explicitly to collectively deduplicate records of multiple types. We construct a conditional random field model of deduplication that captures these relational dependencies, and then employ a novel relational partitioning algorithm to jointly deduplicate records. For two citation matching datasets, we show that collectively deduplicating paper and venue records results in up to a 30% error reduction in venue deduplication, and up to a 20% error reduction in paper deduplication. expand
|
|
|
Localized routing trees for query processing in sensor networks |
| |
Jie Lian,
Lei Chen,
Kshirasagar Naik,
M. Tamer Özsu,
G. Agnew
|
|
Pages: 259-260 |
|
doi>10.1145/1099554.1099616 |
|
Full text: PDF
|
|
In this paper, we propose a novel energy-efficient approach, a localized routing tree (LRT) coupled with a route redirection (RR) strategy, to support various types of queries. LRTs take care of the sensors near the sink and reduce the energy consumption ...
In this paper, we propose a novel energy-efficient approach, a localized routing tree (LRT) coupled with a route redirection (RR) strategy, to support various types of queries. LRTs take care of the sensors near the sink and reduce the energy consumption of these sensors, and RR reduces the energy cost of data receptions. Compared to the existing approaches, simulation studies show that LRT together with RR has significant improvement on the query capacity. expand
|
|
|
A latent semantic classification model |
| |
Ming-Wen Wang,
Jian-Yun Nie,
Xue-Qiang Zeng
|
|
Pages: 261-262 |
|
doi>10.1145/1099554.1099617 |
|
Full text: PDF
|
|
Latent Semantic Indexing (LSI) has been successfully applied to information retrieval and text classification. However, when LSI is used in classification, some important features for small classes may be ignored because of their small feature values. ...
Latent Semantic Indexing (LSI) has been successfully applied to information retrieval and text classification. However, when LSI is used in classification, some important features for small classes may be ignored because of their small feature values. To solve this problem, we propose the latent semantic classification (LSC) model which extends the LSI model in the following way: the classification information of the training documents is introduced into the latent semantic structure via a second set of latent variables, so that both indexing and classification information can be taken into account during the classification process. Our experiments on Reuters show that our new model performs better than the existing classification methods such as kNN and SVM. expand
|
|
|
Supporting ranked search in parallel search cluster networks |
| |
Fang Xiong,
Qiong Luo,
Dyce Jing Zhao
|
|
Pages: 263-264 |
|
doi>10.1145/1099554.1099618 |
|
Full text: PDF
|
|
We investigate how to support ranked keyword search in a Parallel Search Cluster Network, which is a newly proposed peer-to-peer network overlay. In particular, we study how to efficiently acquire and distribute the global information required by ranked ...
We investigate how to support ranked keyword search in a Parallel Search Cluster Network, which is a newly proposed peer-to-peer network overlay. In particular, we study how to efficiently acquire and distribute the global information required by ranked keyword search by taking advantage of the architectural features of PSCNs. expand
|
|
|
Web opinion poll: extracting people's view by impression mining from the web |
| |
Tadahiko Kumamoto,
Katsumi Tanaka
|
|
Pages: 265-266 |
|
doi>10.1145/1099554.1099619 |
|
Full text: PDF
|
|
|
|
|
Statistical relationship determination in automatic thesaurus construction |
| |
Libo Chen,
Peter Fankhauser,
Ulrich Thiel,
Thomas Kamps
|
|
Pages: 267-268 |
|
doi>10.1145/1099554.1099620 |
|
Full text: PDF
|
|
Statistical relationship determination among terms is one of the key issues in automatic thesaurus construction. We systematically analyze existing relevant approaches based on their underlying probabilistic assumptions, and propose a combined approach ...
Statistical relationship determination among terms is one of the key issues in automatic thesaurus construction. We systematically analyze existing relevant approaches based on their underlying probabilistic assumptions, and propose a combined approach that overcomes their limitations. expand
|
|
|
Model-guided information discovery for intelligence analysis |
| |
Rafael Alonso,
Hua Li
|
|
Pages: 269-270 |
|
doi>10.1145/1099554.1099621 |
|
Full text: PDF
|
|
Intelligence analysis can be aided and guided by models of the analysts' interests and priorities. This paper describes our approach to analyst modeling as part of the Ant CAFÉ project, in which analyst models are used to guide the searching behavior ...
Intelligence analysis can be aided and guided by models of the analysts' interests and priorities. This paper describes our approach to analyst modeling as part of the Ant CAFÉ project, in which analyst models are used to guide the searching behavior of a swarm of intelligent agents. Structural elements of our analyst model include concepts and relations, both of which help to capture the analyst's current interest and concerns. In addition, the concepts and relationships have associated scalar parameters to provide a quantitative measure of the user's level of interest. We have developed algorithms for dynamically adapting the weights and evolving the elements of the model itself. To evaluate these algorithms we have built an Analyst Modeling Environment workbench. We have tested our approach on this workbench using traces generated by human analysts, and have demonstrated improvements over current state of the art search engines. expand
|
|
|
Biasing web search results for topic familiarity |
| |
Giridhar Kumaran,
Rosie Jones,
Omid Madani
|
|
Pages: 271-272 |
|
doi>10.1145/1099554.1099622 |
|
Full text: PDF
|
|
Depending on a web searcher's familiarity with a query's target topic, it may be more appropriate to show her introductory or advanced documents. The TREC HARD [1] track defined topic familiarity as meta-data associated with a user's query. ...
Depending on a web searcher's familiarity with a query's target topic, it may be more appropriate to show her introductory or advanced documents. The TREC HARD [1] track defined topic familiarity as meta-data associated with a user's query. We instead define a user-independent and query-independent model of topic-familiarity required to read a document, so it can be matched to a given user in response to a query. An introductory web page is defined as A web page that doesn't presuppose any background knowledge of the topic it is on, and to an extent introduces or defines the key terms in the topic. while an advanced web page is defined as A web page that assumes sufficient background knowledge of the topic it is on, and familiarity with the key technical/ important terms in the topic, and potentially builds on them. We develop a method for biasing the initial mix of documents returned by a search engine to increase the number of documents of desired familiarity level up to position 5, and up to position 10. Our method involves building a supervised text classifier, incorporating features based on reading level, the distribution of stop-words in the text, and non-text features such as average line-length. Using this familiarity classifier, we achieve statistically significant improvements at reranking the result set to show introductory documents higher up the ranked list. Our classifier can be seamlessly integrated into current search engine technology without involving any major modifications to existing architectures. expand
|
|
|
Accurate language model estimation with document expansion |
| |
Tao Tao,
Xuanhui Wang,
Qiaozhu Mei,
ChengXiang Zhai
|
|
Pages: 273-274 |
|
doi>10.1145/1099554.1099623 |
|
Full text: PDF
|
|
|
|
|
Mining community structure of named entities from free text |
| |
Xin Li,
Bing Liu
|
|
Pages: 275-276 |
|
doi>10.1145/1099554.1099624 |
|
Full text: PDF
|
|
Although community discovery has been studied extensively in the Web environment, limited research has been done in the case of free text. Co-occurrence of words and entities in sentences and documents usually implies connections among them. In this ...
Although community discovery has been studied extensively in the Web environment, limited research has been done in the case of free text. Co-occurrence of words and entities in sentences and documents usually implies connections among them. In this paper, we investigate the co-occurrences of named entities in text, and mine communities among these entities. We show that identifying communities from free text can be transformed into a graph clustering problem. A hierarchical clustering algorithm is then proposed. Our experiment shows that the algorithm is effective to discover named entity communities from text documents. expand
|
|
|
A practical system of keyphrase extraction for web pages |
| |
Mo Chen,
Jian-Tao Sun,
Hua-Jun Zeng,
Kwok-Yan Lam
|
|
Pages: 277-278 |
|
doi>10.1145/1099554.1099625 |
|
Full text: PDF
|
|
Keyphrases can be used to facilitate Web users grasping the main topic(s) of a Web page. We present a practical system of automatic keyphrase extraction for Web pages. In this system, a regression model was first trained based on a set of human-labeled ...
Keyphrases can be used to facilitate Web users grasping the main topic(s) of a Web page. We present a practical system of automatic keyphrase extraction for Web pages. In this system, a regression model was first trained based on a set of human-labeled documents. Then it was used to extract keyphrases from new pages automatically. This paper makes three contributions. First, the structure information in a Web page was investigated for keyphrase extraction task. Second, the query log data associated with a Web page collected by a search engine server were used to help keyphrase extraction. Third, a method was put forward in this paper in order to evaluate the similarity of phrases. expand
|
|
|
Incremental stock time series data delivery and visualization |
| |
Tak-chung Fu,
Fu-lai Chung,
Pui-ying Tang,
Robert Luk,
Chak-man Ng
|
|
Pages: 279-280 |
|
doi>10.1145/1099554.1099626 |
|
Full text: PDF
|
|
SB-Tree is a binary tree data structure proposed to represent time series according to the importance of data points. Its use in stock data management is distinguished by preserving the critical data points' attribute values, retrieving time series data ...
SB-Tree is a binary tree data structure proposed to represent time series according to the importance of data points. Its use in stock data management is distinguished by preserving the critical data points' attribute values, retrieving time series data according to the importance of data points and facilitating multi-resolution time series retrieval. As new stock data are available continuously, an effective updating mechanism for SB-Tree is needed. In this paper, a study of different updating approaches is reported. Three families of updating methods are proposed. They are periodic rebuild, batch update and point-by-point update. Their efficiency, effectiveness and characteristics are compared and reported. expand
|
|
|
Generating better concept hierarchies using automatic document classification |
| |
Razvan Stefan Bot,
Yi-fang Brook Wu,
Xin Chen,
Quanzhi Li
|
|
Pages: 281-282 |
|
doi>10.1145/1099554.1099627 |
|
Full text: PDF
|
|
This paper presents a hybrid concept hierarchy development technique for web returned documents retrieved by a meta-search engine. The aim of the technique is to separate the initial retrieved documents into topical oriented categories, prior to the ...
This paper presents a hybrid concept hierarchy development technique for web returned documents retrieved by a meta-search engine. The aim of the technique is to separate the initial retrieved documents into topical oriented categories, prior to the actual concept hierarchy generation. The topical categories correspond to different semantic aspects of the query. This is done using a 1-of-n automatic document classification, on the initial set of returned documents. Then, an individual topical concept hierarchy is automatically generated inside each of the resulted categories. Both steps are executed on the fly at retrieval time. Due to the efficiency constraints imposed by the web retrieval context, the algorithm only uses document snippets (rather than full web pages) for both document classification and concept hierarchy generation. Experimental results show that the algorithm is able to improve the quality of the concept hierarchy presented to the searcher; at the same time, the efficiency parameters are kept within reasonable intervals. expand
|
|
|
Domain-specific keyphrase extraction |
| |
Yi-fang Brook Wu,
Quanzhi Li,
Razvan Stefan Bot,
Xin Chen
|
|
Pages: 283-284 |
|
doi>10.1145/1099554.1099628 |
|
Full text: PDF
|
|
Document keyphrases provide semantic metadata characterizing documents and producing an overview of the content of a document. They can be used in many text-mining and knowledge management related applications. This paper describes a Keyphrase Identification ...
Document keyphrases provide semantic metadata characterizing documents and producing an overview of the content of a document. They can be used in many text-mining and knowledge management related applications. This paper describes a Keyphrase Identification Program (KIP), which extracts document keyphrases by using prior positive samples of human identified domain keyphrases to assign weights to the candidate keyphrases. The logic of our algorithm is: the more keywords a candidate keyphrase contains and the more significant these keywords are, the more likely this candidate phrase is a keyphrase. To obtain prior positive inputs, KIP first populates its glossary database using manually identified keyphrases and keywords. It then checks the composition of all noun phrases of a document, looks up the database and calculates scores for all these noun phrases. The ones having higher scores will be extracted as keyphrases. expand
|
|
|
An RSA-based time-bound hierarchical key assignment scheme for electronic article subscription |
| |
Jyh-haw Yeh
|
|
Pages: 285-286 |
|
doi>10.1145/1099554.1099629 |
|
Full text: PDF
|
|
The time-bound hierarchical key assignment problem is to assign time sensitive keys to security classes in a partially ordered hierarchy so that legal data accesses among classes can be enforced. Two time-bound hierarchical key assignment schemes have ...
The time-bound hierarchical key assignment problem is to assign time sensitive keys to security classes in a partially ordered hierarchy so that legal data accesses among classes can be enforced. Two time-bound hierarchical key assignment schemes have been proposed in the literature, but both of them were proved insecure against collusive attacks. In this paper, we will propose an RSA-based time-bound hierarchical key assignment scheme and describe its possible application. The security analysis shows that the new scheme is safe against the collusive attacks. expand
|
|
|
Maximal termsets as a query structuring mechanism |
| |
Bruno Pôssas,
Nivio Ziviani,
Berthier Ribeiro-Neto,
Wagner Meira, Jr.
|
|
Pages: 287-288 |
|
doi>10.1145/1099554.1099630 |
|
Full text: PDF
|
|
Search engines process queries conjunctively to restrict the size of the answer set. Further, it is not rare to observe a mismatch between the vocabulary used in the text of Web pages and the terms used to compose the Web queries. The combination of ...
Search engines process queries conjunctively to restrict the size of the answer set. Further, it is not rare to observe a mismatch between the vocabulary used in the text of Web pages and the terms used to compose the Web queries. The combination of these two features might lead to irrelevant query results, particularly in the case of more specific queries composed of three or more terms. To deal with this problem we propose a new technique for automatically structuring Web queries as a set of smaller subqueries. To select representative subqueries we use information on their distributions in the document collection. This can be adequately modeled using the concept of maximal termsets derived from the formalism of association rules theory. Experimentation shows that our technique leads to improved results. For the TREC-8 test collection, for instance, our technique led to gains in average precision of roughly 28% with regard to a BM25 ranking formula. expand
|
|
|
Accurately extracting coherent relevant passages using hidden Markov models |
| |
Jing Jiang,
ChengXiang Zhai
|
|
Pages: 289-290 |
|
doi>10.1145/1099554.1099631 |
|
Full text: PDF
|
|
In this paper, we present a principled method for accurately extracting coherent relevant passages of variable lengths using HMMs. We show that with appropriate parameter estimation, the HMM method outperforms a number of strong baseline methods on two ...
In this paper, we present a principled method for accurately extracting coherent relevant passages of variable lengths using HMMs. We show that with appropriate parameter estimation, the HMM method outperforms a number of strong baseline methods on two data sets. expand
|
|
|
Structural features in content oriented XML retrieval |
| |
Georgina Ramírez,
Thijs Westerveld,
Arjen P. de Vries
|
|
Pages: 291-292 |
|
doi>10.1145/1099554.1099632 |
|
Full text: PDF
|
|
The structural features of XML components are an extra source of information that should be used in a content-oriented retrieval task on this type of documents. In this paper we explore one of the structural features from the INEX collection [1] that ...
The structural features of XML components are an extra source of information that should be used in a content-oriented retrieval task on this type of documents. In this paper we explore one of the structural features from the INEX collection [1] that could be used in content-oriented search. We analyse the gain this knowledge could add to the performance of an information retrieval system and present a first approach on how this structural information could be extracted from a relevance feedback process to be used as priors in a language modelling framework. expand
|
|
|
Text document clustering based on frequent word sequences |
| |
Yanjun Li,
Soon M. Chung
|
|
Pages: 293-294 |
|
doi>10.1145/1099554.1099633 |
|
Full text: PDF
|
|
In this paper, we propose a new text clustering algorithm, named Clustering based on Frequent Word Sequences (CFWS). A word sequence is frequent if it occurs in more than certain percentage of the documents in the text database. In the past, the vector ...
In this paper, we propose a new text clustering algorithm, named Clustering based on Frequent Word Sequences (CFWS). A word sequence is frequent if it occurs in more than certain percentage of the documents in the text database. In the past, the vector space model was commonly used for information retrieval, but it treats documents as bags of words, ignoring the sequential pattern of word occurrences in the documents. However, the meaning of natural languages strongly depends on the word sequences, and the frequent word sequences can provide compact and valuable information about the text database. Bisecting k-means and FIHC algorithms are evaluated on the performance of text clustering, and are compared with the proposed CFWS algorithm. It has been shown that CFWS has much better performance. expand
|
|
|
Information retrieval and machine learning for probabilistic schema matching |
| |
Henrik Nottelmann,
Umberto Straccia
|
|
Pages: 295-296 |
|
doi>10.1145/1099554.1099634 |
|
Full text: PDF
|
|
Schema matching is the problem of finding correspondences (mapping rules, e.g. logical formulae) between heterogeneous schemas. This paper presents a probabilistic framework, called sPLMap, for automatically learning schema mapping rules. Similar to ...
Schema matching is the problem of finding correspondences (mapping rules, e.g. logical formulae) between heterogeneous schemas. This paper presents a probabilistic framework, called sPLMap, for automatically learning schema mapping rules. Similar to LSD, different techniques, mostly from the IR field, are combined.Our approach, however, is also able to give a probabilistic interpretation of the prediction weights of the candidates, and to select the rule set with highest matching probability. expand
|
|
|
Learning to summarise XML documents using content and structure |
| |
Massih R. Amini,
Anastasios Tombros,
Nicolas Usunier,
Mounia Lalmas,
Patrick Gallinari
|
|
Pages: 297-298 |
|
doi>10.1145/1099554.1099635 |
|
Full text: PDF
|
|
Documents formatted in eXtensible Markup Language (XML) are becoming increasingly available in collections of various document types. In this paper, we present an approach for the summarisation of XML documents. The novelty of this approach lies in that ...
Documents formatted in eXtensible Markup Language (XML) are becoming increasingly available in collections of various document types. In this paper, we present an approach for the summarisation of XML documents. The novelty of this approach lies in that it is based on features not only from the content of documents, but also from their logical structure. We follow a machine learning like, sentence extraction-based summarisation technique. To find which features are more effective for producing summaries this approach views sentence extraction as an ordering task. We evaluated our summarisation model using the INEX dataset. The results demonstrate that the inclusion of features from the logical structure of documents increases the effectiveness of the summariser, and that the learnable system is also effective and well-suited to the task of summarisation in the context of XML documents. expand
|
|
|
Trust-based collaborative filtering |
| |
Jianshu Weng,
Chunyan Miao,
Angela Goh,
Dongtao Li
|
|
Pages: 299-300 |
|
doi>10.1145/1099554.1099636 |
|
Full text: PDF
|
|
|
|
|
The earth mover's distance as a semantic measure for document similarity |
| |
Xiaojun Wan,
Yuxin Peng
|
|
Pages: 301-302 |
|
doi>10.1145/1099554.1099637 |
|
Full text: PDF
|
|
Different words are usually assumed to be semantically independent in most existing similarity measures, which is not often true in practice. The semantic relatedness between words cannot be conveniently employed in the existing measures. We propose ...
Different words are usually assumed to be semantically independent in most existing similarity measures, which is not often true in practice. The semantic relatedness between words cannot be conveniently employed in the existing measures. We propose a novel similarity measure based on the earth mover's distance (EMD). In the proposed measure, the semantic distances between words are computed based on the electronic lexical database-WordNet and then the EMD is employed to calculate the document similarity with a many-to-many matching between words. Experiments and results demonstrate the effectiveness of the proposed similarity measure. expand
|
|
|
Slicing*-tree based web page transformation for small displays |
| |
Xiangye Xiao,
Qiong Luo,
Dan Hong,
Hongbo Fu
|
|
Pages: 303-304 |
|
doi>10.1145/1099554.1099638 |
|
Full text: PDF
|
|
We propose a new Web page transformation method for browsing on mobile devices with small displays. In our approach, an original web page that does not fit into the screen is transformed into a set of pages, each of which fits into the screen. This transformation ...
We propose a new Web page transformation method for browsing on mobile devices with small displays. In our approach, an original web page that does not fit into the screen is transformed into a set of pages, each of which fits into the screen. This transformation is done through slicing the original page. The resulting set of transformed pages form a multi-level tree structure, called a slicing*-tree, in which an internal node consists of a thumbnail image with hyperlinks and a leaf node is a block from the original web page. Our slicing*-tree based Web page transformation eases Web browsing on small displays by providing screen-fitting visual context and reducing page scrolling effort. expand
|
|
|
An evaluation of evolved term-weighting schemes in information retrieval |
| |
Ronan Cummins,
Colm O'Riordan
|
|
Pages: 305-306 |
|
doi>10.1145/1099554.1099639 |
|
Full text: PDF
|
|
This paper presents an evaluation of evolved term-weighting schemes on short, medium and long TREC queries. A previously evolved global (collection-wide) term-weighting scheme is evaluated on unseen TREC data and is shown to increase mean average precision ...
This paper presents an evaluation of evolved term-weighting schemes on short, medium and long TREC queries. A previously evolved global (collection-wide) term-weighting scheme is evaluated on unseen TREC data and is shown to increase mean average precision over idf. A local (within-document) evolved term-weighting scheme is presented which is dependent on the best performing global scheme. The full evolved scheme (i.e. the combined local and global scheme) is compared to both the BM25 scheme and the Pivoted Normalisation scheme.Our results show that the local evolved solution does not perform well on some collections due to its document normalisation properties and we conclude that Okapi-tf can be tuned to interact effectively with the evolved global weighting scheme presented and increase mean average precision over the standard BM25 scheme. expand
|
|
|
Web-centric language models |
| |
Jaap Kamps
|
|
Pages: 307-308 |
|
doi>10.1145/1099554.1099640 |
|
Full text: PDF
|
|
We investigate language models for informational and navigational web search. Retrieval on the web is a task that differs substantially from ordinary ad hoc retrieval. We perform an analysis of prior probability of relevance for a wide range of non-content ...
We investigate language models for informational and navigational web search. Retrieval on the web is a task that differs substantially from ordinary ad hoc retrieval. We perform an analysis of prior probability of relevance for a wide range of non-content features, shedding further light on the importance of non-content features for web retrieval. This directly explains the success or failure of various techniques, e.g., why the link topology is particularly helpful to single out important sites. Language models can naturally incorporate multiple document representations, as well as non-content information. For the former, we employ mixture language models based on document full-text, incoming anchor-text, and document titles. For the latter, we study a range of priors based on document length, URL structure, and link topology. We look at three types of topics--distillation, home page, and named page--as well as for a mixed query set. We find that the mixture models lead to considerable improvement of retrieval effectiveness for all topic types. The web-centric priors generally lead to further improvement of retrieval effectiveness. expand
|
|
|
Using RankBoost to compare retrieval systems |
| |
Huyen-Trang Vu,
Patrick Gallinari
|
|
Pages: 309-310 |
|
doi>10.1145/1099554.1099641 |
|
Full text: PDF
|
|
This paper presents a new pooling method for constructing the assessment sets used in the evaluation of retrieval systems. Our proposal is based on RankBoost, a machine learning voting algorithm. It leads to smaller pools than classical pooling and thus ...
This paper presents a new pooling method for constructing the assessment sets used in the evaluation of retrieval systems. Our proposal is based on RankBoost, a machine learning voting algorithm. It leads to smaller pools than classical pooling and thus reduces the manual assessment workload for building test collections. Experimental results obtained on an XML document collection demonstrate the effectiveness of the approach according to different evaluation criteria. expand
|
|
|
Static score bucketing in inverted indexes |
| |
Chavdar Botev,
Nadav Eiron,
Marcus Fontoura,
Ning Li,
Eugene Shekita
|
|
Pages: 311-312 |
|
doi>10.1145/1099554.1099642 |
|
Full text: PDF
|
|
Maintaining strict static score order of inverted lists is a heuristic used by search engines to improve the quality of query results when the entire inverted lists cannot be processed. This heuristic, however, increases the cost of index generation ...
Maintaining strict static score order of inverted lists is a heuristic used by search engines to improve the quality of query results when the entire inverted lists cannot be processed. This heuristic, however, increases the cost of index generation and requires complex index build algorithms. In this paper, we study a new index organization based on static score bucketing. We show that this new technique significantly improves in index build performance while having minimal impact on the quality of search results. expand
|
|
|
Scalable ranking for preference queries |
| |
Ying Feng,
Divyakant Agrawal,
Amr El Abbadi,
Ambuj Singh
|
|
Pages: 313-314 |
|
doi>10.1145/1099554.1099643 |
|
Full text: PDF
|
|
Top-k preference queries with multiple attributes are critical for decision-making applications. Previous research has concentrated on improving the computational efficiency mainly by using novel index structures and search strategies. Since current ...
Top-k preference queries with multiple attributes are critical for decision-making applications. Previous research has concentrated on improving the computational efficiency mainly by using novel index structures and search strategies. Since current applications need to scale to terabytes of data and thousands of users, performance of such systems is strongly impacted by the amount of available memory. This paper proposes a scalable approach for memory-bounded top-k query processing. expand
|
|
|
Finding experts in community-based question-answering services |
| |
Xiaoyong Liu,
W. Bruce Croft,
Matthew Koll
|
|
Pages: 315-316 |
|
doi>10.1145/1099554.1099644 |
|
Full text: PDF
|
|
|
|
|
Indexing time vs. query time: trade-offs in dynamic information retrieval systems |
| |
Stefan Büttcher,
Charles L. A. Clarke
|
|
Pages: 317-318 |
|
doi>10.1145/1099554.1099645 |
|
Full text: PDF
|
|
We examine issues in the design of fully dynamic information retrieval systems supporting both document insertions and deletions. The two main components of such a system, index maintenance and query processing, affect each other, as high query performance ...
We examine issues in the design of fully dynamic information retrieval systems supporting both document insertions and deletions. The two main components of such a system, index maintenance and query processing, affect each other, as high query performance is usually paid for by additional work during update operations. Two aspects of the system -- incremental updates and garbage collection for delayed document deletions -- are discussed, with a focus on the respective indexing vs. query performance trade-offs. Depending on the relative number of queries and update operations, different strategies lead to optimal overall performance. expand
|
|
|
Poison pills: harmful relevant documents in feedback |
| |
Egidio Terra,
Robert Warren
|
|
Pages: 319-320 |
|
doi>10.1145/1099554.1099646 |
|
Full text: PDF
|
|
|
|
|
Discretization based learning approach to information retrieval |
| |
Dmitri Roussinov,
Weiguo Fan,
Fernando A. Das Neves
|
|
Pages: 321-322 |
|
doi>10.1145/1099554.1099647 |
|
Full text: PDF
|
|
We have designed a representation scheme, which is based on the discrete representation of a document ranking function, which is capable of reproducing and enhancing the properties of such popular ranking functions as tf.idf, BM25 or those ...
We have designed a representation scheme, which is based on the discrete representation of a document ranking function, which is capable of reproducing and enhancing the properties of such popular ranking functions as tf.idf, BM25 or those based on language models. Our tests have demonstrated the capability of our approach to achieve the performance of the best known scoring functions solely through training, without using any known heuristic or analytic formulas. expand
|
|
|
Semantic verification for fact seeking engines |
| |
Dmitri Roussinov,
Weiguo Fan,
Fernando A. Das Neves
|
|
Pages: 323-324 |
|
doi>10.1145/1099554.1099648 |
|
Full text: PDF
|
|
We present the architecture of our web question answering (fact seeking) system and introduce a novel algorithm to validate semantic categories of the expected answers. When tested on the questions used by the prior research, our system demonstrated ...
We present the architecture of our web question answering (fact seeking) system and introduce a novel algorithm to validate semantic categories of the expected answers. When tested on the questions used by the prior research, our system demonstrated the performance comparable to the current state of the art systems. Our semantic verification algorithm has improved the accuracy of answers of the affected questions by 30%. expand
|
|
|
Fast webpage classification using URL features |
| |
Min-Yen Kan,
Hoang Oanh Nguyen Thi
|
|
Pages: 325-326 |
|
doi>10.1145/1099554.1099649 |
|
Full text: PDF
|
|
We demonstrate the usefulness of the uniform resource locator (URL) alone in performing web page classification. This approach is faster than typical web page classification, as the pages do not have to be fetched and analyzed. Our approach segments ...
We demonstrate the usefulness of the uniform resource locator (URL) alone in performing web page classification. This approach is faster than typical web page classification, as the pages do not have to be fetched and analyzed. Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness on two standardized domains. Our results show that in certain scenarios, URL-based methods approach the performance of current state-of-the-art full-text and link-based methods. expand
|
|
|
On the estimation of frequent itemsets for data streams: theory and experiments |
| |
Pierre-Alain Laur,
Richard Nock,
Jean-Emile Symphor,
Pascal Poncelet
|
|
Pages: 327-328 |
|
doi>10.1145/1099554.1099650 |
|
Full text: PDF
|
|
In this paper, we devise a method for the estimation of the true support of itemsets on data streams, with the objective to maximize one chosen criterion among {precision, recall} while ensuring a degradation as reduced as possible for the other criterion. ...
In this paper, we devise a method for the estimation of the true support of itemsets on data streams, with the objective to maximize one chosen criterion among {precision, recall} while ensuring a degradation as reduced as possible for the other criterion. We discuss the strengths, weaknesses and range of applicability of this method that relies on conventional uniform convergence results, yet guarantees statistical optimality from different standpoints. expand
|
|
|
Unapparent information revelation: a concept chain graph approach |
| |
Rohini K. Srihari,
Sudarshan Lamkhede,
Anmol Bhasin
|
|
Pages: 329-330 |
|
doi>10.1145/1099554.1099651 |
|
Full text: PDF
|
|
Information generated by multiple authors working independently at different times when analyzed synergistically reveals more information than apparent. For example, a traditional search for connections between the trucking industry and Iraqi banks may ...
Information generated by multiple authors working independently at different times when analyzed synergistically reveals more information than apparent. For example, a traditional search for connections between the trucking industry and Iraqi banks may not produce any documents mentioning both. However, a search that follows trails of associations across documents may suggest a connection between an auto parts manufacturer who exports to Iraq, and an Iraqi bank providing loans to buy cars. The work described here extends link analysis based on named entities and labeled relationships to general concepts and unnamed associations. Unapparent Information Revelation involves finding chains connecting concepts across documents: it uses a new representation formalism called Concept Chain Graphs. expand
|
|
|
Document quality models for web ad hoc retrieval |
| |
Yun Zhou,
W. Bruce Croft
|
|
Pages: 331-332 |
|
doi>10.1145/1099554.1099652 |
|
Full text: PDF
|
|
The quality of document content, which is an issue that is usually ignored for the traditional ad hoc retrieval task, is a critical issue for Web search. Web pages have a huge variation in quality relative to, for example, newswire articles. To address ...
The quality of document content, which is an issue that is usually ignored for the traditional ad hoc retrieval task, is a critical issue for Web search. Web pages have a huge variation in quality relative to, for example, newswire articles. To address this problem, we propose a document quality language model approach that is incorporated into the basic query likelihood retrieval model in the form of a prior probability. Our results demonstrate that, on average, the new model is significantly better than the baseline (query likelihood model) in terms of precision at the top ranks. expand
|
|
|
Cooperative caching for k-NN search in ad hoc networks |
| |
Bo Yang,
Ali R. Hurson
|
|
Pages: 333-334 |
|
doi>10.1145/1099554.1099653 |
|
Full text: PDF
|
|
Mobile ad hoc networks have multiple limitations in performing similarity-based nearest neighbor search - dynamic topology, frequent disconnections, limited power, and restricted bandwidth. Cooperative caching is an effective technique to reduce network ...
Mobile ad hoc networks have multiple limitations in performing similarity-based nearest neighbor search - dynamic topology, frequent disconnections, limited power, and restricted bandwidth. Cooperative caching is an effective technique to reduce network traffic and increase accessibility. In this paper, we propose to solve the k-nearest-neighbor search problem in ad hoc networks using a semantic-based caching scheme which reflects the content distribution in the network. The proposed scheme describes the semantic similarity among data objects using constraints, and employs cooperative caching to estimate the content distribution in the network. The query resolution based on the cooperative caching scheme is non-flooding and hierarchy-free. expand
|
|
|
A new framework to combine descriptors for content-based image retrieval |
| |
Ricardo da S. Torres,
Alexandre X. Falcão,
Baoping Zhang,
Weiguo Fan,
Edward A. Fox,
Marcos André Gonçalves,
Pavel Calado
|
|
Pages: 335-336 |
|
doi>10.1145/1099554.1099654 |
|
Full text: PDF
|
|
In this paper, we propose a novel framework using Genetic Programming to combine image database descriptors for content-based image retrieval (CBIR). Our framework is validated through several experiments involving two image databases and specific ...
In this paper, we propose a novel framework using Genetic Programming to combine image database descriptors for content-based image retrieval (CBIR). Our framework is validated through several experiments involving two image databases and specific domains, where the images are retrieved based on the shape of their objects. expand
|
|
|
A structure-sensitive framework for text categorization |
| |
Ganesh Ramakrishnan,
Deepa Paranjpe,
Byron Dom
|
|
Pages: 337-338 |
|
doi>10.1145/1099554.1099655 |
|
Full text: PDF
|
|
This paper presents a framework called Structure Sensitive CATegorization(SSCAT), that exploits document structure for improved categorization. There are two parts to this framework, viz. (1) Documents often have layout structure, such that logically ...
This paper presents a framework called Structure Sensitive CATegorization(SSCAT), that exploits document structure for improved categorization. There are two parts to this framework, viz. (1) Documents often have layout structure, such that logically coherent text is grouped together into fields using some mark-up language. We use a log-linear model, which associates one or more features with each field. Weights associated with the field features are learnt from training data and these weights quantify the per-class importance of the field features in determining the category for the document. (2) We employ a technique that exploits the parse tree of fields that are phrasal constructs, such as title and associates weights with words in these constructs while boosting weights of important words called focus words. These weights are learnt from example instances of phrasal constructs, marked with the corresponding focus words. The learning is accomplished by training a classifier that uses linguistic features obtained from the text's parse structure. The weighted words, in fields with phrasal constructs, are used in obtaining features for the corresponding fields in the overall framework. SSCAT was tested on the supervised categorization task of over one million products from Yahoo!'s on-line shopping data. With an accuracy of over 90%, our classifier outperforms Naive Bayes and Support Vector Machines. This not only shows the effectiveness of SSCAT but also strengthens our belief that linguistic features based on natural language structure can improve tasks such as text categorization. expand
|
|
|
Efficient and effective server-sided distributed clustering |
| |
Hans-Peter Kriegel,
Martin Pfeifle
|
|
Pages: 339-340 |
|
doi>10.1145/1099554.1099656 |
|
Full text: PDF
|
|
Clustering has become an increasingly important task in modern application domains where the data are originally located at different sites. In order to create a central clustering, all clients have to transmit their data to a central server. Due to ...
Clustering has become an increasingly important task in modern application domains where the data are originally located at different sites. In order to create a central clustering, all clients have to transmit their data to a central server. Due to technical limitations and security aspects, at the central site often only vague object descriptions are available. The server then has to carry out the clustering based on vague and uncertain data. In a recent paper, an approach for clustering uncertain data was proposed based on the concept of medoid clusterings. The idea of this approach is to create first several sample clusterings. Then based on suitable distance functions between clusterings the most average clustering, i.e. the medoid clustering, was determined. In this paper, we extend this approach for partitioning clustering algorithms and propose to compute a centroid clustering based on these input sample clusterings. These centroid clusterings are new artificial clusterings which minimize the distance to all the sample clusterings. expand
|
|
|
Evaluation of a MCA-based approach to organize data cubes |
| |
Riadh Ben Messaoud,
Omar Boussaid,
Sabine Loudcher Rabaséda
|
|
Pages: 341-342 |
|
doi>10.1145/1099554.1099657 |
|
Full text: PDF
|
|
In the OLAP context, exploration of huge and sparse data cubes is a tedious task that does not always lead to efficient results. We propose to use a Multiple Correspondence Analysis (MCA) in order to enhance data cube representations and make them more ...
In the OLAP context, exploration of huge and sparse data cubes is a tedious task that does not always lead to efficient results. We propose to use a Multiple Correspondence Analysis (MCA) in order to enhance data cube representations and make them more suitable for visualization and thus, easier to analyze. We also provide an original quality criterion to measure the relevance of the obtained data representations. Experimental results we led on real data samples have shown the interest and the efficiency of our approach. expand
|
|
|
Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors |
| |
Francisco M. Couto,
Mário J. Silva,
Pedro M. Coutinho
|
|
Pages: 343-344 |
|
doi>10.1145/1099554.1099658 |
|
Full text: PDF
|
|
Many bioinformatics applications would benefit from comparing proteins based on their biological role rather than their sequence. In most biological databases, proteins are already annotated with ontology terms. Previous studies identified a correlation ...
Many bioinformatics applications would benefit from comparing proteins based on their biological role rather than their sequence. In most biological databases, proteins are already annotated with ontology terms. Previous studies identified a correlation between the sequence similarity and the semantic similarity of proteins. The semantic similarity of proteins was computed from their annotated GO terms. However, proteins sharing a biological role do not necessarily have a similar sequence.This paper introduces our study of the correlation between GO and family similarity. Family similarity overcomes some of the limitations of sequence similarity, thus we obtained a strong correlation between GO and family similarity. Additionally, this paper introduces GraSM, a novel method that uses all the information in the graph structure of the GO, instead of considering it as a hierarchical tree. When calculating the semantic similarity of two concepts, GraSM selects the disjunctive common ancestors rather than only using the most informative common ancestor. GraSM produced a higher family similarity correlation than the original semantic similarity measures. expand
|
|
|
Extracting a website's content structure from its link structure |
| |
Nan Liu,
Christopher C. Yang
|
|
Pages: 345-346 |
|
doi>10.1145/1099554.1099660 |
|
Full text: PDF
|
|
Hierarchical models are commonly used to organize a Website's content. A Website's content structure can be represented by a topic hierarchy, a directed tree rooted at a Website's homepage in which the vertices and edges correspond to Web pages and hyperlinks. ...
Hierarchical models are commonly used to organize a Website's content. A Website's content structure can be represented by a topic hierarchy, a directed tree rooted at a Website's homepage in which the vertices and edges correspond to Web pages and hyperlinks. In this work, we propose an algorithm for extracting a Website's topic hierarchy from its link structure. The proposed algorithm consists of a construction stage and a refining stage, in which we analyze the semantic relationships between web pages based on link structure, web page content and directory structure. We've done extensive experiments using different Websites and obtained very promising results. expand
|
|
|
Frequent pattern discovery with memory constraint |
| |
Kun-Ta Chuang,
Ming-Syan Chen
|
|
Pages: 345-346 |
|
doi>10.1145/1099554.1099659 |
|
Full text: PDF
|
|
We explore in this paper a practicably interesting mining task to retrieve frequent itemsets with memory constraint. As opposed to most previous works that concentrate on improving the mining efficiency or on reducing the memory size by best effort, ...
We explore in this paper a practicably interesting mining task to retrieve frequent itemsets with memory constraint. As opposed to most previous works that concentrate on improving the mining efficiency or on reducing the memory size by best effort, we first attempt to constrain the upper memory size that can be utilized by mining frequent itemsets in this paper. expand
|
|
|
Improving intranet search-engines using context information from databases |
| |
Christoph Mangold,
Holger Schwarz,
Bernhard Mitschang
|
|
Pages: 349-350 |
|
doi>10.1145/1099554.1099661 |
|
Full text: PDF
|
|
Information in enterprises comes in documents and data bases. From a semantic viewpoint, both kinds of information are usually tightly connected. In this paper, we propose to enhance common search-engines with contextual information retrieved from databases. ...
Information in enterprises comes in documents and data bases. From a semantic viewpoint, both kinds of information are usually tightly connected. In this paper, we propose to enhance common search-engines with contextual information retrieved from databases. We establish system requirements and anecdotally demonstrate how documents and database information can be represented as the nodes of a graph. Then, we give an example how we exploit this graph information for document retrieval. expand
|
|
|
A new permutation approach for distributed association rule mining |
| |
Yiqun Huang,
Zhengding Lu,
Heping Hu
|
|
Pages: 351-352 |
|
doi>10.1145/1099554.1099662 |
|
Full text: PDF
|
|
Privacy preserving distributed data mining has become a promising research area. This paper addresses the problem of association rule mining where the global database is vertically partitioned. When transactions are distributed in different sites, scalar ...
Privacy preserving distributed data mining has become a promising research area. This paper addresses the problem of association rule mining where the global database is vertically partitioned. When transactions are distributed in different sites, scalar product is a feasible tool to discover frequent itemsets. We present a new protocol to compute scalar product between two parties with a permutation approach. We analyze the protocol in detail and demonstrate its effectiveness and high privacy properties, and compare it to other published protocols. expand
|
|
|
On off-topic access detection in information systems |
| |
Nazli Goharian,
Ling Ma
|
|
Pages: 353-354 |
|
doi>10.1145/1099554.1099663 |
|
Full text: PDF
|
|
We focus on detecting insider access violations to off-topic documents. Previously, we utilized information retrieval techniques, e.g., clustering and relevance feedback, to warn of potential misuse. For the relevance feedback approach, we minimize the ...
We focus on detecting insider access violations to off-topic documents. Previously, we utilized information retrieval techniques, e.g., clustering and relevance feedback, to warn of potential misuse. For the relevance feedback approach, we minimize the indicative features needed for detection using data mining techniques. We show that the derived reduced feature subset achieves equivalent performance to that of the previously derived full set of features. expand
|
|
|
Privacy leakage in multi-relational databases via pattern based semi-supervised learning |
| |
Hui Xiong,
Michael Steinbach,
Vipin Kumar
|
|
Pages: 355-356 |
|
doi>10.1145/1099554.1099664 |
|
Full text: PDF
|
|
In multi-relational databases, a view, which is a context- and content-dependent subset of one or more tables (or other views), is often used to preserve privacy by hiding sensitive information. However, recent developments in data mining present a new ...
In multi-relational databases, a view, which is a context- and content-dependent subset of one or more tables (or other views), is often used to preserve privacy by hiding sensitive information. However, recent developments in data mining present a new challenge for database security even when traditional database security techniques, such as database access control, are employed. This paper presents a data mining framework using semi-supervised learning that demonstrates the potential for privacy leakage in multi-relational databases. Many different types of semi-supervised learning techniques, such as the K-nearest neighbor (KNN) method, can be used to demonstrate privacy leakage. However, we also introduce a new approach to semi-supervised learning, hyperclique pattern based semi-supervised learning (HPSL), which differs from traditional semi-supervised learning approaches in that it considers the similarity among groups of objects instead of only pairs of objects. Our experimental results show that both the KNN and HPSL methods have the ability to compromise database security, although HPSL is better at this privacy violation than the KNN method. expand
|
|
|
Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering |
| |
Yingbo Miao,
Vlado Kešelj,
Evangelos Milios
|
|
Pages: 357-358 |
|
doi>10.1145/1099554.1099665 |
|
Full text: PDF
|
|
We propose a novel method for document clustering using character N-grams. In the traditional vector-space model, the documents are represented as vectors, in which each dimension corresponds to a word. We propose a document representation based on the ...
We propose a novel method for document clustering using character N-grams. In the traditional vector-space model, the documents are represented as vectors, in which each dimension corresponds to a word. We propose a document representation based on the most frequent character N-grams, with window size of up to 10 characters. We derive a new distance measure, which produces uniformly better results when compared to the word-based and term-based methods. The result becomes more significant in the light of the robustness of the N-gram method with no language-dependent preprocessing. Experiments on the performance of a clustering algorithm on a variety of test document corpora demonstrate that the N-gram representation with n=3 outperforms both word and term representations. The comparison between word and term representations depends on the data set and the selected dimensionality. expand
|
|
|
Inferring document similarity from hyperlinks |
| |
David Grangier,
Samy Bengio
|
|
Pages: 359-360 |
|
doi>10.1145/1099554.1099666 |
|
Full text: PDF
|
|
Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with ...
Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with or without hyperlinks. As linked documents are generally semantically closer than unlinked documents, we use a training corpus with hyperlinks to infer a function a,b → sim(a,b) that assigns a higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that this function compares favorably with OKAPI matching on document retrieval tasks. expand
|
|
|
A hybrid approach to NER by MEMM and manual rules |
| |
Moshe Fresko,
Binyamin Rosenfeld,
Ronen Feldman
|
|
Pages: 361-362 |
|
doi>10.1145/1099554.1099667 |
|
Full text: PDF
|
|
This paper describes a framework for defining domain specific Feature Functions in a user friendly form to be used in a Maximum Entropy Markov Model (MEMM) for the Named Entity Recognition (NER) task. Our system called MERGE allows defining general Feature ...
This paper describes a framework for defining domain specific Feature Functions in a user friendly form to be used in a Maximum Entropy Markov Model (MEMM) for the Named Entity Recognition (NER) task. Our system called MERGE allows defining general Feature Function Templates, as well as Linguistic Rules incorporated into the classifier. The simple way of translating these rules into specific feature functions are shown. We show that MERGE can perform better from both purely machine learning based systems and purely-knowledge based approaches by some small expert interaction of rule-tuning. expand
|
|
|
Situation-aware risk management in autonomous agents |
| |
Martin Lorenz,
Jan D. Gehrke,
Hagen Langer,
Ingo J. Timm,
Joachim Hammer
|
|
Pages: 363-364 |
|
doi>10.1145/1099554.1099668 |
|
Full text: PDF
|
|
We present a novel approach to enable decision-making in a highly distributed multiagent environment where individual agents need to act in an autonomous fashion. Our architecture framework integrates risk management, knowledge management, and agent ...
We present a novel approach to enable decision-making in a highly distributed multiagent environment where individual agents need to act in an autonomous fashion. Our architecture framework integrates risk management, knowledge management, and agent deliberation to enable sophisticated, autonomous decision-making. Instead of a centralized knowledge repository, our approach supports a highly distributed knowledge base in which each agent manages a fraction of the knowledge needed by the entire system. expand
|
|
|
SESSION: Paper session IR-4 (information retrieval): machine learning |
|
|
|
|
Mining officially unrecognized side effects of drugs by combining web search and machine learning |
| |
Carlo A. Curino,
Yuanyuan Jia,
Bruce Lambert,
Patricia M. West,
Clement Yu
|
|
Pages: 365-372 |
|
doi>10.1145/1099554.1099670 |
|
Full text: PDF
|
|
We consider the problem of finding officially unrecognized side effects of drugs. By submitting queries to the Web involving a given drug name, it is possible to retrieve pages concerning the drug. However, many retrieved pages are irrelevant and some ...
We consider the problem of finding officially unrecognized side effects of drugs. By submitting queries to the Web involving a given drug name, it is possible to retrieve pages concerning the drug. However, many retrieved pages are irrelevant and some relevant pages are not retrieved. More relevant pages can be obtained by adding the active ingredient of the drug to the query. In order to eliminate irrelevant pages, we propose a machine learning process to filter out the undesirable pages. The process is shown experimentally to be very effective. Since obtaining training data for the machine learning process can be time consuming and expensive, we provide an automatic method to generate the training data. The method is also shown to be very accurate. The side effects of three drugs which are not recognized by FDA are validated by an expert. We believe that the same approach can be applied to many real life problems and will yield high precision. Thus, this could lead a new way to perform retrieval with high accuracy. expand
|
|
|
MailRank: using ranking for spam detection |
| |
Paul-Alexandru Chirita,
Jörg Diederich,
Wolfgang Nejdl
|
|
Pages: 373-380 |
|
doi>10.1145/1099554.1099671 |
|
Full text: PDF
|
|
Can we use social networks to combat spam? This paper investigates the feasibility of MailRank, a new email ranking and classification scheme exploiting the social communication network created via email interactions. The underlying email network data ...
Can we use social networks to combat spam? This paper investigates the feasibility of MailRank, a new email ranking and classification scheme exploiting the social communication network created via email interactions. The underlying email network data is collected from the email contacts of all MailRank users and updated automatically based on their email activities to achieve an easy maintenance. MailRank is used to rate the sender address of arriving emails such that emails from trustworthy senders can be ranked and classified as spam or non-spam. The paper presents two variants: Basic MailRank computes a global reputation score for each email address, whereas in Personalized MailRank the score of each email address is different for each MailRank user. The evaluation shows that MailRank is highly resistant against spammer attacks, which obviously have to be considered right from the beginning in such an application scenario. MailRank also performs well even for rather sparse networks, i.e., where only a small set of peers actually take part in the ranking of email addresses. expand
|
|
|
ViPER: augmenting automatic information extraction with visual perceptions |
| |
Kai Simon,
Georg Lausen
|
|
Pages: 381-388 |
|
doi>10.1145/1099554.1099672 |
|
Full text: PDF
|
|
In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised Web data extraction becomes feasible when supposing pages that are made up of repetitive patterns, as it is the case, e.g., for search engine result pages. ...
In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised Web data extraction becomes feasible when supposing pages that are made up of repetitive patterns, as it is the case, e.g., for search engine result pages. Hereby the extraction rules are generated automatically without any training or human interaction, by means of operating on the DOM tree respectively the flat tag token sequence of a single page.Our contribution to automatic data extraction through this paper is twofold. First, we identify and rank potential repetitive patterns with respect to the user's visual perception of the Web page, well aware that location and size of matching elements within a Web page constitute important criteria for defining relevance. Second, matching sub-sequences of the pattern with the highest weightiness are aligned with global multiple sequence alignment techniques. Experimental results show that our system is able to achieve high accuracy in distilling and aligning regularly structured objects inside complex Web pages. expand
|
|
|
SESSION: Paper session DB-4 (databases): XML and query processing |
|
|
|
|
Interconnection semantics for keyword search in XML |
| |
Sara Cohen,
Yaron Kanza,
Benny Kimelfeld,
Yehoshua Sagiv
|
|
Pages: 389-396 |
|
doi>10.1145/1099554.1099674 |
|
Full text: PDF
|
|
A framework for describing semantic relationships among nodes in XML documents is presented. In contrast to earlier work, the XML documents may have ID references (i.e., they correspond to graphs and not just trees). A specific interconnection semantics ...
A framework for describing semantic relationships among nodes in XML documents is presented. In contrast to earlier work, the XML documents may have ID references (i.e., they correspond to graphs and not just trees). A specific interconnection semantics in this framework can be defined explicitly or derived automatically. The main advantage of interconnection semantics is the ability to pose queries on XML data in the style of keyword search. Several methods for automatically deriving interconnection semantics are presented. The complexity of the evaluation and the satisfiability problems under the derived semantics is analyzed. For many important cases, the complexity is tractable and hence, the proposed interconnection semantics can be efficiently applied to real-world XML documents. expand
|
|
|
Efficient indexing and querying of XML data using modified Prüfer sequences |
| |
K. Hima Prasad,
P. Sreenivasa Kumar
|
|
Pages: 397-404 |
|
doi>10.1145/1099554.1099675 |
|
Full text: PDF
|
|
With the advent of XML as the new standard for information representation and exchange, indexing and querying of XML data is of major concern. In this paper, we propose a method for representing an XML document as a sequence based on a variation of Prüfer ...
With the advent of XML as the new standard for information representation and exchange, indexing and querying of XML data is of major concern. In this paper, we propose a method for representing an XML document as a sequence based on a variation of Prüfer sequences. We incorporate new components in the node encodings such as level, number of a certain kind of descendants and develop methods for holistic processing of tree pattern queries. The query processing involves converting the query also into a sequence and performing subsequence matching on the document sequence. We establish certain interesting properties of the proposed method of sequencing that give rise to a new efficient pattern matching algorithm. The sequence data is stored in a two level B+-trees to support query processing. We also propose an optimization for parent-child axis to speed up the query processing. Our approach does not require any post-processing and guarantees results that are free of false positives and duplicates. Experimental results show that our system performs significantly better than previous systems in a large number of cases. expand
|
|
|
Towards automatic association of relevant unstructured content with structured query results |
| |
Prasan Roy,
Mukesh Mohania,
Bhuvan Bamba,
Shree Raman
|
|
Pages: 405-412 |
|
doi>10.1145/1099554.1099676 |
|
Full text: PDF
|
|
Faced with growing knowledge management needs, enterprises are increasingly realizing the importance of seamlessly integrating critical business information distributed across both structured and unstructured data sources. In existing information integration ...
Faced with growing knowledge management needs, enterprises are increasingly realizing the importance of seamlessly integrating critical business information distributed across both structured and unstructured data sources. In existing information integration solutions, the application needs to formulate the SQL logic to retrieve the needed structured data on one hand, and identify a set of keywords to retrieve the related unstructured data on the other. This paper proposes a novel approach wherein the application specifies its information needs using only a SQL query on the structured data, and this query is automatically ``translated'' into a set of keywords that can be used to retrieve relevant unstructured data. We describe the techniques used for obtaining these keywords from (i) the query result, and (ii) additional related information in the underlying database. We further show that these techniques achieve high accuracy with very reasonable overheads. expand
|
|
|
SESSION: Paper session KM-4 (knowledge management): information extraction |
|
|
|
|
Predicting accuracy of extracting information from unstructured text collections |
| |
Eugene Agichtein,
Silviu Cucerzan
|
|
Pages: 413-420 |
|
doi>10.1145/1099554.1099678 |
|
Full text: PDF
|
|
Exploiting lexical and semantic relationships in large unstructured text collections can significantly enhance managing, integrating, and querying information locked in unstructured text. Most notably, named entities and relations between entities are ...
Exploiting lexical and semantic relationships in large unstructured text collections can significantly enhance managing, integrating, and querying information locked in unstructured text. Most notably, named entities and relations between entities are crucial for effective question answering and other information retrieval and knowledge management tasks. Unfortunately, the success in extracting these relationships can vary for different domains, languages, and document collections. Predicting extraction performance is an important step towards scalable and intelligent knowledge management, information retrieval and information integration. We present a general language modeling method for quantifying the difficulty of information extraction tasks. We demonstrate the viability of our approach by predicting performance of real world information extraction tasks, Named Entity recognition and Relation Extraction. expand
|
|
|
WAM-Miner: in the search of web access motifs from historical web log data |
| |
Qiankun Zhao,
Sourav S. Bhowmick,
Le Gruenwald
|
|
Pages: 421-428 |
|
doi>10.1145/1099554.1099679 |
|
Full text: PDF
|
|
Existing web usage mining techniques focus only on discovering knowledge based on the statistical measures obtained from the static characteristics of web usage data. They do not consider the dynamic nature of web usage data. In this paper, we ...
Existing web usage mining techniques focus only on discovering knowledge based on the statistical measures obtained from the static characteristics of web usage data. They do not consider the dynamic nature of web usage data. In this paper, we focus on discovering novel knowledge by analyzing the change patterns of historical web access sequence data. We present an algorithm called W<small>AM</small>-M<small>INER</small> to discover Web Access Motifs (WAMs). WAMs are web access patterns that never change or do not change significantly most of the time (if not always) in terms of their support values during a specific time period. WAMs are useful for many applications, such as intelligent web advertisement, web site restructuring, business intelligence, and intelligent web caching. expand
|
|
|
A framework for mining topological patterns in spatio-temporal databases |
| |
Junmei Wang,
Wynne Hsu,
Mong Li Lee
|
|
Pages: 429-436 |
|
doi>10.1145/1099554.1099680 |
|
Full text: PDF
|
|
Mining topological patterns in spatial databases has received a lot of attention. However, existing work typically ignores the temporal aspect and suffers from certain efficiency problems. They are not scalable for mining topological patterns in spatio-temporal ...
Mining topological patterns in spatial databases has received a lot of attention. However, existing work typically ignores the temporal aspect and suffers from certain efficiency problems. They are not scalable for mining topological patterns in spatio-temporal databases. In this paper, we study the problem for mining topological patterns by incorporating the temporal aspect in the mining process. We introduce a summary-structure that records the instances' count information of a feature in a region within a time window. Using this structure, we design an algorithm, TopologyMiner, to find interesting topological patterns without the need to generate candidates. Experimental results show that TopologyMiner is effective and scalable in finding topological patterns and outperforms Apriori-like algorithm by a few orders of magnitudes. expand
|
|
|
SESSION: Industry track session |
|
|
|
|
Automated cleansing for spend analytics |
| |
Moninder Singh,
Jayant R. Kalagnanam,
Sudhir Verma,
Amit J. Shah,
Swaroop K. Chalasani
|
|
Pages: 437-445 |
|
doi>10.1145/1099554.1099682 |
|
Full text: PDF
|
|
The development of an aggregate view of the procurement spend across an enterprise using transactional data is increasingly becoming a very important and strategic activity. Not only does it provide a complete and accurate picture of what the enterprise ...
The development of an aggregate view of the procurement spend across an enterprise using transactional data is increasingly becoming a very important and strategic activity. Not only does it provide a complete and accurate picture of what the enterprise is buying and from whom, it also allows it to consolidate suppliers, as well as negotiate better prices. The importance, as well as the complexity, of this cleansing exercise is further magnified by the increasing popularity of Business Transformation Outsourcing (BTO) wherein enterprises are turning over non-core activities, such as indirect procurement, to third parties, who now need to develop an integrated view of spend across multiple enterprises in order to optimize procurement and generate maximum savings. However, the creation of such an integrated view of procurement spend requires the creation of a homogeneous data repository from disparate (heterogeneous) data sources across various geographic and functional organizations throughout the enterprise(s). Such repositories get transactional data from various sources such as invoices, purchase orders, account ledgers. As such, the transactions are not cross-indexed, refer to the same suppliers by different names, and use different ways of representing information about the same commodities. Before an aggregated spend view can be developed, this data needs to be cleansed, primarily to normalize the supplier names and correctly map each transaction to the appropriate commodity code. Commodity mapping, in particular, is made more difficult by the fact that it has to be done on the basis of unstructured text descriptions found in the various data sources. We describe an on-demand system to automatically perform this cleansing activity using techniques from information retrieval and machine learning. Built on standard integration and application infrastructure software, this system provides enterprises with a fast, reliable, accurate and on-demand way of cleansing transactional data and generating an integrated view of spend. This system is currently in the process of being deployed by IBM for use in its BTO practice. expand
|
|
|
Feature-based recommendation system |
| |
Eui-Hong (Sam) Han,
George Karypis
|
|
Pages: 446-452 |
|
doi>10.1145/1099554.1099683 |
|
Full text: PDF
|
|
The explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems--a personalized information filtering technology used to identify a set of N items that will be of interest to ...
The explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems--a personalized information filtering technology used to identify a set of N items that will be of interest to a certain user. User-based and model-based collaborative filtering are the most successful technology for building recommender systems to date and is extensively used in many commercial recommender systems. The basic assumption in these algorithms is that there are sufficient historical data for measuring similarity between products or users. However, this assumption does not hold in various application domains such as electronics retail, home shopping network, on-line retail where new products are introduced and existing products disappear from the catalog. Another such application domains is home improvement retail industry where a lot of products (such as window treatments, bathroom, kitchen or deck) are custom made. Each product is unique and there are very little duplicate products. In this domain, the probability of the same exact two products bought together is close to zero. In this paper, we discuss the challenges of providing recommendation in the domains where no sufficient historical data exist for measuring similarity between products or users. We present feature-based recommendation algorithms that overcome the limitations of the existing top-n recommendation algorithms. The experimental evaluation of the proposed algorithms in the real life data sets shows a great promise. The pilot project deploying the proposed feature-based recommendation algorithms in the on-line retail web site shows 75% increase in the recommendation revenue for the first 2 month period. expand
|
|
|
Automatic analysis of call-center conversations |
| |
Gilad Mishne,
David Carmel,
Ron Hoory,
Alexey Roytman,
Aya Soffer
|
|
Pages: 453-459 |
|
doi>10.1145/1099554.1099684 |
|
Full text: PDF
|
|
We describe a system for automating call-center analysis and monitoring. Our system integrates transcription of incoming calls with analysis of their content; for the analysis, we introduce a novel method of estimating the domain-specific importance ...
We describe a system for automating call-center analysis and monitoring. Our system integrates transcription of incoming calls with analysis of their content; for the analysis, we introduce a novel method of estimating the domain-specific importance of conversation fragments, based on divergence of corpus statistics. Combining this method with Information Retrieval approaches, we provide knowledge-mining tools both for the call-center agents and for administrators of the center. expand
|
|
|
A new approach to intranet search based on information extraction |
| |
Hang Li,
Yunbo Cao,
Jun Xu,
Yunhua Hu,
Shenjie Li,
Dmitriy Meyerzon
|
|
Pages: 460-468 |
|
doi>10.1145/1099554.1099685 |
|
Full text: PDF
|
|
This paper is concerned with 'intranet search'. By intranet search, we mean searching for information on an intranet within an organization. We have found that search needs on an intranet can be categorized into types, through an analysis of survey results ...
This paper is concerned with 'intranet search'. By intranet search, we mean searching for information on an intranet within an organization. We have found that search needs on an intranet can be categorized into types, through an analysis of survey results and an analysis of search log data. The types include searching for definitions, persons, experts, and homepages. Traditional information retrieval only focuses on search of relevant documents, but not on search of special types of information. We propose a new approach to intranet search in which we search for information in each of the special types, in addition to the traditional relevance search. Information extraction technologies can play key roles in such kind of 'search by type' approach, because we must first extract from the documents the necessary information in each type. We have developed an intranet search system called 'Information Desk'. In the system, we try to address the most important types of search first - finding term definitions, homepages of groups or topics, employees' personal information and experts on topics. For each type of search, we use information extraction technologies to extract, fuse, and summarize information in advance. The system is in operation on the intranet of Microsoft and receives accesses from about 500 employees per month. Feedbacks from users and system logs show that users consider the approach useful and the system can really help people to find information. This paper describes the architecture, features, component technologies, and evaluation results of the system. expand
|
|
|
SESSION: Paper session IR-5 (information retrieval): machine learning and collaborative filtering |
|
|
|
|
A novel refinement approach for text categorization |
| |
Songbo Tan,
Xueqi Cheng,
Moustafa M. Ghanem,
Bin Wang,
Hongbo Xu
|
|
Pages: 469-476 |
|
doi>10.1145/1099554.1099687 |
|
Full text: PDF
|
|
In this paper we present a novel strategy, DragPushing, for improving the performance of text classifiers. The strategy is generic and takes advantage of training errors to successively refine the classification model of a base classifier. We describe ...
In this paper we present a novel strategy, DragPushing, for improving the performance of text classifiers. The strategy is generic and takes advantage of training errors to successively refine the classification model of a base classifier. We describe how it is applied to generate two new classification algorithms; a Refined Centroid Classifier and a Refined Naïve Bayes Classifier. We present an extensive experimental evaluation of both algorithms on three English collections and one Chinese corpus. The results indicate that in each case, the refined classifiers achieve significant performance improvement over the base classifiers used. Furthermore, the performance of the Refined Centroid Classifier implemented is comparable, if not better, to that of state-of-the-art support vector machine (SVM)-based classifier, but offers a much lower computational cost. expand
|
|
|
Intelligent GP fusion from multiple sources for text classification |
| |
Baoping Zhang,
Yuxin Chen,
Weiguo Fan,
Edward A. Fox,
Marcos Gonçalves,
Marco Cristo,
Pável Calado
|
|
Pages: 477-484 |
|
doi>10.1145/1099554.1099688 |
|
Full text: PDF
|
|
This paper shows how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity -- five derived from the ...
This paper shows how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity -- five derived from the citation information of the collection, and three derived from the structural content -- and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our experiments with the ACM Computing Classification Scheme, using documents from the ACM Digital Library, indicate that GP can discover similarity functions superior to those based solely on a single type of evidence. Effectiveness of the similarity functions discovered through simple majority voting is better than that of content-based as well as combination-based Support Vector Machine classifiers. Experiments also were conducted to compare the performance between GP techniques and other fusion techniques such as Genetic Algorithms (GA) and linear fusion. Empirical results show that GP was able to discover better similarity functions than GA or other fusion techniques. expand
|
|
|
Time weight collaborative filtering |
| |
Yi Ding,
Xue Li
|
|
Pages: 485-492 |
|
doi>10.1145/1099554.1099689 |
|
Full text: PDF
|
|
Collaborative filtering is regarded as one of the most promising recommendation algorithms. The item-based approaches for collaborative filtering identify the similarity between two items by comparing users' ratings on them. In these approaches, ratings ...
Collaborative filtering is regarded as one of the most promising recommendation algorithms. The item-based approaches for collaborative filtering identify the similarity between two items by comparing users' ratings on them. In these approaches, ratings produced at different times are weighted equally. That is to say, changes in user purchase interest are not taken into consideration. For example, an item that was rated recently by a user should have a bigger impact on the prediction of future user behaviour than an item that was rated a long time ago. In this paper, we present a novel algorithm to compute the time weights for different items in a manner that will assign a decreasing weight to old data. More specifically, the users' purchase habits vary. Even the same user has quite different attitudes towards different items. Our proposed algorithm uses clustering to discriminate between different kinds of items. To each item cluster, we trace each user's purchase interest change and introduce a personalized decay factor according to the user own purchase behaviour. Empirical studies have shown that our new algorithm substantially improves the precision of item-based collaborative filtering without introducing higher order computational complexity. expand
|
|
|
SESSION: Paper session DB-5 (databases): updates and change detection |
|
|
|
|
Handling frequent updates of moving objects |
| |
Bin Lin,
Jianwen Su
|
|
Pages: 493-500 |
|
doi>10.1145/1099554.1099691 |
|
Full text: PDF
|
|
A critical issue in moving object databases is to develop appropriate indexing structures for continuously moving object locations so that queries can still be performed efficiently. However, such location changes typically cause a high volume of updates, ...
A critical issue in moving object databases is to develop appropriate indexing structures for continuously moving object locations so that queries can still be performed efficiently. However, such location changes typically cause a high volume of updates, which in turn poses serious problems on maintaining index structures. In this paper we propose a Lazy Group Update (LGU) algorithm for disk-based index structures of moving objects. LGU contains two key additional structures to group ``similar'' updates so that they can be performed together: a disk-based insertion buffer (I-Buffer) for each internal node, and a memory-based deletion table (D-Table) for the entire tree. Different strategies of ``pushing down'' an overflow I-Buffer to the next level are studied. Comprehensive empirical studies over uniform and skewed datasets, as well as simulated street traffic data show that LGU achieves a significant improvement on update throughput while allowing a reasonable performance for queries. expand
|
|
|
QED: a novel quaternary encoding to completely avoid re-labeling in XML updates |
| |
Changqing Li,
Tok Wang Ling
|
|
Pages: 501-508 |
|
doi>10.1145/1099554.1099692 |
|
Full text: PDF
|
|
The method of assigning labels to the nodes of the XML tree is called a labeling scheme. Based on the labels only, both ordered and un-ordered queries can be processed without accessing the original XML file. One more important point for the labeling ...
The method of assigning labels to the nodes of the XML tree is called a labeling scheme. Based on the labels only, both ordered and un-ordered queries can be processed without accessing the original XML file. One more important point for the labeling scheme is the label update cost in inserting or deleting a node into or from the XML tree. All the current labeling schemes have high update cost, therefore in this paper we propose a novel quaternary encoding approach for the labeling schemes. Based on this encoding approach, we need not re-label any existing nodes when the update is performed. Extensive experimental results on the XML datasets illustrate that our QED works much better than the existing labeling schemes on the label updates when considering either the number of nodes or the time for re-labeling. expand
|
|
|
Detecting changes on unordered XML documents using relational databases: a schema-conscious approach |
| |
Erwin Leonardi,
Sourav S. Bhowmick
|
|
Pages: 509-516 |
|
doi>10.1145/1099554.1099693 |
|
Full text: PDF
|
|
Several relational approaches have been proposed to detect the changes to XML documents by using relational databases. These approaches store the XML documents in the relational database and issue SQL queries (whenever appropriate) to detect the ...
Several relational approaches have been proposed to detect the changes to XML documents by using relational databases. These approaches store the XML documents in the relational database and issue SQL queries (whenever appropriate) to detect the changes. All of these relational-based approaches use the schema-oblivious XML storage strategy for detecting the changes. However, there is growing evidence that schema-conscious storage approaches perform significantly better than schema-oblivious approaches as far as XML query processing is concerned. In this paper, we study a relational-based unordered XML change detection technique (called H<small>ELIOS</small>) that uses a schema-conscious approach (Shared-Inlining) as the underlying storage strategy. H<small>ELIOS</small> is up to 52 times faster than X-Diff [7] for large datasets (more than 1000 nodes). It is also up to 6.7 times faster than X<small>ANDY</small> [4]. The result quality of deltas detected by H<small>ELIOS</small> is comparable to the result quality of deltas detected by XANDY. expand
|
|
|
SESSION: Paper session IR-6 (information retrieval): IR models 1 |
|
|
|
|
Similarity measures for tracking information flow |
| |
Donald Metzler,
Yaniv Bernstein,
W. Bruce Croft,
Alistair Moffat,
Justin Zobel
|
|
Pages: 517-524 |
|
doi>10.1145/1099554.1099695 |
|
Full text: PDF
|
|
Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- ...
Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- are useful for applications such as information flow analysis and question-answering tasks. In this paper, we explore mechanisms for measuring such intermediate kinds of similarity, focusing on the task of identifying where a particular piece of information originated. We consider both sentence-to-sentence and document-to-document comparison, and have incorporated these algorithms into <small>RECAP</small>, a prototype information flow analysis tool. Our experimental results with <small>RECAP</small> indicate that new mechanisms such as those we propose are likely to be more appropriate than existing methods for identifying the intermediate forms of similarity. expand
|
|
|
Word sense disambiguation in queries |
| |
Shuang Liu,
Clement Yu,
Weiyi Meng
|
|
Pages: 525-532 |
|
doi>10.1145/1099554.1099696 |
|
Full text: PDF
|
|
This paper presents a new approach to determine the senses of words in queries by using WordNet. In our approach, noun phrases in a query are determined first. For each word in the query, information associated with it, including its synonyms, hyponyms, ...
This paper presents a new approach to determine the senses of words in queries by using WordNet. In our approach, noun phrases in a query are determined first. For each word in the query, information associated with it, including its synonyms, hyponyms, hypernyms, definitions of its synonyms and hyponyms, and its domains, can be used for word sense disambiguation. By comparing these pieces of information associated with the words which form a phrase, it may be possible to assign senses to these words. If the above disambiguation fails, then other query words, if exist, are used, by going through exactly the same process. If the sense of a query word cannot be determined in this manner, then a guess of the sense of the word is made, if the guess has at least 50% chance of being correct. If no sense of the word has 50% or higher chance of being used, then we apply a Web search to assist in the word sense disambiguation process. Experimental results show that our approach has 100% applicability and 90% accuracy on the most recent robust track of TREC collection of 250 queries. We combine this disambiguation algorithm to our retrieval system to examine the effect of word sense disambiguation in text retrieval. Experimental results show that the disambiguation algorithm together with other components of our retrieval system yield a result which is 13.7% above that produced by the same system but without the disambiguation, and 9.2% above that produced by using Lesk's algorithm. Our retrieval effectiveness is 7% better than the best reported result in the literature. expand
|
|
|
ERkNN: efficient reverse k-nearest neighbors retrieval with local kNN-distance estimation |
| |
Chenyi Xia,
Wynne Hsu,
Mong Li Lee
|
|
Pages: 533-540 |
|
doi>10.1145/1099554.1099697 |
|
Full text: PDF
|
|
The Reverse k-Nearest Neighbors (RkNN) queries are important in profile-based marketing, information retrieval, decision support and data mining systems. However, they are very expensive and existing algorithms are not scalable to queries in high dimensional ...
The Reverse k-Nearest Neighbors (RkNN) queries are important in profile-based marketing, information retrieval, decision support and data mining systems. However, they are very expensive and existing algorithms are not scalable to queries in high dimensional spaces or of large values of k. This paper describes an efficient estimation-based RkNN search algorithm (ERkNN) which answers RkNN queries based on local kNN-distance estimation methods. The proposed approach utilizes estimation-based filtering strategy to lower the computation cost of RkNN queries. The results of extensive experiments on both synthetic and real life datasets demonstrate that ERkNN algorithm retrieves RkNN efficiently and is scalable with respect to data dimensionality, k, and data size. expand
|
|
|
SESSION: Industry track session |
|
|
|
|
Kalchas: a dynamic XML search engine |
| |
Rasmus Kaae,
Thanh-Duy Nguyen,
Dennis Nørgaard,
Albrecht Schmidt
|
|
Pages: 541-548 |
|
doi>10.1145/1099554.1099699 |
|
Full text: PDF
|
|
This paper outlines the system architecture and the core data structures of Kalchas, a fulltext search engine for XML data with emphasis on dynamic indexing, and identifies features worth demonstrating. The concept of dynamic index implies that the aim ...
This paper outlines the system architecture and the core data structures of Kalchas, a fulltext search engine for XML data with emphasis on dynamic indexing, and identifies features worth demonstrating. The concept of dynamic index implies that the aim is to re ect the creation of, deletion of, and updates to relevant files in the search index as early as possible. This is achieved by a number of techniques, including ideas drawn from partitioned B-Trees and inverted indices. The actual ranked retrieval of document is implemented with XML-specific query operators for lowest common ancestor queries.A live demonstration will discuss Kalchas' behaviour in typical use cases, such as interactive editing sessions and bulk loading large amounts of static files as well as querying the contents of the indexed files; it tries to clarify both the short-comings and the advantages of the method. expand
|
|
|
Order checking in a CPOE using event analyzer |
| |
Lilian Harada,
Yuuji Hotta
|
|
Pages: 549-555 |
|
doi>10.1145/1099554.1099700 |
|
Full text: PDF
|
|
In this paper we present our experience in applying Event Analyzer, a processing engine we have developed to extract patterns from a sequence of events, in the checking of medical orders of a CPOE system. We present some extensions we have implemented ...
In this paper we present our experience in applying Event Analyzer, a processing engine we have developed to extract patterns from a sequence of events, in the checking of medical orders of a CPOE system. We present some extensions we have implemented in Event Analyzer in order to fulfill the needs of those orders checking, as well as some performance evaluation results. We also outline some problems we are facing now to adapt Event Analyzer's pattern detection engine to support streaming orders in an on-line CPOE checking system. expand
|
|
|
SyynX solutions: practical knowledge management in a medical environment |
| |
Christian Herzog,
Gianpiero Liuzzi,
Mario Diwersy
|
|
Pages: 556-559 |
|
doi>10.1145/1099554.1099701 |
|
Full text: PDF
|
|
In this paper we describe the Knowledge Management approach for the biomedical scientific community developed by SyynX Solutions GmbH [1].
In this paper we describe the Knowledge Management approach for the biomedical scientific community developed by SyynX Solutions GmbH [1]. expand
|
|
|
Leveraging collective knowledge |
| |
Henry Kon,
Michael Hoey
|
|
Pages: 560-567 |
|
doi>10.1145/1099554.1099702 |
|
Full text: PDF
|
|
As more organizations begin to deploy taxonomies for categorization and faceted search, the cost of producing these knowledge models is becoming the largest expense on a project. At a cost of 200 - 300 dollars per topic, manually developing subject area ...
As more organizations begin to deploy taxonomies for categorization and faceted search, the cost of producing these knowledge models is becoming the largest expense on a project. At a cost of 200 - 300 dollars per topic, manually developing subject area taxonomies does not scale for any but the smallest of projects. This paper will discuss an approach called Orthogonal Corpus Indexing ( OCI ). OCI leverages existing published knowledge in the subject area of the taxonomy model. This knowledge is algorithmically mapped into multiple taxonomies via the OCI algorithm. The resulting taxonomy costs are 1/ 100th of the cost of manual methods and are created with embedded rule sets for categorization engines. This paper will discuss the theory of OCI, its practical use as well as examples of knowledge management techniques that are possible when taxonomies are large, detailed and inexpensive. expand
|
|
|
Taxonomies by the numbers: building high-performance taxonomies |
| |
Stephen C. Gates,
Wilfried Teiken,
Keh-Shin F. Cheng
|
|
Pages: 568-577 |
|
doi>10.1145/1099554.1099703 |
|
Full text: PDF
|
|
In this paper, we describe a system for the construction of taxonomies which yield high accuracies with automated categorization systems, even on Web and intranet documents. In particular, we describe the way in which measurement of five key features ...
In this paper, we describe a system for the construction of taxonomies which yield high accuracies with automated categorization systems, even on Web and intranet documents. In particular, we describe the way in which measurement of five key features of the system can be used to predict when categories are sufficiently well defined to yield high accuracy categorization. We describe the use of this system to construct a large (8800-category) general-purpose taxonomy and categorization system. expand
|
|
|
SESSION: Paper session IR-7 (information retrieval): distributed retrieval |
|
|
|
|
Distributed PageRank computation based on iterative aggregation-disaggregation methods |
| |
Yangbo Zhu,
Shaozhi Ye,
Xing Li
|
|
Pages: 578-585 |
|
doi>10.1145/1099554.1099705 |
|
Full text: PDF
|
|
PageRank has been widely used as a major factor in search engine ranking systems. However, global link graph information is required when computing PageRank, which causes prohibitive communication cost to achieve accurate results in distributed solution. ...
PageRank has been widely used as a major factor in search engine ranking systems. However, global link graph information is required when computing PageRank, which causes prohibitive communication cost to achieve accurate results in distributed solution. In this paper, we propose a distributed PageRank computation algorithm based on iterative aggregation-disaggregation (IAD) method with Block Jacobi smoothing. The basic idea is divide-and-conquer. We treat each web site as a node to explore the block structure of hyperlinks. Local PageRank is computed by each node itself and then updated with a low communication cost with a coordinator. We prove the global convergence of the Block Jacobi method and then analyze the communication overhead and major advantages of our algorithm. Experiments on three real web graphs show that our method converges 5-7 times faster than the traditional Power method. We believe our work provides an efficient and practical distributed solution for PageRank on large scale Web graphs. expand
|
|
|
Scalable summary based retrieval in P2P networks |
| |
Wolfgang Müller,
Martin Eisenhardt,
Andreas Henrich
|
|
Pages: 586-593 |
|
doi>10.1145/1099554.1099706 |
|
Full text: PDF
|
|
Much of the present P2P-IR literature is focused on distributed indexing structures. Within this paper, we present an approach based on the replication of peer data summaries via rumor spreading and multicast in a structured overlay.We will describe ...
Much of the present P2P-IR literature is focused on distributed indexing structures. Within this paper, we present an approach based on the replication of peer data summaries via rumor spreading and multicast in a structured overlay.We will describe Rumorama, a P2P framework for similar-ity queries inspired by GlOSS and CORI and their P2P-adaptation, PlanetP. Rumorama achieves a hierarchization of PlanetP-like summary-based P2P-IR networks. In a Rumorama network, each peer views the network as a small PlanetP network with connections to peers that see other small PlanetP networks. One important aspect is that each peer can choose the size of the PlanetP network it wants to see according to its local processing power and bandwidth. Even in this adaptive environment, Rumorama manages to process a query such that the summary of each peer is considered exactly once in a network without churn. However, the actual number of peers to be contacted for a query is a small fraction of the total number of peers in the network.Within this article, we present the Rumorama base protocol, as well as experiments demonstrating the scalability and viability of the approach under churn. expand
|
|
|
SESSION: Paper session DB-6 (databases): algorithms |
|
|
|
|
Compact reachability labeling for graph-structured data |
| |
Hao He,
Haixun Wang,
Jun Yang,
Philip S. Yu
|
|
Pages: 594-601 |
|
doi>10.1145/1099554.1099708 |
|
Full text: PDF
|
|
Testing reachability between nodes in a graph is a well-known problem with many important applications, including knowledge representation, program analysis, and more recently, biological and ontology databases inferencing as well as XML query processing. ...
Testing reachability between nodes in a graph is a well-known problem with many important applications, including knowledge representation, program analysis, and more recently, biological and ontology databases inferencing as well as XML query processing. Various approaches have been proposed to encode graph reachability information using node labeling schemes, but most existing schemes only work well for specific types of graphs. In this paper, we propose a novel approach, HLSS(Hierarchical Labeling of Sub-Structures), which identifies different types of substructures within a graph and encodes them using techniques suitable to the characteristics of each of them. We implement HLSS with an efficient two-phase algorithm, where the first phase identifies and encodes strongly connected components as well as tree substructures, and the second phase encodes the remaining reachability relationships by compressing dense rectangular submatrices in the transitive closure matrix. For the important subproblem of finding densest submatrices, we demonstrate the hardness of the problem and propose several practical algorithms. Experiments show that HLSS handles different types of graphs well, while existing approaches fall prey to graphs with substructures they are not designed to handle. expand
|
|
|
A formal characterization of PIVOT/UNPIVOT |
| |
Catharine M. Wyss,
Edward L. Robertson
|
|
Pages: 602-608 |
|
doi>10.1145/1099554.1099709 |
|
Full text: PDF
|
|
PIVOT is an important relational operation that allows data in rows to be exchanged for columns. Although most current relational database management systems support PIVOT-type operations, to date a purely formal, algebraic characterization of PIVOT ...
PIVOT is an important relational operation that allows data in rows to be exchanged for columns. Although most current relational database management systems support PIVOT-type operations, to date a purely formal, algebraic characterization of PIVOT has been lacking. In this paper, we present a characterization in terms of extended relational algebra operators τ (transpose), Π (drop projection), and μ (unique optimal tuple merge). This enables us to (1) draw parallels with PIVOT and existing operators employed in Dynamic Data Mapping Systems (DDMS), (2) formally characterize invertible PIVOT instances, and (3) provide complexity results for PIVOT-type operations. These contributions are an important part of ongoing work on formal models for relational OLAP. expand
|
|
|
SESSION: Paper session DB-7 (databases): privacy and sharing |
|
|
|
|
A novel approach for privacy-preserving video sharing |
| |
Jianping Fan,
Hangzai Luo,
Mohand-Said Hacid,
Elisa Bertino
|
|
Pages: 609-616 |
|
doi>10.1145/1099554.1099711 |
|
Full text: PDF
|
|
To support privacy-preserving video sharing, we have proposed a novel framework that is able to protect the video content privacy at the individual video clip level and prevent statistical inferences from video collections. To protect the video content ...
To support privacy-preserving video sharing, we have proposed a novel framework that is able to protect the video content privacy at the individual video clip level and prevent statistical inferences from video collections. To protect the video content privacy at the individual video clip level, we have developed an effective algorithm to automatically detect privacy-sensitive video objects and video events. To prevent the statistical inferences from video collections, we have developed a distributed framework for privacy-preserving classifier training, which is able to significantly reduce the costs of data transmission and reliably limit the privacy breaches by determining the optimal size of blurred test samples for classifier validation. Our experiments on a specific domain of patient training and counseling videos show convincing results. expand
|
|
|
SESSION: Paper session IR-8 (information retrieval): sentiment and genre classification |
|
|
|
|
Determining the semantic orientation of terms through gloss classification |
| |
Andrea Esuli,
Fabrizio Sebastiani
|
|
Pages: 617-624 |
|
doi>10.1145/1099554.1099713 |
|
Full text: PDF
|
|
Sentiment classification is a recent subdiscipline of text classification which is concerned not with the topic a document is about, but with the opinion it expresses. It has a rich set of applications, ranging from tracking users' opinions about ...
Sentiment classification is a recent subdiscipline of text classification which is concerned not with the topic a document is about, but with the opinion it expresses. It has a rich set of applications, ranging from tracking users' opinions about products or about political candidates as expressed in online forums, to customer relationship management. Functional to the extraction of opinions from text is the determination of the orientation of ``subjective'' terms contained in text, i.e. the determination of whether a term that carries opinionated content has a positive or a negative connotation. In this paper we present a new method for determining the orientation of subjective terms. The method is based on the quantitative analysis of the glosses of such terms, i.e. the definitions that these terms are given in on-line dictionaries, and on the use of the resulting term representations for semi-supervised term classification. The method we present outperforms all known methods when tested on the recognized standard benchmarks for this task. expand
|
|
|
Using appraisal groups for sentiment analysis |
| |
Casey Whitelaw,
Navendu Garg,
Shlomo Argamon
|
|
Pages: 625-631 |
|
doi>10.1145/1099554.1099714 |
|
Full text: PDF
|
|
Little work to date in sentiment analysis (classifying texts by `positive' or `negative' orientation) has attempted to use fine-grained semantic distinctions in features used for classification. We present a new method for sentiment classification based ...
Little work to date in sentiment analysis (classifying texts by `positive' or `negative' orientation) has attempted to use fine-grained semantic distinctions in features used for classification. We present a new method for sentiment classification based on extracting and analyzing appraisal groups such as ``very good'' or ``not terribly funny''. An appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on Appraisal Theory. Semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. We classify movie reviews using features based upon these taxonomies combined with standard ``bag-of-words'' features, and report state-of-the-art accuracy of 90.2%. In addition, we find that some types of appraisal appear to be more significant for sentiment classification than others. expand
|
|
|
Effects of web document evolution on genre classification |
| |
Elizabeth Sugar Boese,
Adele E. Howe
|
|
Pages: 632-639 |
|
doi>10.1145/1099554.1099715 |
|
Full text: PDF
|
|
The World Wide Web is a massive corpus that constantly evolves. Classification experiments usually grab a snapshot (temporally and spatially) of the Web for a corpus. In this paper, we examine the effects of page evolution on genre classification of ...
The World Wide Web is a massive corpus that constantly evolves. Classification experiments usually grab a snapshot (temporally and spatially) of the Web for a corpus. In this paper, we examine the effects of page evolution on genre classification of Web pages. Web genre refers to the type of the page characterized by features such as style, form or presentation layout, and meta-content; Web genre can be used to tune spider crawling re-visits and inform relevance judgments for search engines. We found that pages in some genres change rarely if at all and can be used in present-day research experiments without requiring an updated version. We show that an old corpus can be used for training when testing on new Web pages, with only a marginal drop in accuracy rates on genre classification. We also show that features found to be useful in one corpus do not transfer well to other corpora with different genres. expand
|
|
|
SESSION: Paper session DB-8 (databases): query optimisation |
|
|
|
|
Query workload-aware overlay construction using histograms |
| |
Georgia Koloniari,
Yannis Petrakis,
Evaggelia Pitoura,
Thodoris Tsotsos
|
|
Pages: 640-647 |
|
doi>10.1145/1099554.1099717 |
|
Full text: PDF
|
|
Peer-to-peer(p2p) systems over an efficient means of data sharing among a dynamically changing set of a large number of a tonomous nodes.Each node in a p2p system is connected with a small number of other nodes thus creating an overlay network of nodes. ...
Peer-to-peer(p2p) systems over an efficient means of data sharing among a dynamically changing set of a large number of a tonomous nodes.Each node in a p2p system is connected with a small number of other nodes thus creating an overlay network of nodes. A query posed at a node is routed through the overlay network towards nodes hosting data items that satisfy it. In this paper, we consider building overlays that exploit the query workload so that nodes are clustered based on their results to a given query workload. The motivation is to create overlays where nodes that match a large number of similar queries are a fewlinks apart. Query frequency is also taken into account so that popular queries have a greater effect on the formation of the overlay than unpopular ones. We focus on range selection queries and se histograms to estimate the query results of each node. Then, nodes are clustered based on the similarity of their histograms. To this end,we introd ce a workload-aware edit distance metric between histograms that takes into account the query workload. Our experimental results show that workload-aware overlays increase the percentage of query results returned for a given number of nodes visited as compared to both random (i.e., unclustered)overlays and non workload-aware clustered overlays (i.e., overlays that cluster nodes based solely on the nodes' content). expand
|
|
|
Optimizing candidate check costs for bitmap indices |
| |
Doron Rotem,
Kurt Stockinger,
Kesheng Wu
|
|
Pages: 648-655 |
|
doi>10.1145/1099554.1099718 |
|
Full text: PDF
|
|
In this paper, we propose a new strategy for optimizing the placement of bin boundaries to minimize the cost of query evaluation using bitmap indices with binning. For attributes with a large number of distinct values, often the most efficient index ...
In this paper, we propose a new strategy for optimizing the placement of bin boundaries to minimize the cost of query evaluation using bitmap indices with binning. For attributes with a large number of distinct values, often the most efficient index scheme is a bitmap index with binning. However, this type of index may not be able to fully resolve some user queries. To fully resolve these queries, one has to access parts of the original data to check whether certain candidate records actually satisfy the specified conditions. We call this procedure the candidate check, which usually dominates the total query processing time. Given a set of user queries, we seek to minimize the total time required to an-swer the queries by optimally placing the bin boundaries. We show that our dynamic programming based algorithm can efficiently determine the bin boundaries. We verify our analysis with some real user queries from the Sloan Digital Sky Survey. For queries that require significant amount of time to perform candidate check, using our optimal bin boundaries reduces the candidate check time by a factor of 2 and the total query processing time by 40%. expand
|
|
|
Towards estimating the number of distinct value combinations for a set of attributes |
| |
Xiaohui Yu,
Calisto Zuzarte,
Kenneth C. Sevcik
|
|
Pages: 656-663 |
|
doi>10.1145/1099554.1099719 |
|
Full text: PDF
|
|
Accurately and efficiently estimating the number of distinct values for some attribute(s) or sets of attributes in a data set is of critical importance to many database operations, such as query optimization and approximation query answering. Previous ...
Accurately and efficiently estimating the number of distinct values for some attribute(s) or sets of attributes in a data set is of critical importance to many database operations, such as query optimization and approximation query answering. Previous work has focused on the estimation of the number of distinct values for a single attribute and most existing work adopts a data sampling approach. This paper addresses the equally important issue of estimating the number of distinct value combinations for multiple attributes which we call COLSCARD (for COLumn Set CARDinality). It also takes a different approach that uses existing statistical information (e.g., histograms) available on the individual attributes to assist estimation. We start with cases where exact frequency information on individual attributes is available, and present a pair of lower and upper bounds on COLSCARD that are consistent with the available information, as well as an estimator of COLSCARD based on probability. We then proceed to study the case where only partial information (in the form of histograms) is available on individual attributes, and show how the proposed estimator can be adapted to this case. We consider two types of widely used histograms and show how they can be constructed in order to obtain optimal approximation. An experimental evaluation of the proposed estimation method on synthetic as well as two real data sets is provided. expand
|
|
|
SESSION: Paper session IR-9 (information retrieval): IR models 2 |
|
|
|
|
A geometric interpretation and analysis of R-precision |
| |
Javed A. Aslam,
Emine Yilmaz
|
|
Pages: 664-671 |
|
doi>10.1145/1099554.1099721 |
|
Full text: PDF
|
|
Average precision and R-precision are two of the most commonly cited measures of overall retrieval performance, but their correlation, though well-known, has defied explanation. We recently devised a geometric interpretation of R-precision which suggests ...
Average precision and R-precision are two of the most commonly cited measures of overall retrieval performance, but their correlation, though well-known, has defied explanation. We recently devised a geometric interpretation of R-precision which suggests that under a reasonable set of assumptions, R-precision approximates the area under the precision-recall curve, as does average precision, thus explaining their correlation. In this paper, we consider these assumptions and our geometric interpretation of R-precision in order to further understand, and make reasonable use of, the information that R-precision provides. Given our geometric interpretation of R-precision, we show that R-precision is highly informative by demonstrating that it can be used to (1) accurately infer precision-recall curves, (2) accurately infer other measures of retrieval performance, and (3) devise new measures of retrieval performance. Through our analysis, we also state the conditions under which R-precision is informative. expand
|
|
|
Regularizing ad hoc retrieval scores |
| |
Fernando Diaz
|
|
Pages: 672-679 |
|
doi>10.1145/1099554.1099722 |
|
Full text: PDF
|
|
The cluster hypothesis states: closely related documents tend to be relevant to the same request. We exploit this hypothesis directly by adjusting ad hoc retrieval scores from an initial retrieval so that topically related documents receive similar scores. ...
The cluster hypothesis states: closely related documents tend to be relevant to the same request. We exploit this hypothesis directly by adjusting ad hoc retrieval scores from an initial retrieval so that topically related documents receive similar scores. We refer to this process as score regularization. Score regularization can be presented as an optimization problem, allowing the use of results from semi-supervised learning. We demonstrate that regularized scores consistently and significantly rank documents better than unregularized scores, given a variety of initial retrieval algorithms. We evaluate our method on two large corpora across a substantial number of topics. expand
|
|
|
Incremental test collections |
| |
Ben Carterette,
James Allan
|
|
Pages: 680-687 |
|
doi>10.1145/1099554.1099723 |
|
Full text: PDF
|
|
Corpora and topics are readily available for information retrieval research. Relevance judgments, which are necessary for system evaluation, are expensive; the cost of obtaining them prohibits in-house evaluation of retrieval systems on new corpora or ...
Corpora and topics are readily available for information retrieval research. Relevance judgments, which are necessary for system evaluation, are expensive; the cost of obtaining them prohibits in-house evaluation of retrieval systems on new corpora or new topics. We present an algorithm for cheaply constructing sets of relevance judgments. Our method intelligently selects documents to be judged and decides when to stop in such a way that with very little work there can be a high degree of confidence in the result of the evaluation. We demonstrate the algorithm's effectiveness by showing that it produces small sets of relevance judgments that reliably discriminate between two systems. The algorithm can be used to incrementally design retrieval systems by simultaneously comparing sets of systems. The number of additional judgments needed after each incremental design change decreases at a rate reciprocal to the number of systems being compared. To demonstrate the effectiveness of our method, we evaluate TREC ad hoc submissions, showing that with 95% fewer relevance judgments we can reach a Kendall's tau rank correlation of at least 0.9. expand
|
|
|
SESSION: Paper session IR-10 (information retrieval): query expansion |
|
|
|
|
Query expansion using term relationships in language models for information retrieval |
| |
Jing Bai,
Dawei Song,
Peter Bruza,
Jian-Yun Nie,
Guihong Cao
|
|
Pages: 688-695 |
|
doi>10.1145/1099554.1099725 |
|
Full text: PDF
|
|
Language Modeling (LM) has been successfully applied to Information Retrieval (IR). However, most of the existing LM approaches only rely on term occurrences in documents, queries and document collections. In traditional unigram based models, terms (or ...
Language Modeling (LM) has been successfully applied to Information Retrieval (IR). However, most of the existing LM approaches only rely on term occurrences in documents, queries and document collections. In traditional unigram based models, terms (or words) are usually considered to be independent. In some recent studies, dependence models have been proposed to incorporate term relationships into LM, so that links can be created between words in the same sentence, and term relationships (e.g. synonymy) can be used to expand the document model. In this study, we further extend this family of dependence models in the following two ways: (1) Term relationships are used to expand query model instead of document model, so that query expansion process can be naturally implemented; (2) We exploit more sophisticated inferential relationships extracted with Information Flow (IF). Information flow relationships are not simply pairwise term relationships as those used in previous studies, but are between a set of terms and another term. They allow for context-dependent query expansion. Our experiments conducted on TREC collections show that we can obtain large and significant improvements with our approach. This study shows that LM is an appropriate framework to implement effective query expansion. expand
|
|
|
Concept-based interactive query expansion |
| |
Bruno M. Fonseca,
Paulo Golgher,
Bruno Pôssas,
Berthier Ribeiro-Neto,
Nivio Ziviani
|
|
Pages: 696-703 |
|
doi>10.1145/1099554.1099726 |
|
Full text: PDF
|
|
Despite the recent advances in search quality, the fast increase in the size of the Web collection has introduced new challenges for Web ranking algorithms. In fact, there are still many situations in which the users are presented with imprecise or very ...
Despite the recent advances in search quality, the fast increase in the size of the Web collection has introduced new challenges for Web ranking algorithms. In fact, there are still many situations in which the users are presented with imprecise or very poor results. One of the key difficulties is the fact that users usually submit very short and ambiguous queries, and they do not fully specify their information needs. That is, it is necessary to improve the query formation process if better answers are to be provided. In this work we propose a novel concept-based query expansion technique, which allows disambiguating queries submitted to search engines. The concepts are extracted by analyzing and locating cycles in a special type of query relations graph. This is a directed graph built from query relations mined using association rules. The concepts related to the current query are then shown to the user who selects the one concept that he interprets is most related to his query. This concept is used to expand the original query and the expanded query is processed instead. Using a Web test collection, we show that our approach leads to gains in average precision figures of roughly 32%. Further, if the user also provides information on the type of relation between his query and the selected concept, the gains in average precision go up to roughly 52%. expand
|
|
|
Query expansion using random walk models |
| |
Kevyn Collins-Thompson,
Jamie Callan
|
|
Pages: 704-711 |
|
doi>10.1145/1099554.1099727 |
|
Full text: PDF
|
|
It has long been recognized that capturing term relationships is an important aspect of information retrieval. Even with large amounts of data, we usually only have significant evidence for a fraction of all potential term pairs. It is therefore important ...
It has long been recognized that capturing term relationships is an important aspect of information retrieval. Even with large amounts of data, we usually only have significant evidence for a fraction of all potential term pairs. It is therefore important to consider whether multiple sources of evidence may be combined to predict term relations more accurately. This is particularly important when trying to predict the probability of relevance of a set of terms given a query, which may involve both lexical and semantic relations between the terms.We describe a Markov chain framework that combines multiple sources of knowledge on term associations. The stationary distribution of the model is used to obtain probability estimates that a potential expansion term reflects aspects of the original query. We use this model for query expansion and evaluate the effectiveness of the model by examining the accuracy and robustness of the expansion methods, and investigate the relative effectiveness of various sources of term evidence. Statistically significant differences in accuracy were observed depending on the weighting of evidence in the random walk. For example, using co-occurrence data later in the walk was generally better than using it early, suggesting further improvements in effectiveness may be possible by learning walk behaviors. expand
|
|
|
SESSION: Paper session DB-9 (databases): query processing 1 |
|
|
|
|
Semantic querying of tree-structured data sources using partially specified tree patterns |
| |
Dimitri Theodoratos,
Theodore Dalamagas,
Antonis Koufopoulos,
Narain Gehani
|
|
Pages: 712-719 |
|
doi>10.1145/1099554.1099729 |
|
Full text: PDF
|
|
Nowadays, huge volumes of data are organized or exported in a tree-structured form. Querying capabilities are provided through queries that are based on branching path expression. Even for a single knowledge domain structural differences raise difficulties ...
Nowadays, huge volumes of data are organized or exported in a tree-structured form. Querying capabilities are provided through queries that are based on branching path expression. Even for a single knowledge domain structural differences raise difficulties for querying data sources in a uniform way. In this paper, we present a method for semantically querying tree-structured data sources using partially specified tree patterns. Based on dimensions which are sets of semantically related nodes in tree structures, we define dimension graphs. Dimension graphs can be automatically extracted from trees and abstract their structural information. They are semantically rich constructs that support the formulation of queries and their efficient evaluation. We design a tree-pattern query language to query multiple tree-structured data sources. A central feature of this language is that the structure can be specified fully, partially, or not at all in the queries. Therefore, it can be used to query multiple trees with structural differences. %and We study the derivation of structural expressions in queries by introducing a set of inference rules for structural expressions. We define two types of query unsatisfiability and we provide necessary and sufficient conditions for checking each of them. Our approach is validated through experimental evaluation. expand
|
|
|
Selectivity-based partitioning: a divide-and-union paradigm for effective query optimization |
| |
Neoklis Polyzotis
|
|
Pages: 720-727 |
|
doi>10.1145/1099554.1099730 |
|
Full text: PDF
|
|
Modern query optimizers select an efficient join ordering for a physical execution plan based essentially on the average join selectivity factors among the referenced tables. In this paper, we argue that this "monolithic" approach can miss important ...
Modern query optimizers select an efficient join ordering for a physical execution plan based essentially on the average join selectivity factors among the referenced tables. In this paper, we argue that this "monolithic" approach can miss important opportunities for the effective optimization of relational queries. We propose selectivity-based partitioning, a novel optimization paradigm that takes into account the join correlations among relation fragments in order to essentially enable multiple (and more effective) join orders for the evaluation of a single query. In a nutshell, the basic idea is to carefully partition a relation according to the selectivities of the join operations, and subsequently rewrite the query as a union of constituent queries over the computed partitions. We provide a formal definition of the related optimization problem and derive properties that characterize the set of optimal solutions. Based on our analysis, we develop a heuristic algorithm for computing efficiently an effective partitioning of the input query. Results from a preliminary experimental study verify the effectiveness of the proposed approach and demonstrate its potential as an effective optimization technique. expand
|
|
|
Efficient evaluation of parameterized pattern queries |
| |
Cédric du Mouza,
Philippe Rigaux,
Michel Scholl
|
|
Pages: 728-735 |
|
doi>10.1145/1099554.1099731 |
|
Full text: PDF
|
|
Many applications rely on sequence databases and use extensively pattern-matching queries to retrieve data of interest. This paper extends the traditional pattern-matching expressions to parameterized patterns, featuring variables. Parameterized ...
Many applications rely on sequence databases and use extensively pattern-matching queries to retrieve data of interest. This paper extends the traditional pattern-matching expressions to parameterized patterns, featuring variables. Parameterized patterns are more expressive and allow to define concisely regular expressions that would be very complex to describe without variables. They can also be used to express additional constraints on patterns' variables.We show that they can be evaluated without additional cost with respect to traditional techniques (e.g., the Knuth-Morris-Pratt algorithm). We describe an algorithm that enjoys low memory and CPU time requirements, and provide experimental results which illustrate the gain of the optimized solution. expand
|
|
|
SESSION: Paper session IR-11 (information retrieval): novelty detection |
|
|
|
|
Redundant documents and search effectiveness |
| |
Yaniv Bernstein,
Justin Zobel
|
|
Pages: 736-743 |
|
doi>10.1145/1099554.1099733 |
|
Full text: PDF
|
|
The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous ...
The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous attempts to address this issue, most notably the TREC novelty track, were characterized by difficulties with accuracy and evaluation. In this paper we explore syntactic techniques --- particularly document fingerprinting --- for detecting content equivalence. Using these techniques on the TREC GOV1 and GOV2 corpora revealed a high degree of redundancy; a user study confirmed that our metrics were accurately identifying content-equivalence. We show, moreover, that content-equivalent documents have a significant effect on the search experience: we found that 16.6% of all relevant documents in runs submitted to the TREC 2004 terabyte track were redundant. expand
|
|
|
Novelty detection based on sentence level patterns |
| |
Xiaoyan Li,
W. Bruce Croft
|
|
Pages: 744-751 |
|
doi>10.1145/1099554.1099734 |
|
Full text: PDF
|
|
The detection of new information in a document stream is an important component of many potential applications. In this paper, a new novelty detection approach based on the identification of sentence level patterns is proposed. Given a user's information ...
The detection of new information in a document stream is an important component of many potential applications. In this paper, a new novelty detection approach based on the identification of sentence level patterns is proposed. Given a user's information need, some patterns in sentences such as combinations of query words, named entities and phrases, may contain more important and relevant information than single words. Therefore, the proposed novelty detection approach focuses on the identification of previously unseen query-related patterns in sentences. Specifically, a query is preprocessed and represented with patterns that include both query words and required answer types. These patterns are used to retrieve sentences, which are then determined to be novel if it is likely that a new answer is present. An analysis of patterns in sentences was performed with data from the TREC 2002 novelty track and experiments on novelty detection were carried out on data from the TREC 2003 and 2004 novelty tracks. The experimental results show that the proposed pattern-based approach significantly outperforms all three baselines in terms of precision at top ranks. expand
|
|
|
Minimal document set retrieval |
| |
Wei Dai,
Rohini Srihari
|
|
Pages: 752-759 |
|
doi>10.1145/1099554.1099735 |
|
Full text: PDF
|
|
This paper presents a novel formulation and approach to the minimal document set retrieval problem. Minimal Document Set Retrieval (MDSR) is a promising information retrieval task in which each query topic is assumed to have different subtopics; ...
This paper presents a novel formulation and approach to the minimal document set retrieval problem. Minimal Document Set Retrieval (MDSR) is a promising information retrieval task in which each query topic is assumed to have different subtopics; the task is to retrieve and rank relevant document sets with maximum coverage but minimum redundancy of subtopics in each set. For this task, we propose three document set retrieval and ranking algorithms: Novelty Based method, Cluster Based method and Subtopic Extraction Based method. In order to evaluate the system performance, we design a new evaluation framework for document set ranking which evaluates both relevance between set and query topic, and redundancy within each set. Finally, we compare the performance of the three algorithms using the TREC interactive track dataset. Experimental results show the effectiveness of our algorithms. expand
|
|
|
SESSION: Paper session IR-12 (information retrieval): IR potpourri |
|
|
|
|
A model for weighting image objects in home photographs |
| |
Jean Martinet,
Yves Chiaramella,
Philippe Mulhem
|
|
Pages: 760-767 |
|
doi>10.1145/1099554.1099737 |
|
Full text: PDF
|
|
The paper presents a contribution to image indexing consisting in a weighting model for visible objects -- or image objects -- in home photographs. To improve its effectiveness this weighting model has been designed according to human perception criteria ...
The paper presents a contribution to image indexing consisting in a weighting model for visible objects -- or image objects -- in home photographs. To improve its effectiveness this weighting model has been designed according to human perception criteria about what is estimated as important in photographs. Four basic hypotheses related to human perception are presented, and their validity is estimated as compared to actual observations from a user study. Finally a formal definition of this weighting model is presented and its consistence with the user study is evaluated. expand
|
|
|
Automatic construction of multifaceted browsing interfaces |
| |
Wisam Dakka,
Panagiotis G. Ipeirotis,
Kenneth R. Wood
|
|
Pages: 768-775 |
|
doi>10.1145/1099554.1099738 |
|
Full text: PDF
|
|
Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Interfaces that use multifaceted ...
Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Interfaces that use multifaceted hierarchies represent a new powerful browsing paradigm which has been proven to be a successful complement to keyword searching. Thus far, multifaceted hierarchies have been created manually or semi-automatically, making it difficult to deploy multifaceted interfaces over a large number of databases. We present automatic and scalable methods for creation of multifaceted interfaces. Our methods are integrated with traditional relational databases and can scale well for large databases. Furthermore, we present methods for selecting the best portions of the generated hierarchies when the screen space is not sufficient for displaying all the hierarchy at once. We apply our technique to a range of large data sets, including annotated images, television programming schedules, and web pages. The results are promising and suggest directions for future research. expand
|
|
|
Fast on-line index construction by geometric partitioning |
| |
Nicholas Lester,
Alistair Moffat,
Justin Zobel
|
|
Pages: 776-783 |
|
doi>10.1145/1099554.1099739 |
|
Full text: PDF
|
|
Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line merge-based methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line ...
Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line merge-based methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line index construction -- that is, how to build an inverted index when the underlying data must be continuously queryable, and the documents must be indexed and available for search as soon they are inserted. When straightforward approaches are used, document insertions become increasingly expensive as the size of the database grows. This paper describes a mechanism based on controlled partitioning that can be adapted to suit different balances of insertion and querying operations, and is faster and scales better than previous methods. Using experiments on 100GB of web data we demonstrate the efficiency of our methods in practice, showing that they dramatically reduce the cost of on-line index construction. expand
|
|
|
SESSION: Paper session DB-10 (databases): query processing 2 |
|
|
|
|
Optimizing cursor movement in holistic twig joins |
| |
Marcus Fontoura,
Vanja Josifovski,
Eugene Shekita,
Beverly Yang
|
|
Pages: 784-791 |
|
doi>10.1145/1099554.1099741 |
|
Full text: PDF
|
|
Holistic twig join algorithms represent the state of the art for evaluating path expressions in XML queries. Using inverted indexes on XML elements, holistic twig joins move a set of index cursors in a coordinated way to quickly find structural matches. ...
Holistic twig join algorithms represent the state of the art for evaluating path expressions in XML queries. Using inverted indexes on XML elements, holistic twig joins move a set of index cursors in a coordinated way to quickly find structural matches. Because each cursor move can trigger I/O, the performance of a holistic twig join is largely determined by how many cursor moves it makes, yet, surprisingly, existing join algorithms have not been optimized along these lines. In this paper, we describe TwigOptimal, a new holistic twig join algorithm with optimal cursor movement. We sketch the proof of TwigOptimal's optimality, and describe how TwigOptimal can use information in the return clause of XQuery to boost its performance. Finally, experimental results are presented, showing TwigOptimal's superiority over existing holistic twig join algorithms. expand
|
|
|
Consistent query answering under key and exclusion dependencies: algorithms and experiments |
| |
Luca Grieco,
Domenico Lembo,
Riccardo Rosati,
Marco Ruzzi
|
|
Pages: 792-799 |
|
doi>10.1145/1099554.1099742 |
|
Full text: PDF
|
|
Research in consistent query answering studies the definition and computation of "meaningful" answers to queries posed to inconsistent databases, i.e., databases whose data do not satisfy the integrity constraints (ICs) declared on their ...
Research in consistent query answering studies the definition and computation of "meaningful" answers to queries posed to inconsistent databases, i.e., databases whose data do not satisfy the integrity constraints (ICs) declared on their schema. Computing consistent answers to conjunctive queries is generally coNP-hard in data complexity, even in the presence of very restricted forms of ICs (single, unary keys). Recent studies on consistent query answering for database schemas containing only key dependencies have analyzed the possibility of identifying classes of queries whose consistent answers can be obtained by a first-order rewriting of the query, which in turn can be easily formulated in SQL and directly evaluated through any relational DBMS. In this paper we study consistent query answering in the presence of key dependencies and exclusion dependencies. We first prove that even in the presence of only exclusion dependencies the problem is coNP-hard in data complexity, and define a general method for consistent answering of conjunctive queries under key and exclusion dependencies, based on the rewriting of the query in Datalog with negation. Then, we identify a subclass of conjunctive queries that can be first-order rewritten in the presence of key and exclusion dependencies, and define an algorithm for computing the first-order rewriting of a query belonging to such a class of queries. Finally, we compare the relative efficiency of the two methods for processing queries in the subclass above mentioned. Experimental results, conducted on a real and large database of the computer science engineering degrees of the University of Rome "La Sapienza", clearly show the computational advantage of the first-order based technique. expand
|
|
|
Balancing performance and confidentiality in air index |
| |
Qingzhao Tan,
Wang-Chien Lee,
Baihua Zheng,
Peng Liu,
Dik Lun Lee
|
|
Pages: 800-807 |
|
doi>10.1145/1099554.1099743 |
|
Full text: PDF
|
|
Studies on the performance issues (i.e., access latency and energy conservation) of wireless data broadcast have appeared in the literature. However, the important security issues have not been well addressed. This paper investigates the tradeoff between ...
Studies on the performance issues (i.e., access latency and energy conservation) of wireless data broadcast have appeared in the literature. However, the important security issues have not been well addressed. This paper investigates the tradeoff between performance and security of signature-based air index schemes in wireless data broadcast. From the performance perspective, keeping low false drop probability helps clients retrieve the information from a broadcast channel efficiently. Meanwhile, from the security perspective, achieving high false guess probability prevents the hacker from guessing the information easily. There is a tradeoff between these two aspects. An administrator of the wireless broadcast system may balance this tradeoff by carefully configuring the signatures used in broadcast. This study provides a guidance for parameter settings of the signature schemes in order to meet the performance and security requirements. Experiments are performed to validate the analytical results and to obtain optimal signature configuration corresponding to different application criteria. expand
|
|
|
SESSION: Paper session IR-13 (information retrieval): context and personalization |
|
|
|
|
Context modeling and discovery using vector space bases |
| |
Massimo Melucci
|
|
Pages: 808-815 |
|
doi>10.1145/1099554.1099745 |
|
Full text: PDF
|
|
In this paper, context is modeled by vector space bases and its evolution is modeled by linear transformations from one base to another. Each document or query can be associated to a distinct base, which corresponds to one context. Also, algorithms are ...
In this paper, context is modeled by vector space bases and its evolution is modeled by linear transformations from one base to another. Each document or query can be associated to a distinct base, which corresponds to one context. Also, algorithms are proposed to discover contexts from document, query or groups or them. Linear algebra can thus by employed in a mathematical framework to process context, its evolution and application. expand
|
|
|
Y!Q: contextual search at the point of inspiration |
| |
Reiner Kraft,
Farzin Maghoul,
Chi Chao Chang
|
|
Pages: 816-823 |
|
doi>10.1145/1099554.1099746 |
|
Full text: PDF
|
|
Contextual search tries to better capture a user's information need by augmenting the user's query with contextual information extracted from the search context (for example, terms from the web page the user is currently reading or a file the user is ...
Contextual search tries to better capture a user's information need by augmenting the user's query with contextual information extracted from the search context (for example, terms from the web page the user is currently reading or a file the user is currently editing).This paper presents Y!Q---a first of its kind large-scale contextual search system---and provides an overview of its system design and architecture. Y!Q solves two major problems. First, how to capture high quality search context. Second, how to use that context in a way to improve the relevancy of search queries. To address the first problem, Y!Q introduces an information widget that captures precise search context and provides convenient access to its functionality at the point of inspiration. For example, Y!Q can be easily embedded into web pages using a web API, or it can be integrated into a web browser toolbar. This paper provides an overview of Y!Q's user interaction design, highlighting its novel aspects for capturing high quality search context.To address the second problem, Y!Q uses a semantic network for analyzing search context, possibly resolving ambiguous terms, and generating a contextual digest comprising its key concepts. This digest is passed through a query planner and rewriting framework for augmenting a user's search query with relevant context terms to improve the overall search relevancy and experience. We show experimental results comparing contextual Y!Q search results side-by-side with regular Yahoo! web search results. This evaluation suggests that Y!Q results are considered significantly more relevant.The paper also identifies interesting research problems and argues that contextual search may represent the next major step in the evolution of web search engines. expand
|
|
|
Implicit user modeling for personalized search |
| |
Xuehua Shen,
Bin Tan,
ChengXiang Zhai
|
|
Pages: 824-831 |
|
doi>10.1145/1099554.1099747 |
|
Full text: PDF
|
|
Information retrieval systems (e.g., web search engines) are critical for overcoming information overload. A major deficiency of existing retrieval systems is that they generally lack user modeling and are not adaptive to individual users, resulting ...
Information retrieval systems (e.g., web search engines) are critical for overcoming information overload. A major deficiency of existing retrieval systems is that they generally lack user modeling and are not adaptive to individual users, resulting in inherently non-optimal retrieval performance. For example, a tourist and a programmer may use the same word "java" to search for different information, but the current search systems would return the same results. In this paper, we study how to infer a user's interest from the user's search context and use the inferred implicit user model for personalized search. We present a decision theoretic framework and develop techniques for implicit user modeling in information retrieval. We develop an intelligent client-side web search agent (UCAIR) that can perform eager implicit feedback, e.g., query expansion based on previous queries and immediate result reranking based on clickthrough information. Experiments on web search show that our search agent can improve search accuracy over the popular Google search engine. expand
|