|
|
SESSION: Search engineering 1 |
| |
Luis Gravano
|
|
|
|
|
What's new on the web?: the evolution of the web from a search engine perspective |
| |
Alexandros Ntoulas,
Junghoo Cho,
Christopher Olston
|
|
Pages: 1-12 |
|
doi>10.1145/988672.988674 |
|
Full text: PDF
|
|
We seek to gain improved insight into how Web search engines shouldcope with the evolving Web, in an attempt to provide users with themost up-to-date results possible. For this purpose we collectedweekly snapshots of some 150 Web sites over the course ...
We seek to gain improved insight into how Web search engines shouldcope with the evolving Web, in an attempt to provide users with themost up-to-date results possible. For this purpose we collectedweekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate ofcreation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them. For pages that persistover time we found that, perhaps surprisingly, the degree of contentshift as measured using TF.IDF cosine distance does not appear to beconsistently correlated with the frequency of contentupdating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications ofour results for the design of effective Web search engines. expand
|
|
|
Understanding user goals in web search |
| |
Daniel E. Rose,
Danny Levinson
|
|
Pages: 13-19 |
|
doi>10.1145/988672.988675 |
|
Full text: PDF
|
|
Previous work on understanding user web search behavior has focused on how people search and what they are searching for, but not why they are searching. In this paper, we describe a framework for understanding the underlying goals of user searches, ...
Previous work on understanding user web search behavior has focused on how people search and what they are searching for, but not why they are searching. In this paper, we describe a framework for understanding the underlying goals of user searches, and our experience in using the framework to manually classify queries from a web search engine. Our analysis suggests that so-called navigational" searches are less prevalent than generally believed while a previously unexplored "resource-seeking" goal may account for a large fraction of web searches. We also illustrate how this knowledge of user search goals might be used to improve future web search engines. expand
|
|
|
Impact of search engines on page popularity |
| |
Junghoo Cho,
Sourashis Roy
|
|
Pages: 20-29 |
|
doi>10.1145/988672.988676 |
|
Full text: PDF
|
|
Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines ...
Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines have on the popularity evolution of Web pages. For example, given that search engines return currently popular" pages at the top of search results, are we somehow penalizing newly created pages that are not very well known yet? Are popular pages getting even more popular and new pages completely ignored? We first show that this unfortunate trend indeed exists on the Web through an experimental study based on real Web data. We then analytically estimate how much longer it takes for a new page to attract a large number of Web users when search engines return only popular pages at the top of search results. Our result shows that search engines can have an immensely worrisome impact on the discovery of new Web pages. expand
|
|
|
SESSION: Security and privacy |
| |
Patrick MacDaniel
|
|
|
|
|
Anti-aliasing on the web |
| |
Jasmine Novak,
Prabhakar Raghavan,
Andrew Tomkins
|
|
Pages: 30-39 |
|
doi>10.1145/988672.988678 |
|
Full text: PDF
|
|
It is increasingly common for users to interact with the web using a number of different aliases. This trend is a double-edged sword. On one hand, it is a fundamental building block in approaches to online privacy. On the other hand, there are economic ...
It is increasingly common for users to interact with the web using a number of different aliases. This trend is a double-edged sword. On one hand, it is a fundamental building block in approaches to online privacy. On the other hand, there are economic and social consequences to allowing each user an arbitrary number of free aliases. Thus, there is great interest in understanding the fundamental issues in obscuring the identities behind aliases.However, most work in the area has focused on linking aliases through analysis of lower-level properties of interactions such as network routes. We show that aliases that actively post text on the web can be linked together through analysis of that text. We study a large number of users posting on bulletin boards, and develop algorithms to anti-alias those users: we can with a high degree of success identify when two aliases belong to the same individual.Our results show that such techniques are surprisingly effective, leading us to conclude that guaranteeing privacy among aliases that post actively requires mechanisms that do not yet exist. expand
|
|
|
Securing web application code by static analysis and runtime protection |
| |
Yao-Wen Huang,
Fang Yu,
Christian Hang,
Chung-Hung Tsai,
Der-Tsai Lee,
Sy-Yen Kuo
|
|
Pages: 40-52 |
|
doi>10.1145/988672.988679 |
|
Full text: PDF
|
|
Security remains a major roadblock to universal acceptance of the Web for many kinds of transactions, especially since the recent sharp increase in remotely exploitable vulnerabilities have been attributed to Web application bugs. Many verification tools ...
Security remains a major roadblock to universal acceptance of the Web for many kinds of transactions, especially since the recent sharp increase in remotely exploitable vulnerabilities have been attributed to Web application bugs. Many verification tools are discovering previously unknown vulnerabilities in legacy C programs, raising hopes that the same success can be achieved with Web applications. In this paper, we describe a sound and holistic approach to ensuring Web application security. Viewing Web application vulnerabilities as a secure information flow problem, we created a lattice-based static analysis algorithm derived from type systems and typestate, and addressed its soundness. During the analysis, sections of code considered vulnerable are instrumented with runtime guards, thus securing Web applications in the absence of user intervention. With sufficient annotations, runtime overhead can be reduced to zero. We also created a tool named.WebSSARI (Web application Security by Static Analysis and Runtime Inspection) to test our algorithm, and used it to verify 230 open-source Web application projects on SourceForge.net, which were selected to represent projects of different maturity, popularity, and scale. 69 contained vulnerabilities. After notifying the developers, 38 acknowledged our findings and stated their plans to provide patches. Our statistics also show that static analysis reduced potential runtime overhead by 98.4%. expand
|
|
|
Trust-serv: model-driven lifecycle management of trust negotiation policies for web services |
| |
Halvard Skogsrud,
Boualem Benatallah,
Fabio Casati
|
|
Pages: 53-62 |
|
doi>10.1145/988672.988680 |
|
Full text: PDF
|
|
A scalable approach to trust negotiation is required in Web service environments that have large and dynamic requester populations. We introduce Trust-Serv, a model-driven trust negotiation framework for Web services. The framework employs a model for ...
A scalable approach to trust negotiation is required in Web service environments that have large and dynamic requester populations. We introduce Trust-Serv, a model-driven trust negotiation framework for Web services. The framework employs a model for trust negotiation that is based on state machines, extended with security abstractions. Our policy model supports lifecycle management, an important trait in the dynamic environments that characterize Web services. In particular, we provide a set of change operations to modify policies, and migration strategies that permit ongoing negotiations to be migrated to new policies without being disrupted. Experimental results show the performance benefit of these strategies. The proposed approach has been implemented as a container-centric mechanism that is transparent to the Web services and to the developers of Web services, simplifying Web service development and management as well as enabling scalable deployments. expand
|
|
|
SESSION: Usability and accessibility |
| |
Bay-Wei Chang
|
|
|
|
|
Smartback: supporting users in back navigation |
| |
Natasa Milic-Frayling,
Rachel Jones,
Kerry Rodden,
Gavin Smyth,
Alan Blackwell,
Ralph Sommerer
|
|
Pages: 63-71 |
|
doi>10.1145/988672.988682 |
|
Full text: PDF
|
|
This paper presents the design and user evaluation of SmartBack, a feature that complements the standard Back button by enabling users to jump directly to key pages in their navigation session, making common navigation activities more efficient. Defining ...
This paper presents the design and user evaluation of SmartBack, a feature that complements the standard Back button by enabling users to jump directly to key pages in their navigation session, making common navigation activities more efficient. Defining key pages was informed by the findings of a user study that involved detailed monitoring of Web usage and analysis of Web browsing in terms of navigation trails. The pages accessible through SmartBack are determined automatically based on the structure of the user's navigation trails or page association with specific user's activities, such as search or browsing bookmarked sites. We discuss implementation decisions and present results of a usability study in which we deployed the SmartBack prototype and monitored usage for a month in both corporate and home settings. The results show that the feature brings qualitative improvement to the browsing experience of individuals who use it. expand
|
|
|
Web accessibility: a broader view |
| |
John T. Richards,
Vicki L. Hanson
|
|
Pages: 72-79 |
|
doi>10.1145/988672.988683 |
|
Full text: PDF
|
|
Web accessibility is an important goal. However, most approaches to its attainment are based on unrealistic economic models in which Web content developers are required to spend too much for which they receive too little. We believe this situation is ...
Web accessibility is an important goal. However, most approaches to its attainment are based on unrealistic economic models in which Web content developers are required to spend too much for which they receive too little. We believe this situation is due, in part, to the overly narrow definitions given both to those who stand to benefit from enhanced access to the Web and what is meant by this enhanced access. In this paper, we take a broader view, discussing a complementary approach that costs developers less and provides greater advantages to a larger community of users. While we have quite specific aims in our technical work, we hope it can also serve as an example of how the technical conversation regarding Web accessibility can move beyond the narrow confines of limited adaptations for small populations. expand
|
|
|
Hearsay: enabling audio browsing on hypertext content |
| |
I. V. Ramakrishnan,
Amanda Stent,
Guizhen Yang
|
|
Pages: 80-89 |
|
doi>10.1145/988672.988684 |
|
Full text: PDF
|
|
In this paper we present HearSay, a system for browsing hypertext Web documents via audio. The HearSay system is based on our novel approach to automatically creating audio browsable content from hypertext Web documents. It combines two key technologies: ...
In this paper we present HearSay, a system for browsing hypertext Web documents via audio. The HearSay system is based on our novel approach to automatically creating audio browsable content from hypertext Web documents. It combines two key technologies: (1) automatic partitioning of Web documents through tightly coupled structural and semantic analysis, which transforms raw HTML documents into semantic structures so as to facilitate audio browsing; and (2) VoiceXML, an already standardized technology which we adopt to represent voice dialogs automatically created from the XML output of partitioning. This paper describes the software components of HearSay and presents an initial system evaluation. expand
|
|
|
SESSION: Information extraction |
| |
Roberto Bayardo
|
|
|
|
|
Unsupervised learning of soft patterns for generating definitions from online news |
| |
Hang Cui,
Min-Yen Kan,
Tat-Seng Chua
|
|
Pages: 90-99 |
|
doi>10.1145/988672.988686 |
|
Full text: PDF
|
|
Breaking news often contains timely definitions and descriptions of current terms, organizations and personalities. We utilize such web sources to construct definitions for such terms. Previous work has identified definitions using hand-crafted rules ...
Breaking news often contains timely definitions and descriptions of current terms, organizations and personalities. We utilize such web sources to construct definitions for such terms. Previous work has identified definitions using hand-crafted rules or supervised learning that constructs rigid, hard text patterns. In contrast, we demonstrate a new approach that uses flexible, soft matching patterns to characterize definition sentences. Our soft patterns are able to effectively accommodate the diversity of definition sentence structure exhibited in news. We use pseudo-relevance feedback to automatically label sentences for use in soft pattern generation. The application of our unsupervised method significantly improves baseline systems on both the standardized TREC corpus as well as crawled online news articles by 27% and 30%, respectively, in terms of F measure. When applied to a state-of-art definition generation system recently fielded in the TREC 2003 definitional question answering task, it improves the performance by 14%. expand
|
|
|
Web-scale information extraction in knowitall: (preliminary results) |
| |
Oren Etzioni,
Michael Cafarella,
Doug Downey,
Stanley Kok,
Ana-Maria Popescu,
Tal Shaked,
Stephen Soderland,
Daniel S. Weld,
Alexander Yates
|
|
Pages: 100-110 |
|
doi>10.1145/988672.988687 |
|
Full text: PDF
|
|
Manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. Search engines retrieve and rank potentially relevantdocuments for human perusal, but do not extract facts, ...
Manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. Search engines retrieve and rank potentially relevantdocuments for human perusal, but do not extract facts, assessconfidence, or fuse information from multiple documents. This paperintroduces KnowItAll, a system that aims to automate the tedious process ofextracting large collections of facts from the web in an autonomous,domain-independent, and scalable manner.The paper describes preliminary experiments in which an instance of KnowItAll, running for four days on a single machine, was able to automatically extract 54,753 facts. KnowItAll associates a probability with each fact enabling it to trade off precision and recall. The paper analyzes KnowItAll's architecture and reports on lessons learned for the design of large-scale information extraction systems. expand
|
|
|
Is question answering an acquired skill? |
| |
Ganesh Ramakrishnan,
Soumen Chakrabarti,
Deepa Paranjpe,
Pushpak Bhattacharya
|
|
Pages: 111-120 |
|
doi>10.1145/988672.988688 |
|
Full text: PDF
|
|
We present a question answering (QA) system which learns how to detect and rank answer passages by analyzing questions and their answers (QA pairs) provided as training data. We built our system in only a few person-months using off-the-shelf components: ...
We present a question answering (QA) system which learns how to detect and rank answer passages by analyzing questions and their answers (QA pairs) provided as training data. We built our system in only a few person-months using off-the-shelf components: a part-of-speech tagger, a shallow parser, a lexical network, and a few well-known supervised learning algorithms. In contrast, many of the top TREC QA systems are large group efforts, using customized ontologies, question classifiers, and highly tuned ranking functions. Our ease of deployment arises from using generic, trainable algorithms that exploit simple feature extractors on QA pairs. With TREC QA data, our system achieves mean reciprocal rank (MRR) that compares favorably with the best scores in recent years, and generalizes from one corpus to another. Our key technique is to recover, from the question, fragments of what might have been posed as a structured query, had a suitable schema been available. comprises selectors: tokens that are likely to appear (almost) unchanged in an answer passage. The other fragment contains question tokens which give clues about the answer type, and are expected to be replaced in the answer passage by tokens which specialize or instantiate the desired answer type. Selectors are like constants in where-clauses in relational queries, and answer types are like column names. We present new algorithms for locating selectors and answer type clues and using them in scoring passages with respect to a question. expand
|
|
|
SESSION: Mobility |
| |
Fred Douglis
|
|
|
|
|
Session level techniques for improving web browsing performance on wireless links |
| |
Pablo Rodriguez,
Sarit Mukherjee,
Sampath Ramgarajan
|
|
Pages: 121-130 |
|
doi>10.1145/988672.988690 |
|
Full text: PDF
|
|
Recent observations through experiments that we have performed incurrent third generation wireless networks have revealed that the achieved throughput over wireless links varies widely depending on the application. In particular, the throughput achieved ...
Recent observations through experiments that we have performed incurrent third generation wireless networks have revealed that the achieved throughput over wireless links varies widely depending on the application. In particular, the throughput achieved by file transfer application (FTP) and web browsing application (HTTP) are quite different. The throughput achieved over a HTTP session is much lower than that achieved over an FTP session. The reason for the lower HTTP throughput is that the HTTP protocol is affected by the large Round-Trip Time (RTT) across Wireless links. HTTP transfers require multiple TCP connections and DNS lookups before a HTTP page can be displayed. Each TCP connection requires several RTTs to fully open the TCP send window and each DNS lookup requires several RTTs before resolving the domain name to IP mapping. These TCP/DNS RTTs significantly degrade the performance of HTTP over wireless links. To overcome these problems, we have developed session level optimization techniques to enhance HTTP download mechanisms. These techniques (a) minimize the number of DNS lookups over the wireless link and (b) minimize the number of TCP connections opened by the browser. These optimizations bridge the mismatch caused by wireless links between application-level protocols (such as HTTP) and transport-level protocols (such asTCP). Our solutions do not require any client-side software and can be deployed transparently on a service provider network toprovide 30-50% decrease in end-to-end user perceived latency and 50-100% increase in data throughput across wireless links for HTTP sessions. expand
|
|
|
Flexible on-device service object replication with replets |
| |
Dong Zhou,
Nayeem Islam,
Ali Ismael
|
|
Pages: 131-142 |
|
doi>10.1145/988672.988691 |
|
Full text: PDF
|
|
An increasingly large amount of Web applications employ service
objects such as Servlets to generate dynamic and personalized
content. Existing caching infrastructures are not well suited for
caching such content in mobile environments because of
disconnection ...
An increasingly large amount of Web applications employ service
objects such as Servlets to generate dynamic and personalized
content. Existing caching infrastructures are not well suited for
caching such content in mobile environments because of
disconnection and weak connection. One possible approach to this
problem is to replicate Web-related application logic to client
devices. The challenges to this approach are to deal with client
devices that exhibit huge divergence in resource availabilities, to
support applications that have different data sharing and coherency
requirements, and to accommodate the same application under
different deployment environments.
The Replet system targets these challenges. It uses client,
server and application capability and preference information (CPI)
to direct the replication of service objects to client devices:
from the selection of a device for replication and populating the
device with client-specific data, to choosing an appropriate
replica to serve a given request and maintaining the desired state
consistency among replicas. The Replet system exploits on-device
replication to enable client-, server- and application-specific
cost metrics for replica invocation and synchronization. We have
implemented a prototype in the context of Servlet-based Web
applications. Our experiment and simulation results demonstrate the
viability and significant benefits of CPI-driven on-device service
object replication.
expand
|
|
|
Improving web browsing performance on wireless pdas using thin-client computing |
| |
Albert M. Lai,
Jason Nieh,
Bhagyashree Bohra,
Vijayarka Nandikonda,
Abhishek P. Surana,
Suchita Varshneya
|
|
Pages: 143-154 |
|
doi>10.1145/988672.988692 |
|
Full text: PDF
|
|
Web applications are becoming increasingly popular for mobile wireless PDAs. However, web browsing on these systems can be quite slow. An alternative approach is handheld thin-client computing, in which the web browser and associated application logic ...
Web applications are becoming increasingly popular for mobile wireless PDAs. However, web browsing on these systems can be quite slow. An alternative approach is handheld thin-client computing, in which the web browser and associated application logic run on a server, which then sends simple screen updates to thePDA for display. To assess the viability of this thin-client approach, we compare the web browsing performance of thin clients against fat clients that run the web browser locally on a PDA. Our results show that thin clients can provide better web browsing performance compared to fat clients, both in terms of speed and ability to correctly display web content. Surprisingly, thin clients are faster even when having to send more data over the network. We characterize and analyze different design choices in various thin-client systems and explain why these approaches can yield superior web browsing performance on mobile wireless PDAs. expand
|
|
|
SESSION: XML |
| |
Bebo White
|
|
|
|
|
XVM: a bridge between xml data and its behavior |
| |
Quanzhong Li,
Michelle Y. Kim,
Edward So,
Steve Wood
|
|
Pages: 155-163 |
|
doi>10.1145/988672.988694 |
|
Full text: PDF
|
|
XML has become one of the core technologies for contemporary business applications, especially web-based applications. To facilitate processing of diverse XML data, we propose an extensible, integrated XML processing architecture, the XML Virtual Machine ...
XML has become one of the core technologies for contemporary business applications, especially web-based applications. To facilitate processing of diverse XML data, we propose an extensible, integrated XML processing architecture, the XML Virtual Machine (XVM), which connects XML data with their behaviors. At the same time, the XVM is also a framework for developing and deploying XML-based applications. Using component-based techniques, the XVM supports arbitrary granularity and provides a high degree of modularity and reusability. XVM components are dynamically loaded and composed during XML data processing. Using the XVM, both client-side and server-side XML applications can be developed and deployed in an integrated way. We also present an XML application container built on top of the XVM along with several sample applications to demonstrate the applicability of the XVM framework. expand
|
|
|
Schemapath, a minimal extension to xml schema for conditional constraints |
| |
Claudio Sacerdoti Coen,
Paolo Marinelli,
Fabio Vitali
|
|
Pages: 164-174 |
|
doi>10.1145/988672.988695 |
|
Full text: PDF
|
|
In the past few years, a number of constraint languages for XML documents has been proposed. They are cumulatively called schema languages or validation languages and they comprise, among others, DTD, XML Schema, RELAX NG, Schematron, DSD, xlinkit. ...
In the past few years, a number of constraint languages for XML documents has been proposed. They are cumulatively called schema languages or validation languages and they comprise, among others, DTD, XML Schema, RELAX NG, Schematron, DSD, xlinkit. One major point of discrimination among schema languages is the support of co-constraints, or co-occurrence constraints, e.g., requiring that attribute A is present if and only if attribute B is (or is not) presentin the same element. Although there is no way in XML Schema to express these requirements, they are in fact frequently used in many XML document types, usually only expressed in plain human-readable text, and validated by means of special code modules by the relevant applications. In this paper we propose SchemaPath, a light extension of XML Schema to handle conditional constraints on XML documents. Two new constructs have been added to XML Schema: conditions -- based on XPath patterns -- on type assignments for elements and attributes; and a new simple type, xsd:error, for the direct expression of negative constraints (e.g. it is prohibited for attribute A to be present if attribute B is also present). A proof-of-concept implementation is provided. A Web interface is publicly accessible for experiments and assessments of the real expressiveness of the proposed extension. expand
|
|
|
Composite events for xml |
| |
Martin Bernauer,
Gerti Kappel,
Gerhard Kramler
|
|
Pages: 175-183 |
|
doi>10.1145/988672.988696 |
|
Full text: PDF
|
|
Recently, active behavior has received attention in the XML field to automatically react to occurred events. Aside from proprietary approaches for enriching XML with active behavior, the W3C standardized the Document Object Model (DOM) Event Module for ...
Recently, active behavior has received attention in the XML field to automatically react to occurred events. Aside from proprietary approaches for enriching XML with active behavior, the W3C standardized the Document Object Model (DOM) Event Module for the detection of events in XML documents. When using any of these approaches, however, it is often impossible to decide which event to react upon because not a single event but a combination of multiple events, i.e., a composite event determines a situation to react upon. The paper presents the first approach for detecting composite events in XML documents by addressing the peculiarities of XML events which are caused by their hierarchical order in addition to their temporal order. It also provides for the detection of satisfied multiplicity constraints defined by XML schemas. Thereby the approach enables applications operating on XML documents to react to composite events which have richer semantics. expand
|
|
|
SESSION: Learning classifiers |
| |
Bing Liu
|
|
|
|
|
Liveclassifier: creating hierarchical text classifiers through web corpora |
| |
Chien-Chung Huang,
Shui-Lung Chuang,
Lee-Feng Chien
|
|
Pages: 184-192 |
|
doi>10.1145/988672.988698 |
|
Full text: PDF
|
|
Many Web information services utilize techniques of information extraction(IE) to collect important facts from the Web. To create more advanced services, one possible method is to discover thematic information from the collected facts through text classification. ...
Many Web information services utilize techniques of information extraction(IE) to collect important facts from the Web. To create more advanced services, one possible method is to discover thematic information from the collected facts through text classification. However, most conventional text classification techniques rely on manual-labelled corpora and are thus ill-suited to cooperate with Web information services with open domains. In this work, we present a system named LiveClassifier that can automatically train classifiersthrough Web corpora based on user-defined topic hierarchies. Due to its flexibility and convenience, LiveClassifier can be easily adapted for various purposes. New Web information services can be created to fully exploit it; human users can use it to create classifiers for their personal applications. The effectiveness of classifiers created by LiveClassifier is well supportedby empirical evidence. expand
|
|
|
Using urls and table layout for web classification tasks |
| |
L. K. Shih,
D. R. Karger
|
|
Pages: 193-202 |
|
doi>10.1145/988672.988699 |
|
Full text: PDF
|
|
We propose new features and algorithms for automating Web-page classification tasks such as content recommendation and ad blocking. We show that the automated classification of Web pages can be much improved if, instead of looking at their textual content, ...
We propose new features and algorithms for automating Web-page classification tasks such as content recommendation and ad blocking. We show that the automated classification of Web pages can be much improved if, instead of looking at their textual content, we consider each links's URL and the visual placement of those links on a referring page. These features are unusual: rather than being scalar measurements like word counts they are tree structured---describing the position of the item in a tree. We develop a model and algorithm for machine learning using such tree-structured features. We apply our methods in automated tools for recognizing and blocking Web advertisements and for recommending "interesting" news stories to a reader. Experiments show that our algorithms are both faster and more accurate than those based on the text content of Web documents. expand
|
|
|
Learning block importance models for web pages |
| |
Ruihua Song,
Haifeng Liu,
Ji-Rong Wen,
Wei-Ying Ma
|
|
Pages: 203-211 |
|
doi>10.1145/988672.988700 |
|
Full text: PDF
|
|
Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate ...
Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view. expand
|
|
|
SESSION: Web site engineering |
| |
Andreas Paepcke
|
|
|
|
|
Staging transformations for multimodal web interaction management |
| |
Michael Narayan,
Christopher Williams,
Saverio Perugini,
Naren Ramakrishnan
|
|
Pages: 212-223 |
|
doi>10.1145/988672.988702 |
|
Full text: PDF
|
|
Multimodal interfaces are becoming increasingly ubiquitous with the advent of mobile devices, accessibility considerations, and novel software technologies that combine diverse interaction media. In addition to improving access and delivery capabilities, ...
Multimodal interfaces are becoming increasingly ubiquitous with the advent of mobile devices, accessibility considerations, and novel software technologies that combine diverse interaction media. In addition to improving access and delivery capabilities, such interfaces enable flexible and personalized dialogs with websites, much like a conversation between humans. In this paper, we present a software framework for multimodal web interaction management that supports mixed-initiative dialogs between users and websites. A mixed-initiative dialog is one where the user and the website take turns changing the flow of interaction. The framework supports the functional specification and realization of such dialogs using staging transformations -- a theory for representing and reasoning about dialogs based on partial input. It supports multiple interaction interfaces, and offers sessioning, caching, and co-ordination functions through the use of an interaction manager. Two case studies are presented to illustrate the promise of this approach. expand
|
|
|
Enforcing strict model-view separation in template engines |
| |
Terence John Parr
|
|
Pages: 224-233 |
|
doi>10.1145/988672.988703 |
|
Full text: PDF
|
|
The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template ...
The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development. This situation is due mostly to a lack of formal definition of separation and fear that enforcing separation emasculates a template's power. I show that not only is strict separation a worthy design principle, but that we can enforce separation while providing a potent template engine. I demonstrate my StringTemplate engine, used to build jGuru.com and other commercial sites, at work solving some nontrivial generational tasks.My goal is to formalize the study of template engines, thus, providing a common nomenclature, a means of classifying template generational power, and a way to leverage interesting results from formal language theory. I classify three types of restricted templates analogous to Chomsky's type 1..3 grammar classes and formally define separation including the rules that embody separation.Because this paper provides a clear definition of model-view separation, template engine designers may no longer blindly claim enforcement of separation. Moreover, given theoretical arguments and empirical evidence, programmers no longer have an excuse to entangle model and view. expand
|
|
|
A flexible framework for engineering "my" portals |
| |
Fernando Bellas,
Daniel Fernández,
Abel Muiño
|
|
Pages: 234-243 |
|
doi>10.1145/988672.988704 |
|
Full text: PDF
|
|
There exist many portal servers that support the construction of "My" portals that is portals that allow the user to have one or more personal pages composed of a number of personalizable services. The main drawback of current portal servers is their ...
There exist many portal servers that support the construction of "My" portals that is portals that allow the user to have one or more personal pages composed of a number of personalizable services. The main drawback of current portal servers is their lack of generality and adaptability. This paper presents the design of MyPersonalizer a J2EE-based framework for engineering My portals. The framework is structured according to the Model-View-Controller and Layers architectural patterns providing generic adaptable model and controller layers that implement the typical use cases of a My portal. MyPersonalizer allows for a good separation of roles in the development team: graphical designers (without programming skills) develop the portal view by writing JSP pages while software engineers implement service plugins and specify framework configuration. expand
|
|
|
SESSION: Semantic interfaces and OWL tools |
| |
Peter Patel-Schneider
|
|
|
|
|
Semantic email |
| |
Luke McDowell,
Oren Etzioni,
Alon Halevy,
Henry Levy
|
|
Pages: 244-254 |
|
doi>10.1145/988672.988706 |
|
Full text: PDF
|
|
This paper investigates how the vision of the Semantic Web can be carried overto the realm of email. We introduce a general notion of semantice mail, in which an email message consists of an RDF query or update coupled with corresponding explanatory ...
This paper investigates how the vision of the Semantic Web can be carried overto the realm of email. We introduce a general notion of semantice mail, in which an email message consists of an RDF query or update coupled with corresponding explanatory text. Semantic email opens the door to a wide range of automated, email-mediated applications with formally guaranteed properties. In particular, this paper introduces a broad class of semantic email processes. For example consider the process of sending an email to a program committee asking who will attend the PC dinner automatically collecting the responses and tallying them up. We define bothlogical and decision-theoretic models where an email process ismodeled as a set of updates to a data set on which we specify goals via certain constraints or utilities. We then describe a set ofinference problems that arise while trying to satisfy these goals and analyze their computational tractability. In particular weshow that for the logical model it is possible to automatically infer which email responses are acceptable w.r.t. a set ofconstraints in polynomial time and for the decision-theoreticmodel it is possible to compute the optimal message-handling policy in polynomial time. Finally we discuss our publicly available implementation of semantic email and outline research challenges inthis realm. expand
|
|
|
How to make a semantic web browser |
| |
D. A. Quan,
R. Karger
|
|
Pages: 255-265 |
|
doi>10.1145/988672.988707 |
|
Full text: PDF
|
|
Two important architectural choices underlie the success of the Web: numerous, independently operated servers speak a common protocol, and a single type of client the Web browser provides point-and-click access to the content and services on these decentralized ...
Two important architectural choices underlie the success of the Web: numerous, independently operated servers speak a common protocol, and a single type of client the Web browser provides point-and-click access to the content and services on these decentralized servers. However, because HTML marries content and presentation into a single representation, end users are often stuck with inappropriate choices made by the Web site designer of how to work with and view the content. RDF metadata on the Semantic Web does not have this limitation: users can gain direct access to information and control over how it is presented. This principle forms the basis for our Semantic Web browser an end user application that automatically locates metadata and assembles point-and-click interfaces from a combination of relevant information, ontological specifications, and presentation knowledge, all described in RDF and retrieved dynamically from the Semantic Web. Because data and services are accessed directly through a standalone client and not through a central point of access (e.g., a portal), new content and services can be consumed as soon as they become available. In this way we take advantage of an important sociological force that encourages the production of new Semantic Web content while remaining faithful to the decentralized nature of the Web. expand
|
|
|
Parsing owl dl: trees or triples? |
| |
Sean K. Bechhofer,
Jeremy J. Carroll
|
|
Pages: 266-275 |
|
doi>10.1145/988672.988708 |
|
Full text: PDF
|
|
The Web Ontology Language (OWL) defines three classes of documents: Lite, DL, and Full. All RDF/XML documents are OWL Full documents, some OWL Full documents are also OWL DL documents, and some OWL DL documents are also OWL Lite documents. This paper ...
The Web Ontology Language (OWL) defines three classes of documents: Lite, DL, and Full. All RDF/XML documents are OWL Full documents, some OWL Full documents are also OWL DL documents, and some OWL DL documents are also OWL Lite documents. This paper discusses parsing and species recognition -- that is the process of determining whether a given document falls into the OWL Lite, DL or Full class. Wedescribe two alternative approaches to this task, one based on abstract syntax trees, the other on RDF triples, and compare their key characteristics. expand
|
|
|
SESSION: Server performance and scalability |
| |
Irwin King
|
|
|
|
|
A method for transparent admission control and request scheduling in e-commerce web sites |
| |
Sameh Elnikety,
Erich Nahum,
John Tracey,
Willy Zwaenepoel
|
|
Pages: 276-286 |
|
doi>10.1145/988672.988710 |
|
Full text: PDF
|
|
This paper presents a method for admission control and request scheduling for multiply-tiered e-commerce Web sites, achieving both stable behavior during overload and improved response times. Our method externally observes execution costs of requests ...
This paper presents a method for admission control and request scheduling for multiply-tiered e-commerce Web sites, achieving both stable behavior during overload and improved response times. Our method externally observes execution costs of requests online, distinguishing different request types, and performs overload protection and preferential scheduling using relatively simple measurements and a straight forward control mechanism. Unlike previous proposals, which require extensive changes to the server or operating system, our method requires no modifications to the host O.S., Web server, application server or database. Since our method is external, it can be implemented in a proxy. We present such an implementation, called Gatekeeper, using it with standard software components on the Linux operating system. We evaluate the proxy using the industry standard TPC-W workload generator in a typical three-tiered e-commerce environment. We show consistent performance during overload and throughput increases of up to 10 percent. Response time improves by up to a factor of 14, with only a 15 percent penalty to large jobs. expand
|
|
|
A smart hill-climbing algorithm for application server configuration |
| |
Bowei Xi,
Zhen Liu,
Mukund Raghavachari,
Cathy H. Xia,
Li Zhang
|
|
Pages: 287-296 |
|
doi>10.1145/988672.988711 |
|
Full text: PDF
|
|
The overwhelming success of the Web as a mechanism for facilitating information retrieval and for conducting business transactions has ledto an increase in the deployment of complex enterprise applications. These applications typically run on Web Application ...
The overwhelming success of the Web as a mechanism for facilitating information retrieval and for conducting business transactions has ledto an increase in the deployment of complex enterprise applications. These applications typically run on Web Application Servers, which assume the burden of managing many tasks, such as concurrency, memory management, database access, etc., required by these applications. The performance of an Application Server depends heavily on appropriate configuration. Configuration is a difficult and error-prone task dueto the large number of configuration parameters and complex interactions between them. We formulate the problem of finding an optimal configuration for a given application as a black-box optimization problem. We propose a smart hill-climbing algorithm using ideas of importance sampling and Latin Hypercube Sampling (LHS). The algorithm is efficient in both searching and random sampling. It consists of estimating a local function, and then, hill-climbing in the steepest descent direction. The algorithm also learns from past searches and restarts in a smart and selective fashion using the idea of importance sampling. We have carried out extensive experiments with an on-line brokerage application running in a WebSphere environment. Empirical results demonstrate that our algorithm is more efficient than and superior to traditional heuristic methods. expand
|
|
|
Challenges and practices in deploying web acceleration solutions for distributed enterprise systems |
| |
Wen-Syan Li,
Wang-Pin Hsiung,
Oliver Po,
Koji Hino,
Kasim Selcuk Candan,
Divyakant Agrawal
|
|
Pages: 297-308 |
|
doi>10.1145/988672.988712 |
|
Full text: PDF
|
|
For most Web-based applications, contents are created dynamically based on the current state of a business, such as product prices and inventory, stored in database systems. These applications demand personalized content and track user behavior while ...
For most Web-based applications, contents are created dynamically based on the current state of a business, such as product prices and inventory, stored in database systems. These applications demand personalized content and track user behavior while maintaining application integrity. Many of such practices are not compatible with Web acceleration solutions. Consequently, although many web acceleration solutions have shown promising performance improvement and scalability, architecting and engineering distributed enterprise Web applications to utilize available content delivery networks remains a challenge. In this paper, we examine the challenge to accelerate J2EE-based enterprise web applications. We list obstacles and recommend some practices to transform typical database-driven J2EE applications to cache friendly Web applications where Web acceleration solutions can be applied. Furthermore, such transformation should be done without modification to the underlying application business logic and without sacrificing functions that are essential to e-commerce. We take the J2EE reference software, the Java PetStore, as a case study. By using the proposed guideline, we are able to cache more than 90% of the content in the PetStore and scale up the Web site more than 20 times. expand
|
|
|
SESSION: Link analysis |
| |
Junghoo Cho
|
|
|
|
|
Ranking the web frontier |
| |
Nadav Eiron,
Kevin S. McCurley,
John A. Tomlin
|
|
Pages: 309-318 |
|
doi>10.1145/988672.988714 |
|
Full text: PDF
|
|
The celebrated PageRank algorithm has proved to be a very effective paradigm for ranking results of web search algorithms. In this paper we refine this basic paradigm to take into account several evolving prominent features of the web, and propose several ...
The celebrated PageRank algorithm has proved to be a very effective paradigm for ranking results of web search algorithms. In this paper we refine this basic paradigm to take into account several evolving prominent features of the web, and propose several algorithmic innovations. First, we analyze features of the rapidly growing "frontier" of the web, namely the part of the web that crawlers are unable to cover for one reason or another. We analyze the effect of these pages and find it to be significant. We suggest ways to improve the quality of ranking by modeling the growing presence of "link rot" on the web as more sites and pages fall out of maintenance. Finally we suggest new methods of ranking that are motivated by the hierarchical structure of the web, are more efficient than PageRank, and may be more resistant to direct manipulation. expand
|
|
|
Link fusion: a unified link analysis framework for multi-type interrelated data objects |
| |
Wensi Xi,
Benyu Zhang,
Zheng Chen,
Yizhou Lu,
Shuicheng Yan,
Wei-Ying Ma,
Edward Allan Fox
|
|
Pages: 319-327 |
|
doi>10.1145/988672.988715 |
|
Full text: PDF
|
|
Web link analysis has proven to be a significant enhancement for quality based web search. Most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous ...
Web link analysis has proven to be a significant enhancement for quality based web search. Most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). Unfortunately, most link analysis research only considers one type of link. In this paper, we propose a unified link analysis framework, called "link fusion", which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. The PageRank and HITS algorithms are shown to be special cases of our unified link analysis framework. Experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the HITS and DirectHit algorithms by 24.6% and 38.2% respectively. expand
|
|
|
Sic transit gloria telae: towards an understanding of the web's decay |
| |
Ziv Bar-Yossef,
Andrei Z. Broder,
Ravi Kumar,
Andrew Tomkins
|
|
Pages: 328-337 |
|
doi>10.1145/988672.988716 |
|
Full text: PDF
|
|
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with ...
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users. expand
|
|
|
SESSION: Optimizing encoding |
| |
Jason Nieh
|
|
|
|
|
Using link analysis to improve layout on mobile devices |
| |
Xinyi Yin,
Wee Sun Lee
|
|
Pages: 338-344 |
|
doi>10.1145/988672.988718 |
|
Full text: PDF
|
|
Delivering web pages to mobile phones or personal digital assistants has become possible with the latest wireless technology. However, mobile devices have very small screen sizes and memory capacities. Converting web pages for delivery to a mobile device ...
Delivering web pages to mobile phones or personal digital assistants has become possible with the latest wireless technology. However, mobile devices have very small screen sizes and memory capacities. Converting web pages for delivery to a mobile device is an exciting new problem. In this paper, we propose to use a ranking algorithm similar to Google's PageRank algorithm to rank the content objects within a web page. This allows the extraction of only important parts of web pages for delivery to mobile devices. Experiments show that the new method is effective. In experiments on pages from randomly selected websites, the system needed to extract and deliver only 39% of the objects in a web page in order to provide 85% of a viewer's desired viewing content. This provides significant savings in the wireless traffic and downloading time while providing a satisfactory reading experience on the mobile device. expand
|
|
|
An evaluation of binary xml encoding optimizations for fast stream based xml processing |
| |
R. J. Bayardo,
D. Gruhl,
V. Josifovski,
J. Myllymaki
|
|
Pages: 345-354 |
|
doi>10.1145/988672.988719 |
|
Full text: PDF
|
|
This paper provides an objective evaluation of the performance impacts of binary XML encodings, using a fast stream-based XQuery processor as our representative application. Instead of proposing one binary format and comparing it against standard XML ...
This paper provides an objective evaluation of the performance impacts of binary XML encodings, using a fast stream-based XQuery processor as our representative application. Instead of proposing one binary format and comparing it against standard XML parsers, we investigate the individual effects of several binary encoding techniques that are shared by many proposals. Our goal is to provide a deeper understanding of the performance impacts of binary XML encodings in order to clarify the ongoing and often contentious debate over their merits, particularly in the domain of high performance XML stream processing. expand
|
|
|
Optimization of html automatically generated by wysiwyg programs |
| |
Jacqueline Spiesser,
Les Kitchen
|
|
Pages: 355-364 |
|
doi>10.1145/988672.988720 |
|
Full text: PDF
|
|
Automatically generated HTML, as produced by WYSIWYG programs, typically contains much repetitive and unnecessary markup. Thispaper identifies aspects of such HTML that may be altered whileleaving a semantically equivalent document, and proposes techniques ...
Automatically generated HTML, as produced by WYSIWYG programs, typically contains much repetitive and unnecessary markup. Thispaper identifies aspects of such HTML that may be altered whileleaving a semantically equivalent document, and proposes techniques to achieve optimizing modifications. These techniques include attribute re-arrangement via dynamic programming, the use of style classes, and dead-coderemoval. These techniques produce documents as small as 33% of original size. The size decreases obtained are still significant when the techniques are used in combination with conventional text-based compression. expand
|
|
|
SESSION: Semantic web applications |
| |
Amit Sheth
|
|
|
|
|
Building a companion website in the semantic web |
| |
Timothy J. Miles-Board,
Christopher P. Bailey,
Wendy Hall,
Leslie A. Carr
|
|
Pages: 365-373 |
|
doi>10.1145/988672.988722 |
|
Full text: PDF
|
|
A problem facing many textbook authors (including one of the authors of this paper) is the inevitable delay between new advances in the subject area and their incorporation in a new (paper) edition of the textbook. This means that some textbooks are ...
A problem facing many textbook authors (including one of the authors of this paper) is the inevitable delay between new advances in the subject area and their incorporation in a new (paper) edition of the textbook. This means that some textbooks are quickly considered out of date, particularly in active technological areas such as the Web, even though the ideas presented in the textbook are still valid and important to the community. This paper describes our approach to building a companion website for the textbook Hypermedia and the Web: An Engineering Approach. We use Bloom's taxonomy of educational objectives to critically evaluate a number of authoring and presentation techniques used in existing companion websites, and adapt these techniques to create our own companion website using Semantic Web technologies in order to overcome the identified weaknesses. Finally, we discuss a potential model of future companion websites, in the context of an e-publishing, e-commerce Semantic Web services scenario. expand
|
|
|
A hybrid approach for searching in the semantic web |
| |
Cristiano Rocha,
Daniel Schwabe,
Marcus Poggi Aragao
|
|
Pages: 374-383 |
|
doi>10.1145/988672.988723 |
|
Full text: PDF
|
|
This paper presents a search architecture that combines classical search techniques with spread activation techniques applied to a semantic model of a given domain. Given an ontology, weights are assigned to links based on certain properties of the ontology, ...
This paper presents a search architecture that combines classical search techniques with spread activation techniques applied to a semantic model of a given domain. Given an ontology, weights are assigned to links based on certain properties of the ontology, so that they measure the strength of the relation. Spread activation techniques are used to find related concepts in the ontology given an initial set of concepts and corresponding initial activation values. These initial values are obtained from the results of classical search applied to the data associated with the concepts in the ontology. Two test cases were implemented, with very positive results. It was also observed that the proposed hybrid spread activation, combining the symbolic and the sub-symbolic approaches, achieved better results when compared to each of the approaches alone. expand
|
|
|
CS AKTive space: representing computer science in the semantic web |
| |
m. c. schraefel,
Nigel R. Shadbolt,
Nicholas Gibbins,
Stephen Harris,
Hugh Glaser
|
|
Pages: 384-392 |
|
doi>10.1145/988672.988724 |
|
Full text: PDF
|
|
We present a Semantic Web application that we callCS AKTive Space. The application exploits a wide range of semantically heterogeneousand distributed content relating to Computer Science research in theUK. This content is gathered on a continuous basis ...
We present a Semantic Web application that we callCS AKTive Space. The application exploits a wide range of semantically heterogeneousand distributed content relating to Computer Science research in theUK. This content is gathered on a continuous basis using a variety of methods including harvesting and scraping as well as adopting a range models for content acquisition. The content currently comprises aroundten million RDF triples and we have developed storage, retrieval andmaintenance methods to support its management. The content is mediated through an ontology constructed for the application domainand incorporates components from other published ontologies. CS AKTive Spacesupports the exploration of patterns and implications inherent in the content and exploits a variety of visualisations and multi dimensional representations. Knowledge services supported in the applicationinclude investigating communities of practice: who is working, researching or publishing with whom. This work illustrates a number ofsubstantial challenges for the Semantic Web. These include problems of referential integrity, tractable inference and interaction support. Wereview our approaches to these issues and discuss relevant related work. expand
|
|
|
SESSION: Reputation networks |
| |
David Pennock
|
|
|
|
|
Shilling recommender systems for fun and profit |
| |
Shyong K. Lam,
John Riedl
|
|
Pages: 393-402 |
|
doi>10.1145/988672.988726 |
|
Full text: PDF
|
|
Recommender systems have emerged in the past several years as an effective way to help people cope with the problem of information overload. One application in which they have become particularly common is in e-commerce, where recommendation of items ...
Recommender systems have emerged in the past several years as an effective way to help people cope with the problem of information overload. One application in which they have become particularly common is in e-commerce, where recommendation of items can often help a customer find what she is interested in and, therefore can help drive sales. Unscrupulous producers in the never-ending quest for market penetration may find it profitable to shill recommender systems by lying to the systems in order to have their products recommended more often than those of their competitors. This paper explores four open questions that may affect the effectiveness of such shilling attacks: which recommender algorithm is being used, whether the application is producing recommendations or predictions, how detectable the attacks are by the operator of the system, and what the properties are of the items being attacked. The questions are explored experimentally on a large data set of movie ratings. Taken together, the results of the paper suggest that new ways must be used to evaluate and detect shilling attacks on recommender systems. expand
|
|
|
Propagation of trust and distrust |
| |
R. Guha,
Ravi Kumar,
Prabhakar Raghavan,
Andrew Tomkins
|
|
Pages: 403-412 |
|
doi>10.1145/988672.988727 |
|
Full text: PDF
|
|
A (directed) network of people connected by ratings or trust scores, and a model for propagating those trust scores, is a fundamental building block in many of today's most successful e-commerce and recommendation systems. We develop a framework of trust ...
A (directed) network of people connected by ratings or trust scores, and a model for propagating those trust scores, is a fundamental building block in many of today's most successful e-commerce and recommendation systems. We develop a framework of trust propagation schemes, each of which may be appropriate in certain circumstances, and evaluate the schemes on a large trust network consisting of 800K trust scores expressed among 130K people. We show that a small number of expressed trusts/distrust per individual allows us to predict trust between any two people in the system with high accuracy. Our work appears to be the first to incorporate distrust in a computational trust propagation setting. expand
|
|
|
A community-aware search engine |
| |
Rodrigo B. Almeida,
Virgilio A. F. Almeida
|
|
Pages: 413-421 |
|
doi>10.1145/988672.988728 |
|
Full text: PDF
|
|
Current search technologies work in a "one size fits all" fashion. Therefore, the answer to a query is independent of specific user information need. In this paper we describe a novel ranking technique for personalized search servicesthat combines content-based ...
Current search technologies work in a "one size fits all" fashion. Therefore, the answer to a query is independent of specific user information need. In this paper we describe a novel ranking technique for personalized search servicesthat combines content-based and community-based evidences. The community-based information is used in order to provide context for queries andis influenced by the current interaction of the user with the service. Ouralgorithm is evaluated using data derived from an actual service available on the Web an online bookstore. We show that the quality of content-based ranking strategies can be improved by the use of communityinformation as another evidential source of relevance. In our experiments the improvements reach up to 48% in terms of average precision. expand
|
|
|
SESSION: Versioning and fragmentation |
| |
Corey Anderson
|
|
|
|
|
Managing versions of web documents in a transaction-time web server |
| |
Curtis E. Dyreson,
Hui-ling Lin,
Yingxia Wang
|
|
Pages: 422-432 |
|
doi>10.1145/988672.988730 |
|
Full text: PDF
|
|
This paper presents a transaction-time HTTP server, called TTApache that supports document versioning. A document often consists of a main file formatted in HTML or XML and several included files such as images and stylesheets. A change to any of the ...
This paper presents a transaction-time HTTP server, called TTApache that supports document versioning. A document often consists of a main file formatted in HTML or XML and several included files such as images and stylesheets. A change to any of the files associated with a document creates a new version of that document. To construct a document version history, snapshots of the document's files are obtained over time. Transaction times are associated with each file version to record the version's lifetime. The transaction time is the system time of the edit that created the version. Accounting for transaction time is essential to supporting audit queries that delve into past document versions and differential queries that pinpoint differences between two versions. TTApache performs automatic versioning when a document is read thereby removing the burden of versioning from document authors. Since some versions may be created but never read, TTApache distinguishes between known and assumed versions of a document. TTApache has a simple query language to retrieve desired versions. A browser can request a specific version, or the entire history of a document. Queries can also rewrite links and references to point to current or past versions. Over time, the version history of a document continually grows. To free space, some versions can be vacuumed. Vacuuming a version however changes the semantics of requests for that version. This paper presents several policies for vacuuming versions and strategies for accounting for vacuumed versions in queries. expand
|
|
|
Fine-grained, structured configuration management for web projects |
| |
Tien Nhut Nguyen,
Ethan Vincent Munson,
Cheng Thao
|
|
Pages: 433-442 |
|
doi>10.1145/988672.988731 |
|
Full text: PDF
|
|
Researchers in Web engineering have regularly noted that existing Web application development environments provide little support for managing the evolution of Web applications. Key limitations of Web development environments include line-oriented change ...
Researchers in Web engineering have regularly noted that existing Web application development environments provide little support for managing the evolution of Web applications. Key limitations of Web development environments include line-oriented change models that inadequately represent Web document semantics and in ability to model changes to link structure or the set of objects making up the Webapplication. Developers may find it difficult to grasp how theoverall structure of the Web application has changed over time and may respond by using ad hoc solutions that lead to problems of maintain ability, quality and reliability. Web applications are software artifacts, and as such, can benefit from advanced version control and software configuration management (SCM)technologies from software engineering. We have modified an integrated development environment to manage the evolution and maintenance of Web applications. The resulting environment is distinguished by itsfine-grained version control framework, fine-grained Web contentchange management, and product versioning configuration management, in which a Web project can be organized at the logical level and itsstructure and components are versioned in a fine-grained manner aswell. This paper describes the motivation for this environment as well as its user interfaces, features, and implementation. expand
|
|
|
Automatic detection of fragments in dynamically generated web pages |
| |
Lakshmish Ramaswamy,
Arun Iyengar,
Ling Liu,
Fred Douglis
|
|
Pages: 443-454 |
|
doi>10.1145/988672.988732 |
|
Full text: PDF
|
|
Dividing web pages into fragments has been shown to provide significant benefits for both content generation and caching. In order for a web site to use fragment-based content generation, however, good methods are needed for dividing web pages into fragments. ...
Dividing web pages into fragments has been shown to provide significant benefits for both content generation and caching. In order for a web site to use fragment-based content generation, however, good methods are needed for dividing web pages into fragments. Manual fragmentation of web pages is expensive, error prone, and unscalable. This paper proposes a novel scheme to automatically detect and flag fragments that are cost-effective cache units in web sites serving dynamic content. We consider the fragments to be interesting if they are shared among multiple documents or they have different lifetime or personalization characteristics. Our approach has three unique features. First, we propose a hierarchical and fragment-aware model of the dynamic web pages and a data structure that is compact and effective for fragment detection. Second, we present an efficient algorithm to detect maximal fragments that are shared among multiple documents. Third, we develop a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. We evaluate the proposed scheme through a series of experiments, showing the benefits and costs of the algorithms. We also study the impact of adopting the fragments detected by our system on disk space utilization and network bandwidth consumption. expand
|
|
|
SESSION: Semantic annotation and integration |
| |
Carole Goble
|
|
|
|
|
Incremental formalization of document annotations through ontology-based paraphrasing |
| |
Jim Blythe,
Yolanda Gil
|
|
Pages: 455-461 |
|
doi>10.1145/988672.988734 |
|
Full text: PDF
|
|
For the manual semantic markup of documents to become wide-spread, usersmust be able to express annotations that conform to ontologies (orschemas) that have shared meaning. However, a typical user is unlikelyto be familiar with the details of the terms ...
For the manual semantic markup of documents to become wide-spread, usersmust be able to express annotations that conform to ontologies (orschemas) that have shared meaning. However, a typical user is unlikelyto be familiar with the details of the terms as defined by the ontology authors. In addition, the idea to be expressed may not fit perfectly within a pre-defined ontology. The ideal tool should help users find apartial formalization that closely follows the ontology where possiblebut deviates from the formal representation where needed. We describe animplemented approach to help users create semi-structured semantic annotations for a document according to an extensible OWL ontology. In our approach, users enter a short sentence in free text to describe allor part of a document, and the system presents a set of potential paraphrases of the sentence that are generated from valid expressions inthe ontology, from which the user chooses the closest match. We use a combination of off-the-shelf parsing tools and breadth-first search of expressions in the ontology to help users create valid annotations starting from free text. The user can also define new terms to augmentthe ontology, so the potential matches can improve over time. expand
|
|
|
Towards the self-annotating web |
| |
Philipp Cimiano,
Siegfried Handschuh,
Steffen Staab
|
|
Pages: 462-471 |
|
doi>10.1145/988672.988735 |
|
Full text: PDF
|
|
The success of the Semantic Web depends on the availability of ontologies as well as on the proliferation of web pages annotated with metadata conforming to these ontologies. Thus, a crucial question is where to acquire these metadata from. In this paper ...
The success of the Semantic Web depends on the availability of ontologies as well as on the proliferation of web pages annotated with metadata conforming to these ontologies. Thus, a crucial question is where to acquire these metadata from. In this paper wepropose PANKOW (Pattern-based Annotation through Knowledge on theWeb), a method which employs an unsupervised, pattern-based approach to categorize instances with regard to an ontology. The approach is evaluated against the manual annotations of two human subjects. The approach is implemented in OntoMat, an annotation tool for the Semantic Web and shows very promising results. expand
|
|
|
Web taxonomy integration using support vector machines |
| |
Dell Zhang,
Wee Sun Lee
|
|
Pages: 472-481 |
|
doi>10.1145/988672.988736 |
|
Full text: PDF
|
|
We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process ...
We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process would be to train a classifier for each category in the master taxonomy, and then classify objects from the source taxonomy into these categories. In this paper we attempt to use a powerful classification method, Support Vector Machine (SVM), to attack this problem. Our key insight is that the availability of the source taxonomy data could be helpful to build better classifiers in this scenario, therefore it would be beneficial to do transductive learning rather than inductive learning, i.e., learning to optimize classification performance on a particular set of test examples. Noticing that the categorizations of the master and source taxonomies often have some semantic overlap, we propose a method, Cluster Shrinkage (CS), to further enhance the classification by exploiting such implicit knowledge. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration. expand
|
|
|
SESSION: Mining new media |
| |
Krishna Bharat
|
|
|
|
|
Newsjunkie: providing personalized newsfeeds via analysis of information novelty |
| |
Evgeniy Gabrilovich,
Susan Dumais,
Eric Horvitz
|
|
Pages: 482-490 |
|
doi>10.1145/988672.988738 |
|
Full text: PDF
|
|
We present a principled methodology for filtering news stories by formal measures of information novelty, and show how the techniques can be usedto custom-tailor news feeds based on information that a user has already reviewed. We review methods for ...
We present a principled methodology for filtering news stories by formal measures of information novelty, and show how the techniques can be usedto custom-tailor news feeds based on information that a user has already reviewed. We review methods for analyzing novelty and then describe Newsjunkie, a system that personalizes news for users by identifying the novelty of stories in the context of stories they have already reviewed. Newsjunkie employs novelty-analysis algorithms that represent articles as words and named entities. The algorithms analyze inter-andintra-document dynamics by considering how information evolves over timefrom article to article, as well as within individual articles. We review the results of a user study undertaken to gauge the value of the approachover legacy time-based review of newsfeeds, and also to compare the performance of alternate distance metrics that are used to estimate the dissimilarity between candidate new articles and sets of previously reviewed articles. expand
|
|
|
Information diffusion through blogspace |
| |
Daniel Gruhl,
R. Guha,
David Liben-Nowell,
Andrew Tomkins
|
|
Pages: 491-501 |
|
doi>10.1145/988672.988739 |
|
Full text: PDF
|
|
We study the dynamics of information propagation in environments of low-overhead personal publishing, using a large collection of weblogs over time as our example domain. We characterize and model this collection at two levels. First, we present a macroscopic ...
We study the dynamics of information propagation in environments of low-overhead personal publishing, using a large collection of weblogs over time as our example domain. We characterize and model this collection at two levels. First, we present a macroscopic characterization of topic propagation through our corpus, formalizing the notion of long-running "chatter" topics consisting recursively of "spike" topics generated by outside world events, or more rarely, by resonances within the community. Second, we present a microscopic characterization of propagation from individual to individual, drawing on the theory of infectious diseases to model the flow. We propose, validate, and employ an algorithm to induce the underlying propagation network from a sequence of posts, and report on the results. expand
|
|
|
Automatic web news extraction using tree edit distance |
| |
D. C. Reis,
P. B. Golgher,
A. S. Silva,
A. F. Laender
|
|
Pages: 502-511 |
|
doi>10.1145/988672.988740 |
|
Full text: PDF
|
|
The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques ...
The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results.In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites. expand
|
|
|
SESSION: Workload analysis |
| |
Alec Wolman
|
|
|
|
|
Accurate, scalable in-network identification of p2p traffic using application signatures |
| |
Subhabrata Sen,
Oliver Spatscheck,
Dongmei Wang
|
|
Pages: 512-521 |
|
doi>10.1145/988672.988742 |
|
Full text: PDF
|
|
The ability to accurately identify the network traffic associated with different P2P applications is important to a broad range of network operations including application-specific traffic engineering, capacity planning, provisioning, service differentiation,etc. ...
The ability to accurately identify the network traffic associated with different P2P applications is important to a broad range of network operations including application-specific traffic engineering, capacity planning, provisioning, service differentiation,etc. However, traditional traffic to higher-level application mapping techniques such as default server TCP or UDP network-port baseddisambiguation is highly inaccurate for some P2P applications.In this paper, we provide an efficient approach for identifying the P2P application traffic through application level signatures. We firstidentify the application level signatures by examining some available documentations, and packet-level traces. We then utilize the identified signatures to develop online filters that can efficiently and accurately track the P2P traffic even on high-speed network links.We examine the performance of our application-level identification approach using five popular P2P protocols. Our measurements show thatour technique achieves less than 5% false positive and false negative ratios in most cases. We also show that our approach only requires the examination of the very first few packets (less than 10packets) to identify a P2P connection, which makes our approach highly scalable. Our technique can significantly improve the P2P traffic volume estimates over what pure network port based approaches provide. For instance, we were able to identify 3 times as much traffic for the popular Kazaa P2P protocol, compared to the traditional port-based approach. expand
|
|
|
Characterization of a large web site population with implications for content delivery |
| |
L. Bent,
M. Rabinovich,
G. M. Voelker,
Z. Xiao
|
|
Pages: 522-533 |
|
doi>10.1145/988672.988743 |
|
Full text: PDF
|
|
This paper presents a systematic study of the properties of a large number of Web sites hosted by a major ISP. To our knowledge, ours is the first comprehensive study of a large server farm that contains thousands of commercial Web sites. We also perform ...
This paper presents a systematic study of the properties of a large number of Web sites hosted by a major ISP. To our knowledge, ours is the first comprehensive study of a large server farm that contains thousands of commercial Web sites. We also perform a simulation analysis to estimate potential performance benefits of content delivery networks (CDNs) for these Web sites. We make several interesting observations about the current usage of Web technologies and Web site performance characteristics. First, compared with previous client workload studies, the Web server farm workload contains a much higher degree of uncacheable responses and responses that require mandatory cache validations. A significant reason for this is that cookie use is prevalent among our population, especially among more popular sites. However, we found an indication of wide-spread indiscriminate usage of cookies, which unnecessarily impedes the use of many content delivery optimizations. We also found that most Web sites do not utilize the cache-control features ofthe HTTP 1.1 protocol, resulting in suboptimal performance. Moreover, the implicit expiration time in client caches for responses is constrained by the maximum values allowed in the Squid proxy. Finally, our simulation results indicate that most Web sites benefit from the use of a CDN. The amount of the benefit depends on site popularity, and, somewhat surprisingly, a CDN may increase the peak to average request ratio at the origin server because the CDN can decrease the average request rate more than the peak request rate. expand
|
|
|
Analyzing client interactivity in streaming media |
| |
Cristiano P. Costa,
Italo S. Cunha,
Alex Borges,
Claudiney V. Ramos,
Marcus M. Rocha,
Jussara M. Almeida,
Berthier Ribeiro-Neto
|
|
Pages: 534-543 |
|
doi>10.1145/988672.988744 |
|
Full text: PDF
|
|
This paper provides an extensive analysis of pre-stored streaming media workloads, focusing on the client interactive behavior. We analyze four workloads that fall into three different domains, namely, education, entertainment video and entertainment ...
This paper provides an extensive analysis of pre-stored streaming media workloads, focusing on the client interactive behavior. We analyze four workloads that fall into three different domains, namely, education, entertainment video and entertainment audio. Our main goals are: (a) to identify qualitative similarities and differences in the typical client behavior for the three workload classes and (b) to provide data for generating realistic synthetic workloads. expand
|
|
|
SESSION: Semantic web services |
| |
Steffen Staab
|
|
|
|
|
Augmenting semantic web service descriptions with compositional specification |
| |
Monika Solanki,
Antonio Cau,
Hussein Zedan
|
|
Pages: 544-552 |
|
doi>10.1145/988672.988746 |
|
Full text: PDF
|
|
Current ontological specifications for semantically describing properties of Web services are limited to their static interface description. Normally for proving properties of service compositions, mapping input/output parameters and specifying the pre/post ...
Current ontological specifications for semantically describing properties of Web services are limited to their static interface description. Normally for proving properties of service compositions, mapping input/output parameters and specifying the pre/post conditions are found to be sufficient. However these properties are assertions only on the initial and final states of the service respectively. They do not help in specifying/verifying ongoing behaviour of an individual service or a composed system. We propose a framework for enriching semantic service descriptions with two compositional assertions: assumption and commitment that facilitate reasoning about service composition and verification of their integration. The technique is based on Interval Temporal Logic(ITL): a sound formalism for specifying and proving temporal properties of systems. Our approach utilizes the recently proposed Semantic Web Rule Language. expand
|
|
|
Meteor-s web service annotation framework |
| |
Abhijit A. Patil,
Swapna A. Oundhakar,
Amit P. Sheth,
Kunal Verma
|
|
Pages: 553-562 |
|
doi>10.1145/988672.988747 |
|
Full text: PDF
|
|
The World Wide Web is emerging not only as an infrastructure for data, but also for a broader variety of resources that are increasingly being made available as Web services. Relevant current standards like UDDI, WSDL, and SOAP are in their fledgling ...
The World Wide Web is emerging not only as an infrastructure for data, but also for a broader variety of resources that are increasingly being made available as Web services. Relevant current standards like UDDI, WSDL, and SOAP are in their fledgling years and form the basis of making Web services a workable and broadly adopted technology. However, realizing the fuller scope of the promise of Web services and associated service oriented architecture will requite further technological advances in the areas of service interoperation, service discovery, service composition, and process orchestration. Semantics, especially as supported by the use of ontologies, and related Semantic Web technologies, are likely to provide better qualitative and scalable solutions to these requirements. Just as semantic annotation of data in the Semantic Web is the first critical step to better search, integration and analytics over heterogeneous data, semantic annotation of Web services is an equally critical first step to achieving the above promise. Our approach is to work with existing Web services technologies and combine them with ideas from the Semantic Web to create a better framework for Web service discovery and composition. In this paper we present MWSAF (METEOR-S Web Service Annotation Framework), a framework for semi-automatically marking up Web service descriptions with ontologies. We have developed algorithms to match and annotate WSDL files with relevant ontologies. We use domain ontologies to categorize Web services into domains. An empirical study of our approach is presented to help evaluate its performance. expand
|
|
|
Foundations for service ontologies: aligning OWL-S to dolce |
| |
Peter Mika,
Daniel Oberle,
Aldo Gangemi,
Marta Sabou
|
|
Pages: 563-572 |
|
doi>10.1145/988672.988748 |
|
Full text: PDF
|
|
Clarity in semantics and a rich formalization of this semantics are important requirements for ontologies designed to be deployed in large-scale, open, distributed systems such as the envisioned Semantic Web This is especially important for the description ...
Clarity in semantics and a rich formalization of this semantics are important requirements for ontologies designed to be deployed in large-scale, open, distributed systems such as the envisioned Semantic Web This is especially important for the description of Web Services, which should enable complex tasks involving multiple agents. As one of the first initiatives of the Semantic Webcommunity for describing Web Services, OWL-S attracts a lot of interest even though it is still under development. We identify problematic aspects of OWL-S and suggest enhancements through alignment to a foundational ontology. Another contribution of ourwork is the Core Ontology of Services that tries to fill the epistemological gap between the foundational ontology and OWL-S. It can be reused to align other Web Service description languages as well. Finally, we demonstrate the applicability of our work byaligning OWL-S' standard example called CongoBuy. expand
|
|
|
SESSION: Search engineering 2 |
| |
Nick Koudas
|
|
|
|
|
Mining models of human activities from the web |
| |
Mike Perkowitz,
Matthai Philipose,
Kenneth Fishkin,
Donald J. Patterson
|
|
Pages: 573-582 |
|
doi>10.1145/988672.988750 |
|
Full text: PDF
|
|
The ability to determine what day-to-day activity (such as cooking pasta, taking a pill, or watching a video) a person is performing is of interest in many application domains. A system that can do this requires models of the activities of interest, ...
The ability to determine what day-to-day activity (such as cooking pasta, taking a pill, or watching a video) a person is performing is of interest in many application domains. A system that can do this requires models of the activities of interest, but model construction does not scale well: humans must specify low-level details, such as segmentation and feature selection of sensor data, and high-level structure, such as spatio-temporal relations between states of the model, for each and every activity. As a result, previous practical activity recognition systems have been content to model a tiny fraction of the thousands of human activities that are potentially useful to detect. In this paper, we present an approach to sensing and modeling activities that scales to a much larger class of activities than before. We show how a new class of sensors, based on Radio Frequency Identification (RFID) tags, can directly yield semantic terms that describe the state of the physical world. These sensors allow us to formulate activity models by translating labeled activities, such as 'cooking pasta', into probabilistic collections of object terms, such as 'pot'. Given this view of activity models as text translations, we show how to mine definitions of activities in an unsupervised manner from the web. We have used our technique to mine definitions for over 20,000 activities. We experimentally validate our approach using data gathered from actual human activity as well as simulated data. expand
|
|
|
Texquery: a full-text search extension to xquery |
| |
S. Amer-Yahia,
C. Botev,
J. Shanmugasundaram
|
|
Pages: 583-594 |
|
doi>10.1145/988672.988751 |
|
Full text: PDF
|
|
One of the key benefits of XML is its ability to represent a mix of structured and unstructured (text) data. Although current XML query languages such as XPath and XQuery can express rich queries over structured data, they can only express very rudimentary ...
One of the key benefits of XML is its ability to represent a mix of structured and unstructured (text) data. Although current XML query languages such as XPath and XQuery can express rich queries over structured data, they can only express very rudimentary queries over text data. We thus propose TeXQuery, which is a powerful full-text search extension to XQuery. TeXQuery provides a rich set of fully composable full-text search primitives,such as Boolean connectives, phrase matching, proximity distance, stemming and thesauri. TeXQuery also enables users to seamlessly query over both structured and text data by embedding TeXQuery primitives in XQuery, and vice versa. Finally, TeXQuery supports a flexible scoring construct that can be used toscore query results based on full-text predicates. TeXQuery is the precursor ofthe full-text language extensions to XPath 2.0 and XQuery 1.0 currently being developed by the W3C. expand
|
|
|
The webgraph framework I: compression techniques |
| |
P. Boldi,
S. Vigna
|
|
Pages: 595-602 |
|
doi>10.1145/988672.988752 |
|
Full text: PDF
|
|
Studying web graphs is often difficult due to their large size. Recently,several proposals have been published about various techniques that allow tostore a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph ...
Studying web graphs is often difficult due to their large size. Recently,several proposals have been published about various techniques that allow tostore a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks)in as little as 3.08 bits per link, and its transposed version in as littleas 2.89 bits per link. expand
|
|
|
SESSION: Infastructure for implementation |
| |
Martin Gaedke
|
|
|
|
|
XQuery at your web service |
| |
Nicola Onose,
Jerome Simeon
|
|
Pages: 603-611 |
|
doi>10.1145/988672.988754 |
|
Full text: PDF
|
|
XML messaging is at the heart of Web services, providing the flexibility required for their deployment, composition, and maintenance. Yet, current approaches to Web services development hide the messaging layer behind Java or C# APIs, preventing the ...
XML messaging is at the heart of Web services, providing the flexibility required for their deployment, composition, and maintenance. Yet, current approaches to Web services development hide the messaging layer behind Java or C# APIs, preventing the application to get direct access to the underlying XML information. To address this problem, we advocate the use of a native XML language, namely XQuery, as an integral part of the Web services development infrastructure. The main contribution of the paper is a binding between WSDL, the Web Services Description Language, and XQuery. The approach enables the use of XQuery for both Web services deployment and composition. We present a simple command-line tool that can be used to automatically deploy a Web service from a given XQuery module, and extend the XQuery language itself with a statement for accessing one or more Web services. The binding provides tight-coupling between WSDL and XQuery, yielding additional benefits, notably: the ability to use WSDL as an interface language for XQuery, and the ability to perform static typing on XQuery programs that include Web service calls. Last but not least, the proposal requires only minimal changes to the existing infrastructure. We report on our experience implementing this approach in the Galax XQuery processor. expand
|
|
|
Adapting databases and WebDAV protocol |
| |
Bita Shadgar,
Ian Holyer
|
|
Pages: 612-620 |
|
doi>10.1145/988672.988755 |
|
Full text: PDF
|
|
The ability of the Web to share data regardless of geographical location raises a new issue called remote authoring. With the Internet and Web browsers being independent of hardware, it becomes possible to build Web-enabled database applications. Many ...
The ability of the Web to share data regardless of geographical location raises a new issue called remote authoring. With the Internet and Web browsers being independent of hardware, it becomes possible to build Web-enabled database applications. Many approaches are provided to integrate databases into the Web environment, which use the Web's protocol i.e. HTTP to transfer the data between clients and servers. However, those methods are affected by the HTTP shortfalls with regard to remote authoring. This paper introduces and discusses a new methodology for remote authoring of databases, which is based on the WebDAV protocol. It is a seamless and effective methodology for accessing and authoring databases, particularly in that it naturally benefits from the WebDAV advantages such as metadata and access control. These features establish a standard way of accessing database metadata, and increase the database security, while speeding up the database connection. expand
|
|
|
Analysis of interacting BPEL web services |
| |
Xiang Fu,
Tevfik Bultan,
Jianwen Su
|
|
Pages: 621-630 |
|
doi>10.1145/988672.988756 |
|
Full text: PDF
|
|
This paper presents a set of tools and techniques for analyzing interactions of composite web services which are specified in BPEL and communicate through asynchronous XML messages. We model the interactions of composite web services as conversations, ...
This paper presents a set of tools and techniques for analyzing interactions of composite web services which are specified in BPEL and communicate through asynchronous XML messages. We model the interactions of composite web services as conversations, the global sequence of messages exchanged by the web services. As opposed to earlier work, our tool-set handles rich data manipulation via XPath expressions. This allows us to verify designs at a more detailed level and check properties about message content. We present a framework where BPEL specifications of web services are translated to an intermediate representation, followed by the translation of the intermediate representation to a verification language. As an intermediate representation we use guarded automata augmented with unbounded queues for incoming messages, where the guards are expressed as XPath expressions. As the target verification language we use Promela, input language of the model checker SPIN. Since SPIN model checker is a finite-state verification tool we can only achieve partial verification by fixing the sizes of the input queues in the translation. We propose the concept of synchronizability to address this problem. We show that if a composite web service is synchronizable, then its conversation set remains same when asynchronous communication is replaced with synchronous communication. We give a set of sufficient conditions that guarantee synchronizability and that can be checked statically. Based on our synchronizability results, we show that a large class of composite web services with unbounded input queues can be completely verified using a finite state model checker such as SPIN. expand
|
|
|
SESSION: Distributed semantic query |
| |
Frank van Harmelen
|
|
|
|
|
Index structures and algorithms for querying distributed RDF repositories |
| |
Heiner Stuckenschmidt,
Richard Vdovjak,
Geert-Jan Houben,
Jeen Broekstra
|
|
Pages: 631-639 |
|
doi>10.1145/988672.988758 |
|
Full text: PDF
|
|
A technical infrastructure for storing, querying and managing RDFdata is a key element in the current semantic web development. Systems like Jena, Sesame or the ICS-FORTH RDF Suite are widelyused for building semantic web applications. Currently, none ...
A technical infrastructure for storing, querying and managing RDFdata is a key element in the current semantic web development. Systems like Jena, Sesame or the ICS-FORTH RDF Suite are widelyused for building semantic web applications. Currently, none ofthese systems supports the integrated querying of distributed RDF repositories. We consider this a major shortcoming since the semanticweb is distributed by nature. In this paper we present an architecture for querying distributed RDF repositories by extending the existing Sesame system. We discuss the implications of our architectureand propose an index structure as well as algorithms forquery processing and optimization in such a distributed context. expand
|
|
|
Remindin': semantic query routing in peer-to-peer networks based on social metaphors |
| |
Christoph Tempich,
Steffen Staab,
Adrian Wranik
|
|
Pages: 640-649 |
|
doi>10.1145/988672.988759 |
|
Full text: PDF
|
|
In peer-to-peer networks, finding the appropriate answer for an information request, such as the answer to a query for RDF(S) data, depends on selecting the right peer in the network. We hereinvestigate how social metaphors can be exploited effectively ...
In peer-to-peer networks, finding the appropriate answer for an information request, such as the answer to a query for RDF(S) data, depends on selecting the right peer in the network. We hereinvestigate how social metaphors can be exploited effectively andefficiently to solve this task. To this end, we define a method for query routing, REMINDIN', that lets <em>(i)</em> peers observewhich queries are successfully answered by other peers,<em>(ii)</em>, memorizes this observation, and, <em>(iii)</em>,subsequently uses this information in order to select peers to forward requests to.REMINDIN' has been implemented for the SWAP peer-to-peer platformas well as for a simulation environment. We have used the simulation environment in order to investigate how successfulvariations of REMINDIN' are and how they compare to baseline strategies in terms of number of messages forwarded in the networkand statements appropriately retrieved. expand
|
|
|
RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network |
| |
Min Cai,
Martin Frank
|
|
Pages: 650-657 |
|
doi>10.1145/988672.988760 |
|
Full text: PDF
|
|
Centralized Resource Description Framework (RDF) repositories have limitations both in their failure tolerance and in their scalability. Existing Peer-to-Peer (P2P) RDF repositories either cannot guarantee to find query results, even if these results ...
Centralized Resource Description Framework (RDF) repositories have limitations both in their failure tolerance and in their scalability. Existing Peer-to-Peer (P2P) RDF repositories either cannot guarantee to find query results, even if these results exist in the network, or require up-front definition of RDF schemas and designation of super peers. We present a scalable distributed RDF repository (RDFPeers) that stores each triple at three places in a multi-attribute addressable network by applying globally known hash functions to its subject predicate and object. Thus all nodes know which node is responsible for storing triple values they are looking for and both exact-match and range queries can be efficiently routed to those nodes. RDFPeers has no single point of failure nor elevated peers and does not require the prior definition of RDF schemas. Queries are guaranteed to find matched triples in the network if the triples exist. In RDFPeers both the number of neighbors per node and the number of routing hops for inserting RDF triples and for resolving most queries are logarithmic to the number of nodes in the network. We further performed experiments that show that the triple-storing load in RDFPeers differs by less than an order of magnitude between the most and the least loaded nodes for real-world RDF data. expand
|
|
|
SESSION: Query result processing |
| |
Andrei Broder
|
|
|
|
|
A hierarchical monothetic document clustering algorithm for summarization and browsing search results |
| |
Krishna Kummamuru,
Rohit Lotlikar,
Shourya Roy,
Karan Singal,
Raghu Krishnapuram
|
|
Pages: 658-665 |
|
doi>10.1145/988672.988762 |
|
Full text: PDF
|
|
Organizing Web search results into a hierarchy of topics and sub-topics facilitates browsing the collection and locating results of interest. In this paper, we propose a new hierarchical monothetic clustering algorithm to build a topic hierarchy for ...
Organizing Web search results into a hierarchy of topics and sub-topics facilitates browsing the collection and locating results of interest. In this paper, we propose a new hierarchical monothetic clustering algorithm to build a topic hierarchy for a collection of search results retrieved in response to a query. At every level of the hierarchy, the new algorithm progressively identifies topics in a way that maximizes the coverage while maintaining distinctiveness of the topics. We refer the proposed algorithm to as DisCover. Evaluating the quality of a topic hierarchy is a non-trivial task, the ultimate test being user judgment. We use several objective measures such as coverage and reach time for an empirical comparison of the proposed algorithm with two other monothetic clustering algorithms to demonstrate its superiority. Even though our algorithm is slightly more computationally intensive than one of the algorithms, it generates better hierarchies. Our user studies also show that the proposed algorithm is superior to the other algorithms as a summarizing and browsing tool. expand
|
|
|
Mining anchor text for query refinement |
| |
Reiner Kraft,
Jason Zien
|
|
Pages: 666-674 |
|
doi>10.1145/988672.988763 |
|
Full text: PDF
|
|
When searching large hypertext document collections, it is often possible that there are too many results available for ambiguous queries. Query refinement is an interactive process of query modification that can be used to narrow down the scope of search ...
When searching large hypertext document collections, it is often possible that there are too many results available for ambiguous queries. Query refinement is an interactive process of query modification that can be used to narrow down the scope of search results. We propose a new method for automatically generating refinements or related terms to queries by mining anchor text for a large hypertext document collection. We show that the usage of anchor text as a basis for query refinement produces high quality refinement suggestions that are significantly better in terms of perceived usefulness compared to refinements that are derived using the document content. Furthermore, our study suggests that anchor text refinements can also be used to augment traditional query refinement algorithms based on query logs, since they typically differ in coverage and produce different refinements. Our results are based on experiments on an anchor text collection of a large corporate intranet. expand
|
|
|
Adaptive web search based on user profile constructed without any effort from users |
| |
Kazunari Sugiyama,
Kenji Hatano,
Masatoshi Yoshikawa
|
|
Pages: 675-684 |
|
doi>10.1145/988672.988764 |
|
Full text: PDF
|
|
Web search engines help users find useful information on the World Wide Web (WWW). However, when the same query is submitted by different users, typical search engines return the same result regardless of who submitted the query. Generally, each user ...
Web search engines help users find useful information on the World Wide Web (WWW). However, when the same query is submitted by different users, typical search engines return the same result regardless of who submitted the query. Generally, each user has different information needs for his/her query. Therefore, the search result should be adapted to users with different information needs. In this paper, we first propose several approaches to adapting search results according to each user's need for relevant information without any user effort, and then verify the effectiveness of our proposed approaches. Experimental results show that search systems that adapt to each user's preferences can be achieved by constructing user profiles based on modified collaborative filtering with detailed analysis of user's browsing history in one day. expand
|
|
|
SESSION: Web site analysis and customization |
| |
Daniel Schwabe
|
|
|
|
|
Practical semantic analysis of web sites and documents |
| |
Thierry Despeyroux
|
|
Pages: 685-693 |
|
doi>10.1145/988672.988766 |
|
Full text: PDF
|
|
As Web sites are now ordinary products, it is necessary to explicit the notion of quality of a Web site. The quality of a site may belinked to the easiness of accessibility and also to other criteria such as the fact that the site is up to date and coherent. ...
As Web sites are now ordinary products, it is necessary to explicit the notion of quality of a Web site. The quality of a site may belinked to the easiness of accessibility and also to other criteria such as the fact that the site is up to date and coherent. This last quality is difficult to insure because sites may be updated very frequently, may have many authors, may be partially generated and inthis context proof-reading is very difficult. The same piece of information may be found in different occurrences, but also in data ormeta-data, leading to the need for consistency checking. In this paper we make a parallel between programs and Web sites. We present some examples of semantic constraints that one would like to specify (constraints between the meaning of categories and sub-categories in a thematic directory, consistency between the organization chart and the rest of the site in an academic site). We present quickly the Natural Semantics a way to specify the semantics of programming languages that inspires ourworks. Natural Semantics itself comes from both an operational semantics and from logic programming and its implementation uses Prolog. Then we propose a specification language for semantic constraints in Web sites that, in conjunction with the well known "make" program, permits to generate some site verification tools by compiling the specification into Prolog code. We apply our method to alarge XML document which is the scientific part of our instituteactivity report, tracking errors or inconsistencies and alsoconstructing some indicators that can be used by the management of theinstitute. expand
|
|
|
Web customization using behavior-based remote executing agents |
| |
Eugene Hung,
Joseph Pasquale
|
|
Pages: 694-703 |
|
doi>10.1145/988672.988767 |
|
Full text: PDF
|
|
ReAgents are remotely executing agents that customize Web browsing for non-standard clients. A reAgent is essentially a one-shot" mobile agent that acts as an extension of a client dynamically launched by the client to run on its behalf at a remote more ...
ReAgents are remotely executing agents that customize Web browsing for non-standard clients. A reAgent is essentially a one-shot" mobile agent that acts as an extension of a client dynamically launched by the client to run on its behalf at a remote more advantageous location. ReAgents simplify the use of mobile agent technology by transparently handling data migration and run-time network communications and provide a general interface for programmers to more easily implement their application-specific customizing logic. This is made possible by the identification of useful remote behaviors i.e. common patterns of actions that exploit the ability to process and communicate remotely. Examples of such behaviors are transformers monitors cachers and collators. In this paper we identify a set ofuseful reAgent behaviors for interacting with Web services via astandard browser describe how to program and use reAgents and show that the overhead of using reAgents is low and outweighed by its benefits. expand
|
|
|
SESSION: Semantic web foundations |
| |
Jeremy Carroll
|
|
|
|
|
A possible simplification of the semantic web architecture |
| |
Bernardo Cuenca Grau
|
|
Pages: 704-713 |
|
doi>10.1145/988672.988769 |
|
Full text: PDF
|
|
In the semantic Web architecture, Web ontology languages arebuilt on top of RDF(S). However, serious difficulties have arisen when trying to layer expressive ontology languages, like OWL, on top of RDF-Schema. Although these problems can be avoided, ...
In the semantic Web architecture, Web ontology languages arebuilt on top of RDF(S). However, serious difficulties have arisen when trying to layer expressive ontology languages, like OWL, on top of RDF-Schema. Although these problems can be avoided, OWL (andthe whole semantic Web architecture) becomes much more complex than it should be. In this paper, a possible simplification of thesemantic Web architecture is suggested, which has several import antadvantages with respect to the layering currently accepted by the W3C Ontology Working Group. expand
|
|
|
A combined approach to checking web ontologies |
| |
J. S. Dong,
C. H. Lee,
H. B. Lee,
Y. F. Li,
H. Wang
|
|
Pages: 714-722 |
|
doi>10.1145/988672.988770 |
|
Full text: PDF
|
|
The understanding of Semantic Web documents is built upon ontologies that define concepts and relationships of data. Hence, the correctness of ontologies is vital. Ontology reasoners such as RACER and FaCT have been developed to reason ontologies with ...
The understanding of Semantic Web documents is built upon ontologies that define concepts and relationships of data. Hence, the correctness of ontologies is vital. Ontology reasoners such as RACER and FaCT have been developed to reason ontologies with a high degree of automation. However, complex ontology-related properties may not be expressible within the current web ontology languages, consequently they may not be checkable by RACER and FaCT. We propose to use the software engineering techniques and tools, i.e., Z/EVES and Alloy Analyzer, to complement the ontology tools for checking Semantic Web documents.In this approach, Z/EVES is first applied to remove trivial syntax and type errors of the ontologies. Next, RACER is used to identify any ontological inconsistencies, whose origins can be traced by Alloy Analyzer. Finally Z/EVES is used again to express complex ontology-related properties and reveal errors beyond the modeling capabilities of the current web ontology languages. We have successfully applied this approach to checking a set of military plan ontologies. expand
|
|
|
A proposal for an owl rules language |
| |
Ian Horrocks,
Peter F. Patel-Schneider
|
|
Pages: 723-731 |
|
doi>10.1145/988672.988771 |
|
Full text: PDF
|
|
Although the OWLWeb Ontology Language adds considerable expressive power to the Semantic Web it does have expressive limitations, particularly with respect to what can be said about properties. Wepresent ORL (OWL Rules Language), a Horn clause rules ...
Although the OWLWeb Ontology Language adds considerable expressive power to the Semantic Web it does have expressive limitations, particularly with respect to what can be said about properties. Wepresent ORL (OWL Rules Language), a Horn clause rules extension to OWL that overcomes many of these limitations. ORL extends OWL in a syntactically and semantically coherent manner: the basic syntax for ORL rules is an extension of the abstract syntax for OWL DL and OWLLite; ORL rules are given formal meaning via an extension of the OWLDL model-theoretic semantics; ORL rules are given an XML syntax basedon the OWL XML presentation syntax; and a mapping from ORL rules to RDF graphs is given based on the OWL RDF/XML exchange syntax. Wediscuss the expressive power of ORL, showing that the ontology consistency problem is undecidable, provide several examples of ORLusage, and discuss how reasoning support for ORL might be provided. expand
|