|
|
User-centered design for KDD |
| |
Eric Haseltine
|
|
Pages: 1-1 |
|
doi>10.1145/1014052.1014053 |
|
Full text: PDF
|
|
During initial development, KDD solutions often focus heavily on algorithms, architectures, software, hardware, and systems engineering challenges, without first thoroughly exploring how end-users will employ the new KDD technology. As a result of such ...
During initial development, KDD solutions often focus heavily on algorithms, architectures, software, hardware, and systems engineering challenges, without first thoroughly exploring how end-users will employ the new KDD technology. As a result of such "system-centered" design, many useless features are implemented that prolong development and significantly add to life cycle cost, while making the system hard to operate and use. This presentation will describe an alternate "user-centered" approach -- borrowed from the consumer products industry -- that can produce KDD solutions with shorter development cycles, lower costs, and much better usability. expand
|
|
|
Graphical models for data mining |
| |
David Heckerman
|
|
Pages: 2-2 |
|
doi>10.1145/1014052.1014054 |
|
Full text: PDF
|
|
I will discuss the use of graphical models for data mining. I will review key research areas including structure learning, variational methods, a relational modeling, and describe applications ranging from web traffic analysis to AIDS vaccine design.
I will discuss the use of graphical models for data mining. I will review key research areas including structure learning, variational methods, a relational modeling, and describe applications ranging from web traffic analysis to AIDS vaccine design. expand
|
|
|
SESSION: Research track papers |
|
|
|
|
An iterative method for multi-class cost-sensitive learning |
| |
Naoki Abe,
Bianca Zadrozny,
John Langford
|
|
Pages: 3-11 |
|
doi>10.1145/1014052.1014056 |
|
Full text: PDF
|
|
Cost-sensitive learning addresses the issue of classification in the presence of varying costs associated with different types of misclassification. In this paper, we present a method for solving multi-class cost-sensitive learning problems using any ...
Cost-sensitive learning addresses the issue of classification in the presence of varying costs associated with different types of misclassification. In this paper, we present a method for solving multi-class cost-sensitive learning problems using any binary classification algorithm. This algorithm is derived using hree key ideas: 1) iterative weighting; 2) expanding data space; and 3) gradient boosting with stochastic ensembles. We establish some theoretical guarantees concerning the performance of this method. In particular, we show that a certain variant possesses the boosting property, given a form of weak learning assumption on the component binary classifier. We also empirically evaluate the performance of the proposed method using benchmark data sets and verify that our method generally achieves better results than representative methods for cost-sensitive learning, in terms of predictive performance (cost minimization) and, in many cases, computational efficiency. expand
|
|
|
Approximating a collection of frequent sets |
| |
Foto Afrati,
Aristides Gionis,
Heikki Mannila
|
|
Pages: 12-19 |
|
doi>10.1145/1014052.1014057 |
|
Full text: PDF
|
|
One of the most well-studied problems in data mining is computing the collection of frequent item sets in large transactional databases. One obstacle for the applicability of frequent-set mining is that the size of the output collection can be far too ...
One of the most well-studied problems in data mining is computing the collection of frequent item sets in large transactional databases. One obstacle for the applicability of frequent-set mining is that the size of the output collection can be far too large to be carefully examined and understood by the users. Even restricting the output to the border of the frequent item-set collection does not help much in alleviating the problem.In this paper we address the issue of overwhelmingly large output size by introducing and studying the following problem: What are the k sets that best approximate a collection of frequent item sets? Our measure of approximating a collection of sets by k sets is defined to be the size of the collection covered by the the k sets, i.e., the part of the collection that is included in one of the k sets. We also specify a bound on the number of extra sets that are allowed to be covered. We examine different problem variants for which we demonstrate the hardness of the corresponding problems and we provide simple polynomial-time approximation algorithms. We give empirical evidence showing that the approximation methods work well in practice. expand
|
|
|
Mining reference tables for automatic text segmentation |
| |
Eugene Agichtein,
Venkatesh Ganti
|
|
Pages: 20-29 |
|
doi>10.1145/1014052.1014058 |
|
Full text: PDF
|
|
Automatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In ...
Automatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In this paper, we mine tables present in data warehouses and relational databases to develop an automatic segmentation system. Thus, we overcome limitations of existing supervised text segmentation approaches, which require comprehensive manually labeled training data. Our segmentation system is robust, accurate, and efficient, and requires no additional manual effort. Thorough evaluation on real datasets demonstrates the robustness and accuracy of our system, with segmentation accuracy exceeding state of the art supervised approaches. expand
|
|
|
Recovering latent time-series from their observed sums: network tomography with particle filters. |
| |
Edoardo Airoldi,
Christos Faloutsos
|
|
Pages: 30-39 |
|
doi>10.1145/1014052.1014059 |
|
Full text: PDF
|
|
Hidden variables, evolving over time, appear in multiple settings, where it is valuable to recover them, typically from observed sums. Our driving application is 'network tomography', where we need to estimate the origin-destination (OD) traffic flows ...
Hidden variables, evolving over time, appear in multiple settings, where it is valuable to recover them, typically from observed sums. Our driving application is 'network tomography', where we need to estimate the origin-destination (OD) traffic flows to determine, e.g., who is communicating with whom in a local area network. This information allows network engineers and managers to solve problems in design, routing, configuration debugging, monitoring and pricing. Unfortunately the direct measurement of the OD traffic is usually difficult, or even impossible; instead, we can easily measure the loads on every link, that is, sums of desirable OD flows.In this paper we propose i-FILTER, a method to solve this problem, which improves the state-of-the-art by (a) introducing explicit time dependence, and by (b) using realistic, non-Gaussian marginals in the statistical models for the traffic flows, as never attempted before. We give experiments on real data, where i-FILTER scales linearly with new observations and out-performs the best existing solutions, in a wide variety of settings. Specifically, on real network traffic measured at CMU, and at AT&T, i-FILTER reduced the estimation errors between 15% and 46% in all cases. expand
|
|
|
Fast nonlinear regression via eigenimages applied to galactic morphology |
| |
Brigham Anderson,
Andrew Moore,
Andrew Connolly,
Robert Nichol
|
|
Pages: 40-48 |
|
doi>10.1145/1014052.1014060 |
|
Full text: PDF
|
|
Astronomy increasingly faces the issue of massive, unwieldly data sets. The Sloan Digital Sky Survey (SDSS) [11] has so far generated tens of millions of images of distant galaxies, of which only a tiny fraction have been morphologically classified. ...
Astronomy increasingly faces the issue of massive, unwieldly data sets. The Sloan Digital Sky Survey (SDSS) [11] has so far generated tens of millions of images of distant galaxies, of which only a tiny fraction have been morphologically classified. Morphological classification in this context is achieved by fitting a parametric model of galaxy shape to a galaxy image. This is a nonlinear regression problem, whose challenges are threefold, 1) blurring of the image caused by atmosphere and mirror imperfections, 2) large numbers of local minima, and 3) massive data sets.Our strategy is to use the eigenimages of the parametric model to form a new feature space, and then to map both target image and the model parameters into this feature space. In this low-dimensional space we search for the best image-to-parameter match. To search the space, we sample it by creating a database of many random parameter vectors (prototypes) and mapping them into the feature space. The search problem then becomes one of finding the best prototype match, so the fitting process a nearest-neighbor search.In addition to the savings realized by decomposing the original space into an eigenspace, we can use the fact that the model is a linear sum of functions to reduce the prototypes further: the only prototypes stored are the components of the model function. A modified form of nearest neighbor is used to search among them.Additional complications arise in the form of missing data and heteroscedasticity, both of which are addressed with weighted linear regression. Compared to existing techniques, speed-ups ach-ieved are between 2 and 3 orders of magnitude. This should enable the analysis of the entire SDSS dataset. expand
|
|
|
Clustering time series from ARMA models with clipped data |
| |
A. J. Bagnall,
G. J. Janacek
|
|
Pages: 49-58 |
|
doi>10.1145/1014052.1014061 |
|
Full text: PDF
|
|
Clustering time series is a problem that has applications in a wide variety of fields, and has recently attracted a large amount of research. In this paper we focus on clustering data derived from Autoregressive Moving Average (ARMA) models using k-means ...
Clustering time series is a problem that has applications in a wide variety of fields, and has recently attracted a large amount of research. In this paper we focus on clustering data derived from Autoregressive Moving Average (ARMA) models using k-means and k-medoids algorithms with the Euclidean distance between estimated model parameters. We justify our choice of clustering technique and distance metric by reproducing results obtained in related research. Our research aim is to assess the affects of discretising data into binary sequences of above and below the median, a process known as clipping, on the clustering of time series. It is known that the fitted AR parameters of clipped data tend asymptotically to the parameters for unclipped data. We exploit this result to demonstrate that for long series the clustering accuracy when using clipped data from the class of ARMA models is not significantly different to that achieved with unclipped data. Next we show that if the data contains outliers then using clipped data produces significantly better clusterings. We then demonstrate that using clipped series requires much less memory and operations such as distance calculations can be much faster. Finally, we demonstrate these advantages on three real world data sets. expand
|
|
|
A probabilistic framework for semi-supervised clustering |
| |
Sugato Basu,
Mikhail Bilenko,
Raymond J. Mooney
|
|
Pages: 59-68 |
|
doi>10.1145/1014052.1014062 |
|
Full text: PDF
|
|
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing ...
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. expand
|
|
|
Data mining in metric space: an empirical analysis of supervised learning performance criteria |
| |
Rich Caruana,
Alexandru Niculescu-Mizil
|
|
Pages: 69-78 |
|
doi>10.1145/1014052.1014063 |
|
Full text: PDF
|
|
Many criteria can be used to evaluate the performance of supervised learning. Different criteria are appropriate in different settings, and it is not always clear which criteria to use. A further complication is that learning methods that perform well ...
Many criteria can be used to evaluate the performance of supervised learning. Different criteria are appropriate in different settings, and it is not always clear which criteria to use. A further complication is that learning methods that perform well on one criterion may not perform well on other criteria. For example, SVMs and boosting are designed to optimize accuracy, whereas neural nets typically optimize squared error or cross entropy. We conducted an empirical study using a variety of learning methods (SVMs, neural nets, k-nearest neighbor, bagged and boosted trees, and boosted stumps) to compare nine boolean classification performance metrics: Accuracy, Lift, F-Score, Area under the ROC Curve, Average Precision, Precision/Recall Break-Even Point, Squared Error, Cross Entropy, and Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low dimensional manifold. The three metrics that are appropriate when predictions are interpreted as probabilities: squared error, cross entropy, and calibration, lay in one part of metric space far away from metrics that depend on the relative order of the predicted values: ROC area, average precision, break-even point, and lift. In between them fall two metrics that depend on comparing predictions to a threshold: accuracy and F-score. As expected, maximum margin methods such as SVMs and boosted trees have excellent performance on metrics like accuracy, but perform poorly on probability metrics such as squared error. What was not expected was that the margin methods have excellent performance on ordering metrics such as ROC area and average precision. We introduce a new metric, SAR, that combines squared error, accuracy, and ROC area into one metric. MDS and correlation analysis shows that SAR is centrally located and correlates well with other metrics, suggesting that it is a good general purpose metric to use when more specific criteria are not known. expand
|
|
|
Fully automatic cross-associations |
| |
Deepayan Chakrabarti,
Spiros Papadimitriou,
Dharmendra S. Modha,
Christos Faloutsos
|
|
Pages: 79-88 |
|
doi>10.1145/1014052.1014064 |
|
Full text: PDF
|
|
Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually ...
Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices---e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining---require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support." Choosing suitable values for such parameters is a challenging problem.Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results. expand
|
|
|
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods |
| |
William W. Cohen,
Sunita Sarawagi
|
|
Pages: 89-98 |
|
doi>10.1145/1014052.1014065 |
|
Full text: PDF
|
|
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries---more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities ...
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries---more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process, which is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling high-performance NER methods and high-performance similarity functions, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance over previous methods for using external dictionaries in NER. expand
|
|
|
Adversarial classification |
| |
Nilesh Dalvi,
Pedro Domingos,
Mausam,
Sumit Sanghai,
Deepak Verma
|
|
Pages: 99-108 |
|
doi>10.1145/1014052.1014066 |
|
Full text: PDF
|
|
Essentially all data mining algorithms assume that the data-generating process is independent of the data miner's activities. However, in many domains, including spam detection, intrusion detection, fraud detection, surveillance and counter-terrorism, ...
Essentially all data mining algorithms assume that the data-generating process is independent of the data miner's activities. However, in many domains, including spam detection, intrusion detection, fraud detection, surveillance and counter-terrorism, this is far from the case: the data is actively manipulated by an adversary seeking to make the classifier produce false negatives. In these domains, the performance of a classifier can degrade rapidly after it is deployed, as the adversary learns to defeat it. Currently the only solution to this is repeated, manual, ad hoc reconstruction of the classifier. In this paper we develop a formal framework and algorithms for this problem. We view classification as a game between the classifier and the adversary, and produce a classifier that is optimal given the adversary's optimal strategy. Experiments in a spam detection domain show that this approach can greatly outperform a classifier learned in the standard way, and (within the parameters of the problem) automatically adapt the classifier to the adversary's evolving manipulations. expand
|
|
|
Regularized multi--task learning |
| |
Theodoros Evgeniou,
Massimiliano Pontil
|
|
Pages: 109-117 |
|
doi>10.1145/1014052.1014067 |
|
Full text: PDF
|
|
Past empirical work has shown that learning multiple related tasks from data simultaneously can be advantageous in terms of predictive performance relative to learning these tasks independently. In this paper we present an approach to multi--task learning ...
Past empirical work has shown that learning multiple related tasks from data simultaneously can be advantageous in terms of predictive performance relative to learning these tasks independently. In this paper we present an approach to multi--task learning based on the minimization of regularization functionals similar to existing ones, such as the one for Support Vector Machines (SVMs), that have been successfully used in the past for single--task learning. Our approach allows to model the relation between tasks in terms of a novel kernel function that uses a task--coupling parameter. We implement an instance of the proposed approach similar to SVMs and test it empirically using simulated as well as real data. The experimental results show that the proposed method performs better than existing multi--task learning methods and largely outperforms single--task learning using SVMs. expand
|
|
|
Fast discovery of connection subgraphs |
| |
Christos Faloutsos,
Kevin S. McCurley,
Andrew Tomkins
|
|
Pages: 118-127 |
|
doi>10.1145/1014052.1014068 |
|
Full text: PDF
|
|
We define a connection subgraph as a small subgraph of a large graph that best captures the relationship between two nodes. The primary motivation for this work is to provide a paradigm for exploration and knowledge discovery in large social networks ...
We define a connection subgraph as a small subgraph of a large graph that best captures the relationship between two nodes. The primary motivation for this work is to provide a paradigm for exploration and knowledge discovery in large social networks graphs. We present a formal definition of this problem, and an ideal solution based on electricity analogues. We then show how to accelerate the computations, to produce approximate, but high-quality connection subgraphs in real time on very large (disk resident) graphs.We describe our operational prototype, and we demonstrate results on a social network graph derived from the World Wide Web. Our graph contains 15 million nodes and 96 million edges, and our system still produces quality responses within seconds. expand
|
|
|
Systematic data selection to mine concept-drifting data streams |
| |
Wei Fan
|
|
Pages: 128-137 |
|
doi>10.1145/1014052.1014069 |
|
Full text: PDF
|
|
One major problem of existing methods to mine data streams is that it makes ad hoc choices to combine most recent data with some amount of old data to search the new hypothesis. The assumption is that the additional old data always helps produce a more ...
One major problem of existing methods to mine data streams is that it makes ad hoc choices to combine most recent data with some amount of old data to search the new hypothesis. The assumption is that the additional old data always helps produce a more accurate hypothesis than using the most recent data only. We first criticize this notion and point out that using old data blindly is not better than "gambling"; in other words, it helps increase the accuracy only if we are "lucky." We discuss and analyze the situations where old data will help and what kind of old data will help. The practical problem on choosing the right example from old data is due to the formidable cost to compare different possibilities and models. This problem will go away if we have an algorithm that is extremely efficient to compare all sensible choices with little extra cost. Based on this observation, we propose a simple, efficient and accurate cross-validation decision tree ensemble method. expand
|
|
|
Efficient closed pattern mining in the presence of tough block constraints |
| |
Krishna Gade,
Jianyong Wang,
George Karypis
|
|
Pages: 138-147 |
|
doi>10.1145/1014052.1014070 |
|
Full text: PDF
|
|
Various constrained frequent pattern mining problem formulations and associated algorithms have been developed that enable the user to specify various itemset-based constraints that better capture the underlying application requirements and characteristics. ...
Various constrained frequent pattern mining problem formulations and associated algorithms have been developed that enable the user to specify various itemset-based constraints that better capture the underlying application requirements and characteristics. In this paper we introduce a new class of block constraints that determine the significance of an itemset pattern by considering the dense block that is formed by the pattern's items and its associated set of transactions. Block constraints provide a natural framework by which a number of important problems can be specified and make it possible to solve numerous problems on binary and real-valued datasets. However, developing computationally efficient algorithms to find these block constraints poses a number of challenges as unlike the different itemset-based constraints studied earlier, these block constraints are tough as they are neither anti-monotone, monotone, nor convertible. To overcome this problem, we introduce a new class of pruning methods that significantly reduce the overall search space and present a computationally efficient and scalable algorithm called CBMiner to find the closed itemsets that satisfy the block constraints. expand
|
|
|
Discovering complex matchings across web query interfaces: a correlation mining approach |
| |
Bin He,
Kevin Chen-Chuan Chang,
Jiawei Han
|
|
Pages: 148-157 |
|
doi>10.1145/1014052.1014071 |
|
Full text: PDF
|
|
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing ...
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this paper takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this "deep Web," query interfaces generally form complex matchings between attribute groups (e.g., [author] corresponds to [first name, last name] in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., [first name, last name]) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preparation, dual mining of positive and negative correlations, and finally matching selection. Unlike previous correlation mining algorithms, which mainly focus on finding strong positive correlations, our algorithm cares both positive and negative correlations, especially the subtlety of negative correlations, due to its special importance in schema matching. This leads to the introduction of a new correlation measure, $H$-measure, distinct from those proposed in previous work. We evaluate our approach extensively and the results show good accuracy for discovering complex matchings. expand
|
|
|
Cyclic pattern kernels for predictive graph mining |
| |
Tamás Horváth,
Thomas Gärtner,
Stefan Wrobel
|
|
Pages: 158-167 |
|
doi>10.1145/1014052.1014072 |
|
Full text: PDF
|
|
With applications in biology, the world-wide web, and several other areas, mining of graph-structured objects has received significant interest recently. One of the major research directions in this field is concerned with predictive data mining in graph ...
With applications in biology, the world-wide web, and several other areas, mining of graph-structured objects has received significant interest recently. One of the major research directions in this field is concerned with predictive data mining in graph databases where each instance is represented by a graph. Some of the proposed approaches for this task rely on the excellent classification performance of support vector machines. To control the computational cost of these approaches, the underlying kernel functions are based on frequent patterns. In contrast to these approaches, we propose a kernel function based on a natural set of cyclic and tree patterns independent of their frequency, and discuss its computational aspects. To practically demonstrate the effectiveness of our approach, we use the popular NCI-HIV molecule dataset. Our experimental results show that cyclic pattern kernels can be computed quickly and offer predictive performance superior to recent graph kernels based on frequent patterns. expand
|
|
|
Mining and summarizing customer reviews |
| |
Minqing Hu,
Bing Liu
|
|
Pages: 168-177 |
|
doi>10.1145/1014052.1014073 |
|
Full text: PDF
|
|
Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows ...
Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds or even thousands. This makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. It also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. For the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. In this research, we aim to mine and to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. We do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. Our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. This paper proposes several novel techniques to perform these tasks. Our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques. expand
|
|
|
Interestingness of frequent itemsets using Bayesian networks as background knowledge |
| |
Szymon Jaroszewicz,
Dan A. Simovici
|
|
Pages: 178-186 |
|
doi>10.1145/1014052.1014074 |
|
Full text: PDF
|
|
The paper presents a method for pruning frequent itemsets based on background knowledge represented by a Bayesian network. The interestingness of an itemset is defined as the absolute difference between its support estimated from data and from the Bayesian ...
The paper presents a method for pruning frequent itemsets based on background knowledge represented by a Bayesian network. The interestingness of an itemset is defined as the absolute difference between its support estimated from data and from the Bayesian network. Efficient algorithms are presented for finding interestingness of a collection of frequent itemsets, and for finding all attribute sets with a given minimum interestingness. Practical usefulness of the algorithms and their efficiency have been verified experimentally. expand
|
|
|
Mining the space of graph properties |
| |
Glen Jeh,
Jennifer Widom
|
|
Pages: 187-196 |
|
doi>10.1145/1014052.1014075 |
|
Full text: PDF
|
|
Existing data mining algorithms on graphs look for nodes satisfying specific properties, such as specific notions of structural similarity or specific measures of link-based importance. While such analyses for predetermined properties can be effective ...
Existing data mining algorithms on graphs look for nodes satisfying specific properties, such as specific notions of structural similarity or specific measures of link-based importance. While such analyses for predetermined properties can be effective in well-understood domains, sometimes identifying an appropriate property for analysis can be a challenge, and focusing on a single property may neglect other important aspects of the data. In this paper, we develop a foundation for mining the properties themselves. We present a theoretical framework defining the space of graph properties, a variety of mining queries enabled by the framework, techniques to handle the enormous size of the query space, and an experimental system called F-Miner that demonstrates the utility and feasibility of property mining. expand
|
|
|
Web usage mining based on probabilistic latent semantic analysis |
| |
Xin Jin,
Yanzan Zhou,
Bamshad Mobasher
|
|
Pages: 197-205 |
|
doi>10.1145/1014052.1014076 |
|
Full text: PDF
|
|
The primary goal of Web usage mining is the discovery of patterns in the navigational behavior of Web users. Standard approaches, such as clustering of user sessions and discovering association rules or frequent navigational paths, do not generally provide ...
The primary goal of Web usage mining is the discovery of patterns in the navigational behavior of Web users. Standard approaches, such as clustering of user sessions and discovering association rules or frequent navigational paths, do not generally provide the ability to automatically characterize or quantify the unobservable factors that lead to common navigational patterns. It is, therefore, necessary to develop techniques that can automatically discover hidden semantic relationships among users as well as between users and Web objects. Probabilistic Latent Semantic Analysis (PLSA) is particularly useful in this context, since it can uncover latent semantic associations among users and pages based on the co-occurrence patterns of these pages in user sessions. In this paper, we develop a unified framework for the discovery and analysis of Web navigational patterns based on PLSA. We show the flexibility of this framework in characterizing various relationships among users and Web objects. Since these relationships are measured in terms of probabilities, we are able to use probabilistic inference to perform a variety of analysis tasks such as user segmentation, page classification, as well as predictive tasks such as collaborative recommendations. We demonstrate the effectiveness of our approach through experiments performed on real-world data sets. expand
|
|
|
Towards parameter-free data mining |
| |
Eamonn Keogh,
Stefano Lonardi,
Chotirat Ann Ratanamahatana
|
|
Pages: 206-215 |
|
doi>10.1145/1014052.1014077 |
|
Full text: PDF
|
|
Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a ...
Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process.Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameter-free data-mining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets. expand
|
|
|
A graph-theoretic approach to extract storylines from search results |
| |
Ravi Kumar,
Uma Mahadevan,
D. Sivakumar
|
|
Pages: 216-225 |
|
doi>10.1145/1014052.1014078 |
|
Full text: PDF
|
|
We present a graph-theoretic approach to discover storylines from search results. Storylines are windows that offer glimpses into interesting themes latent among the top search results for a query; they are different from, and complementary to, clusters ...
We present a graph-theoretic approach to discover storylines from search results. Storylines are windows that offer glimpses into interesting themes latent among the top search results for a query; they are different from, and complementary to, clusters obtained through traditional approaches. Our framework is axiomatically developed and combinatorial in nature, based on generalizations of the maximum induced matching problem on bipartite graphs. The core algorithmic task involved is to mine for signature structures in a robust graph representation of the search results. We present a very fast algorithm for this task based on local search. Experiments show that the collection of storylines extracted through our algorithm offers a concise organization of the wealth of information hidden beyond the first page of search results. expand
|
|
|
Incremental maintenance of quotient cube for median |
| |
Cuiping Li,
Gao Cong,
Anthony K. H. Tung,
Shan Wang
|
|
Pages: 226-235 |
|
doi>10.1145/1014052.1014079 |
|
Full text: PDF
|
|
Data cube pre-computation is an important concept for supporting OLAP(Online Analytical Processing) and has been studied extensively. It is often not feasible to compute a complete data cube due to the huge storage requirement. Recently proposed quotient ...
Data cube pre-computation is an important concept for supporting OLAP(Online Analytical Processing) and has been studied extensively. It is often not feasible to compute a complete data cube due to the huge storage requirement. Recently proposed quotient cube addressed this issue through a partitioning method that groups cube cells into equivalence partitions. Such an approach is not only useful for distributive aggregate functions such as SUM but can also be applied to the holistic aggregate functions like MEDIAN.Maintaining a data cube for holistic aggregation is a hard problem since its difficulty lies in the fact that history tuple values must be kept in order to compute the new aggregate when tuples are inserted or deleted. The quotient cube makes the problem harder since we also need to maintain the equivalence classes. In this paper, we introduce two techniques called addset data structure and sliding window to deal with this problem. We develop efficient algorithms for maintaining a quotient cube with holistic aggregation functions that takes up reasonably small storage space. Performance study shows that our algorithms are effective, efficient and scalable over large databases. expand
|
|
|
Mining, indexing, and querying historical spatiotemporal data |
| |
Nikos Mamoulis,
Huiping Cao,
George Kollios,
Marios Hadjieleftheriou,
Yufei Tao,
David W. Cheung
|
|
Pages: 236-245 |
|
doi>10.1145/1014052.1014080 |
|
Full text: PDF
|
|
In many applications that track and analyze spatiotemporal data, movements obey periodic patterns; the objects follow the same routes (approximately) over regular time intervals. For example, people wake up at the same time and follow more or less the ...
In many applications that track and analyze spatiotemporal data, movements obey periodic patterns; the objects follow the same routes (approximately) over regular time intervals. For example, people wake up at the same time and follow more or less the same route to their work everyday. The discovery of hidden periodic patterns in spatiotemporal data, apart from unveiling important information to the data analyst, can facilitate data management substantially. Based on this observation, we propose a framework that analyzes, manages, and queries object movements that follow such patterns. We define the spatiotemporal periodic pattern mining problem and propose an effective and fast mining algorithm for retrieving maximal periodic patterns. We also devise a novel, specialized index structure that can benefit from the discovered patterns to support more efficient execution of spatiotemporal queries. We evaluate our methods experimentally using datasets with object trajectories that exhibit periodicity. expand
|
|
|
Machine learning for online query relaxation |
| |
Ion Muslea
|
|
Pages: 246-255 |
|
doi>10.1145/1014052.1014081 |
|
Full text: PDF
|
|
In this paper we provide a fast, data-driven solution to the failing query problem: given a query that returns an empty answer, how can one relax the query's constraints so that it returns a non-empty set of tuples? We introduce a novel algorithm, ...
In this paper we provide a fast, data-driven solution to the failing query problem: given a query that returns an empty answer, how can one relax the query's constraints so that it returns a non-empty set of tuples? We introduce a novel algorithm, loqr, which is designed to relax queries that are in the disjunctive normal form and contain a mixture of discrete and continuous attributes. loqr discovers the implicit relationships that exist among the various domain attributes and then uses this knowledge to relax the constraints from the failing query.In a first step, loqr uses a small, randomly-chosen subset of the target database to learn a set of decision rules that predict whether an attribute's value satisfies the constraints in the failing query; this query-driven operation is performed online for each failing query. In the second step, loqr uses nearest-neighbor techniques to find the learned rule that is the most similar to the failing query; then it uses the attributes' values from this rule to relax the failing query's constraints. Our experiments on six application domains show that loqr is both robust and fast: it successfully relaxes more than 95% of the failing queries, and it takes under a second for processing queries that consist of up to 20 attributes (larger queries of up to 93 attributes are processed in several seconds). expand
|
|
|
Rapid detection of significant spatial clusters |
| |
Daniel B. Neill,
Andrew W. Moore
|
|
Pages: 256-265 |
|
doi>10.1145/1014052.1014082 |
|
Full text: PDF
|
|
Given an N x N grid of squares, where each square has a count cij and an underlying population pij, our goal is to find the rectangular region with the highest density, and to calculate its significance ...
Given an N x N grid of squares, where each square has a count cij and an underlying population pij, our goal is to find the rectangular region with the highest density, and to calculate its significance by randomization. An arbitrary density function D, dependent on a region's total count C and total population P, can be used. For example, if each count represents the number of disease cases occurring in that square, we can use Kulldorff's spatial scan statistic DK to find the most significant spatial disease cluster. A naive approach to finding the maximum density region requires O(N4) time, and is generally computationally infeasible. We present a multiresolution algorithm which partitions the grid into overlapping regions using a novel overlap-kd tree data structure, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For sufficiently dense regions, this method finds the maximum density region in O((N log N)2) time, in practice resulting in significant (20-2000x) speedups on both real and simulated datasets. expand
|
|
|
Turning CARTwheels: an alternating algorithm for mining redescriptions |
| |
Naren Ramakrishnan,
Deept Kumar,
Bud Mishra,
Malcolm Potts,
Richard F. Helm
|
|
Pages: 266-275 |
|
doi>10.1145/1014052.1014083 |
|
Full text: PDF
|
|
We present an unusual algorithm involving classification trees---CARTwheels---where two trees are grown in opposite directions so that they are joined at their leaves. This approach finds application in a new data mining task we formulate, called redescription ...
We present an unusual algorithm involving classification trees---CARTwheels---where two trees are grown in opposite directions so that they are joined at their leaves. This approach finds application in a new data mining task we formulate, called redescription mining. A redescription is a shift-of-vocabulary, or a different way of communicating information about a given subset of data; the goal of redescription mining is to find subsets of data that afford multiple descriptions. We highlight the importance of this problem in domains such as bioinformatics, which exhibit an underlying richness and diversity of data descriptors (e.g., genes can be studied in a variety of ways). CARTwheels exploits the duality between class partitions and path partitions in an induced classification tree to model and mine redescriptions. It helps integrate multiple forms of characterizing datasets, situates the knowledge gained from one dataset in the context of others, and harnesses high-level abstractions for uncovering cryptic and subtle features of data. Algorithm design decisions, implementation details, and experimental results are presented. expand
|
|
|
Selection, combination, and evaluation of effective software sensors for detecting abnormal computer usage |
| |
Jude Shavlik,
Mark Shavlik
|
|
Pages: 276-285 |
|
doi>10.1145/1014052.1014084 |
|
Full text: PDF
|
|
We present and empirically analyze a machine-learning approach for detecting intrusions on individual computers. Our Winnow-based algorithm continually monitors user and system behavior, recording such properties as the number of bytes transferred over ...
We present and empirically analyze a machine-learning approach for detecting intrusions on individual computers. Our Winnow-based algorithm continually monitors user and system behavior, recording such properties as the number of bytes transferred over the last 10 seconds, the programs that currently are running, and the load on the CPU. In all, hundreds of measurements are made and analyzed each second. Using this data, our algorithm creates a model that represents each particular computer's range of normal behavior. Parameters that determine when an alarm should be raised, due to abnormal activity, are set on a per-computer basis, based on an analysis of training data. A major issue in intrusion-detection systems is the need for very low false-alarm rates. Our empirical results suggest that it is possible to obtain high intrusion-detection rates (95%) and low false-alarm rates (less than one per day per computer), without "stealing" too many CPU cycles (less than 1%). We also report which system measurements are the most valuable in terms of detecting intrusions. A surprisingly large number of different measurements prove significantly useful. expand
|
|
|
A Bayesian network framework for reject inference |
| |
Andrew Smith,
Charles Elkan
|
|
Pages: 286-295 |
|
doi>10.1145/1014052.1014085 |
|
Full text: PDF
|
|
Most learning methods assume that the training set is drawn randomly from the population to which the learned model is to be applied. However in many applications this assumption is invalid. For example, lending institutions create models of who is likely ...
Most learning methods assume that the training set is drawn randomly from the population to which the learned model is to be applied. However in many applications this assumption is invalid. For example, lending institutions create models of who is likely to repay a loan from training sets consisting of people in their records to whom loans were given in the past; however, the institution approved loan applications previously based on who was thought unlikely to default. Learning from only approved loans yields an incorrect model because the training set is a biased sample of the general population of applicants. The issue of including rejected samples in the learning process, or alternatively using rejected samples to adjust a model learned from accepted samples only, is called reject inference.The main contribution of this paper is a systematic analysis of different cases that arise in reject inference, with explanations of which cases arise in various real-world situations. We use Bayesian networks to formalize each case as a set of conditional independence relationships and identify eight cases, including the familiar missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) cases. For each case we present an overview of available learning algorithms. These algorithms have been published in separate fields of research, including epidemiology, econometrics, clinical trial evaluation, sociology, and credit scoring; our second major contribution is to describe these algorithms in a common framework. expand
|
|
|
Support envelopes: a technique for exploring the structure of association patterns |
| |
Michael Steinbach,
Pang-Ning Tan,
Vipin Kumar
|
|
Pages: 296-305 |
|
doi>10.1145/1014052.1014086 |
|
Full text: PDF
|
|
This paper introduces support envelopes---a new tool for analyzing association patterns---and illustrates some of their properties, applications, and possible extensions. Specifically, the support envelope for a transaction data set and a specified ...
This paper introduces support envelopes---a new tool for analyzing association patterns---and illustrates some of their properties, applications, and possible extensions. Specifically, the support envelope for a transaction data set and a specified pair of positive integers (m,n) consists of the items and transactions that need to be searched to find any association pattern involving m or more transactions and n or more items. For any transaction data set with M transactions and N items, there is a unique lattice of at most M*N support envelopes that captures the structure of the association patterns in that data set. Because support envelopes are not encumbered by a support threshold, this support lattice provides a complete view of the association structure of the data set, including association patterns that have low support. Furthermore, the boundary of the support lattice---the support boundary---has at most min(M,N) envelopes and is especially interesting since it bounds the maximum sizes of potential association patterns---not only for frequent, closed, and maximal itemsets, but also for patterns, such as error-tolerant itemsets, that are more general. The association structure can be represented graphically as a two-dimensional scatter plot of the (m,n) values associated with the support envelopes of the data set, a feature that is useful in the exploratory analysis of association patterns. Finally, the algorithm to compute support envelopes is simple and computationally efficient, and it is straightforward to parallelize the process of finding all the support envelopes. expand
|
|
|
Probabilistic author-topic models for information discovery |
| |
Mark Steyvers,
Padhraic Smyth,
Michal Rosen-Zvi,
Thomas Griffiths
|
|
Pages: 306-315 |
|
doi>10.1145/1014052.1014087 |
|
Full text: PDF
|
|
We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, ...
We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer. expand
|
|
|
Scalable mining of large disk-based graph databases |
| |
Chen Wang,
Wei Wang,
Jian Pei,
Yongtai Zhu,
Baile Shi
|
|
Pages: 316-325 |
|
doi>10.1145/1014052.1014088 |
|
Full text: PDF
|
|
Mining frequent structural patterns from graph databases is an interesting problem with broad applications. Most of the previous studies focus on pruning unfruitful search subspaces effectively, but few of them address the mining on large, disk-based ...
Mining frequent structural patterns from graph databases is an interesting problem with broad applications. Most of the previous studies focus on pruning unfruitful search subspaces effectively, but few of them address the mining on large, disk-based databases. As many graph databases in applications cannot be held into main memory, scalable mining of large, disk-based graph databases remains a challenging problem. In this paper, we develop an effective index structure, ADI (for <u>ad</u>jacency <u>i</u>ndex), to support mining various graph patterns over large databases that cannot be held into main memory. The index is simple and efficient to build. Moreover, the new index structure can be easily adopted in various existing graph pattern mining algorithms. As an example, we adapt the well-known gSpan algorithm by using the ADI structure. The experimental results show that the new index structure enables the scalable graph pattern mining over large databases. In one set of the experiments, the new disk-based method can mine graph databases with one million graphs, while the original gSpan algorithm can only handle databases of up to 300 thousand graphs. Moreover, our new method is faster than gSpan when both can run in main memory. expand
|
|
|
Incorporating prior knowledge with weighted margin support vector machines |
| |
Xiaoyun Wu,
Rohini Srihari
|
|
Pages: 326-333 |
|
doi>10.1145/1014052.1014089 |
|
Full text: PDF
|
|
Like many purely data-driven machine learning methods, Support Vector Machine (SVM) classifiers are learned exclusively from the evidence presented in the training dataset; thus a larger training dataset is required for better performance. In some applications, ...
Like many purely data-driven machine learning methods, Support Vector Machine (SVM) classifiers are learned exclusively from the evidence presented in the training dataset; thus a larger training dataset is required for better performance. In some applications, there might be human knowledge available that, in principle, could compensate for the lack of data. In this paper, we propose a simple generalization of SVM: Weighted Margin SVM (WMSVMs) that permits the incorporation of prior knowledge. We show that Sequential Minimal Optimization can be used in training WMSVM. We discuss the issues of incorporating prior knowledge using this rather general formulation. The experimental results show that the proposed methods of incorporating prior knowledge is effective. expand
|
|
|
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs |
| |
Hui Xiong,
Shashi Shekhar,
Pang-Ning Tan,
Vipin Kumar
|
|
Pages: 334-343 |
|
doi>10.1145/1014052.1014090 |
|
Full text: PDF
|
|
Given a user-specified minimum correlation threshold θ and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number ...
Given a user-specified minimum correlation threshold θ and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation que Ry (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives. expand
|
|
|
The complexity of mining maximal frequent itemsets and maximal frequent patterns |
| |
Guizhen Yang
|
|
Pages: 344-353 |
|
doi>10.1145/1014052.1014091 |
|
Full text: PDF
|
|
Mining maximal frequent itemsets is one of the most fundamental problems in data mining. In this paper we study the complexity-theoretic aspects of maximal frequent itemset mining, from the perspective of counting the number of solutions. We present ...
Mining maximal frequent itemsets is one of the most fundamental problems in data mining. In this paper we study the complexity-theoretic aspects of maximal frequent itemset mining, from the perspective of counting the number of solutions. We present the first formal proof that the problem of counting the number of distinct maximal frequent itemsets in a database of transactions, given an arbitrary support threshold, is #P-complete, thereby providing strong theoretical evidence that the problem of mining maximal frequent itemsets is NP-hard. This result is of particular interest since the associated decision problem of checking the existence of a maximal frequent itemset is in P.We also extend our complexity analysis to other similar data mining problems dealing with complex data structures, such as sequences, trees, and graphs, which have attracted intensive research interests in recent years. Normally, in these problems a partial order among frequent patterns can be defined in such a way as to preserve the downward closure property, with maximal frequent patterns being those without any successor with respect to this partial order. We investigate several variants of these mining problems in which the patterns of interest are subsequences, subtrees, or subgraphs, and show that the associated problems of counting the number of maximal frequent patterns are all either #P-complete or #P-hard. expand
|
|
|
GPCA: an efficient dimension reduction scheme for image compression and retrieval |
| |
Jieping Ye,
Ravi Janardan,
Qi Li
|
|
Pages: 354-363 |
|
doi>10.1145/1014052.1014092 |
|
Full text: PDF
|
|
Recent years have witnessed a dramatic increase in the quantity of image data collected, due to advances in fields such as medical imaging, reconnaissance, surveillance, astronomy, multimedia etc. With this increase has come the need to be able to store, ...
Recent years have witnessed a dramatic increase in the quantity of image data collected, due to advances in fields such as medical imaging, reconnaissance, surveillance, astronomy, multimedia etc. With this increase has come the need to be able to store, transmit, and query large volumes of image data efficiently. A common operation on image databases is the retrieval of all images that are similar to a query image. For this, the images in the database are often represented as vectors in a high-dimensional space and a query is answered by retrieving all image vectors that are proximal to the query image in this space, under a suitable similarity metric. To overcome problems associated with high dimensionality, such as high storage and retrieval times, a dimension reduction step is usually applied to the vectors to concentrate relevant information in a small number of dimensions. Principal Component Analysis (PCA) is a well-known dimension reduction scheme. However, since it works with vectorized representations of images, PCA does not take into account the spatial locality of pixels in images. In this paper, a new dimension reduction scheme, called Generalized Principal Component Analysis (GPCA), is presented. This scheme works directly with images in their native state, as two-dimensional matrices, by projecting the images to a vector space that is the tensor product of two lower-dimensional vector spaces. Experiments on databases of face images show that, for the same amount of storage, GPCA is superior to PCA in terms of quality of the compressed images, query precision, and computational cost. expand
|
|
|
IDR/QR: an incremental dimension reduction algorithm via QR decomposition |
| |
Jieping Ye,
Qi Li,
Hui Xiong,
Haesun Park,
Ravi Janardan,
Vipin Kumar
|
|
Pages: 364-373 |
|
doi>10.1145/1014052.1014093 |
|
Full text: PDF
|
|
Dimension reduction is critical for many database and data mining applications, such as efficient storage and retrieval of high-dimensional data. In the literature, a well-known dimension reduction scheme is Linear Discriminant Analysis (LDA). The common ...
Dimension reduction is critical for many database and data mining applications, such as efficient storage and retrieval of high-dimensional data. In the literature, a well-known dimension reduction scheme is Linear Discriminant Analysis (LDA). The common aspect of previously proposed LDA based algorithms is the use of Singular Value Decomposition (SVD). Due to the difficulty of designing an incremental solution for the eigenvalue problem on the product of scatter matrices in LDA, there is little work on designing incremental LDA algorithms. In this paper, we propose an LDA based incremental dimension reduction algorithm, called IDR/QR, which applies QR Decomposition rather than SVD. Unlike other LDA based algorithms, this algorithm does not require the whole data matrix in main memory. This is desirable for large data sets. More importantly, with the insertion of new data items, the IDR/QR algorithm can constrain the computational cost by applying efficient QR-updating techniques. Finally, we evaluate the effectiveness of the IDR/QR algorithm in terms of classification accuracy on the reduced dimensional space. Our experiments on several real-world data sets reveal that the accuracy achieved by the IDR/QR algorithm is very close to the best possible accuracy achieved by other LDA based algorithms. However, the IDR/QR algorithm has much less computational cost, especially when new data items are dynamically inserted. expand
|
|
|
On the discovery of significant statistical quantitative rules |
| |
Hong Zhang,
Balaji Padmanabhan,
Alexander Tuzhilin
|
|
Pages: 374-383 |
|
doi>10.1145/1014052.1014094 |
|
Full text: PDF
|
|
In this paper we study market share rules, rules that have a certain market share statistic associated with them. Such rules are particularly relevant for decision making from a business perspective. Motivated by market share rules, in this paper we ...
In this paper we study market share rules, rules that have a certain market share statistic associated with them. Such rules are particularly relevant for decision making from a business perspective. Motivated by market share rules, in this paper we consider statistical quantitative rules (SQ rules) that are quantitative rules in which the RHS can be any statistic that is computed for the segment satisfying the LHS of the rule. Building on prior work, we present a statistical approach for learning all significant SQ rules, i.e., SQ rules for which a desired statistic lies outside a confidence interval computed for this rule. In particular we show how resampling techniques can be effectively used to learn significant rules. Since our method considers the significance of a large number of rules in parallel, it is susceptible to learning a certain number of "false" rules. To address this, we present a technique that can determine the number of significant SQ rules that can be expected by chance alone, and suggest that this number can be used to determine a "false discovery rate" for the learning procedure. We apply our methods to online consumer purchase data and report the results. expand
|
|
|
Fast mining of spatial collocations |
| |
Xin Zhang,
Nikos Mamoulis,
David W. Cheung,
Yutao Shou
|
|
Pages: 384-393 |
|
doi>10.1145/1014052.1014095 |
|
Full text: PDF
|
|
Spatial collocation patterns associate the co-existence of non-spatial features in a spatial neighborhood. An example of such a pattern can associate contaminated water reservoirs with certain deceases in their spatial neighborhood. Previous work on ...
Spatial collocation patterns associate the co-existence of non-spatial features in a spatial neighborhood. An example of such a pattern can associate contaminated water reservoirs with certain deceases in their spatial neighborhood. Previous work on discovering collocation patterns converts neighborhoods of feature instances to itemsets and applies mining techniques for transactional data to discover the patterns. We propose a method that combines the discovery of spatial neighborhoods with the mining process. Our technique is an extension of a spatial join algorithm that operates on multiple inputs and counts long pattern instances. As demonstrated by experimentation, it yields significant performance improvements compared to previous approaches. expand
|
|
|
SESSION: Industry/government track papers |
|
|
|
|
TiVo: making show recommendations using a distributed collaborative filtering architecture |
| |
Kamal Ali,
Wijnand van Stam
|
|
Pages: 394-401 |
|
doi>10.1145/1014052.1014097 |
|
Full text: PDF
|
|
We describe the TiVo television show collaborative recommendation system which has been fielded in over one million TiVo clients for four years. Over this install base, TiVo currently has approximately 100 million ratings by users over approximately ...
We describe the TiVo television show collaborative recommendation system which has been fielded in over one million TiVo clients for four years. Over this install base, TiVo currently has approximately 100 million ratings by users over approximately 30,000 distinct TV shows and movies. TiVo uses an item-item (show to show) form of collaborative filtering which obviates the need to keep any persistent memory of each user's viewing preferences at the TiVo server. Taking advantage of TiVo's client-server architecture has produced a novel collaborative filtering system in which the server does a minimum of work and most work is delegated to the numerous clients. Nevertheless, the server-side processing is also highly scalable and parallelizable. Although we have not performed formal empirical evaluations of its accuracy, internal studies have shown its recommendations to be useful even for multiple user households. TiVo's architecture also allows for throttling of the server so if more server-side resources become available, more correlations can be computed on the server allowing TiVo to make recommendations for niche audiences. expand
|
|
|
Predicting customer shopping lists from point-of-sale purchase data |
| |
Chad Cumby,
Andrew Fano,
Rayid Ghani,
Marko Krema
|
|
Pages: 402-409 |
|
doi>10.1145/1014052.1014098 |
|
Full text: PDF
|
|
This paper describes a prototype that predicts the shopping lists for customers in a retail store. The shopping list prediction is one aspect of a larger system we have developed for retailers to provide individual and personalized interactions with ...
This paper describes a prototype that predicts the shopping lists for customers in a retail store. The shopping list prediction is one aspect of a larger system we have developed for retailers to provide individual and personalized interactions with customers as they navigate through the retail store. Instead of using traditional personalization approaches, such as clustering or segmentation, we learn separate classifiers for each customer from historical transactional data. This allows us to make very fine-grained and accurate predictions about what items a particular individual customer will buy on a given shopping trip.We formally frame the shopping list prediction as a classification problem, describe the algorithms and methodology behind our system, its impact on the business case in which we frame it, and explore some of the properties of the data source that make it an interesting testbed for KDD algorithms. Our results show that we can predict a shopper's shopping list with high levels of accuracy, precision, and recall. We believe that this work impacts both the data mining and the retail business community. The formulation of shopping list prediction as a machine learning problem results in algorithms that should be useful beyond retail shopping list prediction. For retailers, the result is not only a practical system that increases revenues by up to 11%, but also enhances customer experience and loyalty by giving them the tools to individually interact with customers and anticipate their needs. expand
|
|
|
A rank sum test method for informative gene discovery |
| |
Lin Deng,
Jian Pei,
Jinwen Ma,
Dik Lun Lee
|
|
Pages: 410-419 |
|
doi>10.1145/1014052.1014099 |
|
Full text: PDF
|
|
Finding informative genes from microarray data is an important research problem in bioinformatics research and applications. Most of the existing methods rank features according to their discriminative capability and then find a subset of discriminative ...
Finding informative genes from microarray data is an important research problem in bioinformatics research and applications. Most of the existing methods rank features according to their discriminative capability and then find a subset of discriminative genes (usually top k genes). In particular, t-statistic criterion and its variants have been adopted extensively. This kind of methods rely on the statistics principle of t-test, which requires that the data follows a normal distribution. However, according to our investigation, the normality condition often cannot be met in real data sets.To avoid the assumption of the normality condition, in this paper, we propose a rank sum test method for informative gene discovery. The method uses a rank-sum statistic as the ranking criterion. Moreover, we propose using the significance level threshold, instead of the number of informative genes, as the parameter. The significance level threshold as a parameter carries the quality specification in statistics. We follow the Pitman efficiency theory to show that the rank sum method is more accurate and more robust than the t-statistic method in theory.To verify the effectiveness of the rank sum method, we use support vector machine (SVM) to construct classifiers based on the identified informative genes on two well known data sets, namely colon data and leukemia data. The prediction accuracy reaches 96.2% on the colon data and 100% on the leukemia data. The results are clearly better than those from the previous feature ranking methods. By experiments, we also verify that using significance level threshold is more effective than directly specifying an arbitrary k. expand
|
|
|
Early detection of insider trading in option markets |
| |
Steve Donoho
|
|
Pages: 420-429 |
|
doi>10.1145/1014052.1014100 |
|
Full text: PDF
|
|
"Inside information" comes in many forms: knowledge of a corporate takeover, a terrorist attack, unexpectedly poor earnings, the FDA's acceptance of a new drug, etc. Anyone who knows some piece of soon-to-break news possesses inside information. Historically, ...
"Inside information" comes in many forms: knowledge of a corporate takeover, a terrorist attack, unexpectedly poor earnings, the FDA's acceptance of a new drug, etc. Anyone who knows some piece of soon-to-break news possesses inside information. Historically, insider trading has been detected after the news is public, but this is often too late: fraud has been perpetrated, innocent investors have been disadvantaged, or terrorist acts have been carried out. This paper explores early detection of insider trading - detection before the news breaks. Data mining holds great promise for this emerging application, but the problem also poses significant challenges. We present the specific problem of insider trading in option markets, compare decision tree, logistic regression, and neural net results to results from an expert model, and discuss insights that knowledge discovery techniques shed upon this problem. expand
|
|
|
Mining coherent gene clusters from gene-sample-time microarray data |
| |
Daxin Jiang,
Jian Pei,
Murali Ramanathan,
Chun Tang,
Aidong Zhang
|
|
Pages: 430-439 |
|
doi>10.1145/1014052.1014101 |
|
Full text: PDF
|
|
Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene-sample-time microarray data sets, which records the expression levels of ...
Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene-sample-time microarray data sets, which records the expression levels of various genes under a set of samples during a series of time points. In particular, we propose the mining of coherent gene clusters from such data sets. Each cluster contains a subset of genes and a subset of samples such that the genes are coherent on the samples along the time series. The coherent gene clusters may identify the samples corresponding to some phenotypes (e.g., diseases), and suggest the candidate genes correlated to the phenotypes. We present two efficient algorithms, namely the Sample-Gene Search and the Gene-Sample Search, to mine the complete set of coherent gene clusters. We empirically evaluate the performance of our approaches on both a real microarray data set and synthetic data sets. The test results have shown that our approaches are both efficient and effective to find meaningful coherent gene clusters. expand
|
|
|
Eigenspace-based anomaly detection in computer systems |
| |
Tsuyoshi IDÉ,
Hisashi KASHIMA
|
|
Pages: 440-449 |
|
doi>10.1145/1014052.1014102 |
|
Full text: PDF
|
|
We report on an automated runtime anomaly detection method at the application layer of multi-node computer systems. Although several network management systems are available in the market, none of them have sufficient capabilities to detect faults in ...
We report on an automated runtime anomaly detection method at the application layer of multi-node computer systems. Although several network management systems are available in the market, none of them have sufficient capabilities to detect faults in multi-tier Web-based systems with redundancy. We model a Web-based system as a weighted graph, where each node represents a "service" and each edge represents a dependency between services. Since the edge weights vary greatly over time, the problem we address is that of anomaly detection from a time sequence of graphs.In our method, we first extract a feature vector from the adjacency matrix that represents the activities of all of the services. The heart of our method is to use the principal eigenvector of the eigenclusters of the graph. Then we derive a probability distribution for an anomaly measure defined for a time-series of directional data derived from the graph sequence. Given a critical probability, the threshold value is adaptively updated using a novel online algorithm.We demonstrate that a fault in a Web application can be automatically detected and the faulty services are identified without using detailed knowledge of the behavior of the system. expand
|
|
|
Effective localized regression for damage detection in large complex mechanical structures |
| |
Aleksandar Lazarevic,
Ramdev Kanapady,
Chandrika Kamath
|
|
Pages: 450-459 |
|
doi>10.1145/1014052.1014103 |
|
Full text: PDF
|
|
In this paper, we propose a novel data mining technique for the efficient damage detection within the large-scale complex mechanical structures. Every mechanical structure is defined by the set of finite elements that are called structure elements. Large-scale ...
In this paper, we propose a novel data mining technique for the efficient damage detection within the large-scale complex mechanical structures. Every mechanical structure is defined by the set of finite elements that are called structure elements. Large-scale complex structures may have extremely large number of structure elements, and predicting the failure in every single element using the original set of natural frequencies as features is exceptionally time-consuming task. Traditional data mining techniques simply predict failure in each structure element individually using global prediction models that are built considering all data records. In order to reduce the time complexity of these models, we propose a localized clustering-regression based approach that consists of two phases: (1) building a local cluster around a data record of interest and (2) predicting an intensity of damage only in those structure elements that correspond to data records from the built cluster. For each test data record, we first build a cluster of data records from training data around it. Then, for each data record that belongs to discovered cluster, we identify corresponding structure elements and we build a localized regression model for each of these structure elements. These regression models for specific structure elements are constructed using only a specific set of relevant natural frequencies and merely those data records that correspond to the failure of that structure element. Experiments performed on the problem of damage prediction in a large electric transmission tower frame indicate that the proposed localized clustering-regression based approach is significantly more accurate and more computationally efficient than our previous hierarchical clustering approach, as well as global prediction models. expand
|
|
|
Visually mining and monitoring massive time series |
| |
Jessica Lin,
Eamonn Keogh,
Stefano Lonardi,
Jeffrey P. Lankford,
Donna M. Nystrom
|
|
Pages: 460-469 |
|
doi>10.1145/1014052.1014104 |
|
Full text: PDF
|
|
Moments before the launch of every space vehicle, engineering discipline specialists must make a critical go/no-go decision. The cost of a false positive, allowing a launch in spite of a fault, or a false negative, stopping a potentially successful ...
Moments before the launch of every space vehicle, engineering discipline specialists must make a critical go/no-go decision. The cost of a false positive, allowing a launch in spite of a fault, or a false negative, stopping a potentially successful launch, can be measured in the tens of millions of dollars, not including the cost in morale and other more intangible detriments. The Aerospace Corporation is responsible for providing engineering assessments critical to the go/no-go decision for every Department of Defense space vehicle. These assessments are made by constantly monitoring streaming telemetry data in the hours before launch. We will introduce VizTree, a novel time-series visualization tool to aid the Aerospace analysts who must make these engineering assessments. VizTree was developed at the University of California, Riverside and is unique in that the same tool is used for mining archival data and monitoring incoming live telemetry. The use of a single tool for both aspects of the task allows a natural and intuitive transfer of mined knowledge to the monitoring task. Our visualization approach works by transforming the time series into a symbolic representation, and encoding the data in a modified suffix tree in which the frequency and other properties of patterns are mapped onto colors and other visual properties. We demonstrate the utility of our system by comparing it with state-of-the-art batch algorithms on several real and synthetic datasets. expand
|
|
|
Learning to detect malicious executables in the wild |
| |
Jeremy Z. Kolter,
Marcus A. Maloof
|
|
Pages: 470-478 |
|
doi>10.1145/1014052.1014105 |
|
Full text: PDF
|
|
In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. ...
In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the roc curve of 0.996. Results also suggest that our methodology will scale to larger collections of executables. To the best of our knowledge, ours is the only fielded application for this task developed using techniques from machine learning and data mining. expand
|
|
|
Predicting prostate cancer recurrence via maximizing the concordance index |
| |
Lian Yan,
David Verbel,
Olivier Saidi
|
|
Pages: 479-485 |
|
doi>10.1145/1014052.1014106 |
|
Full text: PDF
|
|
In order to effectively use machine learning algorithms, e.g., neural networks, for the analysis of survival data, the correct treatment of censored data is crucial. The concordance index (CI) is a typical metric for quantifying the predictive ability ...
In order to effectively use machine learning algorithms, e.g., neural networks, for the analysis of survival data, the correct treatment of censored data is crucial. The concordance index (CI) is a typical metric for quantifying the predictive ability of a survival model. We propose a new algorithm that directly uses the CI as the objective function to train a model, which predicts whether an event will eventually occur or not. Directly optimizing the CI allows the model to make complete use of the information from both censored and non-censored observations. In particular, we approximate the CI via a differentiable function so that gradient-based methods can be used to train the model. We applied the new algorithm to predict the eventual recurrence of prostate cancer following radical prostatectomy. Compared with the traditional Cox proportional hazards model and several other algorithms based on neural networks and support vector machines, our algorithm achieves a significant improvement in being able to identify high-risk and low-risk groups of patients. expand
|
|
|
Density-based spam detector |
| |
Kenichi YOSHIDA,
Fuminori ADACHI,
Takashi WASHIO,
Hiroshi MOTODA,
Teruaki HOMMA,
Akihiro NAKASHIMA,
Hiromitsu FUJIKAWA,
Katsuyuki YAMAZAKI
|
|
Pages: 486-493 |
|
doi>10.1145/1014052.1014107 |
|
Full text: PDF
|
|
The volume of mass unsolicited electronic mail, often known as spam, has recently increased enormously and has become a serious threat to not only the Internet but also to society. This paper proposes a new spam detection method which uses document space ...
The volume of mass unsolicited electronic mail, often known as spam, has recently increased enormously and has become a serious threat to not only the Internet but also to society. This paper proposes a new spam detection method which uses document space density information. Although it requires extensive e-mail traffic to acquire the necessary information, an unsupervised learning engine with a short white list can achieve a 98% recall rate and 100% precision. A direct-mapped cache method contributes handling of over 13,000 e-mails per second. Experimental results, which were conducted using over 50 million actual e-mails of traffic, are also reported in this paper. expand
|
|
|
V-Miner: using enhanced parallel coordinates to mine product design and test data |
| |
Kaidi Zhao,
Bing Liu,
Thomas M. Tirpak,
Andreas Schaller
|
|
Pages: 494-502 |
|
doi>10.1145/1014052.1014108 |
|
Full text: PDF
|
|
Analyzing data to find trends, correlations, and stable patterns is an important task in many industrial applications. This paper proposes a new technique based on parallel coordinate visualization. Previous work on parallel coordinate methods has shown ...
Analyzing data to find trends, correlations, and stable patterns is an important task in many industrial applications. This paper proposes a new technique based on parallel coordinate visualization. Previous work on parallel coordinate methods has shown that they are effective only when variables that are correlated and/or show similar patterns are displayed adjacently. Although current parallel coordinate tools allow the user to manually rearrange the order of variables, this process is very time-consuming when the number of variables is large. Automated assistance is required. This paper introduces an edit-distance based technique to rearrange variables so that interesting change patterns can be easily detected visually. The Visual Miner (V-Miner) software includes both automated methods for visualizing common patterns and a query tool that enables the user to describe specific target patterns to be mined or displayed by the system. In addition, the system can filter data according to rules sets imported from other data mining tools. This feature was found very helpful in practice, because it enables decision makers to visually identify interesting rules and data segments for further analysis or data mining. This paper begins with an introduction to the proposed techniques and the V-Miner system. Next, a case study illustrates how V-Miner has been used at Motorola to guide product design and test decisions. expand
|
|
|
POSTER SESSION: Research track posters |
|
|
|
|
On demand classification of data streams |
| |
Charu C. Aggarwal,
Jiawei Han,
Jianyong Wang,
Philip S. Yu
|
|
Pages: 503-508 |
|
doi>10.1145/1014052.1014110 |
|
Full text: PDF
|
|
Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling ...
Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling of very large data sets. Our model for data stream classification views the data stream classification problem from the point of view of a dynamic approach in which simultaneous training and testing streams are used for dynamic classification of data sets. This model reflects real life situations effectively, since it is desirable to classify test streams in real time over an evolving training and test stream. The aim here is to create a classification system in which the training model can adapt quickly to the changes of the underlying data stream. In order to achieve this goal, we propose an on-demand classification process which can dynamically select the appropriate window of past training data to build the classifier. The empirical results indicate that the system maintains a high classification accuracy in an evolving data stream, while providing an efficient solution to the classification task. expand
|
|
|
A generalized maximum entropy approach to bregman co-clustering and matrix approximation |
| |
Arindam Banerjee,
Inderjit Dhillon,
Joydeep Ghosh,
Srujana Merugu,
Dharmendra S. Modha
|
|
Pages: 509-514 |
|
doi>10.1145/1014052.1014111 |
|
Full text: PDF
|
|
Co-clustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an information-theoretic co-clustering approach applicable to empirical joint probability distributions ...
Co-clustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an information-theoretic co-clustering approach applicable to empirical joint probability distributions was proposed. In many situations, co-clustering of more general matrices is desired. In this paper, we present a substantially generalized co-clustering framework wherein any Bregman divergence can be used in the objective function, and various conditional expectation based constraints can be considered based on the statistics that need to be preserved. Analysis of the co-clustering problem leads to the minimum Bregman information principle, which generalizes the maximum entropy principle, and yields an elegant meta algorithm that is guaranteed to achieve local optimality. Our methodology yields new algorithms and also encompasses several previously known clustering and co-clustering algorithms based on alternate minimization. expand
|
|
|
An objective evaluation criterion for clustering |
| |
Arindam Banerjee,
John Langford
|
|
Pages: 515-520 |
|
doi>10.1145/1014052.1014112 |
|
Full text: PDF
|
|
We propose and test an objective criterion for evaluation of clustering performance: How well does a clustering algorithm run on unlabeled data aid a classification algorithm? The accuracy is quantified using the PAC-MDL bound [3] in a semisupervised ...
We propose and test an objective criterion for evaluation of clustering performance: How well does a clustering algorithm run on unlabeled data aid a classification algorithm? The accuracy is quantified using the PAC-MDL bound [3] in a semisupervised setting. Clustering algorithms which naturally separate the data according to (hidden) labels with a small number of clusters perform well. A simple extension of the argument leads to an objective model selection method. Experimental results on text analysis datasets demonstrate that this approach empirically results in very competitive bounds on test set performance on natural datasets. expand
|
|
|
Column-generation boosting methods for mixture of kernels |
| |
Jinbo Bi,
Tong Zhang,
Kristin P. Bennett
|
|
Pages: 521-526 |
|
doi>10.1145/1014052.1014113 |
|
Full text: PDF
|
|
We devise a boosting approach to classification and regression based on column generation using a mixture of kernels. Traditional kernel methods construct models based on a single positive semi-definite kernel with the type of kernel predefined and kernel ...
We devise a boosting approach to classification and regression based on column generation using a mixture of kernels. Traditional kernel methods construct models based on a single positive semi-definite kernel with the type of kernel predefined and kernel parameters chosen according to cross-validation performance. Our approach creates models that are mixtures of a library of kernel models, and our algorithm automatically determines kernels to be used in the final model. The 1-norm and 2-norm regularization methods are employed to restrict the ensemble of kernel models. The proposed method produces sparser solutions, and thus significantly reduces the testing time. By extending the column generation (CG) optimization which existed for linear programs with 1-norm regularization to quadratic programs with 2-norm regularization, we are able to solve many learning formulations by leveraging various algorithms for constructing single kernel models. By giving different priorities to columns to be generated, we are able to scale CG boosting to large datasets. Experimental results on benchmark data are included to demonstrate its effectiveness. expand
|
|
|
IncSpan: incremental mining of sequential patterns in large database |
| |
Hong Cheng,
Xifeng Yan,
Jiawei Han
|
|
Pages: 527-532 |
|
doi>10.1145/1014052.1014114 |
|
Full text: PDF
|
|
Many real life sequence databases grow incrementally. It is undesirable to mine sequential patterns from scratch each time when a small set of sequences grow, or when some new sequences are added into the database. Incremental algorithm should be developed ...
Many real life sequence databases grow incrementally. It is undesirable to mine sequential patterns from scratch each time when a small set of sequences grow, or when some new sequences are added into the database. Incremental algorithm should be developed for sequential pattern mining so that mining can be adapted to incremental database updates. However, it is nontrivial to mine sequential patterns incrementally, especially when the existing sequences grow incrementally because such growth may lead to the generation of many new patterns due to the interactions of the growing subsequences with the original ones. In this study, we develop an efficient algorithm, IncSpan, for incremental mining of sequential patterns, by exploring some interesting properties. Our performance study shows that IncSpan outperforms some previously proposed incremental algorithms as well as a non-incremental one with a wide margin. expand
|
|
|
Parallel computation of high dimensional robust correlation and covariance matrices |
| |
James Chilson,
Raymond Ng,
Alan Wagner,
Ruben Zamar
|
|
Pages: 533-538 |
|
doi>10.1145/1014052.1014115 |
|
Full text: PDF
|
|
The computation of covariance and correlation matrices are critical to many data mining applications and processes. Unfortunately the classical covariance and correlation matrices are very sensitive to outliers. Robust methods, such as QC and the Maronna ...
The computation of covariance and correlation matrices are critical to many data mining applications and processes. Unfortunately the classical covariance and correlation matrices are very sensitive to outliers. Robust methods, such as QC and the Maronna method, have been proposed. However, existing algorithms for QC only give acceptable performance when the dimensionality of the matrix is in the hundreds; and the Maronna method is rarely used in practice because of its high computational cost.In this paper, we develop parallel algorithms for both QC and the Maronna method. We evaluate these parallel algorithms using a real data set of the gene expression of over 6,000 genes, giving rise to a matrix of over 18 million entries. In our experimental evaluation, we explore scalability in dimensionality and in the number of processors. We also compare the parallel behaviours of the two methods. After thorough experimentation, we conclude that for many data mining applications, both QC and Maronna are viable options. Less robust, but faster, QC is the recommended choice for small parallel platforms. On the other hand, the Maronna method is the recommended choice when a high degree of robustness is required, or when the parallel platform features a high number of processors. expand
|
|
|
Belief state approaches to signaling alarms in surveillance systems |
| |
Kaustav Das,
Andrew Moore,
Jeff Schneider
|
|
Pages: 539-544 |
|
doi>10.1145/1014052.1014116 |
|
Full text: PDF
|
|
Surveillance systems have long been used to monitor industrial processes and are becoming increasingly popular in public health and anti-terrorism applications. Most early detection systems produce a time series of p-values or some other statistic as ...
Surveillance systems have long been used to monitor industrial processes and are becoming increasingly popular in public health and anti-terrorism applications. Most early detection systems produce a time series of p-values or some other statistic as their output. Typically, the decision to signal an alarm is based on a threshold or other simple algorithm such as CUSUM that accumulates detection information temporally.We formulate a POMDP model of underlying events and observations from a detector. We solve the model and show how it is used for single-output detectors. When dealing with spatio-temporal data, scan statistics are a popular method of building detectors. We describe the use of scan statistics in surveillance and how our POMDP model can be used to perform alarm signaling with them. We compare the results obtained by our method with simple thresholding and CUSUM on synthetic and semi-synthetic health data. expand
|
|
|
Locating secret messages in images |
| |
Ian Davidson,
Goutam Paul
|
|
Pages: 545-550 |
|
doi>10.1145/1014052.1014117 |
|
Full text: PDF
|
|
Steganography involves hiding messages in innocuous media such as images, while steganalysis is the field of detecting these secret messages. The ultimate goal of steganalysis is two-fold: making a binary classification of a file as stego-bearing or ...
Steganography involves hiding messages in innocuous media such as images, while steganalysis is the field of detecting these secret messages. The ultimate goal of steganalysis is two-fold: making a binary classification of a file as stego-bearing or innocent, and secondly, locating the hidden message with an aim to extracting, sterilizing or manipulating it. Almost all steganalysis approaches (known as attacks) focus on the first of these two issues. In this paper, we explore the difficult related problem: given that we know an image file contains steganography, locate which pixels contain the message. We treat the hidden message location problem as outlier detection using probability/energy measures of images motivated by the image restoration community. Pixels contributing the most to the energy calculations of an image are deemed outliers. Typically, of the top third of one percent of most energized pixels (outliers), we find that 87% are stego-bearing in color images and 61% in grayscale images. In all image types only 1% of all pixels are stego-bearing indicating our techniques provides a substantial lift over random guessing. expand
|
|
|
Kernel k-means: spectral clustering and normalized cuts |
| |
Inderjit S. Dhillon,
Yuqiang Guan,
Brian Kulis
|
|
Pages: 551-556 |
|
doi>10.1145/1014052.1014118 |
|
Full text: PDF
|
|
Kernel k-means and spectral clustering have both been used to identify clusters that are non-linearly separable in input space. Despite significant research, these methods have remained only loosely related. In this paper, we give an explicit ...
Kernel k-means and spectral clustering have both been used to identify clusters that are non-linearly separable in input space. Despite significant research, these methods have remained only loosely related. In this paper, we give an explicit theoretical connection between them. We show the generality of the weighted kernel k-means objective function, and derive the spectral clustering objective of normalized cut as a special case. Given a positive definite similarity matrix, our results lead to a novel weighted kernel k-means algorithm that monotonically decreases the normalized cut. This has important implications: a) eigenvector-based algorithms, which can be computationally prohibitive, are not essential for minimizing normalized cuts, b) various techniques, such as local search and acceleration schemes, may be used to improve the quality as well as speed of kernel k-means. Finally, we present results on several interesting data sets, including diametrical clustering of large gene-expression matrices and a handwriting recognition data set. expand
|
|
|
A microeconomic data mining problem: customer-oriented catalog segmentation |
| |
Martin Ester,
Rong Ge,
Wen Jin,
Zengjian Hu
|
|
Pages: 557-562 |
|
doi>10.1145/1014052.1014119 |
|
Full text: PDF
|
|
The microeconomic framework for data mining [7] assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. In Catalog Segmentation, ...
The microeconomic framework for data mining [7] assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. In Catalog Segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. However, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. Therefore, in this paper, we investigate an alternative problem formulation, that we call Customer-Oriented Catalog Segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. We formally introduce the Customer-Oriented Catalog Segmentation problem and discuss its complexity. Then we investigate two different paradigms to design efficient, approximate algorithms for the Customer-Oriented Catalog Segmentation problem, greedy (deterministic) and randomized algorithms. Since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. Our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical Catalog Segmentation algorithms. expand
|
|
|
k-TTP: a new privacy model for large-scale distributed environments |
| |
Bobi Gilburd,
Assaf Schuster,
Ran Wolff
|
|
Pages: 563-568 |
|
doi>10.1145/1014052.1014120 |
|
Full text: PDF
|
|
Secure multiparty computation allows parties to jointly compute a function of their private inputs without revealing anything but the output. Theoretical results [2] provide a general construction of such protocols for any function. Protocols obtained ...
Secure multiparty computation allows parties to jointly compute a function of their private inputs without revealing anything but the output. Theoretical results [2] provide a general construction of such protocols for any function. Protocols obtained in this way are, however, inefficient, and thus, practically speaking, useless when a large number of participants are involved.The contribution of this paper is to define a new privacy model -- k-privacy -- by means of an innovative, yet natural generalization of the accepted trusted third party model. This allows implementing cryptographically secure efficient primitives for real-world large-scale distributed systems.As an example for the usefulness of the proposed model, we employ k-privacy to introduce a technique for obtaining knowledge -- by way of an association-rule mining algorithm -- from large-scale Data Grids, while ensuring that the privacy is cryptographically secure. expand
|
|
|
Diagnosing extrapolation: tree-based density estimation |
| |
Giles Hooker
|
|
Pages: 569-574 |
|
doi>10.1145/1014052.1014121 |
|
Full text: PDF
|
|
There has historically been very little concern with extrapolation in Machine Learning, yet extrapolation can be critical to diagnose. Predictor functions are almost always learned on a set of highly correlated data comprising a very small segment of ...
There has historically been very little concern with extrapolation in Machine Learning, yet extrapolation can be critical to diagnose. Predictor functions are almost always learned on a set of highly correlated data comprising a very small segment of predictor space. Moreover, flexible predictors, by their very nature, are not controlled at points of extrapolation. This becomes a problem for diagnostic tools that require evaluation on a product distribution. It is also an issue when we are trying to optimize a response over some variable in the input space. Finally, it can be a problem in non-static systems in which the underlying predictor distribution gradually drifts with time or when typographical errors misrecord the values of some predictors.We present a diagnosis for extrapolation as a statistical test for a point originating from the data distribution as opposed to a null hypothesis uniform distribution. This allows us to employ general classification methods for estimating such a test statistic. Further, we observe that CART can be modified to accept an exact distribution as an argument, providing a better classification tool which becomes our extrapolation-detection procedure. We explore some of the advantages of this approach and present examples of its practical application. expand
|
|
|
Discovering additive structure in black box functions |
| |
Giles Hooker
|
|
Pages: 575-580 |
|
doi>10.1145/1014052.1014122 |
|
Full text: PDF
|
|
Many automated learning procedures lack interpretability, operating effectively as a black box: providing a prediction tool but no explanation of the underlying dynamics that drive it. A common approach to interpretation is to plot the dependence of ...
Many automated learning procedures lack interpretability, operating effectively as a black box: providing a prediction tool but no explanation of the underlying dynamics that drive it. A common approach to interpretation is to plot the dependence of a learned function on one or two predictors. We present a method that seeks not to display the behavior of a function, but to evaluate the importance of non-additive interactions within any set of variables. Should the function be close to a sum of low dimensional components, these components can be viewed and even modeled parametrically. Alternatively, the work here provides an indication of where intrinsically high-dimensional behavior takes place.The calculations used in this paper correspond closely with the functional ANOVA decomposition; a well-developed construction in Statistics. In particular, the proposed score of interaction importance measures the loss associated with the projection of the prediction function onto a space of additive models. The algorithm runs in linear time and we present displays of the output as a graphical model of the function for interpretation purposes. expand
|
|
|
SPIN: mining maximal frequent subgraphs from graph databases |
| |
Jun Huan,
Wei Wang,
Jan Prins,
Jiong Yang
|
|
Pages: 581-586 |
|
doi>10.1145/1014052.1014123 |
|
Full text: PDF
|
|
One fundamental challenge for mining recurring subgraphs from semi-structured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration ...
One fundamental challenge for mining recurring subgraphs from semi-structured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration using reasonable computational resources. In this paper, we propose a new algorithm that mines only maximal frequent subgraphs, i.e. subgraphs that are not a part of any other frequent subgraphs. This may exponentially decrease the size of the output set in the best case; in our experiments on practical data sets, mining maximal frequent subgraphs reduces the total number of mined patterns by two to three orders of magnitude.Our method first mines all frequent trees from a general graph database and then reconstructs all maximal subgraphs from the mined trees. Using two chemical structure benchmarks and a set of synthetic graph data sets, we demonstrate that, in addition to decreasing the output size, our algorithm can achieve a five-fold speed up over the current state-of-the-art subgraph mining algorithms. expand
|
|
|
On detecting space-time clusters |
| |
Vijay S. Iyengar
|
|
Pages: 587-592 |
|
doi>10.1145/1014052.1014124 |
|
Full text: PDF
|
|
Detection of space-time clusters is an important function in various domains (e.g., epidemiology and public health). The pioneering work on the spatial scan statistic is often used as the basis to detect and evaluate such clusters. State-of-the-art systems ...
Detection of space-time clusters is an important function in various domains (e.g., epidemiology and public health). The pioneering work on the spatial scan statistic is often used as the basis to detect and evaluate such clusters. State-of-the-art systems based on this approach detect clusters with restrictive shapes that cannot model growth and shifts in location over time. We extend these methods significantly by using the flexible square pyramid shape to model such effects. A heuristic search method is developed to detect the most likely clusters using a randomized algorithm in combination with geometric shapes processing. The use of Monte Carlo methods in the original scan statistic formulation is continued in our work to address the multiple hypothesis testing issues. Our method is applied to a real data set on brain cancer occurrences over a 19 year period. The cluster detected by our method shows both growth and movement which could not have been modeled with the simpler cylindrical shapes used earlier. Our general framework can be extended quite easily to handle other flexible shapes for the space-time clusters. expand
|
|
|
Why collective inference improves relational classification |
| |
David Jensen,
Jennifer Neville,
Brian Gallagher
|
|
Pages: 593-598 |
|
doi>10.1145/1014052.1014125 |
|
Full text: PDF
|
|
Procedures for collective inference make simultaneous statistical judgments about the same variables for a set of related data instances. For example, collective inference could be used to simultaneously classify a set of hyperlinked documents ...
Procedures for collective inference make simultaneous statistical judgments about the same variables for a set of related data instances. For example, collective inference could be used to simultaneously classify a set of hyperlinked documents or infer the legitimacy of a set of related financial transactions. Several recent studies indicate that collective inference can significantly reduce classification error when compared with traditional inference techniques. We investigate the underlying mechanisms for this error reduction by reviewing past work on collective inference and characterizing different types of statistical models used for making inference in relational data. We show important differences among these models, and we characterize the necessary and sufficient conditions for reduced classification error based on experiments with real and simulated data. expand
|
|
|
When do data mining results violate privacy? |
| |
Murat Kantarcioǧlu,
Jiashun Jin,
Chris Clifton
|
|
Pages: 599-604 |
|
doi>10.1145/1014052.1014126 |
|
Full text: PDF
|
|
Privacy-preserving data mining has concentrated on obtaining valid results when the input data is private. An extreme example is Secure Multiparty Computation-based methods, where only the results are revealed. However, this still leaves a potential ...
Privacy-preserving data mining has concentrated on obtaining valid results when the input data is private. An extreme example is Secure Multiparty Computation-based methods, where only the results are revealed. However, this still leaves a potential privacy breach: Do the results themselves violate privacy? This paper explores this issue, developing a framework under which this question can be addressed. Metrics are proposed, along with analysis that those metrics are consistent in the face of apparent problems. expand
|
|
|
Improved robustness of signature-based near-replica detection via lexicon randomization |
| |
Aleksander Kołcz,
Abdur Chowdhury,
Joshua Alspector
|
|
Pages: 605-610 |
|
doi>10.1145/1014052.1014127 |
|
Full text: PDF
|
|
Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct inter-document similarity ...
Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct inter-document similarity computation (e.g., using the cosine measure) are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, are very attractive computationally but may be brittle with respect to small changes to document content. We focus on approaches to near-replica detection that are based upon large-collection statistics and present a general technique of increasing their robustness via multiple lexicon randomization. In experiments with large web-page and spam-email datasets the proposed method is shown to consistently outperform traditional I-Match, with the relative improvement in duplicate-document recall reaching as high as 40-60%. The large gains in detection accuracy are offset by only small increases in computational requirements. expand
|
|
|
Learning spatially variant dissimilarity (SVaD) measures |
| |
Krishna Kummamuru,
Raghu Krishnapuram,
Rakesh Agrawal
|
|
Pages: 611-616 |
|
doi>10.1145/1014052.1014128 |
|
Full text: PDF
|
|
Clustering algorithms typically operate on a feature vector representation of the data and find clusters that are compact with respect to an assumed (dis)similarity measure between the data points in feature space. This makes the type of clusters identified ...
Clustering algorithms typically operate on a feature vector representation of the data and find clusters that are compact with respect to an assumed (dis)similarity measure between the data points in feature space. This makes the type of clusters identified highly dependent on the assumed similarity measure. Building on recent work in this area, we formally define a class of spatially varying dissimilarity measures and propose algorithms to learn the dissimilarity measure automatically from the data. The idea is to identify clusters that are compact with respect to the unknown spatially varying dissimilarity measure. Our experiments show that the proposed algorithms are more stable and achieve better accuracy on various textual data sets when compared with similar algorithms proposed in the literature. expand
|
|
|
Clustering moving objects |
| |
Yifan Li,
Jiawei Han,
Jiong Yang
|
|
Pages: 617-622 |
|
doi>10.1145/1014052.1014129 |
|
Full text: PDF
|
|
Due to the advances in positioning technologies, the real time information of moving objects becomes increasingly available, which has posed new challenges to the database research. As a long-standing technique to identify overall distribution patterns ...
Due to the advances in positioning technologies, the real time information of moving objects becomes increasingly available, which has posed new challenges to the database research. As a long-standing technique to identify overall distribution patterns in data, clustering has achieved brilliant successes in analyzing static datasets. In this paper, we study the problem of clustering moving objects, which could catch interesting pattern changes during the motion process and provide better insight into the essence of the mobile data points. In order to catch the spatial-temporal regularities of moving objects and handle large amounts of data, micro-clustering [20] is employed. Efficient techniques are proposed to keep the moving micro-clusters geographically small. Important events such as the collisions among moving micro-clusters are also identified. In this way, high quality moving micro-clusters are dynamically maintained, which leads to fast and competitive clustering result at any given time instance. We validate our approaches with a through experimental evaluation, where orders of magnitude improvement on running time is observed over normal K-Means clustering method [14]. expand
|
|
|
A framework for ontology-driven subspace clustering |
| |
Jinze Liu,
Wei Wang,
Jiong Yang
|
|
Pages: 623-628 |
|
doi>10.1145/1014052.1014130 |
|
Full text: PDF
|
|
Traditional clustering is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes. While domain knowledge is always the best way to justify clustering, few clustering algorithms have ever take domain ...
Traditional clustering is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes. While domain knowledge is always the best way to justify clustering, few clustering algorithms have ever take domain knowledge into consideration. In this paper, the domain knowledge is represented by hierarchical ontology. We develop a framework by directly incorporating domain knowledge into clustering process, yielding a set of clusters with strong ontology implication. During the clustering process, ontology information is utilized to efficiently prune the exponential search space of the subspace clustering algorithms. Meanwhile, the algorithm generates automatical interpretation of the clustering result by mapping the natural hierarchical organized subspace clusters with significant categorical enrichment onto the ontology hierarchy. Our experiments on a set of gene expression data using gene ontology demonstrate that our pruning technique driven by ontology significantly improve the clustering performance with minimal degradation of the cluster quality. Meanwhile, many hierarchical organizations of gene clusters corresponding to a sub-hierarchies in gene ontology were also successfully captured. expand
|
|
|
The IOC algorithm: efficient many-class non-parametric classification for high-dimensional data |
| |
Ting Liu,
Ke Yang,
Andrew W. Moore
|
|
Pages: 629-634 |
|
doi>10.1145/1014052.1014131 |
|
Full text: PDF
|
|
This paper is about a variant of k nearest neighbor classification on large many-class high dimensional datasets.K nearest neighbor remains a popular classification technique, especially in areas such as computer vision, drug activity prediction and ...
This paper is about a variant of k nearest neighbor classification on large many-class high dimensional datasets.K nearest neighbor remains a popular classification technique, especially in areas such as computer vision, drug activity prediction and astrophysics. Furthermore, many more modern classifiers, such as kernel-based Bayes classifiers or the prediction phase of SVMs, require computational regimes similar to k-NN. We believe that tractable k-NN algorithms therefore continue to be important.This paper relies on the insight that even with many classes, the task of finding the majority class among the k nearest neighbors of a query need not require us to explicitly find those k nearest neighbors. This insight was previously used in (Liu et al., 2003) in two algorithms called KNS2 and KNS3 which dealt with fast classification in the case of two classes. In this paper we show how a different approach, IOC (standing for the International Olympic Committee) can apply to the case of n classes where n > 2.IOC assumes a slightly different processing of the datapoints in the neighborhood of the query. This allows it to search a set of metric trees, one for each class. During the searches it is possible to quickly prune away classes that cannot possibly be the majority.We give experimental results on datasets of up to 5.8 x 105 records and 1.5 x 103 attributes, frequently showing an order of magnitude acceleration compared with each of (i) conventional linear scan, (ii) a well-known independent SR-tree implementation of conventional k-NN and (iii) a highly optimized conventional k-NN metric tree search. expand
|
|
|
Sleeved coclustering |
| |
Avraham A. Melkman,
Eran Shaham
|
|
Pages: 635-640 |
|
doi>10.1145/1014052.1014132 |
|
Full text: PDF
|
|
A coCluster of a m x n matrix X is a submatrix determined by a subset of the rows and a subset of the columns. The problem of finding coClusters with specific properties is of interest, in particular, in the analysis of microarray experiments. In that ...
A coCluster of a m x n matrix X is a submatrix determined by a subset of the rows and a subset of the columns. The problem of finding coClusters with specific properties is of interest, in particular, in the analysis of microarray experiments. In that case the entries of the matrix X are the expression levels of $m$ genes in each of $n$ tissue samples. One goal of the analysis is to extract a subset of the samples and a subset of the genes, such that the expression levels of the chosen genes behave similarly across the subset of the samples, presumably reflecting an underlying regulatory mechanism governing the expression level of the genes.We propose to base the similarity of the genes in a coCluster on a simple biological model, in which the strength of the regulatory mechanism in sample j is Hj, and the response strength of gene i to the regulatory mechanism is Gi. In other words, every two genes participating in a good coCluster should have expression values in each of the participating samples, whose ratio is a constant depending only on the two genes. Noise in the expression levels of genes is taken into account by allowing a deviation from the model, measured by a relative error criterion. The sleeve-width of the coCluster reflects the extent to which entry i,j in the coCluster is allowed to deviate, relatively, from being expressed as the product GiHj.We present a polynomial-time Monte-Carlo algorithm which outputs a list of coClusters whose sleeve-widths do not exceed a prespecified value. Moreover, we prove that the list includes, with fixed probability, a coCluster which is near-optimal in its dimensions. Extensive experimentation with synthetic data shows that the algorithm performs well. expand
|
|
|
Semantic representation: search and mining of multimedia content |
| |
Apostol (Paul) Natsev,
Milind R. Naphade,
John R. Smith
|
|
Pages: 641-646 |
|
doi>10.1145/1014052.1014133 |
|
Full text: PDF
|
|
Semantic understanding of multimedia content is critical in enabling effective access to all forms of digital media data. By making large media repositories searchable, semantic content descriptions greatly enhance the value of such data. Automatic semantic ...
Semantic understanding of multimedia content is critical in enabling effective access to all forms of digital media data. By making large media repositories searchable, semantic content descriptions greatly enhance the value of such data. Automatic semantic understanding is a very challenging problem and most media databases resort to describing content in terms of low-level features or using manually ascribed annotations. Recent techniques focus on detecting semantic concepts in video, such as indoor, outdoor, face, people, nature, etc. This approach works for a fixed lexicon for which annotated training examples exist. In this paper we consider the problem of using such semantic concept detection to map the video clips into semantic spaces. This is done by constructing a model vector that acts as a compact semantic representation of the underlying content. We then present experiments in the semantic spaces leveraging such information for enhanced semantic retrieval, classification, visualization, and data mining purposes. We evaluate these ideas using a large video corpus and demonstrate significant performance gains in retrieval effectiveness. expand
|
|
|
A quickstart in frequent structure mining can make a difference |
| |
Siegfried Nijssen,
Joost N. Kok
|
|
Pages: 647-652 |
|
doi>10.1145/1014052.1014134 |
|
Full text: PDF
|
|
Given a database, structure mining algorithms search for substructures that satisfy constraints such as minimum frequency, minimum confidence, minimum interest and maximum frequency. Examples of substructures include graphs, trees and paths. For these ...
Given a database, structure mining algorithms search for substructures that satisfy constraints such as minimum frequency, minimum confidence, minimum interest and maximum frequency. Examples of substructures include graphs, trees and paths. For these substructures many mining algorithms have been proposed. In order to make graph mining more efficient, we investigate the use of the "quickstart principle", which is based on the fact that these classes of structures are contained in each other, thus allowing for the development of structure mining algorithms that split the search into steps of increasing complexity. We introduce the GrAph/Sequence/Tree extractiON (Gaston) algorithm that implements this idea by searching first for frequent paths, then frequent free trees and finally cyclic graphs. We investigate two alternatives for computing the frequency of structures and present experimental results to relate these alternatives. expand
|
|
|
Automatic multimedia cross-modal correlation discovery |
| |
Jia-Yu Pan,
Hyung-Jeong Yang,
Christos Faloutsos,
Pinar Duygulu
|
|
Pages: 653-658 |
|
doi>10.1145/1014052.1014135 |
|
Full text: PDF
|
|
Given an image (or video clip, or audio song), how do we automatically assign keywords to it? The general problem is to find correlations across the media in a collection of multimedia objects like video clips, with colors, and/or motion, and/or audio, ...
Given an image (or video clip, or audio song), how do we automatically assign keywords to it? The general problem is to find correlations across the media in a collection of multimedia objects like video clips, with colors, and/or motion, and/or audio, and/or text scripts. We propose a novel, graph-based approach, "MMG", to discover such cross-modal correlations.Our "MMG" method requires no tuning, no clustering, no user-determined constants; it can be applied to any multimedia collection, as long as we have a similarity function for each medium; and it scales linearly with the database size. We report auto-captioning experiments on the "standard" Corel image database of 680 MB, where it outperforms domain specific, fine-tuned methods by up to 10 percentage points in captioning accuracy (50% relative improvement). expand
|
|
|
Estimating the size of the telephone universe: a Bayesian Mark-recapture approach |
| |
David Poole
|
|
Pages: 659-664 |
|
doi>10.1145/1014052.1014136 |
|
Full text: PDF
|
|
Mark-recapture models have for many years been used to estimate the unknown sizes of animal and bird populations. In this article we adapt a finite mixture mark-recapture model in order to estimate the number of active telephone lines in the USA. The ...
Mark-recapture models have for many years been used to estimate the unknown sizes of animal and bird populations. In this article we adapt a finite mixture mark-recapture model in order to estimate the number of active telephone lines in the USA. The idea is to use the calling patterns of lines that are observed on the long distance network to estimate the number of lines that do not appear on the network. We present a Bayesian approach and use Markov chain Monte Carlo methods to obtain inference from the posterior distributions of the model parameters. At the state level, our results are in fairly good agreement with recent published reports on line counts. For lines that are easily classified as business or residence, the estimates have low variance. When the classification is unknown, the variability increases considerably. Results are insensitive to changes in the prior distributions. We discuss the significant computational and data mining challenges caused by the scale of the data, approximately 350 million call-detail records per day observed over a number of weeks. expand
|
|
|
Cluster-based concept invention for statistical relational learning |
| |
Alexandrin Popescul,
Lyle H. Ungar
|
|
Pages: 665-670 |
|
doi>10.1145/1014052.1014137 |
|
Full text: PDF
|
|
We use clustering to derive new relations which augment database schema used in automatic generation of predictive features in statistical relational learning. Entities derived from clusters increase the expressivity of feature spaces by creating new ...
We use clustering to derive new relations which augment database schema used in automatic generation of predictive features in statistical relational learning. Entities derived from clusters increase the expressivity of feature spaces by creating new first-class concepts which contribute to the creation of new features. For example, in CiteSeer, papers can be clustered based on words or citations giving "topics", and authors can be clustered based on documents they co-author giving "communities". Such cluster-derived concepts become part of more complex feature expressions. Out of the large number of generated features, those which improve predictive accuracy are kept in the model, as decided by statistical feature selection criteria. We present results demonstrating improved accuracy on two tasks, venue prediction and link prediction, using CiteSeer data. expand
|
|
|
Identifying early buyers from purchase data |
| |
Paat Rusmevichientong,
Shenghuo Zhu,
David Selinger
|
|
Pages: 671-677 |
|
doi>10.1145/1014052.1014138 |
|
Full text: PDF
|
|
Market research has shown that consumers exhibit a variety of different purchasing behaviors; specifically, some tend to purchase products earlier than other consumers. Identifying such early buyers can help personalize marketing strategies, potentially ...
Market research has shown that consumers exhibit a variety of different purchasing behaviors; specifically, some tend to purchase products earlier than other consumers. Identifying such early buyers can help personalize marketing strategies, potentially improving their effectiveness. In this paper, we present a non-parametric approach to the problem of identifying early buyers from purchase data. Our formulation takes as inputs the detailed purchase information of each consumer, with which we construct a weighted directed graph whose nodes correspond to consumers and whose edges correspond to purchases consumers have in common; the edge weights indicate how frequently consumers purchase products earlier than other consumers.Identifying early buyers corresponds to the problem of finding a subset of nodes in the graph with maximum difference between the weights of the outgoing and incoming edges. This problem is a variation of the maximum cut problem in a directed graph. We provide an approximation algorithm based on semidefinite programming (SDP) relaxations pioneered by Goemans and Williamson, and analyze its performance. We apply the algorithm to real purchase data from Amazon.com, providing new insights into consumer behaviors. expand
|
|
|
Privacy preserving regression modelling via distributed computation |
| |
Ashish P. Sanil,
Alan F. Karr,
Xiaodong Lin,
Jerome P. Reiter
|
|
Pages: 677-682 |
|
doi>10.1145/1014052.1014139 |
|
Full text: PDF
|
|
Reluctance of data owners to share their possibly confidential or proprietary data with others who own related databases is a serious impediment to conducting a mutually beneficial data mining analysis. We address the case of vertically partitioned data ...
Reluctance of data owners to share their possibly confidential or proprietary data with others who own related databases is a serious impediment to conducting a mutually beneficial data mining analysis. We address the case of vertically partitioned data -- multiple data owners/agencies each possess a few attributes of every data record. We focus on the case of the agencies wanting to conduct a linear regression analysis with complete records without disclosing values of their own attributes. This paper describes an algorithm that enables such agencies to compute the exact regression coefficients of the global regression equation and also perform some basic goodness-of-fit diagnostics while protecting the confidentiality of their data. In more general settings beyond the privacy scenario, this algorithm can also be viewed as method for the distributed computation for regression analyses. expand
|
|
|
Dense itemsets |
| |
Jouni K. Seppänen,
Heikki Mannila
|
|
Pages: 683-688 |
|
doi>10.1145/1014052.1014140 |
|
Full text: PDF
|
|
Frequent itemset mining has been the subject of a lot of work in data mining research ever since association rules were introduced. In this paper we address a problem with frequent itemsets: that they only count rows where all their attributes are present, ...
Frequent itemset mining has been the subject of a lot of work in data mining research ever since association rules were introduced. In this paper we address a problem with frequent itemsets: that they only count rows where all their attributes are present, and do not allow for any noise. We show that generalizing the concept of frequency while preserving the performance of mining algorithms is nontrivial, and introduce a generalization of frequent itemsets, dense itemsets. Dense itemsets do not require all attributes to be present at the same time; instead, the itemset needs to define a sufficiently large submatrix that exceeds a given density threshold of attributes present.We consider the problem of computing all dense itemsets in a database. We give a levelwise algorithm for this problem, and also study the top-$k$ variations, i.e., finding the k densest sets with a given support, or the k best-supported sets with a given density. These algorithms select the other parameter automatically, which simplifies mining dense itemsets in an explorative way. We show that the concept captures natural facets of data sets, and give extensive empirical results on the performance of the algorithms. Combining the concept of dense itemsets with set cover ideas, we also show that dense itemsets can be used to obtain succinct descriptions of large datasets. We also discuss some variations of dense itemsets. expand
|
|
|
Generalizing the notion of support |
| |
Michael Steinbach,
Pang-Ning Tan,
Hui Xiong,
Vipin Kumar
|
|
Pages: 689-694 |
|
doi>10.1145/1014052.1014141 |
|
Full text: PDF
|
|
The goal of this paper is to show that generalizing the notion of support can be useful in extending association analysis to non-traditional types of patterns and non-binary data. To that end, we describe a framework for generalizing support that is ...
The goal of this paper is to show that generalizing the notion of support can be useful in extending association analysis to non-traditional types of patterns and non-binary data. To that end, we describe a framework for generalizing support that is based on the simple, but useful observation that support can be viewed as the composition of two functions: a function that evaluates the strength or presence of a pattern in each object (transaction) and a function that summarizes these evaluations with a single number. A key goal of any framework is to allow people to more easily express, explore, and communicate ideas, and hence, we illustrate how our support framework can be used to describe support for a variety of commonly used association patterns, such as frequent itemsets, general Boolean patterns, and error-tolerant itemsets. We also present two examples of the practical usefulness of generalized support. One example shows the usefulness of support functions for continuous data. Another example shows how the hyperclique pattern---an association pattern originally defined for binary data---can be extended to continuous data by generalizing a support function. expand
|
|
|
Ordering patterns by combining opinions from multiple sources |
| |
Pang-Ning Tan,
Rong Jin
|
|
Pages: 695-700 |
|
doi>10.1145/1014052.1014142 |
|
Full text: PDF
|
|
Pattern ordering is an important task in data mining because the number of patterns extracted by standard data mining algorithms often exceeds our capacity to manually analyze them. In this paper, we present an effective approach to address the pattern ...
Pattern ordering is an important task in data mining because the number of patterns extracted by standard data mining algorithms often exceeds our capacity to manually analyze them. In this paper, we present an effective approach to address the pattern ordering problem by combining the rank information gathered from disparate sources. Although rank aggregation techniques have been developed for applications such as meta-search engines, they are not directly applicable to pattern ordering for two reasons. First, the techniques are mostly supervised, i.e., they require a sufficient amount of labeled data. Second, the objects to be ranked are assumed to be independent and identically distributed (i.i.d), an assumption that seldom holds in pattern ordering. The method proposed in this paper is an adaptation of the original Hedge algorithm, modified to work in an unsupervised learning setting. Techniques for addressing the i.i.d. violation in pattern ordering are also presented. Experimental results demonstrate that our unsupervised Hedge algorithm outperforms many alternative techniques such as those based on weighted average ranking and singular value decomposition. expand
|
|
|
A generative probabilistic approach to visualizing sets of symbolic sequences |
| |
Peter Tiño,
Ata Kabán,
Yi Sun
|
|
Pages: 701-706 |
|
doi>10.1145/1014052.1014143 |
|
Full text: PDF
|
|
There is a notable interest in extending probabilistic generative modeling principles to accommodate for more complex structured data types. In this paper we develop a generative probabilistic model for visualizing sets of discrete symbolic sequences. ...
There is a notable interest in extending probabilistic generative modeling principles to accommodate for more complex structured data types. In this paper we develop a generative probabilistic model for visualizing sets of discrete symbolic sequences. The model, a constrained mixture of discrete hidden Markov models, is a generalization of density-based visualization methods previously developed for static data sets. We illustrate our approach on sequences representing web-log data and chorals by J.S. Bach. expand
|
|
|
Rotation invariant distance measures for trajectories |
| |
Michail Vlachos,
D. Gunopulos,
Gautam Das
|
|
Pages: 707-712 |
|
doi>10.1145/1014052.1014144 |
|
Full text: PDF
|
|
For the discovery of similar patterns in 1D time-series, it is very typical to perform a normalization of the data (for example a transformation so that the data follow a zero mean and unit standard deviation). Such transformations can reveal latent ...
For the discovery of similar patterns in 1D time-series, it is very typical to perform a normalization of the data (for example a transformation so that the data follow a zero mean and unit standard deviation). Such transformations can reveal latent patterns and are very commonly used in datamining applications. However, when dealing with multidimensional time-series, which appear naturally in applications such as video-tracking, motion-capture etc, similar motion patterns can also be expressed at different orientations. It is therefore imperative to provide support for additional transformations, such as rotation. In this work, we transform the positional information of moving data, into a space that is translation, scale and rotation invariant. Our distance measure in the new space is able to detect elastic matches and can be efficiently lower bounded, thus being computationally tractable. The proposed methods are easy to implement, fast to compute and can have many applications for real world problems, in areas such as handwriting recognition and posture estimation in motion-capture data. Finally, we empirically demonstrate the accuracy and the efficiency of the technique, using real and synthetic handwriting data. expand
|
|
|
Privacy-preserving Bayesian network structure computation on distributed heterogeneous data |
| |
Rebecca Wright,
Zhiqiang Yang
|
|
Pages: 713-718 |
|
doi>10.1145/1014052.1014145 |
|
Full text: PDF
|
|
As more and more activities are carried out using computers and computer networks, the amount of potentially sensitive data stored by business, governments, and other parties increases. Different parties may wish to benefit from cooperative use of their ...
As more and more activities are carried out using computers and computer networks, the amount of potentially sensitive data stored by business, governments, and other parties increases. Different parties may wish to benefit from cooperative use of their data, but privacy regulations and other privacy concerns may prevent the parties from sharing their data. Privacy-preserving data mining provides a solution by creating distributed data mining algorithms in which the underlying data is not revealed.In this paper, we present a privacy-preserving protocol for a particular data mining task: learning the Bayesian network structure for distributed heterogeneous data. In this setting, two parties owning confidential databases wish to learn the structure of Bayesian network on the combination of their databases without revealing anything about their data to each other. We give an efficient and privacy-preserving version of the K2 algorithm to construct the structure of a Bayesian network for the parties' joint data. expand
|
|
|
Mining scale-free networks using geodesic clustering |
| |
Andrew Y. Wu,
Michael Garland,
Jiawei Han
|
|
Pages: 719-724 |
|
doi>10.1145/1014052.1014146 |
|
Full text: PDF
|
|
Many real-world graphs have been shown to be scale-free---vertex degrees follow power law distributions, vertices tend to cluster, and the average length of all shortest paths is small. We present a new model for understanding scale-free networks based ...
Many real-world graphs have been shown to be scale-free---vertex degrees follow power law distributions, vertices tend to cluster, and the average length of all shortest paths is small. We present a new model for understanding scale-free networks based on multilevel geodesic approximation, using a new data structure called a multilevel mesh.Using this multilevel framework, we propose a new kind of graph clustering for data reduction of very large graph systems such as social, biological, or electronic networks. Finally, we apply our algorithms to real-world social networks and protein interaction graphs to show that they can reveal knowledge embedded in underlying graph structures. We also demonstrate how our data structures can be used to quickly answer approximate distance and shortest path queries on scale-free networks. expand
|
|
|
IMMC: incremental maximum margin criterion |
| |
Jun Yan,
Benyu Zhang,
Shuicheng Yan,
Qiang Yang,
Hua Li,
Zheng Chen,
Wensi Xi,
Weiguo Fan,
Wei-Ying Ma,
Qiansheng Cheng
|
|
Pages: 725-730 |
|
doi>10.1145/1014052.1014147 |
|
Full text: PDF
|
|
Subspace learning approaches have attracted much attention in academia recently. However, the classical batch algorithms no longer satisfy the applications on streaming data or large-scale data. To meet this desirability, Incremental Principal Component ...
Subspace learning approaches have attracted much attention in academia recently. However, the classical batch algorithms no longer satisfy the applications on streaming data or large-scale data. To meet this desirability, Incremental Principal Component Analysis (IPCA) algorithm has been well established, but it is an unsupervised subspace learning approach and is not optimal for general classification tasks, such as face recognition and Web document categorization. In this paper, we propose an incremental supervised subspace learning algorithm, called Incremental Maximum Margin Criterion (IMMC), to infer an adaptive subspace by optimizing the Maximum Margin Criterion. We also present the proof for convergence of the proposed algorithm. Experimental results on both synthetic dataset and real world datasets show that IMMC converges to the similar subspace as that of batch approach. expand
|
|
|
2PXMiner: an efficient two pass mining of frequent XML query patterns |
| |
Liang Huai Yang,
Mong Li Lee,
Wynne Hsu,
Xinyu Guo
|
|
Pages: 731-736 |
|
doi>10.1145/1014052.1014148 |
|
Full text: PDF
|
|
Caching the results of frequent query patterns can improve the performance of query evaluation. This paper describes a 2-pass mining algorithm called 2PXMiner to discover frequent XML query patterns. We design 3 data structures to expedite the mining ...
Caching the results of frequent query patterns can improve the performance of query evaluation. This paper describes a 2-pass mining algorithm called 2PXMiner to discover frequent XML query patterns. We design 3 data structures to expedite the mining process. Experiments results indicate that 2PXMiner is both efficient and scalable. expand
|
|
|
Redundancy based feature selection for microarray data |
| |
Lei Yu,
Huan Liu
|
|
Pages: 737-742 |
|
doi>10.1145/1014052.1014149 |
|
Full text: PDF
|
|
In gene expression microarray data analysis, selecting a small number of discriminative genes from thousands of genes is an important problem for accurate classification of diseases or phenotypes. The problem becomes particularly challenging due to the ...
In gene expression microarray data analysis, selecting a small number of discriminative genes from thousands of genes is an important problem for accurate classification of diseases or phenotypes. The problem becomes particularly challenging due to the large number of features (genes) and small sample size. Traditional gene selection methods often select the top-ranked genes according to their individual discriminative power without handling the high degree of redundancy among the genes. Latest research shows that removing redundant genes among selected ones can achieve a better representation of the characteristics of the targeted phenotypes and lead to improved classification accuracy. Hence, we study in this paper the relationship between feature relevance and redundancy and propose an efficient method that can effectively remove redundant genes. The efficiency and effectiveness of our method in comparison with representative methods has been demonstrated through an empirical study using public microarray data sets. expand
|
|
|
A cross-collection mixture model for comparative text mining |
| |
ChengXiang Zhai,
Atulya Velivelli,
Bei Yu
|
|
Pages: 743-748 |
|
doi>10.1145/1014052.1014150 |
|
Full text: PDF
|
|
In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections ...
In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and within-collection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model. expand
|
|
|
A data mining approach to modeling relationships among categories in image collection |
| |
Ruofei Zhang,
Zhongfei (Mark) Zhang,
Sandeep Khanzode
|
|
Pages: 749-754 |
|
doi>10.1145/1014052.1014151 |
|
Full text: PDF
|
|
This paper proposes a data mining approach to modeling relationships among categories in image collection. In our approach, with image feature grouping, a visual dictionary is created for color, texture, and shape feature attributes respectively. Labeling ...
This paper proposes a data mining approach to modeling relationships among categories in image collection. In our approach, with image feature grouping, a visual dictionary is created for color, texture, and shape feature attributes respectively. Labeling each training image with the keywords in the visual dictionary, a classification tree is built. Based on the statistical properties of the feature space we define a structure, called α-Semantics Graph, to discover the hidden semantic relationships among the semantic categories embodied in the image collection. With the α-Semantics Graph, each semantic category is modeled as a unique fuzzy set to explicitly address the semantic uncertainty and semantic overlap among the categories in the feature space. The model is utilized in the semantics-intensive image retrieval application. An algorithm using the classification accuracy measures is developed to combine the built classification tree with the fuzzy set modeling method to deliver semantically relevant image retrieval for a given query image. The experimental evaluations have demonstrated that the proposed approach models the semantic relationships effectively and the image retrieval prototype system utilizing the derived model is promising both in effectiveness and efficiency. expand
|
|
|
A DEA approach for model combination |
| |
Zhiqiang Zheng,
Balaji Padmanabhan,
Haoqiang Zheng
|
|
Pages: 755-760 |
|
doi>10.1145/1014052.1014152 |
|
Full text: PDF
|
|
This paper proposes a novel Data Envelopment Analysis (DEA) based approach for model combination. We first prove that for the 2-class classification problems DEA models identify the same convex hull as the popular ROC analysis used for model combination. ...
This paper proposes a novel Data Envelopment Analysis (DEA) based approach for model combination. We first prove that for the 2-class classification problems DEA models identify the same convex hull as the popular ROC analysis used for model combination. For general k-class classifiers, we then develop a DEA-based method to combine multiple classifiers. Experiments show that the method outperforms other benchmark methods and suggest that DEA can be a promising tool for model combination. expand
|
|
|
Optimal randomization for privacy preserving data mining |
| |
Yu Zhu,
Lei Liu
|
|
Pages: 761-766 |
|
doi>10.1145/1014052.1014153 |
|
Full text: PDF
|
|
Randomization is an economical and efficient approach for privacy preserving data mining (PPDM). In order to guarantee the performance of data mining and the protection of individual privacy, optimal randomization schemes need to be employed. This paper ...
Randomization is an economical and efficient approach for privacy preserving data mining (PPDM). In order to guarantee the performance of data mining and the protection of individual privacy, optimal randomization schemes need to be employed. This paper demonstrates the construction of optimal randomization schemes for privacy preserving density estimation. We propose a general framework for randomization using mixture models. The impact of randomization on data mining is quantified by performance degradation and mutual information loss, while privacy and privacy loss are quantified by interval-based metrics. Two different types of problems are defined to identify optimal randomization for PPDM. Illustrative examples and simulation results are reported. expand
|
|
|
POSTER SESSION: Industry/government track posters |
|
|
|
|
Cross channel optimized marketing by reinforcement learning |
| |
Naoki Abe,
Naval Verma,
Chid Apte,
Robert Schroko
|
|
Pages: 767-772 |
|
doi>10.1145/1014052.1016912 |
|
Full text: PDF
|
|
The issues of cross channel integration and customer life time value modeling are two of the most important topics surrounding customer relationship management (CRM) today. In the present paper, we describe and evaluate a novel solution that treats these ...
The issues of cross channel integration and customer life time value modeling are two of the most important topics surrounding customer relationship management (CRM) today. In the present paper, we describe and evaluate a novel solution that treats these two important issues in a unified framework of Markov Decision Processes (MDP). In particular, we report on the results of a joint project between IBM Research and Saks Fifth Avenue to investigate the applicability of this technology to real world problems. The business problem we use as a testbed for our evaluation is that of optimizing direct mail campaign mailings for maximization of profits in the store channel. We identify a problem common to cross-channel CRM, which we call the Cross-Channel Challenge, due to the lack of explicit linking between the marketing actions taken in one channel and the customer responses obtained in another. We provide a solution for this problem based on old and new techniques in reinforcement learning. Our in-laboratory experimental evaluation using actual customer interaction data show that as much as 7 to 8 per cent increase in the store profits can be expected, by employing a mailing policy automatically generated by our methodology. These results confirm that our approach is valid in dealing with the cross channel CRM scenarios in the real world. expand
|
|
|
Interactive training of advanced classifiers for mining remote sensing image archives |
| |
Selim Aksoy,
Krzysztof Koperski,
Carsten Tusk,
Giovanni Marchisio
|
|
Pages: 773-782 |
|
doi>10.1145/1014052.1016913 |
|
Full text: PDF
|
|
Advances in satellite technology and availability of downloaded images constantly increase the sizes of remote sensing image archives. Automatic content extraction, classification and content-based retrieval have become highly desired goals for the development ...
Advances in satellite technology and availability of downloaded images constantly increase the sizes of remote sensing image archives. Automatic content extraction, classification and content-based retrieval have become highly desired goals for the development of intelligent remote sensing databases. The common approach for mining these databases uses rules created by analysts. However, incorporating GIS information and human expert knowledge with digital image processing improves remote sensing image analysis. We developed a system that uses decision tree classifiers for interactive learning of land cover models and mining of image archives. Decision trees provide a promising solution for this problem because they can operate on both numerical (continuous) and categorical (discrete) data sources, and they do not require any assumptions about neither the distributions nor the independence of attribute values. This is especially important for the fusion of measurements from different sources like spectral data, DEM data and other ancillary GIS data. Furthermore, using surrogate splits provides the capability of dealing with missing data during both training and classification, and enables handling instrument malfunctions or the cases where one or more measurements do not exist for some locations. Quantitative and qualitative performance evaluation showed that decision trees provide powerful tools for modeling both pixel and region contents of images and mining of remote sensing image archives. expand
|
|
|
Exploring the community structure of newsgroups |
| |
Christian Borgs,
Jennifer Chayes,
Mohammad Mahdian,
Amin Saberi
|
|
Pages: 783-787 |
|
doi>10.1145/1014052.1016914 |
|
Full text: PDF
|
|
We propose to use the community structure of Usenet for organizing and retrieving the information stored in newsgroups. In particular, we study the network formed by cross-posts, messages that are posted to two or more newsgroups simultaneously. We present ...
We propose to use the community structure of Usenet for organizing and retrieving the information stored in newsgroups. In particular, we study the network formed by cross-posts, messages that are posted to two or more newsgroups simultaneously. We present what is, to our knowledge, by far the most detailed data that has been collected on Usenet cross-postings. We analyze this network to show that it is a small-world network with significant clustering. We also present a spectral algorithm which clusters newsgroups based on the cross-post matrix. The result of our clustering provides a topical classification of newsgroups. Our clustering gives many examples of significant relationships that would be missed by semantic clustering methods. expand
|
|
|
Feature selection in scientific applications |
| |
Erick Cantú-Paz,
Shawn Newsam,
Chandrika Kamath
|
|
Pages: 788-793 |
|
doi>10.1145/1014052.1016915 |
|
Full text: PDF
|
|
Numerous applications of data mining to scientific data involve the induction of a classification model. In many cases, the collection of data is not performed with this task in mind, and therefore, the data might contain irrelevant or redundant features ...
Numerous applications of data mining to scientific data involve the induction of a classification model. In many cases, the collection of data is not performed with this task in mind, and therefore, the data might contain irrelevant or redundant features that affect negatively the accuracy of the induction algorithms. The size and dimensionality of typical scientific data make it difficult to use any available domain information to identify features that discriminate between the classes of interest. Similarly, exploratory data analysis techniques have limitations on the amount and dimensionality of the data they can process effectively. In this paper, we describe applications of efficient feature selection methods to data sets from astronomy, plasma physics, and remote sensing. We use variations of recently proposed filter methods as well as traditional wrapper approaches, where practical. We discuss the general challenges of feature selection in scientific datasets, the strategies for success that were common among our diverse applications, and the lessons learned in solving these problems. expand
|
|
|
A general approach to incorporate data quality matrices into data mining algorithms |
| |
Ian Davidson,
Ashish Grover,
Ashwin Satyanarayana,
Giri K. Tayi
|
|
Pages: 794-798 |
|
doi>10.1145/1014052.1016916 |
|
Full text: PDF
|
|
Data quality is a central issue for many information-oriented organizations. Recent advances in the data quality field reflect the view that a database is the product of a manufacturing process. While routine errors, such as non-existent zip codes, can ...
Data quality is a central issue for many information-oriented organizations. Recent advances in the data quality field reflect the view that a database is the product of a manufacturing process. While routine errors, such as non-existent zip codes, can be detected and corrected using traditional data cleansing tools, many errors systemic to the manufacturing process cannot be addressed. Therefore, the product of the data manufacturing process is an imprecise recording of information about the entities of interest (i.e. customers, transactions or assets). In this way, the database is only one (flawed) version of the entities it is supposed to represent. Quality assurance systems such as Motorola's Six-Sigma and other continuous improvement methods document the data manufacturing process's shortcomings. A widespread method of documentation is quality matrices. In this paper, we explore the use of the readily available data quality matrices for the data mining classification task. We first illustrate that if we do not factor in these quality matrices, then our results for prediction are sub-optimal. We then suggest a general-purpose ensemble approach that perturbs the data according to these quality matrices to improve the predictive accuracy and show the improvement is due to a reduction in variance. expand
|
|
|
ANN quality diagnostic models for packaging manufacturing: an industrial data mining case study |
| |
Nicolás de Abajo,
Alberto B. Diez,
Vanesa Lobato,
Sergio R. Cuesta
|
|
Pages: 799-804 |
|
doi>10.1145/1014052.1016917 |
|
Full text: PDF
|
|
World steel trade becomes more competitive every day and new high international quality standards and productivity levels can only be achieved by applying the latest computational technologies. Data driven analysis of complex processes is necessary in ...
World steel trade becomes more competitive every day and new high international quality standards and productivity levels can only be achieved by applying the latest computational technologies. Data driven analysis of complex processes is necessary in many industrial applications where analytical modeling is not possible. This paper presents the deployment of KDD technology in one real industrial problem: the development of new tinplate quality diagnostic models.The electrodeposition of tin on steel strips is the most critical stage of a complex process that involves a great amount of variables and operating conditions. Its optimization is not only a great commercial and economic challenge but also a compulsion due to the social impact of the tinplate product-more than 90% of the production is used for food packaging. The necessary certification with standards, like ISO 9000, requires the use of diagnostic models to minimize the costs and the environmental impact. This aim has been achieved following the multi-stage DM methodology CRISP-DM and a novel application of pro-active maintenance methods, as FMEA, for the identification of the specific process anomalies. Three DM tools have been used for the development of the models. The final results include two ANN tinplate quality diagnostic models, that provide the estimated quality of the final product just seconds after its production and only based on the process data. The results have much better performance than the classical Faraday's models widely used for the estimation. expand
|
|
|
A system for automated mapping of bill-of-materials part numbers |
| |
Jayant Kalagnanam,
Moninder Singh,
Sudhir Verma,
Michael Patek,
Yuk Wah Wong
|
|
Pages: 805-810 |
|
doi>10.1145/1014052.1016918 |
|
Full text: PDF
|
|
Part numbers are widely used within an enterprise throughout the manufacturing process. The point of entry of such part numbers into this process is normally via a Bill of Materials, or BOM, sent by a contact manufacturer or supplier. Each line of the ...
Part numbers are widely used within an enterprise throughout the manufacturing process. The point of entry of such part numbers into this process is normally via a Bill of Materials, or BOM, sent by a contact manufacturer or supplier. Each line of the BOM provides information about one part such as the supplier part number, the BOM receiver's corresponding internal part number, an unstructured textual part description, the supplier name, etc. However, in a substantial number of cases, the BOM receiver's internal part number is absent. Hence, before this part can be incorporated into the receiver's manufacturing process, it has to be mapped to an internal part (of the BOM receiver) based on the information of the part in the BOM. Historically, this mapping process has been done manually which is a highly time-consuming, labor intensive and error-prone process. This paper describes a system for automating the mapping of BOM part numbers. The system uses a two step modeling and mapping approach. First, the system uses historical BOM data, receiver's part specifications data and receiver's part taxonomic data along with domain knowledge to automatically learn classification models for mapping a given BOM part description to successively lower levels of the receiver's part taxonomy to reduce the set of potential internal parts to which the BOM part could map to. Then, information about various part parameters is extracted from the BOM part description and compared to the specifications data of the potential internal parts to choose the final mapped internal part. Mappings done by the system are very accurate, and the system is currently being deployed within IBM for mapping BOMs received by the corporate procurement/manufacturing divisions. expand
|
|
|
Tracking dynamics of topic trends using a finite mixture model |
| |
Satoshi Morinaga,
Kenji Yamanishi
|
|
Pages: 811-816 |
|
doi>10.1145/1014052.1016919 |
|
Full text: PDF
|
|
In a wide range of business areas dealing with text data streams, including CRM, knowledge management, and Web monitoring services, it is an important issue to discover topic trends and analyze their dynamics in real-time. Specifically we consider the ...
In a wide range of business areas dealing with text data streams, including CRM, knowledge management, and Web monitoring services, it is an important issue to discover topic trends and analyze their dynamics in real-time. Specifically we consider the following three tasks in topic trend analysis: 1)Topic Structure Identification; identifying what kinds of main topics exist and how important they are, 2)Topic Emergence Detection; detecting the emergence of a new topic and recognizing how it grows, 3)Topic Characterization; identifying the characteristics for each of main topics. For real topic analysis systems, we may require that these three tasks be performed in an on-line fashion rather than in a retrospective way, and be dealt with in a single framework. This paper proposes a new topic analysis framework which satisfies this requirement from a unifying viewpoint that a topic structure is modeled using a finite mixture model and that any change of a topic trend is tracked by learning the finite mixture model dynamically. In this framework we propose the usage of a time-stamp based discounting learning algorithm in order to realize real-time topic structure identification. This enables tracking the topic structure adaptively by forgetting out-of-date statistics. Further we apply the theory of dynamic model selection to detecting changes of main components in the finite mixture model in order to realize topic emergence detection. We demonstrate the effectiveness of our framework using real data collected at a help desk to show that we are able to track dynamics of topic trends in a timely fashion. expand
|
|
|
Mining traffic data from probe-car system for travel time prediction |
| |
Takayuki Nakata,
Jun-ichi Takeuchi
|
|
Pages: 817-822 |
|
doi>10.1145/1014052.1016920 |
|
Full text: PDF
|
|
We are developing a technique to predict travel time of a vehicle for an objective road section, based on real time traffic data collected through a probe-car system. In the area of Intelligent Transport System (ITS), travel time prediction is an important ...
We are developing a technique to predict travel time of a vehicle for an objective road section, based on real time traffic data collected through a probe-car system. In the area of Intelligent Transport System (ITS), travel time prediction is an important subject. Probe-car system is an upcoming data collection method, in which a number of vehicles are used as moving sensors to detect actual traffic situation. It can collect data concerning much larger area, compared with traditional fixed detectors. Our prediction technique is based on statistical analysis using AR model with seasonal adjustment and MDL (Minimum Description Length) criterion. Seasonal adjustment is used to handle periodicities of 24 hours in traffic data. Alternatively, we employ state space model, which can handle time series with periodicities. It is important to select really effective data for prediction, among the data from widespread area, which are collected via probe-car system. We do this using MDL criterion. That is, we find the explanatory variables that really have influence on the future travel time. In this paper, we experimentally show effectiveness of our method using probe-car data collected in Nagoya Metropolitan Area in 2002. expand
|
|
|
Programming the K-means clustering algorithm in SQL |
| |
Carlos Ordonez
|
|
Pages: 823-828 |
|
doi>10.1145/1014052.1016921 |
|
Full text: PDF
|
|
Using SQL has not been considered an efficient and feasible way to implement data mining algorithms. Although this is true for many data mining, machine learning and statistical algorithms, this work shows it is feasible to get an efficient SQL implementation ...
Using SQL has not been considered an efficient and feasible way to implement data mining algorithms. Although this is true for many data mining, machine learning and statistical algorithms, this work shows it is feasible to get an efficient SQL implementation of the well-known K-means clustering algorithm that can work on top of a relational DBMS. The article emphasizes both correctness and performance. From a correctness point of view the article explains how to compute Euclidean distance, nearest-cluster queries and updating clustering results in SQL. From a performance point of view it is explained how to cluster large data sets defining and indexing tables to store and retrieve intermediate and final results, optimizing and avoiding joins, optimizing and simplifying clustering aggregations, and taking advantage of sufficient statistics. Experiments evaluate scalability with synthetic data sets varying size and dimensionality. The proposed K-means implementation can cluster large data sets and exhibits linear scalability. expand
|
|
|
Document preprocessing for naive Bayes classification and clustering with mixture of multinomials |
| |
Dmitry Pavlov,
Ramnath Balasubramanyan,
Byron Dom,
Shyam Kapur,
Jignashu Parikh
|
|
Pages: 829-834 |
|
doi>10.1145/1014052.1016922 |
|
Full text: PDF
|
|
Naive Bayes classifier has long been used for text categorization tasks. Its sibling from the unsupervised world, the probabilistic mixture of multinomial models, has likewise been successfully applied to text clustering problems. Despite the strong ...
Naive Bayes classifier has long been used for text categorization tasks. Its sibling from the unsupervised world, the probabilistic mixture of multinomial models, has likewise been successfully applied to text clustering problems. Despite the strong independence assumptions that these models make, their attractiveness come from low computational cost, relatively low memory consumption, ability to handle heterogeneous features and multiple classes, and often competitiveness with the top of the line models. Recently, there has been several attempts to alleviate the problems of Naive Bayes by performing heuristic feature transformations, such as IDF, normalization by the length of the documents and taking the logarithms of the counts. We justify the use of these techniques and apply them to two problems: classification of products in Yahoo! Shopping and clustering the vectors of collocated terms in user queries to Yahoo! Search. The experimental evaluation allows us to draw conclusions about the promise that these transformations carry with regard to alleviating the strong assumptions of the multinomial model. expand
|
|
|
Learning a complex metabolomic dataset using random forests and support vector machines |
| |
Young Truong,
Xiaodong Lin,
Chris Beecher
|
|
Pages: 835-840 |
|
doi>10.1145/1014052.1016923 |
|
Full text: PDF
|
|
Metabolomics is the "omics" science of biochemistry. The associated data include the quantitative measurements of all small molecule metabolites in a biological sample. These datasets provide a window into dynamic biochemical networks and conjointly ...
Metabolomics is the "omics" science of biochemistry. The associated data include the quantitative measurements of all small molecule metabolites in a biological sample. These datasets provide a window into dynamic biochemical networks and conjointly with other "omic" data, genes and proteins, have great potential to unravel complex human diseases. The dataset used in this study has 63 individuals, normal and diseased, and the diseased are drug treated or not, so there are three classes. The goal is to classify these individuals using the observed metabolite levels for 317 measured metabolites. There are a number of statistical challenges: non-normal data, the number of samples is less than the number of metabolites; there are missing data and the fact that data are missing is informative (assay values below detection limits can point to a specific class); also, there are high correlations among the metabolites. We investigate support vector machines (SVM), and random forest (RF), for outlier detection, variable selection and classification. We use the variables selected with RF in SVM and visa versa. The benefit of this study is insight into interplay of variable selection and classification methods. We link our selected predictors to the biochemistry of the disease. expand
|
|
|
1-dimensional splines as building blocks for improving accuracy of risk outcomes models |
| |
David S. Vogel,
Morgan C. Wang
|
|
Pages: 841-846 |
|
doi>10.1145/1014052.1016924 |
|
Full text: PDF
|
|
Transformation of both the response variable and the predictors is commonly used in fitting regression models. However, these transformation methods do not always provide the maximum linear correlation between the response variable and the predictors, ...
Transformation of both the response variable and the predictors is commonly used in fitting regression models. However, these transformation methods do not always provide the maximum linear correlation between the response variable and the predictors, especially when there are non-linear relationships between predictors and the response such as the medical data set used in this study. A spline based transformation method is proposed that is second order smooth, continuous, and minimizes the mean squared error between the response and each predictor. Since the computation time for generating this spline is O(n), the processing time is reasonable with massive data sets. In contrast to cubic smoothing splines, the resulting transformation equations also display a high level of efficiency for scoring. Data used for predicting health outcomes contains an abundance of non-linear relationships between predictors and the outcomes requiring an algorithm for modeling them accurately. Thus, a transformation that fits an adaptive cubic spline to each of a set of variables is proposed. These curves are used as a set of transformation functions on the predictors. A case study of how the transformed variables can be fed into a simple linear regression model to predict risk outcomes is presented. The results show significant improvement over the performance of the original variables in both linear and non-linear models. expand
|
|
|
Analytical view of business data |
| |
Adam Yeh,
Jonathan Tang,
Youxuan Jin,
Sam Skrivan
|
|
Pages: 847-852 |
|
doi>10.1145/1014052.1016925 |
|
Full text: PDF
|
|
This paper describes a logical extension to Microsoft Business Framework (MBF) called Analytical View (AV). AV consists of three components: Model Service for design time, Business Intelligence Entity (BIE) for programming model, and IntellDrill for ...
This paper describes a logical extension to Microsoft Business Framework (MBF) called Analytical View (AV). AV consists of three components: Model Service for design time, Business Intelligence Entity (BIE) for programming model, and IntellDrill for runtime navigation between OLTP and OLAP data sources. AV feature-set fulfills enterprise application requirements for Analysis and Decision Support, complementing the transactional feature-set currently provided by MBF. Model Service automatically transforms an "object oriented model (transactional view)" to a "multi-dimensional model (analytical view)" without the traditional Extraction/Transformation/Loading (ETL) overhead and complexity. It infers dimensionality from the object layer where richer metadata is stored, eliminating the "guesswork" that a traditional data warehousing process requires when going through physical database schema. BI Entities are classes code-generated by Model Service. As an intrinsic part of the framework, BI Entities enable a consistent object oriented way of programming model with strong types and rich semantics for OLAP, similar to what MBF object persistence technology does for OLTP data. More importantly, data contained in BI Entities have a higher degree of "application awareness," such as the integrated application level security and customizability. IntelliDrill links together all the information islands in MBF using metadata. Because of the automatic transformation from transactional view to analytical view enabled by Model Service, we have the ability to understand natively what kind of drill-ability an object would have, thus making information navigation in MBF fully discover-able with built-in ontology. expand
|