|
|
SESSION: Statistical methods I |
|
|
|
|
Bayesian analysis of massive datasets via particle filters |
| |
Greg Ridgeway,
David Madigan
|
|
Pages: 5-13 |
|
doi>10.1145/775047.775049 |
|
Full text: PDF
|
|
Markov Chain Monte Carlo (MCMC) techniques revolutionized statistical practice in the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence ...
Markov Chain Monte Carlo (MCMC) techniques revolutionized statistical practice in the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive datasets and the expansion of the field of data mining has created the need to produce statistically sound methods that scale to these large problems. Except for the most trivial examples, current MCMC methods require a complete scan of the dataset for each iteration eliminating their candidacy as feasible data mining techniques.In this article we present a method for making Bayesian analysis of massive datasets computationally feasible. The algorithm simulates from a posterior distribution that conditions on a smaller, more manageable portion of the dataset. The remainder of the dataset may be incorporated by reweighting the initial draws using importance sampling. Computation of the importance weights requires a single scan of the remaining observations. While importance sampling increases efficiency in data access, it comes at the expense of estimation efficiency. A simple modification, based on the "rejuvenation" step used in particle filters for dynamic systems models, sidesteps the loss of efficiency with only a slight increase in the number of data accesses.To show proof-of-concept, we demonstrate the method on a mixture of transition models that has been used to model web traffic and robotics. For this example we show that estimation efficiency is not affected while offering a 95% reduction in data accesses. expand
|
|
|
Scalable robust covariance and correlation estimates for data mining |
| |
Fatemah A. Alqallaf,
Kjell P. Konis,
R. Douglas Martin,
Ruben H. Zamar
|
|
Pages: 14-23 |
|
doi>10.1145/775047.775050 |
|
Full text: PDF
|
|
Covariance and correlation estimates have important applications in data mining. In the presence of outliers, classical estimates of covariance and correlation matrices are not reliable. A small fraction of outliers, in some cases even a single outlier, ...
Covariance and correlation estimates have important applications in data mining. In the presence of outliers, classical estimates of covariance and correlation matrices are not reliable. A small fraction of outliers, in some cases even a single outlier, can distort the classical covariance and correlation estimates making them virtually useless. That is, correlations for the vast majority of the data can be very erroneously reported; principal components transformations can be misleading; and multidimensional outlier detection via Mahalanobis distances can fail to detect outliers. There is plenty of statistical literature on robust covariance and correlation matrix estimates with an emphasis on affine-equivariant estimators that possess high breakdown points and small worst case biases. All such estimators have unacceptable exponential complexity in the number of variables and quadratic complexity in the number of observations. In this paper we focus on several variants of robust covariance and correlation matrix estimates with quadratic complexity in the number of variables and linear complexity in the number of observations. These estimators are based on several forms of pairwise robust covariance and correlation estimates. The estimators studied include two fast estimators based on coordinate-wise robust transformations embedded in an overall procedure recently proposed by [14]. We show that the estimators have attractive robustness properties, and give an example that uses one of the estimators in the new Insightful Miner data mining product. expand
|
|
|
MARK: a boosting algorithm for heterogeneous kernel models |
| |
Kristin P. Bennett,
Michinari Momma,
Mark J. Embrechts
|
|
Pages: 24-31 |
|
doi>10.1145/775047.775051 |
|
Full text: PDF
|
|
Support Vector Machines and other kernel methods have proven to be very effective for nonlinear inference. Practical issues are how to select the type of kernel including any parameters and how to deal with the computational issues caused by the fact ...
Support Vector Machines and other kernel methods have proven to be very effective for nonlinear inference. Practical issues are how to select the type of kernel including any parameters and how to deal with the computational issues caused by the fact that the kernel matrix grows quadratically with the data. Inspired by ensemble and boosting methods like MART, we propose the Multiple Additive Regression Kernels (MARK) algorithm to address these issues. MARK considers a large (potentially infinite) library of kernel matrices formed by different kernel functions and parameters. Using gradient boosting/column generation, MARK constructs columns of the heterogeneous kernel matrix (the base hypotheses) on the fly and then adds them into the kernel ensemble. Regularization methods such as used in SVM, kernel ridge regression, and MART, are used to prevent overfitting. We investigate how MARK is applied to heterogeneous kernel ridge regression. The resulting algorithm is simple to implement and efficient. Kernel parameter selection is handled within MARK. Sampling and "weak" kernels are used to further enhance the computational efficiency of the resulting additive algorithm. The user can incorporate and potentially extract domain knowledge by restricting the kernel library to interpretable kernels. MARK compares very favorably with SVM and kernel ridge regression on several benchmark datasets. expand
|
|
|
SESSION: Frequent patterns I |
|
|
|
|
Selecting the right interestingness measure for association patterns |
| |
Pang-Ning Tan,
Vipin Kumar,
Jaideep Srivastava
|
|
Pages: 32-41 |
|
doi>10.1145/775047.775053 |
|
Full text: PDF
|
|
Many techniques for association rule mining and feature selection require a suitable metric to capture the dependencies among variables in a data set. For example, metrics such as support, confidence, lift, correlation, and collective strength are often ...
Many techniques for association rule mining and feature selection require a suitable metric to capture the dependencies among variables in a data set. For example, metrics such as support, confidence, lift, correlation, and collective strength are often used to determine the interestingness of association patterns. However, many such measures provide conflicting information about the interestingness of a pattern, and the best metric to use for a given application domain is rarely known. In this paper, we present an overview of various measures proposed in the statistics, machine learning and data mining literature. We describe several key properties one should examine in order to select the right measure for a given application domain. A comparative study of these properties is made using twenty one of the existing measures. We show that each measure has different properties which make them useful for some application domains, but not for others. We also present two scenarios in which most of the existing measures agree with each other, namely, support-based pruning and table standardization. Finally, we present an algorithm to select a small set of tables such that an expert can select a desirable measure by looking at just this small set of tables. expand
|
|
|
DualMiner: a dual-pruning algorithm for itemsets with constraints |
| |
Cristian Bucila,
Johannes Gehrke,
Daniel Kifer,
Walker White
|
|
Pages: 42-51 |
|
doi>10.1145/775047.775054 |
|
Full text: PDF
|
|
Constraint-based mining of itemsets for questions such as "find all frequent itemsets where the total price is at least $50" has received much attention recently. Two classes of constraints, monotone and antimonotone, have been identified as very useful. ...
Constraint-based mining of itemsets for questions such as "find all frequent itemsets where the total price is at least $50" has received much attention recently. Two classes of constraints, monotone and antimonotone, have been identified as very useful. There are algorithms that efficiently take advantage of either one of these two classes, but no previous algorithms can efficiently handle both types of constraints simultaneously. In this paper, we present the first algorithm (called DualMiner) that uses both monotone and antimonotone constraints to prune its search space. We complement a theoretical analysis and proof of correctness of DualMiner with an experimental study that shows the efficacy of DualMiner compared to previous work. expand
|
|
|
Querying multiple sets of discovered rules |
| |
Alexander Tuzhilin,
Bing Liu
|
|
Pages: 52-60 |
|
doi>10.1145/775047.775055 |
|
Full text: PDF
|
|
Rule mining is an important data mining task that has been applied to numerous real-world applications. Often a rule mining system generates a large number of rules and only a small subset of them is really useful in applications. Although there exist ...
Rule mining is an important data mining task that has been applied to numerous real-world applications. Often a rule mining system generates a large number of rules and only a small subset of them is really useful in applications. Although there exist some systems allowing the user to query the discovered rules, they are less suitable for complex ad hoc querying of multiple data mining rulebases to retrieve interesting rules. In this paper, we propose a new powerful rule query language Rule-QL for querying multiple rulebases that is modeled after SQL and has rigorous theoretical foundations of a rule-based calculus. In particular, we first propose a rule-based calculus RC based on the first-order logic, and then present the language Rule-QL that is at least as expressive as the safe fragment of RC. We also propose a number of efficient query evaluation techniques for Rule-QL and test them experimentally on some representative queries to demonstrate the feasibility of Rule-QL. expand
|
|
|
SESSION: Graphs and trees |
|
|
|
|
Mining knowledge-sharing sites for viral marketing |
| |
Matthew Richardson,
Pedro Domingos
|
|
Pages: 61-70 |
|
doi>10.1145/775047.775057 |
|
Full text: PDF
|
|
Viral marketing takes advantage of networks of influence among customers to inexpensively achieve large changes in behavior. Our research seeks to put it on a firmer footing by mining these networks from data, building probabilistic models of them, and ...
Viral marketing takes advantage of networks of influence among customers to inexpensively achieve large changes in behavior. Our research seeks to put it on a firmer footing by mining these networks from data, building probabilistic models of them, and using these models to choose the best viral marketing plan. Knowledge-sharing sites, where customers review products and advise each other, are a fertile source for this type of data mining. In this paper we extend our previous techniques, achieving a large reduction in computational cost, and apply them to data from a knowledge-sharing site. We optimize the amount of marketing funds spent on each customer, rather than just making a binary decision on whether to market to him. We take into account the fact that knowledge of the network is partial, and that gathering that knowledge can itself have a cost. Our results show the robustness and utility of our approach. expand
|
|
|
Efficiently mining frequent trees in a forest |
| |
Mohammed J. Zaki
|
|
Pages: 71-80 |
|
doi>10.1145/775047.775058 |
|
Full text: PDF
|
|
Mining frequent trees is very useful in domains like bioinformatics, web mining, mining semistructured data, and so on. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TREEMINER, ...
Mining frequent trees is very useful in domains like bioinformatics, web mining, mining semistructured data, and so on. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TREEMINER, a novel algorithm to discover all frequent subtrees in a forest, using a new data structure called scope-list. We contrast TREEMINER with a pattern matching tree mining algorithm (PATTERNMATCHER). We conduct detailed experiments to test the performance and scalability of these methods. We find that TREEMINER outperforms the pattern matching approach by a factor of 4 to 20, and has good scaleup properties. We also present an application of tree mining to analyze real web logs for usage patterns. expand
|
|
|
ANF: a fast and scalable tool for data mining in massive graphs |
| |
Christopher R. Palmer,
Phillip B. Gibbons,
Christos Faloutsos
|
|
Pages: 81-90 |
|
doi>10.1145/775047.775059 |
|
Full text: PDF
|
|
Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, ...
Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, such as actors appearing in movies, can be represented as a graph. This work presents a data mining tool, called ANF, that can quickly answer a number of interesting questions on graph-represented data, such as the following. How robust is the Internet to failures? What are the most influential database papers? Are there gender differences in movie appearance patterns? At its core, ANF is based on a fast and memory-efficient approach for approximating the complete "neighbourhood function" for a graph. For the Internet graph (268K nodes), ANF's highly-accurate approximation is more than 700 times faster than the exact computation. This reduces the running time from nearly a day to a matter of a minute or two, allowing users to perform ad hoc drill-down tasks and to repeatedly answer questions about changing data sources. To enable this drill-down, ANF employs new techniques for approximating neighbourhood-type functions for graphs with distinguished nodes and/or edges. When compared to the best existing approximation, ANF's approach is both faster and more accurate, given the same resources. Additionally, unlike previous approaches, ANF scales gracefully to handle disk resident graphs. Finally, we present some of our results from mining large graphs using ANF. expand
|
|
|
SESSION: Streams and time series |
|
|
|
|
Bursty and hierarchical structure in streams |
| |
Jon Kleinberg
|
|
Pages: 91-101 |
|
doi>10.1145/775047.775061 |
|
Full text: PDF
|
|
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in ...
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise --- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; in some ways, it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. expand
|
|
|
On the need for time series data mining benchmarks: a survey and empirical demonstration |
| |
Eamonn Keogh,
Shruti Kasetty
|
|
Pages: 102-111 |
|
doi>10.1145/775047.775062 |
|
Full text: PDF
|
|
In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of ...
In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of "improvement" that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details.To illustrate our point, we have undertaken the most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than two dozen papers, and testing them on 50 real world, highly diverse datasets. Our empirical results strongly support our assertion, and suggest the need for a set of time series benchmarks and more careful empirical evaluation in the data mining community. expand
|
|
|
SESSION: Visualization |
|
|
|
|
Query, analysis, and visualization of hierarchically structured data using Polaris |
| |
Chris Stolte,
Diane Tang,
Pat Hanrahan
|
|
Pages: 112-122 |
|
doi>10.1145/775047.775064 |
|
Full text: PDF
|
|
In the last several years, large OLAP databases have become common in a variety of applications such as corporate data warehouses and scientific computing. To support interactive analysis, many of these databases are augmented with hierarchical structures ...
In the last several years, large OLAP databases have become common in a variety of applications such as corporate data warehouses and scientific computing. To support interactive analysis, many of these databases are augmented with hierarchical structures that provide meaningful levels of abstraction that can be leveraged by both the computer and analyst. This hierarchical structure generates many challenges and opportunities in the design of systems for the query, analysis, and visualization of these databases.In this paper, we present an interactive visual exploration tool that facilitates exploratory analysis of data warehouses with rich hierarchical structure, such as might be stored in data cubes. We base this tool on Polaris, a system for rapidly constructing table-based graphical displays of multidimensional databases. Polaris builds visualizations using an algebraic formalism derived from the interface and interpreted as a set of queries to a database. We extend the user interface, algebraic formalism, and generation of data queries in Polaris to expose and take advantage of hierarchical structure. In the resulting system, analysts can navigate through the hierarchical projections of a database, rapidly and incrementally generating visualizations for each projection. expand
|
|
|
On interactive visualization of high-dimensional data using the hyperbolic plane |
| |
Jörg A. Walter,
Helge Ritter
|
|
Pages: 123-132 |
|
doi>10.1145/775047.775065 |
|
Full text: PDF
|
|
We propose a novel projection based visualization method for high-dimensional datasets by combining concepts from MDS and the geometry of the hyperbolic spaces. Our approach Hyperbolic Multi-Dimensional Scaling (H-MDS) extends earlier work [7] ...
We propose a novel projection based visualization method for high-dimensional datasets by combining concepts from MDS and the geometry of the hyperbolic spaces. Our approach Hyperbolic Multi-Dimensional Scaling (H-MDS) extends earlier work [7] using hyperbolic spaces for visualization of tree structures data ( "hyperbolic tree browser" ).By borrowing concepts from multi-dimensional scaling we map proximity data directly into the 2-dimensional hyperbolic space (H2). This removes the restriction to "quasihierarchical", graph-based data -- limiting previous work. Since a suitable distance function can convert all kinds of data to proximity (or distance-based) data this type of data can be considered the most general.We used the circular Poincaré model of the H2 which allows effective human-computer interaction: by moving the "focus" via mouse the user can navigate in the data without loosing the "context". In H2 the "fish-eye" behavior originates not simply by a non-linear view transformation but rather by extraordinary, non-Euclidean properties of the H2. Especially, the exponential growth of length and area of the underlying space makes the H2 a prime target for mapping hierarchical and (now also) high-dimensional data.We present several high-dimensional mapping examples including synthetic and real world data and a successful application for unstructured text. By analyzing and integrating multiple film critiques from news:rec.art.movies.reviews and the internet movie database, each movie becomes placed within the H2. Here the idea is, that related films share more words in their reviews than unrelated. Their semantic proximity leads to a closer arrangement. The result is a kind of high-level content structured display allowing the user to explore the "space of movies". expand
|
|
|
SESSION: Web search and navigation |
|
|
|
|
Optimizing search engines using clickthrough data |
| |
Thorsten Joachims
|
|
Pages: 133-142 |
|
doi>10.1145/775047.775067 |
|
Full text: PDF
|
|
This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents ...
This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. This makes them difficult and expensive to apply. The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such clickthrough data is available in abundance and can be recorded at very low cost. Taking a Support Vector Machine (SVM) approach, this paper presents a method for learning retrieval functions. From a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. Furthermore, it is shown to be feasible even for large sets of queries and features. The theoretical results are verified in a controlled experiment. It shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming Google in terms of retrieval quality after only a couple of hundred training examples. expand
|
|
|
Relational Markov models and their application to adaptive web navigation |
| |
Corin R. Anderson,
Pedro Domingos,
Daniel S. Weld
|
|
Pages: 143-152 |
|
doi>10.1145/775047.775068 |
|
Full text: PDF
|
|
Relational Markov models (RMMs) are a generalization of Markov models where states can be of different types, with each type described by a different set of variables. The domain of each variable can be hierarchically structured, and shrinkage is carried ...
Relational Markov models (RMMs) are a generalization of Markov models where states can be of different types, with each type described by a different set of variables. The domain of each variable can be hierarchically structured, and shrinkage is carried out over the cross product of these hierarchies. RMMs make effective learning possible in domains with very large and heterogeneous state spaces, given only sparse data. We apply them to modeling the behavior of web site users, improving prediction in our PROTEUS architecture for personalizing web sites. We present experiments on an e-commerce and an academic web site showing that RMMs are substantially more accurate than alternative methods, and make good predictions even when applied to previously-unvisited parts of the site. expand
|
|
|
SESSION: Sequences and strings |
|
|
|
|
Pattern discovery in sequences under a Markov assumption |
| |
Darya Chudova,
Padhraic Smyth
|
|
Pages: 153-162 |
|
doi>10.1145/775047.775070 |
|
Full text: PDF
|
|
In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. We investigate the fundamental aspects ...
In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. We investigate the fundamental aspects of this data mining problem that can make discovery "easy" or "hard." We present a general framework for characterizing learning in this context by deriving the Bayes error rate for this problem under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology. expand
|
|
|
On effective classification of strings with wavelets |
| |
Charu C. Aggarwal
|
|
Pages: 163-172 |
|
doi>10.1145/775047.775071 |
|
Full text: PDF
|
|
In recent years, the technological advances in mapping genes have made it increasingly easy to store and use a wide variety of biological data. Such data are usually in the form of very long strings for which it is difficult to determine the most relevant ...
In recent years, the technological advances in mapping genes have made it increasingly easy to store and use a wide variety of biological data. Such data are usually in the form of very long strings for which it is difficult to determine the most relevant features for a classification task. For example, a typical DNA string may be millions of characters long, and there may be thousands of such strings in a database. In many cases, the classification behavior of the data may be hidden in the compositional behavior of certain segments of the string which cannot be easily determined apriori. Another problem which complicates the classification task is that in some cases the classification behavior is reflected in global behavior of the string, whereas in others it is reflected in local patterns. Given the enormous variation in the behavior of the strings over different data sets, it is useful to develop an approach which is sensitive to both the global and local behavior of the strings for the purpose of classification. For this purpose, we will exploit the multi-resolution property of wavelet decomposition in order to create a scheme which can mine classification characteristics at different levels of granularity. The resulting scheme turns out to be very effective in practice on a wide range of problems. expand
|
|
|
SESSION: Statistical methods II |
|
|
|
|
Shrinkage estimator generalizations of Proximal Support Vector Machines |
| |
Deepak K. Agarwal
|
|
Pages: 173-182 |
|
doi>10.1145/775047.775073 |
|
Full text: PDF
|
|
We give a statistical interpretation of Proximal Support Vector Machines (PSVM) proposed at KDD2001 as linear approximaters to (nonlinear) Support Vector Machines (SVM). We prove that PSVM using a linear kernel is identical to ridge regression, a biased-regression ...
We give a statistical interpretation of Proximal Support Vector Machines (PSVM) proposed at KDD2001 as linear approximaters to (nonlinear) Support Vector Machines (SVM). We prove that PSVM using a linear kernel is identical to ridge regression, a biased-regression method known in the statistical community for more than thirty years. Techniques from the statistical literature to estimate the tuning constant that appears in the SVM and PSVM framework are discussed. Better shrinkage strategies that incorporate more than one tuning constant are suggested. For nonlinear kernels, the minimization problem posed in the PSVM framework is equivalent to finding the posterior mode of a Bayesian model defined through a Gaussian process on the predictor space. Apart from providing new insights, these interpretations help us attach an estimate of uncertainty to our predictions and enable us to build richer classes of models. In particular, we propose a new algorithm called PSVMMIX which is a combination of ridge regression and a Gaussian process model. Extension to the case of continuous response is straightforward and illustrated with example datasets. expand
|
|
|
Hierarchical model-based clustering of large datasets through fractionation and refractionation |
| |
Jeremy Tantrum,
Alejandro Murua,
Werner Stuetzle
|
|
Pages: 183-190 |
|
doi>10.1145/775047.775074 |
|
Full text: PDF
|
|
The goal of clustering is to identify distinct groups in a dataset. Compared to non-parametric clustering methods like complete linkage, hierarchical model-based clustering has the advantage of offering a way to estimate the number of groups present ...
The goal of clustering is to identify distinct groups in a dataset. Compared to non-parametric clustering methods like complete linkage, hierarchical model-based clustering has the advantage of offering a way to estimate the number of groups present in the data. However, its computational cost is quadratic in the number of items to be clustered, and it is therefore not applicable to large problems. We review an idea called Fractionation, originally conceived by Cutting, Karger, Pedersen and Tukey for non-parametric hierarchical clustering of large datasets, and describe an adaptation of Fractionation to model-based clustering. A further extension, called Refractionation, leads to a procedure that can be successful even in the difficult situation where there are large numbers of small groups. expand
|
|
|
SESSION: Text classification |
|
|
|
|
Enhanced word clustering for hierarchical text classification |
| |
Inderjit S. Dhillon,
Subramanyam Mallela,
Rahul Kumar
|
|
Pages: 191-200 |
|
doi>10.1145/775047.775076 |
|
Full text: PDF
|
|
In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in ...
In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value, thus converging to a local minimum. We show that our algorithm minimizes the "within-cluster Jensen-Shannon divergence" while simultaneously maximizing the "between-cluster Jensen-Shannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm achieves higher classification accuracy especially at lower number of features. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20 Newsgroups data set and a 3-level hierarchy of HTML documents collected from Dmoz Open Directory. expand
|
|
|
A parallel learning algorithm for text classification |
| |
Canasai Kruengkrai,
Chuleerat Jaruskulchai
|
|
Pages: 201-206 |
|
doi>10.1145/775047.775077 |
|
Full text: PDF
|
|
Text classification is the process of classifying documents into predefined categories based on their content. Existing supervised learning algorithms to automatically classify text need sufficient labeled documents to learn accurately. Applying the ...
Text classification is the process of classifying documents into predefined categories based on their content. Existing supervised learning algorithms to automatically classify text need sufficient labeled documents to learn accurately. Applying the Expectation-Maximization (EM) algorithm to this problem is an alternative approach that utilizes a large pool of unlabeled documents to augment the available labeled documents. Unfortunately, the time needed to learn with these large unlabeled documents is too high. This paper introduces a novel parallel learning algorithm for text classification task. The parallel algorithm is based on the combination of the EM algorithm and the naive Bayes classifier. Our goal is to improve the computational time in learning and classifying process. We studied the performance of our parallel algorithm on a large Linux PC cluster called PIRUN Cluster. We report both timing and accuracy results. These results indicate that the proposed parallel algorithm is capable of handling large document collections. expand
|
|
|
A refinement approach to handling model misfit in text categorization |
| |
Haoran Wu,
Tong Heng Phang,
Bing Liu,
Xiaoli Li
|
|
Pages: 207-216 |
|
doi>10.1145/775047.775078 |
|
Full text: PDF
|
|
Text categorization or classification is the automated assigning of text documents to pre-defined classes based on their contents. This problem has been studied in information retrieval, machine learning and data mining. So far, many effective techniques ...
Text categorization or classification is the automated assigning of text documents to pre-defined classes based on their contents. This problem has been studied in information retrieval, machine learning and data mining. So far, many effective techniques have been proposed. However, most techniques are based on some underlying models and/or assumptions. When the data fits the model well, the classification accuracy will be high. However, when the data does not fit the model well, the classification accuracy can be very low. In this paper, we propose a refinement approach to dealing with this problem of model misfit. We show that we do not need to change the classification technique itself (or its underlying model) to make it more flexible. Instead, we propose to use successive refinements of classification on the training data to correct the model misfit. We apply the proposed technique to improve the classification performance of two simple and efficient text classifiers, the Rocchio classifier and the naïve Bayesian classifier. These techniques are suitable for very large text collections because they allow the data to reside on disk and need only one scan of the data to build a text classifier. Extensive experiments on two benchmark document corpora show that the proposed technique is able to improve text categorization accuracy of the two techniques dramatically. In particular, our refined model is able to improve the naïve Bayesian or Rocchio classifier's prediction performance by 45% on average. expand
|
|
|
SESSION: Frequent patterns II |
|
|
|
|
Privacy preserving mining of association rules |
| |
Alexandre Evfimievski,
Ramakrishnan Srikant,
Rakesh Agrawal,
Johannes Gehrke
|
|
Pages: 217-228 |
|
doi>10.1145/775047.775080 |
|
Full text: PDF
|
|
We present a framework for mining association rules from transactions consisting of categorical items where the data has been randomized to preserve privacy of individual transactions. While it is feasible to recover association rules and preserve privacy ...
We present a framework for mining association rules from transactions consisting of categorical items where the data has been randomized to preserve privacy of individual transactions. While it is feasible to recover association rules and preserve privacy using a straightforward "uniform" randomization, the discovered rules can unfortunately be exploited to find privacy breaches. We analyze the nature of privacy breaches and propose a class of randomization operators that are much more effective than uniform randomization in limiting the breaches. We derive formulae for an unbiased support estimator and its variance, which allow us to recover itemset supports from randomized datasets, and show how to incorporate these formulae into mining algorithms. Finally, we present experimental results that validate the algorithm by applying it on real datasets. expand
|
|
|
Mining frequent item sets by opportunistic projection |
| |
Junqiang Liu,
Yunhe Pan,
Ke Wang,
Jiawei Han
|
|
Pages: 229-238 |
|
doi>10.1145/775047.775081 |
|
Full text: PDF
|
|
In this paper, we present a novel algorithm Opportune Project for mining complete set of frequent item sets by projecting databases to grow a frequent item set tree. Our algorithm is fundamentally different from those proposed in the past in that it ...
In this paper, we present a novel algorithm Opportune Project for mining complete set of frequent item sets by projecting databases to grow a frequent item set tree. Our algorithm is fundamentally different from those proposed in the past in that it opportunistically chooses between two different structures, array-based or tree-based, to represent projected transaction subsets, and heuristically decides to build unfiltered pseudo projection or to make a filtered copy according to features of the subsets. More importantly, we propose novel methods to build tree-based pseudo projections and array-based unfiltered projections for projected transaction subsets, which makes our algorithm both CPU time efficient and memory saving. Basically, the algorithm grows the frequent item set tree by depth first search, whereas breadth first search is used to build the upper portion of the tree if necessary. We test our algorithm versus several other algorithms on real world datasets, such as BMS-POS, and on IBM artificial datasets. The empirical results show that our algorithm is not only the most efficient on both sparse and dense databases at all levels of support threshold, but also highly scalable to very large databases. expand
|
|
|
SESSION: Web page classification |
|
|
|
|
PEBL: positive example based learning for Web page classification using SVM |
| |
Hwanjo Yu,
Jiawei Han,
Kevin Chen-Chuan Chang
|
|
Pages: 239-248 |
|
doi>10.1145/775047.775083 |
|
Full text: PDF
|
|
Web page classification is one of the essential techniques for Web mining. Specifically, classifying Web pages of a user-interesting class is the first step of mining interesting information from the Web. However, constructing a classifier for an interesting ...
Web page classification is one of the essential techniques for Web mining. Specifically, classifying Web pages of a user-interesting class is the first step of mining interesting information from the Web. However, constructing a classifier for an interesting class requires laborious pre-processing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of non-homepages (negative examples). In particular, collecting negative training examples requires arduous work and special caution to avoid biasing them. We introduce in this paper the Positive Example Based Learning (PEBL) framework for Web page classification which eliminates the need for manually collecting negative training examples in pre-processing. We present an algorithm called Mapping-Convergence (M-C) that achieves classification accuracy (with positive and unlabeled data) as high as that of traditional SVM (with positive and negative data). Our experiments show that when the M-C algorithm uses the same amount of positive examples as that of traditional SVM, the M-C algorithm performs as well as traditional SVM. expand
|
|
|
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web |
| |
Martin Ester,
Hans-Peter Kriegel,
Matthias Schubert
|
|
Pages: 249-258 |
|
doi>10.1145/775047.775084 |
|
Full text: PDF
|
|
When automatically extracting information from the world wide web, most established methods focus on spotting single HTML-documents. However, the problem of spotting complete web sites is not handled adequately yet, in spite of its importance for various ...
When automatically extracting information from the world wide web, most established methods focus on spotting single HTML-documents. However, the problem of spotting complete web sites is not handled adequately yet, in spite of its importance for various applications. Therefore, this paper discusses the classification of complete web sites. First, we point out the main differences to page classification by discussing a very intuitive approach and its weaknesses. This approach treats a web site as one large HTML-document and applies the well-known methods for page classification. Next, we show how accuracy can be improved by employing a preprocessing step which assigns an occurring web page to its most likely topic. The determined topics now represent the information the web site contains and can be used to classify it more accurately. We accomplish this by following two directions. First, we apply well established classification algorithms to a feature space of occurring topics. The second direction treats a site as a tree of occurring topics and uses a Markov tree model for further classification. To improve the efficiency of this approach, we additionally introduce a powerful pruning method reducing the number of considered web pages. Our experiments show the superiority of the Markov tree approach regarding classification accuracy. In particular, we demonstrate that the use of our pruning method not only reduces the processing time, but also improves the classification accuracy. expand
|
|
|
SESSION: Learning methods |
|
|
|
|
Sequential cost-sensitive decision making with reinforcement learning |
| |
Edwin Pednault,
Naoki Abe,
Bianca Zadrozny
|
|
Pages: 259-268 |
|
doi>10.1145/775047.775086 |
|
Full text: PDF
|
|
Recently, there has been increasing interest in the issues of cost-sensitive learning and decision making in a variety of applications of data mining. A number of approaches have been developed that are effective at optimizing cost-sensitive decisions ...
Recently, there has been increasing interest in the issues of cost-sensitive learning and decision making in a variety of applications of data mining. A number of approaches have been developed that are effective at optimizing cost-sensitive decisions when each decision is considered in isolation. However, the issue of sequential decision making, with the goal of maximizing total benefits accrued over a period of time instead of immediate benefits, has rarely been addressed. In the present paper, we propose a novel approach to sequential decision making based on the reinforcement learning framework. Our approach attempts to learn decision rules that optimize a sequence of cost-sensitive decisions so as to maximize the total benefits accrued over time. We use the domain of targeted' marketing as a testbed for empirical evaluation of the proposed method. We conducted experiments using approximately two years of monthly promotion data derived from the well-known KDD Cup 1998 donation data set. The experimental results show that the proposed method for optimizing total accrued benefits out performs the usual targeted-marketing methodology of optimizing each promotion in isolation. We also analyze the behavior of the targeting rules that were obtained and discuss their appropriateness to the application domain. expand
|
|
|
Interactive deduplication using active learning |
| |
Sunita Sarawagi,
Anuradha Bhamidipaty
|
|
Pages: 269-278 |
|
doi>10.1145/775047.775087 |
|
Full text: PDF
|
|
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing ...
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output. expand
|
|
|
SESSION: Intrusion and privacy |
|
|
|
|
Transforming data to satisfy privacy constraints |
| |
Vijay S. Iyengar
|
|
Pages: 279-288 |
|
doi>10.1145/775047.775089 |
|
Full text: PDF
|
|
Data on individuals and entities are being collected widely. These data can contain information that explicitly identifies the individual (e.g., social security number). Data can also contain other kinds of personal information (e.g., date of birth, ...
Data on individuals and entities are being collected widely. These data can contain information that explicitly identifies the individual (e.g., social security number). Data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. Data are often shared for business or legal reasons. This paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. We explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. We extend earlier works in this area along various dimensions. First, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. This allows us to optimize the process of preserving privacy for the specified usage. In particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. Second, our work improves on previous approaches by allowing more flexible generalizations for the data. Lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. These extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. expand
|
|
|
Exploiting unlabeled data in ensemble methods |
| |
Kristin P. Bennett,
Ayhan Demiriz,
Richard Maclin
|
|
Pages: 289-296 |
|
doi>10.1145/775047.775090 |
|
Full text: PDF
|
|
An adaptive semi-supervised ensemble method, ASSEMBLE, is proposed that constructs classification ensembles based on both labeled and unlabeled data. ASSEMBLE alternates between assigning "pseudo-classes" to the unlabeled data using the existing ensemble ...
An adaptive semi-supervised ensemble method, ASSEMBLE, is proposed that constructs classification ensembles based on both labeled and unlabeled data. ASSEMBLE alternates between assigning "pseudo-classes" to the unlabeled data using the existing ensemble and constructing the next base classifier using both the labeled and pseudolabeled data. Mathematically, this intuitive algorithm corresponds to maximizing the classification margin in hypothesis space as measured on both the labeled and unlabeled of data. Unlike alternative approaches, ASSEMBLE does not require a semi-supervised learning method for the base classifier. ASSEMBLE can be used in conjunction with any cost-sensitive classification algorithm for both two-class and multi-class problems. ASSEMBLE using decision trees won the NIPS 2001 Unlabeled Data Competition. In addition, strong results on several benchmark datasets using both decision trees and neural networks support the proposed method. expand
|
|
|
SESSION: Ensembles and boosting |
|
|
|
|
Predicting rare classes: can boosting make any weak learner strong? |
| |
Mahesh V. Joshi,
Ramesh C. Agarwal,
Vipin Kumar
|
|
Pages: 297-306 |
|
doi>10.1145/775047.775092 |
|
Full text: PDF
|
|
Boosting is a strong ensemble-based learning algorithm with the promise of iteratively improving the classification accuracy using any base learner, as long as it satisfies the condition of yielding weighted accuracy > 0.5. In this paper, we analyze ...
Boosting is a strong ensemble-based learning algorithm with the promise of iteratively improving the classification accuracy using any base learner, as long as it satisfies the condition of yielding weighted accuracy > 0.5. In this paper, we analyze boosting with respect to this basic condition on the base learner, to see if boosting ensures prediction of rarely occurring events with high recall and precision. First we show that a base learner can satisfy the required condition even for poor recall or precision levels, especially for very rare classes. Furthermore, we show that the intelligent weight updating mechanism in boosting, even in its strong cost-sensitive form, does not prevent cases where the base learner always achieves high precision but poor recall or high recall but poor precision, when mapped to the original distribution. In either of these cases, we show that the voting mechanism of boosting falls to achieve good overall recall and precision for the ensemble. In effect, our analysis indicates that one cannot be blind to the base learner performance, and just rely on the boosting mechanism to take care of its weakness. We validate our arguments empirically on variety of real and synthetic rare class problems. In particular, using AdaCost as the boosting algorithm, and variations of PNrule and RIPPER as the base learners, we show that if algorithm A achieves better recall-precision balance than algorithm B, then using A as the base learner in AdaCost yields significantly better performance than using B as the base learner. expand
|
|
|
Efficient handling of high-dimensional feature spaces by randomized classifier ensembles |
| |
Aleksander Kołcz,
Xiaomei Sun,
Jugal Kalita
|
|
Pages: 307-313 |
|
doi>10.1145/775047.775093 |
|
Full text: PDF
|
|
Handling massive datasets is a difficult problem not only due to prohibitively large numbers of entries but in some cases also due to the very high dimensionality of the data. Often, severe feature selection is performed to limit the number of attributes ...
Handling massive datasets is a difficult problem not only due to prohibitively large numbers of entries but in some cases also due to the very high dimensionality of the data. Often, severe feature selection is performed to limit the number of attributes to a manageable size, which unfortunately can lead to a loss of useful information. Feature space reduction may well be necessary for many stand-alone classifiers, but recent advances in the area of ensemble classifier techniques indicate that overall accurate classifier aggregates can be learned even if each individual classifier operates on incomplete "feature view" training data, i.e., such where certain input attributes are excluded. In fact, by using only small random subsets of features to build individual component classifiers, surprisingly accurate and robust models can be created. In this work we demonstrate how these types of architectures effectively reduce the feature space for submodels and groups of sub-models, which lends itself to efficient sequential and/or parallel implementations. Experiments with a randomized version of Adaboost are used to support our arguments, using the text classification task as an example. expand
|
|
|
SESSION: Industry track papers |
|
|
|
|
From run-time behavior to usage scenarios: an interaction-pattern mining approach |
| |
Mohammad El-Ramly,
Eleni Stroulia,
Paul Sorenson
|
|
Pages: 315-324 |
|
doi>10.1145/775047.775095 |
|
Full text: PDF
|
|
A key challenge facing IT organizations today is their evolution towards adopting e-business practices that gives rise to the need for reengineering their underlying software systems. Any reengineering effort has to be aware of the functional requirements ...
A key challenge facing IT organizations today is their evolution towards adopting e-business practices that gives rise to the need for reengineering their underlying software systems. Any reengineering effort has to be aware of the functional requirements of the subject system, in order not to violate the integrity of its intended uses. However, as software systems get regularly maintained throughout their lifecycle, the documentation of their requirements often become obsolete or get lost. To address this problem of "software requirements loss", we have developed an interaction-pattern mining method for the recovery of functional requirements as usage scenarios. Our method analyzes traces of the run-time system-user interaction to discover frequently recurring patterns; these patterns correspond to the functionality currently exercised by the system users, represented as usage scenarios. The discovered scenarios provide the basis for reengineering the software system into web-accessible components, each one supporting one of the discovered scenarios. In this paper, we describe IPM2, our interaction-pattern discovery algorithm, we illustrate it with a case study from a real application and we give an overview of the reengineering process in the context of which it is employed. expand
|
|
|
Exploiting response models: optimizing cross-sell and up-sell opportunities in banking |
| |
Andrew Storey,
Marc-david Cohen
|
|
Pages: 325-331 |
|
doi>10.1145/775047.775096 |
|
Full text: PDF
|
|
The banking industry regularly mounts campaigns to improve customer value by offering new products to existing customers. In recent years this approach has gained significant momentum because of the increasing availability of customer data and the improved ...
The banking industry regularly mounts campaigns to improve customer value by offering new products to existing customers. In recent years this approach has gained significant momentum because of the increasing availability of customer data and the improved analysis capabilities in data mining. Typically, response models based on historical data are used to estimate the probability of a customer purchasing an additional product and the expected return from that additional purchase. Even with these computational improvements and accurate models of customer behavior, the problem of efficiently using marketing resources to maximize the return on marketing investment is a challenge. This problem is compounded because of the capability to launch multiple campaigns through several distribution channels over multiple time periods. The combination of alternatives creates a complicated array of possible actions. This paper presents a solution that answers the question of what products, if any, to offer to each customer in a way that maximizes the marketing return on investment. The solution is an improvement over the usual approach of picking the customers that have the largest expected value for a particular product because it is a global maximization from the viewpoint of the bank and allows for the effective implementation of business constraints across customers and business units. The approach accounts for limited resources, multiple sequential campaigns, and other business constraints. Furthermore, the solution provides insight into the cost of these constraints, in terms of decreased profits, and thus is an effective tool for both tactical campaign execution and strategic planning. expand
|
|
|
Customer lifetime value modeling and its use for customer retention planning |
| |
Saharon Rosset,
Einat Neumann,
Uri Eick,
Nurit Vatnik,
Yizhak Idan
|
|
Pages: 332-340 |
|
doi>10.1145/775047.775097 |
|
Full text: PDF
|
|
We present and discuss the important business problem of estimating the effect of retention efforts on the Lifetime Value of a customer in the Telecommunications industry. We discuss the components of this problem, in particular customer value and length ...
We present and discuss the important business problem of estimating the effect of retention efforts on the Lifetime Value of a customer in the Telecommunications industry. We discuss the components of this problem, in particular customer value and length of service (or tenure) modeling, and present a novel segment-based approach, motivated by the segment-level view marketing analysts usually employ. We then describe how we build on this approach to estimate the effects of retention on Lifetime Value. Our solution has been successfully implemented in Amdocs' Business Insight (BI) platform, and we illustrate its usefulness in real-world scenarios. expand
|
|
|
Mining product reputations on the Web |
| |
Satoshi Morinaga,
Kenji Yamanishi,
Kenji Tateishi,
Toshikazu Fukushima
|
|
Pages: 341-349 |
|
doi>10.1145/775047.775098 |
|
Full text: PDF
|
|
Knowing the reputations of your own and/or competitors' products is important for marketing and customer relationship management. It is, however, very costly to collect and analyze survey data manually. This paper presents a new framework for mining ...
Knowing the reputations of your own and/or competitors' products is important for marketing and customer relationship management. It is, however, very costly to collect and analyze survey data manually. This paper presents a new framework for mining product reputations on the Internet. It automatically collects people's opinions about target products from Web pages, and it uses text mining techniques to obtain the reputations of those products.On the basis of human-test samples, we generate in advance syntactic and linguistic rules to determine whether any given statement is an opinion or not, as well as whether such any opinion is positive or negative in nature. We first collect statements regarding target products using a general search engine, and then, using the rules, extract opinions from among them and attach three labels to each opinion, labels indicating the positive/negative determination, the product name itself, and an numerical value expressing the degree of system confidence that the statement is, in fact, an opinion. The labeled opinions are then input into an opinion database.The mining of reputations, i.e., the finding of statistically meaningful information included in the database, is then conducted. We specify target categories using label values (such as positive opinions of product A) and perform four types of text mining: extraction of 1) characteristic words, 2) co-occurrence words, 3) typical sentences, for individual target categories, and 4) correspondence analysis among multiple target categories.Actual marketing data is used to demonstrate the validity and effectiveness of the framework, which offers a drastic reduction in the overall cost of reputation analysis over that of conventional survey approaches and supports the discovery of knowledge from the pool of opinions on the web. expand
|
|
|
Learning domain-independent string transformation weights for high accuracy object identification |
| |
Sheila Tejada,
Craig A. Knoblock,
Steven Minton
|
|
Pages: 350-359 |
|
doi>10.1145/775047.775099 |
|
Full text: PDF
|
|
The task of object identification occurs when integrating information from multiple websites. The same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. Previous ...
The task of object identification occurs when integrating information from multiple websites. The same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. Previous methods of object identification have required manual construction of domain-specific string transformations or manual setting of general transformation parameter weights for recognizing format inconsistencies. This manual process can be time consuming and error-prone. We have developed an object identification system called Active Atlas [18], which applies a set of domain-independent string transformations to compare the objects' shared attributes in order to identify matching objects. In this paper, we discuss extensions to the Active Atlas system, which allow it to learn to tailor the weights of a set of general transformations to a specific application domain through limited user input. The experimental results demonstrate that this approach achieves higher accuracy and requires less user involvement than previous methods across various application domains. expand
|
|
|
A system for real-time competitive market intelligence |
| |
Sholom M. Weiss,
Naval K. Verma
|
|
Pages: 360-365 |
|
doi>10.1145/775047.775100 |
|
Full text: PDF
|
|
A method is described for real-time market intelligence and competitive analysis. News stories are collected online for a designated group of companies. The goal is to detect critical differences in the text written about a company versus the text for ...
A method is described for real-time market intelligence and competitive analysis. News stories are collected online for a designated group of companies. The goal is to detect critical differences in the text written about a company versus the text for its competitors. A solution is found by mapping the task into a non-stationary text categorization model. The overall design consists of the following components: (a) a real-time crawler that monitors newswires for stories about the competitors (b) a conditional document retriever that selects only those documents that meet the indicated conditions (c) text analysis techniques that convert the documents to a numerical format (d) rule induction methods for finding patterns in data (e) presentation techniques for displaying results. The method is extended to combine text with numerical measures, such as those based on stock prices and market capitalizations, that allow for more objective evaluations and projections. expand
|
|
|
Mining intrusion detection alarms for actionable knowledge |
| |
Klaus Julisch,
Marc Dacier
|
|
Pages: 366-375 |
|
doi>10.1145/775047.775101 |
|
Full text: PDF
|
|
In response to attacks against enterprise networks, administrators increasingly deploy intrusion detection systems. These systems monitor hosts, networks, and other resources for signs of security violations. The use of intrusion detection has given ...
In response to attacks against enterprise networks, administrators increasingly deploy intrusion detection systems. These systems monitor hosts, networks, and other resources for signs of security violations. The use of intrusion detection has given rise to another difficult problem, namely the handling of a generally large number of alarms. In this paper, we mine historical alarms to learn how future alarms can be handled more efficiently. First, we investigate episode rules with respect to their suitability in this approach. We report the difficulties encountered and the unexpected insights gained. In addition, we introduce a new conceptual clustering technique, and use it in extensive experiments with real-world data to show that intrusion detection alarms can be handled efficiently by using previously mined knowledge. expand
|
|
|
Learning nonstationary models of normal network traffic for detecting novel attacks |
| |
Matthew V. Mahoney,
Philip K. Chan
|
|
Pages: 376-385 |
|
doi>10.1145/775047.775102 |
|
Full text: PDF
|
|
Traditional intrusion detection systems (IDS) detect attacks by comparing current behavior to signatures of known attacks. One main drawback is the inability of detecting new attacks which do not have known signatures. In this paper we propose a learning ...
Traditional intrusion detection systems (IDS) detect attacks by comparing current behavior to signatures of known attacks. One main drawback is the inability of detecting new attacks which do not have known signatures. In this paper we propose a learning algorithm that constructs models of normal behavior from attack-free network traffic. Behavior that deviates from the learned normal model signals possible novel attacks. Our IDS is unique in two respects. First, it is nonstationary, modeling probabilities based on the time since the last event rather than on average rate. This prevents alarm floods. Second, the IDS learns protocol vocabularies (at the data link through application layers) in order to detect unknown attacks that attempt to exploit implementation errors in poorly tested features of the target software. On the 1999 DARPA IDS evaluation data set [9], we detect 70 of 180 attacks (with 100 false alarms), about evenly divided between user behavioral anomalies (IP addresses and ports, as modeled by most other systems) and protocol anomalies. Because our methods are unconventional there is a significant non-overlap of our IDS with the original DARPA participants, which implies that they could be combined to increase coverage. expand
|
|
|
ADMIT: anomaly-based data mining for intrusions |
| |
Karlton Sequeira,
Mohammed Zaki
|
|
Pages: 386-395 |
|
doi>10.1145/775047.775103 |
|
Full text: PDF
|
|
Security of computer systems is essential to their acceptance and utility. Computer security analysts use intrusion detection systems to assist them in maintaining computer system security. This paper deals with the problem of differentiating between ...
Security of computer systems is essential to their acceptance and utility. Computer security analysts use intrusion detection systems to assist them in maintaining computer system security. This paper deals with the problem of differentiating between masqueraders and the true user of a computer terminal. Prior efficient solutions are less suited to real time application, often requiring all training data to be labeled, and do not inherently provide an intuitive idea of what the data model means. Our system, called ADMIT, relaxes these constraints, by creating user profiles using semi-incremental techniques. It is a real-time intrusion detection system with host-based data collection and processing. Our method also suggests ideas for dealing with concept drift and affords a detection rate as high as 80.3% and a false positive rate as low as 15.3%. expand
|
|
|
Handling very large numbers of association rules in the analysis of microarray data |
| |
Alexander Tuzhilin,
Gediminas Adomavicius
|
|
Pages: 396-404 |
|
doi>10.1145/775047.775104 |
|
Full text: PDF
|
|
The problem of analyzing microarray data became one of important topics in bioinformatics over the past several years, and different data mining techniques have been proposed for the analysis of such data. In this paper, we propose to use association ...
The problem of analyzing microarray data became one of important topics in bioinformatics over the past several years, and different data mining techniques have been proposed for the analysis of such data. In this paper, we propose to use association rule discovery methods for determining associations among expression levels of different genes. One of the main problems related to the discovery of these associations is the scalability issue. Microarrays usually contain very large numbers of genes that are sometimes measured in 10,000s. Therefore, analysis of such data can generate a very large number of associations that can often be measured in millions. The paper addresses this problem by presenting a method that enables biologists to evaluate these very large numbers of discovered association rules during the post-analysis stage of the data mining process. This is achieved by providing several rule evaluation operators, including rule grouping, filtering, browsing, and data inspection operators, that allow biologists to validate multiple individual gane regulation patterns at a time. By iteratively applying these operators, biologists can explore a significant part of all the initially generated rules in an acceptable period of time and thus answer biological questions that are of a particular interest to him or her. To validate our method, we tested our system on the microarray data pertaining to the studies of environmental hazards and their influence of gane expression processes. As a result, we managed to answer several questions that were of interest to the biologists that had collected this data. expand
|
|
|
On the potential of domain literature for clustering and Bayesian network learning |
| |
Peter Antal,
Patrick Glenisson,
Geert Fannes
|
|
Pages: 405-414 |
|
doi>10.1145/775047.775105 |
|
Full text: PDF
|
|
Thanks to its increasing availability, electronic literature can now be a major source of information when developing complex statistical models where data is scarce or contains much noise. This raises the question of how to integrate information from ...
Thanks to its increasing availability, electronic literature can now be a major source of information when developing complex statistical models where data is scarce or contains much noise. This raises the question of how to integrate information from domain literature with statistical data. Because quantifying similarities or dependencies between variables is a basic building block in knowledge discovery, we consider here the following question. Which vector representations of text and which statistical scores of similarity or dependency support best the use of literature in statistical models? For the text source, we assume to have annotations for the domain variables as short free-text descriptions and optionally to have a large literature repository from which we can further expand the annotations. For evaluation, we contrast the variables similarities or dependencies obtained from text using different annotation sources and vector representations with those obtained from measurement data or expert assessments. Specifically, we consider two learning problems: clustering and Bayesian network learning. Firstly, we report performance (against an expert reference) for clustering yeast genes from textual annotations. Secondly, we assess the agreement between text-based and data-based scores of variable dependencies when learning Bayesian network substructures for the task of modeling the joint distribution of clinical measurements of ovarian tumors. expand
|
|
|
Mining heterogeneous gene expression data with time lagged recurrent neural networks |
| |
Yulan Liang,
Arpad Kelemen
|
|
Pages: 415-421 |
|
doi>10.1145/775047.775106 |
|
Full text: PDF
|
|
Heterogeneous types of gene expressions may provide a better insight into the biological role of gene interaction with the environment, disease development and drug effect at the molecular level. In this paper for both exploring and prediction purposes ...
Heterogeneous types of gene expressions may provide a better insight into the biological role of gene interaction with the environment, disease development and drug effect at the molecular level. In this paper for both exploring and prediction purposes a Time Lagged Recurrent Neural Network with trajectory learning is proposed for identifying and classifying the gene functional patterns from the heterogeneous nonlinear time series microarray experiments. The proposed procedures identify gene functional patterns from the dynamics of a state-trajectory learned in the heterogeneous time series and the gradient information over time. Also, the trajectory learning with Back-propagation through time algorithm can recognize gene expression patterns vary over time. This may reveal much more information about the regulatory network underlying gene expressions. The analyzed data were extracted from spotted DNA microarrays in the budding yeast expression measurements, produced by Eisen et al. The gene matrix contained 79 experiments over a variety of heterogeneous experiment conditions. The number of recognized gene patterns in our study ranged from two to ten and were divided into three cases. Optimal network architectures with different memory structures were selected based on Akaike and Bayesian information statistical criteria using two-way factorial design. The optimal model performance was compared to other popular gene classification algorithms such as Nearest Neighbor, Support Vector Machine, and Self-Organized Map. The reliability of the performance was verified with multiple iterated runs. expand
|
|
|
POSTER SESSION: Poster papers |
|
|
|
|
Collaborative crawling: mining user experiences for topical resource discovery |
| |
Charu C. Aggarwal
|
|
Pages: 423-428 |
|
doi>10.1145/775047.775108 |
|
Full text: PDF
|
|
The rapid growth of the world wide web had made the problem of topic specific resource discovery an important one in recent years. In this problem, it is desired to find web pages which satisfy a predicate specified by the user. Such a predicate could ...
The rapid growth of the world wide web had made the problem of topic specific resource discovery an important one in recent years. In this problem, it is desired to find web pages which satisfy a predicate specified by the user. Such a predicate could be a keyword query, a topical query, or some arbitrary contraint. Several techniques such as focussed crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers are linkage based, since they use the hyperlink behavior in order to perform resource discovery. Recent studies have shown that the topical correlations in hyperlinks are quite noisy and may not always show the consistency necessary for a reliable resource discovery process. In this paper, we will approach the problem of resource discovery from an entirely different perspective; we will mine the significant browsing patterns of world wide web users in order to model the likelihood of web pages belonging to a specified predicate. This user behavior can be mined from the freely available traces of large public domain proxies on the world wide web. We refer to this technique as collaborative crawling because it mines the collective user experiences in order to find topical resources. Such a strategy is extremely effective because the topical consistency in world wide web browsing patterns turns out to very reliable. In addition, the user-centered crawling system can be combined with linkage based systems to create an overall system which works more effectively than a system based purely on either user behavior or hyperlinks. expand
|
|
|
Sequential PAttern mining using a bitmap representation |
| |
Jay Ayres,
Jason Flannick,
Johannes Gehrke,
Tomi Yiu
|
|
Pages: 429-435 |
|
doi>10.1145/775047.775109 |
|
Full text: PDF
|
|
We introduce a new algorithm for mining sequential patterns. Our algorithm is especially efficient when the sequential patterns in the database are very long. We introduce a novel depth-first search strategy that integrates a depth-first traversal ...
We introduce a new algorithm for mining sequential patterns. Our algorithm is especially efficient when the sequential patterns in the database are very long. We introduce a novel depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms.Our implementation of the search strategy combines a vertical bitmap representation of the database with efficient support counting. A salient feature of our algorithm is that it incrementally outputs new frequent itemsets in an online fashion.In a thorough experimental evaluation of our algorithm on standard benchmark data from the literature, our algorithm outperforms previous work up to an order of magnitude. expand
|
|
|
Frequent term-based text clustering |
| |
Florian Beil,
Martin Ester,
Xiaowei Xu
|
|
Pages: 436-442 |
|
doi>10.1145/775047.775110 |
|
Full text: PDF
|
|
Text clustering methods can be used to structure large sets of text or hypertext documents. The well-known methods of text clustering, however, do not really address the special problems of text clustering: very high dimensionality of the data, very ...
Text clustering methods can be used to structure large sets of text or hypertext documents. The well-known methods of text clustering, however, do not really address the special problems of text clustering: very high dimensionality of the data, very large size of the databases and understandability of the cluster description. In this paper, we introduce a novel approach which uses frequent item (term) sets for text clustering. Such frequent sets can be efficiently discovered using algorithms for association rule mining. To cluster based on frequent term sets, we measure the mutual overlap of frequent sets with respect to the sets of supporting documents. We present two algorithms for frequent term-based text clustering, FTC which creates flat clusterings and HFTC for hierarchical clustering. An experimental evaluation on classical text documents as well as on web documents demonstrates that the proposed algorithms obtain clusterings of comparable quality significantly more efficiently than state-of-the- art text clustering algorithms. Furthermore, our methods provide an understandable description of the discovered clusters by their frequent term sets. expand
|
|
|
A theoretical framework for learning from a pool of disparate data sources |
| |
Shai Ben-David,
Johannes Gehrke,
Reba Schuller
|
|
Pages: 443-449 |
|
doi>10.1145/775047.775111 |
|
Full text: PDF
|
|
Many enterprises incorporate information gathered from a variety of data sources into an integrated input for some learning task. For example, aiming towards the design of an automated diagnostic tool for some disease, one may wish to integrate data ...
Many enterprises incorporate information gathered from a variety of data sources into an integrated input for some learning task. For example, aiming towards the design of an automated diagnostic tool for some disease, one may wish to integrate data gathered in many different hospitals. A major obstacle to such endeavors is that different data sources may vary considerably in the way they choose to represent related data. In practice, the problem is usually solved by a manual construction of semantic mappings and translations between the different sources. Recently there have been attempts to introduce automated algorithms based on machine learning tools for the construction of such translations.In this work we propose a theoretical framework for making classification predictions from a collection of different data sources, without creating explicit translations between them. Our framework allows a precise mathematical analysis of the complexity of such tasks, and it provides a tool for the development and comparison of different learning algorithms. Our main objective, at this stage, is to demonstrate the usefulness of computational learning theory to this practically important area and to stimulate further theoretical and experimental research of questions related to this framework. expand
|
|
|
Topics in 0--1 data |
| |
Ella Bingham,
Heikki Mannila,
Jouni K. Seppänen
|
|
Pages: 450-455 |
|
doi>10.1145/775047.775112 |
|
Full text: PDF
|
|
Large 0--1 datasets arise in various applications, such as market basket analysis and information retrieval. We concentrate on the study of topic models, aiming at results which indicate why certain methods succeed or fail. We describe simple algorithms ...
Large 0--1 datasets arise in various applications, such as market basket analysis and information retrieval. We concentrate on the study of topic models, aiming at results which indicate why certain methods succeed or fail. We describe simple algorithms for finding topic models from 0--1 data. We give theoretical results showing that the algorithms can discover the epsilon-separable topic models of Papadimitriou et al. We present empirical results showing that the algorithms find natural topics in real-world data sets. We also briefly discuss the connections to matrix approaches, including nonnegative matrix factorization and independent component analysis. expand
|
|
|
Extracting decision trees from trained neural networks |
| |
Olcay Boz
|
|
Pages: 456-461 |
|
doi>10.1145/775047.775113 |
|
Full text: PDF
|
|
Neural Networks are successful in acquiring hidden knowledge in datasets. Their biggest weakness is that the knowledge they acquire is represented in a form not understandable to humans. Researchers tried to address this problem by extracting rules from ...
Neural Networks are successful in acquiring hidden knowledge in datasets. Their biggest weakness is that the knowledge they acquire is represented in a form not understandable to humans. Researchers tried to address this problem by extracting rules from trained Neural Networks. Most of the proposed rule extraction methods required specialized type of Neural Networks; some required binary inputs and some were computationally expensive. Craven proposed extracting MofN type Decision Trees from Neural Networks. We believe MofN type Decision Trees are only good for MofN type problems and trees created for regular high dimensional real world problems may be very complex. In this paper, we introduced a new method for extracting regular C4.5 like Decision Trees from trained Neural Networks. We showed that the new method (DecText) is effective in extracting high fidelity trees from trained networks. We also introduced a new discretization technique to make DecText be able to handle continuous features and a new pruning technique for finding simplest tree with the highest fidelity. expand
|
|
|
A new two-phase sampling based algorithm for discovering association rules |
| |
Bin Chen,
Peter Haas,
Peter Scheuermann
|
|
Pages: 462-468 |
|
doi>10.1145/775047.775114 |
|
Full text: PDF
|
|
This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each ...
This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90--95% accuracy using a final sample having a size of only 15--33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database --- even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data. expand
|
|
|
CVS: a Correlation-Verification based Smoothing technique on information retrieval and term clustering |
| |
Christina Yip Chung,
Bin Chen
|
|
Pages: 469-474 |
|
doi>10.1145/775047.775115 |
|
Full text: PDF
|
|
As information volume in enterprise systems and in the Web grows rapidly, how to accurately retrieve information is an important research area. Several corpus based smoothing techniques have been proposed to address the data sparsity and synonym problems ...
As information volume in enterprise systems and in the Web grows rapidly, how to accurately retrieve information is an important research area. Several corpus based smoothing techniques have been proposed to address the data sparsity and synonym problems faced by information retrieval systems. Such smoothing techniques are often unable to discover and utilize the correlations among terms.We propose CVS, a Correlation-Verification based Smoothing method, that considers co-occurrence information in smoothing. Strongly correlated terms in a document are identified by their co-occurrence frequencies in the document. To avoid missing correlated terms with low co-occurrence frequencies but specific to the theme of the document, the joint distributions of terms in the document are compared with those in the corpus for statistical significance.A common approach to apply corpus based smoothing techniques to information retrieval is by refining the vector representations of documents. This paper investigates the effects of corpus based smoothing on information retrieval by query expansion using term clusters generated from a term clustering process. The results can also be viewed in light of the effects of smoothing on clustering.Empirical studies show that our approach outperforms previous corpus based smoothing techniques. It improves retrieval effectiveness by 14.6%. The results demonstrate that corpus based smoothing can be used for query expansion by term clustering. expand
|
|
|
Learning to match and cluster large high-dimensional data sets for data integration |
| |
William W. Cohen,
Jacob Richman
|
|
Pages: 475-480 |
|
doi>10.1145/775047.775116 |
|
Full text: PDF
|
|
Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this ...
Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in different databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system. expand
|
|
|
SECRET: a scalable linear regression tree algorithm |
| |
Alin Dobra,
Johannes Gehrke
|
|
Pages: 481-487 |
|
doi>10.1145/775047.775117 |
|
Full text: PDF
|
|
Developing regression models for large datasets that are both accurate and easy to interpret is a very important data mining problem. Regression trees with linear models in the leaves satisfy both these requirements, but thus far, no truly scalable regression ...
Developing regression models for large datasets that are both accurate and easy to interpret is a very important data mining problem. Regression trees with linear models in the leaves satisfy both these requirements, but thus far, no truly scalable regression tree algorithm is known. This paper proposes a novel regression tree construction algorithm (SECRET) that produces trees of high quality and scales to very large datasets. At every node, SECRET uses the EM algorithm for Gaussian mixtures to find two clusters in the data and to locally transform the regression problem into a classification problem based on closeness to these clusters. Goodness of split measures, like the gini gain, can then be used to determine the split variable and the split point much like in classification tree construction. Scalability of the algorithm can be achieved by employing scalable versions of the EM and classification tree construction algorithms. An experimental evaluation on real and artificial data shows that SECRET has accuracy comparable to other linear regression tree algorithms but takes orders of magnitude less computation time for large datasets. expand
|
|
|
Statistical modeling of large-scale simulation data |
| |
Tina Eliassi-Rad,
Terence Critchlow,
Ghaleb Abdulla
|
|
Pages: 488-494 |
|
doi>10.1145/775047.775118 |
|
Full text: PDF
|
|
With the advent of fast computer systems, scientists are now able to generate terabytes of simulation data. Unfortunately, the sheer size of these data sets has made efficient exploration of them impossible. To aid scientists in gleaning insight from ...
With the advent of fast computer systems, scientists are now able to generate terabytes of simulation data. Unfortunately, the sheer size of these data sets has made efficient exploration of them impossible. To aid scientists in gleaning insight from their simulation data, we have developed an ad-hoc query infrastructure. Our system, called AQSim (short for Ad-hoc Queries for Simulation) reduces the data storage requirements and query access times in two stages. First, it creates and stores mathematical and statistical models of the data at multiple resolutions. Second, it evaluates queries on the models of the data instead of on the entire data set. In this paper, we present two simple but effective statistical modeling techniques for simulation data. Our first modeling technique computes the "true" (unbiased) mean of systematic partitions of the data. It makes no assumptions about the distribution of the data and uses a variant of the root mean square error to evaluate a model. Our second statistical modeling technique uses the Andersen-Darling goodness-of-fit method on systematic partitions of the data. This method evaluates a model by how well it passes the normality test on the data. Both of our statistical models effectively answer range queries. At each resolution of the data, we compute the precision of our answer to the user's query by scaling the one-sided Chebyshev Inequalities with the original mesh's topology. We combine precisions at different resolutions by calculating their weighted average. Our experimental evaluations on two scientific simulation data sets illustrate the value of using these statistical modeling techniques on multiple resolutions of large simulation data sets. expand
|
|
|
Tumor cell identification using features rules |
| |
Bin Fang,
Wynne Hsu,
Mong Li Lee
|
|
Pages: 495-500 |
|
doi>10.1145/775047.775119 |
|
Full text: PDF
|
|
Advances in imaging techniques have led to large repositories of images. There is an increasing demand for automated systems that can analyze complex medical images and extract meaningful information for mining patterns. Here, we describe a real-life ...
Advances in imaging techniques have led to large repositories of images. There is an increasing demand for automated systems that can analyze complex medical images and extract meaningful information for mining patterns. Here, we describe a real-life image mining application to the problem of tumour cell counting. The quantitative analysis of tumour cells is fundamental to characterizing the activity of tumour cells. Existing approaches are mostly manual, time-consuming and subjective. Efforts to automate the process of cell counting have largely focused on using image processing techniques only. Our studies indicate that image processing alone is unable to give accurate results. In this paper, we examine the use of extracted features rules to aid in the process of tumor cell counting. We propose a robust local adaptive thresholding and dynamic water immersion algorithms to segment regions of interesting from background. Meaningful features are then extracted from the segmented regions. A number of base classifiers are built to generate features rules to help identify the tumor cell. Two voting strategies are implemented to combine the base classifiers into a meta-classifier. Experiment results indicate that this process of using extracted features rules to help identify tumor cell leads to better accuracy than pure image processing techniques alone. expand
|
|
|
Integrating feature and instance selection for text classification |
| |
Dimitris Fragoudis,
Dimitris Meretakis,
Spiros Likothanassis
|
|
Pages: 501-506 |
|
doi>10.1145/775047.775120 |
|
Full text: PDF
|
|
Instance selection and feature selection are two orthogonal methods for reducing the amount and complexity of data. Feature selection aims at the reduction of redundant features in a dataset whereas instance selection aims at the reduction of the number ...
Instance selection and feature selection are two orthogonal methods for reducing the amount and complexity of data. Feature selection aims at the reduction of redundant features in a dataset whereas instance selection aims at the reduction of the number of instances. So far, these two methods have mostly been considered in isolation. In this paper, we present a new algorithm, which we call FIS (Feature and Instance Selection) that targets both problems simultaneously in the context of text classificationOur experiments on the Reuters and 20-Newsgroups datasets show that FIS considerably reduces both the number of features and the number of instances. The accuracy of a range of classifiers including Naïve Bayes, TAN and LB considerably improves when using the FIS preprocessed datasets, matching and exceeding that of Support Vector Machines, which is currently considered to be one of the best text classification methods. In all cases the results are much better compared to Mutual Information based feature selection. The training and classification speed of all classifiers is also greatly improved. expand
|
|
|
SyMP: an efficient clustering approach to identify clusters of arbitrary shapes in large data sets |
| |
Hichem Frigui
|
|
Pages: 507-512 |
|
doi>10.1145/775047.775121 |
|
Full text: PDF
|
|
We propose a new clustering algorithm, called SyMP, which is based on synchronization of pulse-coupled oscillators. SyMP represents each data point by an Integrate-and-Fire oscillator and uses the relative similarity between the points to model the interaction ...
We propose a new clustering algorithm, called SyMP, which is based on synchronization of pulse-coupled oscillators. SyMP represents each data point by an Integrate-and-Fire oscillator and uses the relative similarity between the points to model the interaction between the oscillators. SyMP is robust to noise and outliers, determines the number of clusters in an unsupervised manner, identifies clusters of arbitrary shapes, and can handle very large data sets. The robustness of SyMP is an intrinsic property of the synchronization mechanism. To determine the optimum number of clusters, SyMP uses a dynamic resolution parameter. To identify clusters of various shapes, SyMP models each cluster by multiple Gaussian components. The number of components is automatically determined using a dynamic intra-cluster resolution parameter. Clusters with simple shapes would be modeled by few components while clusters with more complex shapes would require a larger number of components. The scalable version of SyMP uses an efficient incremental approach that requires a simple pass through the data set. The proposed clustering approach is empirically evaluated with several synthetic and real data sets, and its performance is compared with CURE. expand
|
|
|
Scaling multi-class support vector machines using inter-class confusion |
| |
Shantanu Godbole,
Sunita Sarawagi,
Soumen Chakrabarti
|
|
Pages: 513-518 |
|
doi>10.1145/775047.775122 |
|
Full text: PDF
|
|
Support vector machines (SVMs) excel at two-class discriminative learning problems. They often outperform generative classifiers, especially those that use inaccurate generative models, such as the naïve Bayes (NB) classifier. On the other hand, ...
Support vector machines (SVMs) excel at two-class discriminative learning problems. They often outperform generative classifiers, especially those that use inaccurate generative models, such as the naïve Bayes (NB) classifier. On the other hand, generative classifiers have no trouble in handling an arbitrary number of classes efficiently, and NB classifiers train much faster than SVMs owing to their extreme simplicity. In contrast, SVMs handle multi-class problems by learning redundant yes/no (one-vs-others) classifiers for each class, further worsening the performance gap. We propose a new technique for multi-way classification which exploits the accuracy of SVMs and the speed of NB classifiers. We first use a NB classifier to quickly compute a confusion matrix, which is used to reduce the number and complexity of the two-class SVMs that are built in the second stage. During testing, we first get the prediction of a NB classifier and use that to selectively apply only a subset of the two-class SVMs. On standard benchmarks, our algorithm is 3 to 6 times faster than SVMs and yet matches or even exceeds their accuracy. expand
|
|
|
Visualization support for a user-centered KDD process |
| |
TuBao Ho,
TrongDung Nguyen,
DungDuc Nguyen
|
|
Pages: 519-524 |
|
doi>10.1145/775047.775123 |
|
Full text: PDF
|
|
Viewing knowledge discovery as a user-centered process that requires an effective collaboration between the user and the discovery system, our work aims to support an active role of the user in that process by developing synergistic visualization tools ...
Viewing knowledge discovery as a user-centered process that requires an effective collaboration between the user and the discovery system, our work aims to support an active role of the user in that process by developing synergistic visualization tools integrated in our discovery system D2MS. These tools provide an ability of visualizing the entire process of knowledge discovery in order to help the user with data preprocessing, selecting mining algorithms and parameters, evaluating and comparing discovered models, and taking control of the whole discover process. Our case-studies with two medical datasets on meningitis and stomach cancer show that, with visualization tools in D2MS, the user gains better insight in each step of the knowledge discovery process as well the relationship between data and discovered knowledge. expand
|
|
|
Mining complex models from arbitrarily large databases in constant time |
| |
Geoff Hulten,
Pedro Domingos
|
|
Pages: 525-531 |
|
doi>10.1145/775047.775124 |
|
Full text: PDF
|
|
In this paper we propose a scaling-up method that is applicable to essentially any induction algorithm based on discrete search. The result of applying the method to an algorithm is that its running time becomes independent of the size of the database, ...
In this paper we propose a scaling-up method that is applicable to essentially any induction algorithm based on discrete search. The result of applying the method to an algorithm is that its running time becomes independent of the size of the database, while the decisions made are essentially identical to those that would be made given infinite data. The method works within pre-specified memory limits and, as long as the data is iid, only requires accessing it sequentially. It gives anytime results, and can be used to produce batch, stream, time-changing and active-learning versions of an algorithm. We apply the method to learning Bayesian networks, developing an algorithm that is faster than previous ones by orders of magnitude, while achieving essentially the same predictive performance. We observe these gains on a series of large databases "generated from benchmark networks, on the KDD Cup 2000 e-commerce data, and on a Web log containing 100 million requests. expand
|
|
|
A model for discovering customer value for E-content |
| |
Srinivasan Jagannathan,
Jayanth Nayak,
Kevin Almeroth,
Markus Hofmann
|
|
Pages: 532-537 |
|
doi>10.1145/775047.775125 |
|
Full text: PDF
|
|
There exists a huge demand for multimedia goods and services in the Internet. Currently available bandwidth speeds can support sale of downloadable content like CDs, e-books, etc. as well as services like video-on-demand. In the future, such services ...
There exists a huge demand for multimedia goods and services in the Internet. Currently available bandwidth speeds can support sale of downloadable content like CDs, e-books, etc. as well as services like video-on-demand. In the future, such services will be prevalent in the Internet. Since costs are typically fixed, maximizing revenue can maximize profits. A primary determinant of revenue in such e-content markets is how much value the customers associate with the content. Though marketing surveys are useful, they cannot adapt to the dynamic nature of the Internet market. In this work, we examine how to learn customer valuations in close to real-time. Our contributions in this paper are threefold: (1) we develop a probabilistic model to describe customer behavior, (2) we develop a framework for pricing e-content based on basic economic principles, and (3) we propose a price discovering algorithm that learns customer behavior parameters and suggests prices to an e-content provider. We validate our algorithm using simulations. Our simulations indicate that our algorithm generates revenue close to the maximum expectation. Further, they also indicate that the algorithm is robust to transient customer behavior. expand
|
|
|
SimRank: a measure of structural-context similarity |
| |
Glen Jeh,
Jennifer Widom
|
|
Pages: 538-543 |
|
doi>10.1145/775047.775126 |
|
Full text: PDF
|
|
The problem of measuring "similarity" of objects arises in many applications, and many domain-specific measures have been developed, e.g., matching text across documents or computing overlap among item-sets. We propose a complementary approach, applicable ...
The problem of measuring "similarity" of objects arises in many applications, and many domain-specific measures have been developed, e.g., matching text across documents or computing overlap among item-sets. We propose a complementary approach, applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects. Effectively, we compute a measure that says "two objects are similar if they are related to similar objects:" This general similarity measure, called SimRank, is based on a simple and intuitive graph-theoretic model. For a given domain, SimRank can be combined with other domain-specific similarity measures. We suggest techniques for efficient computation of SimRank scores, and provide experimental results on two application domains showing the computational feasibility and effectiveness of our approach. expand
|
|
|
Similarity measure based on partial information of time series |
| |
Xiaoming Jin,
Yuchang Lu,
Chunyi Shi
|
|
Pages: 544-549 |
|
doi>10.1145/775047.775127 |
|
Full text: PDF
|
|
Similarity measure of time series is an important subroutine in many KDD applications. Previous similarity models mainly focus on the prominent series behaviors by considering the whole information of time series. In this paper, we address the problem: ...
Similarity measure of time series is an important subroutine in many KDD applications. Previous similarity models mainly focus on the prominent series behaviors by considering the whole information of time series. In this paper, we address the problem: which portion of information is more suitable for similarity measure for the data collected from a certain field. We propose a model for the retrieval and representation of the partial information in time series data, and a methodology for evaluating the similarity measurements based on partial information. The methodology is to retrieve various portions of information from the raw data and represent it in a concise form, then cluster the time series using the partial information and evaluate the similarity measurements through comparing the results with a standard classification. Experiments on data set from stock market give some interesting observations and justify the usefulness of our approach. expand
|
|
|
Finding surprising patterns in a time series database in linear time and space |
| |
Eamonn Keogh,
Stefano Lonardi,
Bill 'Yuan-chi' Chiu
|
|
Pages: 550-556 |
|
doi>10.1145/775047.775128 |
|
Full text: PDF
|
|
The problem of finding a specified pattern in a time series database (i.e. query by content) has received much attention and is now a relatively mature field. In contrast, the important problem of enumerating all surprising or interesting patterns has ...
The problem of finding a specified pattern in a time series database (i.e. query by content) has received much attention and is now a relatively mature field. In contrast, the important problem of enumerating all surprising or interesting patterns has received far less attention. This problem requires a meaningful definition of "surprise", and an efficient search technique. All previous attempts at finding surprising patterns in time series use a very limited notion of surprise, and/or do not scale to massive datasets. To overcome these limitations we introduce a novel technique that defines a pattern surprising if the frequency of its occurrence differs substantially from that expected by chance, given some previously seen data. expand
|
|
|
Clustering seasonality patterns in the presence of errors |
| |
Mahesh Kumar,
Nitin R. Patel,
Jonathan Woo
|
|
Pages: 557-563 |
|
doi>10.1145/775047.775129 |
|
Full text: PDF
|
|
Clustering is a very well studied problem that attempts to group similar data points. Most traditional clustering algorithms assume that the data is provided without measurement error. Often, however, real world data sets have such errors and one can ...
Clustering is a very well studied problem that attempts to group similar data points. Most traditional clustering algorithms assume that the data is provided without measurement error. Often, however, real world data sets have such errors and one can obtain estimates of these errors. We present a clustering method that incorporates information contained in these error estimates. We present a new distance function that is based on the distribution of errors in data. Using a Gaussian model for errors, the distance function follows a Chi-Square distribution and is easy to compute. This distance function is used in hierarchical clustering to discover meaningful clusters. The distance function is scale-invariant so that clustering results are independent of units of measuring data. In the special case when the error distribution is the same for each attribute of data points, the rank order of pair-wise distances is the same for our distance function and the Euclidean distance function. The clustering method is applied to the seasonality estimation problem and experimental results are presented for the retail industry data as well as for simulated data, where it outperforms classical clustering methods. expand
|
|
|
Construct robust rule sets for classification |
| |
Jiuyong Li,
Rodney Topor,
Hong Shen
|
|
Pages: 564-569 |
|
doi>10.1145/775047.775130 |
|
Full text: PDF
|
|
We study the problem of computing classification rule sets from relational databases so that accurate predictions can be made on test data with missing attribute values. Traditional classifiers perform badly when test data are not as complete as the ...
We study the problem of computing classification rule sets from relational databases so that accurate predictions can be made on test data with missing attribute values. Traditional classifiers perform badly when test data are not as complete as the training data because they tailor a training database too much. We introduce the concept of one rule set being more robust than another, that is, able to make more accurate predictions on test data with missing attribute values. We show that the optimal class association rule set is as robust as the complete class association rule set. We then introduce the k-optimal rule set, which provides predictions exactly the same as the optimal class association rule set on test data with up to k missing attribute values. This leads to a hierarchy of k-optimal rule sets in which decreasing size corresponds to decreasing robustness, and they all more robust than a traditional classification rule set. We introduce two methods to find k-optimal rule sets, i.e. an optimal association rule mining approach and a heuristic approximate approach. We show experimentally that a k-optimal rule set generated by the optimal association rule mining approach performs better than that by the heuristic approximate approach and both rule sets perform significantly better than a typical classification rule set (C4.5Rules) on incomplete test data. expand
|
|
|
Instability of decision tree classification algorithms |
| |
Ruey-Hsia Li,
Geneva G. Belford
|
|
Pages: 570-575 |
|
doi>10.1145/775047.775131 |
|
Full text: PDF
|
|
The instability problem of decision tree classification algorithms is that small changes in input training samples may cause dramatically large changes in output classification rules. Different rules generated from almost the same training samples are ...
The instability problem of decision tree classification algorithms is that small changes in input training samples may cause dramatically large changes in output classification rules. Different rules generated from almost the same training samples are against human intuition and complicate the process of decision making. In this paper, we present fundamental theorems for the instability problem of decision tree classifiers. The first theorem gives the relationship between a data change and the resulting tree structure change (i.e. split change). The second theorem, Instability Theorem, provides the cause of the instability problem. Based on the two theorems, algorithmic improvements can be made to lessen the instability problem. Empirical results illustrate the theorem statements. The trees constructed by the proposed algorithm are more stable, noise-tolerant, informative, expressive, and concise. Our proposed sensitivity measure can be used as a metric to evaluate the stability of splitting predicates. The tree sensitivity is an indicator of the confidence level in rules and the effective lifetime of rules. expand
|
|
|
Distributed data mining in a chain store database of short transactions |
| |
Cheng-Ru Lin,
Chang-Hung Lee,
Ming-Syan Chen,
Philip S. Yu
|
|
Pages: 576-581 |
|
doi>10.1145/775047.775132 |
|
Full text: PDF
|
|
In this paper, we broaden the horizon of traditional rule mining by introducing a new framework of causality rule mining in a distributed chain store database. Specifically, the causality rule explored in this paper consists of a sequence of triggering ...
In this paper, we broaden the horizon of traditional rule mining by introducing a new framework of causality rule mining in a distributed chain store database. Specifically, the causality rule explored in this paper consists of a sequence of triggering events and a set of consequential events, and is designed with the capability of mining non-sequential, inter-transaction information. Hence, the causality rule mining provides a very general framework for rule derivation. Note, however, that the procedure of causality rule mining is very costly particularly in the presence of a huge number of candidate sets and a distributed database, and in our opinion, cannot be dealt with by direct extensions from existing rule mining methods. Consequently, we devise in this paper a series of level matching algorithms, including Level Matching (abbreviatedly as LM), Level Matching with Selective Scan (abbreviatedly as LMS), and Distributed Level Matching (abbreviatedly as Distibuted LM), to minimize the computing cost needed for the distributed data mining of causality rules. In addition, the phenomena of time window constraints are also taken into consideration for the development of our algorithms. As a result of properly employing the technologies of level matching and selective scan, the proposed algorithms present good efficiency and scalability in the mining of local and global causality rules. Scale-up experiments show that the proposed algorithms scale well with the number of sites and the number of customer transactions.Index Terms: knowledge discovery, distributed data mining causality rules, triggering events, consequential events expand
|
|
|
A robust and efficient clustering algorithm based on cohesion self-merging |
| |
Cheng-Ru Lin,
Ming-Syan Chen
|
|
Pages: 582-587 |
|
doi>10.1145/775047.775133 |
|
Full text: PDF
|
|
Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids, or the distance between ...
Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids, or the distance between two closest (or farthest) data points. However, all of these measurements are vulnerable to outliers, and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measurement referred to as cohesion, to measure the inter-cluster distances. By using this new measurement of cohesion, we design a two-phase clustering algorithm, called cohesion-based self-merging (abbreviated as CSM), which runs in linear time to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase, and then continuously merges the subclusters based on cohesion in a hierarchical manner in the second phase. As shown by our performance studies, the cohesion-based clustering is very robust and possesses the excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently, and provide better clustering results than those by prior methods.Index Terms: Data mining, data clustering, hierarchical clustering, partitional clustering expand
|
|
|
Discovering informative content blocks from Web documents |
| |
Shian-Hua Lin,
Jan-Ming Ho
|
|
Pages: 588-593 |
|
doi>10.1145/775047.775134 |
|
Full text: PDF
|
|
In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag <TABLE> ...
In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag <TABLE> in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced. expand
|
|
|
Collusion in the U.S. crop insurance program: applied data mining |
| |
Bertis B. Little,
Walter L. Johnston,
Ashley C. Lovell,
Roderick M. Rejesus,
Steve A. Steed
|
|
Pages: 594-598 |
|
doi>10.1145/775047.775135 |
|
Full text: PDF
|
|
This paper quantitatively analyzes indicators of Agent (policy seller), Adjuster (indemnity claim adjuster), Producer (policy purchaser/holder) indemnity behavior suggestive of collusion in the United States Department of Agriculture (USDA) Risk Management ...
This paper quantitatively analyzes indicators of Agent (policy seller), Adjuster (indemnity claim adjuster), Producer (policy purchaser/holder) indemnity behavior suggestive of collusion in the United States Department of Agriculture (USDA) Risk Management Agency (RMA) national crop insurance program. According to guidance from the federal law and using six indicator variables of indemnity behavior, those entities equal to or exceeding 150% of the county mean (computed using a simple jackknife procedure) on all entity-relevant indicators were flagged as "anomalous." Log linear analysis was used to test (I) hierarchical node-node arrangements and (2) a non-recursive model of node information sharing. Chi-square distributed deviance statistic identified the optimal log linear model. The results of the applied data mining technique used here suggest that the non-recursive triplet and Agent-producer doublet collusion probabilistically accounts for the greatest proportion of waste, fraud, and abuse in the federal crop insurance program. Triplet and Agent-producer doublets need detailed investigation for possible collusion. Hence, this data mining technique provided a high level of confidence when 24 million records were quantitatively analyzed for possible fraud, waste, or other abuse of the crop insurance program administered by the USDA RMA, and suspect entities reported to USDA. This data mining technique can be applied where vast amounts of data are available to detect patterns of collusion or conspiracy as may be of interest to the criminal justice or intelligence agencies. expand
|
|
|
Incremental context mining for adaptive document classification |
| |
Rey-Long Liu,
Yun-Ling Lu
|
|
Pages: 599-604 |
|
doi>10.1145/775047.775136 |
|
Full text: PDF
|
|
Automatic document classification (DC) is essential for the management of information and knowledge. This paper explores two practical issues in DC: (1) each document has its context of discussion, and (2) both the content and vocabulary of the ...
Automatic document classification (DC) is essential for the management of information and knowledge. This paper explores two practical issues in DC: (1) each document has its context of discussion, and (2) both the content and vocabulary of the document database is intrinsically evolving. The issues call for adaptive document classification (ADC) that adapts a DC system to the evolving contextual requirement of each document category, so that input documents may be classified based on their contexts of discussion. We present an incremental context mining technique to tackle the challenges of ADC. Theoretical analyses and empirical results show that, given a text hierarchy, the mining technique is efficient in incrementally maintaining the evolving contextual requirement of each category. Based on the contextual requirements mined by the system, higher-precision DC may be achieved with better efficiency. expand
|
|
|
Evaluating classifiers' performance in a constrained environment |
| |
Anna Olecka
|
|
Pages: 605-612 |
|
doi>10.1145/775047.775137 |
|
Full text: PDF
|
|
In this paper, we focus on methodology of finding a classifier with a minimal cost in presence of additional performance constraints. ROCCH analysis, where accuracy and cost are intertwined in the solution space, was a revolutionary tool for two-class ...
In this paper, we focus on methodology of finding a classifier with a minimal cost in presence of additional performance constraints. ROCCH analysis, where accuracy and cost are intertwined in the solution space, was a revolutionary tool for two-class problems. We propose an alternative formulation, as an optimization problem, commonly used in Operations Research. This approach extends the ROCCH analysis to allow for locating optimal solutions while outside constraints are present. Similarly to the ROCCH analysis, we combine cost and class distribution while defining the objective function. Rather than focusing on slopes of the edges in the convex hull of the solution space, however, we treat cost as an objective function to be minimized over the solution space, by selecting the best performing classifier(s) (one or more vertex in the solution space). The Linear Programming framework provides a theoretical and computational methodology for finding the vertex (classifier) which minimizes the objective function. expand
|
|
|
Discovering word senses from text |
| |
Patrick Pantel,
Dekang Lin
|
|
Pages: 613-619 |
|
doi>10.1145/775047.775138 |
|
Full text: PDF
|
|
Inventories of manually compiled dictionaries usually serve as a source for word senses. However, they often include many rare senses while missing corpus/domain-specific senses. We present a clustering algorithm called CBC (Clustering By Committee) ...
Inventories of manually compiled dictionaries usually serve as a source for word senses. However, they often include many rare senses while missing corpus/domain-specific senses. We present a clustering algorithm called CBC (Clustering By Committee) that automatically discovers word senses from text. It initially discovers a set of tight clusters called committees that are well scattered in the similarity space. The centroid of the members of a committee is used as the feature vector of the cluster. We proceed by assigning words to their most similar clusters. After assigning an element to a cluster, we remove their overlapping features from the element. This allows CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses. Each cluster that a word belongs to represents one of its senses. We also present an evaluation methodology for automatically measuring the precision and recall of discovered senses. expand
|
|
|
Combining clustering and co-training to enhance text classification using unlabelled data |
| |
Bhavani Raskutti,
Herman Ferrá,
Adam Kowalczyk
|
|
Pages: 620-625 |
|
doi>10.1145/775047.775139 |
|
Full text: PDF
|
|
In this paper, we present a new co-training strategy that makes use of unlabelled data. It trains two predictors in parallel, with each predictor labelling the unlabelled data for training the other predictor in the next round. Both predictors are support ...
In this paper, we present a new co-training strategy that makes use of unlabelled data. It trains two predictors in parallel, with each predictor labelling the unlabelled data for training the other predictor in the next round. Both predictors are support vector machines, one trained using data from the original feature space, the other trained with new features that are derived by clustering both the labelled and unlabelled data. Hence, unlike standard co-training methods, our method does not require a priori the existence of two redundant views either of which can be used for classification, nor is it dependent on the availability of two different supervised learning algorithms that complement each other.We evaluated our method with two classifiers and three text benchmarks: WebKB, Reuters newswire articles and 20 NewsGroups. Our evaluation shows that our co-training technique improves text classification accuracy especially when the number of labelled examples are very few. expand
|
|
|
Single-shot detection of multiple categories of text using parametric mixture models |
| |
Naonori Ueda,
Kazumi Saito
|
|
Pages: 626-631 |
|
doi>10.1145/775047.775140 |
|
Full text: PDF
|
|
In this paper, we address the problem of detecting multiple topics or categories of text where each text is not assumed to belong to one of a number of mutually exclusive categories. Conventionally, the binary classification approach ...
In this paper, we address the problem of detecting multiple topics or categories of text where each text is not assumed to belong to one of a number of mutually exclusive categories. Conventionally, the binary classification approach has been employed, in which whether or not text belongs to a category is judged by the binary classifier for every category. In this paper, we propose a more sophisticated approach to simultaneously detect multiple categories of text using parametric mixture models (PMMs), newly presented in this paper. PMMs are probabilistic generative models for text that has multiple categories. Our PMMs are essentially different from the conventional mixture of multinomial distributions in the sense that in the former several basis multinomial parameters are mixed in the parameter space, while in the latter several multinomial components are mixed. We derive efficient learning algorithms for PMMs within the framework of the maximum a posteriori estimate. We also empirically show that our method can outperform the conventional binary approach when applied to multitopic detection of World Wide Web pages, focusing on those from the "yahoo.com" domain. expand
|
|
|
What's the code?: automatic classification of source code archives |
| |
Secil Ugurel,
Robert Krovetz,
C. Lee Giles
|
|
Pages: 632-638 |
|
doi>10.1145/775047.775141 |
|
Full text: PDF
|
|
There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly ...
There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class. expand
|
|
|
Privacy preserving association rule mining in vertically partitioned data |
| |
Jaideep Vaidya,
Chris Clifton
|
|
Pages: 639-644 |
|
doi>10.1145/775047.775142 |
|
Full text: PDF
|
|
Privacy considerations often constrain data mining projects. This paper addresses the problem of association rule mining where transactions are distributed across sources. Each site holds some attributes of each transaction, and the sites wish to collaborate ...
Privacy considerations often constrain data mining projects. This paper addresses the problem of association rule mining where transactions are distributed across sources. Each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. However, the sites must not reveal individual transaction data. We present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. expand
|
|
|
Non-linear dimensionality reduction techniques for classification and visualization |
| |
Michail Vlachos,
Carlotta Domeniconi,
Dimitrios Gunopulos,
George Kollios,
Nick Koudas
|
|
Pages: 645-651 |
|
doi>10.1145/775047.775143 |
|
Full text: PDF
|
|
In this paper we address the issue of using local embeddings for data visualization in two and three dimensions, and for classification. We advocate their use on the basis that they provide an efficient mapping procedure from the original dimension of ...
In this paper we address the issue of using local embeddings for data visualization in two and three dimensions, and for classification. We advocate their use on the basis that they provide an efficient mapping procedure from the original dimension of the data, to a lower intrinsic dimension. We depict how they can accurately capture the user's perception of similarity in high-dimensional data for visualization purposes. Moreover, we exploit the low-dimensional mapping provided by these embeddings, to develop new classification techniques, and we show experimentally that the classification accuracy is comparable (albeit using fewer dimensions) to a number of other classification procedures. expand
|
|
|
Item selection by "hub-authority" profit ranking |
| |
Ke Wang,
Ming-Yen Thomas Su
|
|
Pages: 652-657 |
|
doi>10.1145/775047.775144 |
|
Full text: PDF
|
|
A fundamental problem in business and other applications is ranking items with respect to some notion of profit based on historical transactions. The difficulty is that the profit of one item not only comes from its own sales, but also from its influence ...
A fundamental problem in business and other applications is ranking items with respect to some notion of profit based on historical transactions. The difficulty is that the profit of one item not only comes from its own sales, but also from its influence on the sales of other items, i.e., the "cross-selling effect". In this paper, we draw an analogy between this influence and the mutual reinforcement of hub/authority web pages. Based on this analogy, we present a novel approach to the item ranking problem.We apply this ranking approach to solve two selection problems. In size-constrained selection, the maximum number of items that can be selected is fixed. In cost-constrained selection, there is no maximum number of items to be selected, but there is some cost associated with the selection of each item. In both cases, the question is what items should be selected to maximize the profit. Empirically, we show that this method finds profitable items in the presence of cross-selling effect. expand
|
|
|
Discovery net: towards a grid of knowledge discovery |
| |
V. Ćurčin,
M. Ghanem,
Y. Guo,
M. Köhler,
A. Rowe,
J. Syed,
P. Wendel
|
|
Pages: 658-663 |
|
doi>10.1145/775047.775145 |
|
Full text: PDF
|
|
This paper provides a blueprint for constructing collaborative and distributed knowledge discovery systems within Grid-based computing environments. The need for such systems is driven by the quest for sharing knowledge, information and computing resources ...
This paper provides a blueprint for constructing collaborative and distributed knowledge discovery systems within Grid-based computing environments. The need for such systems is driven by the quest for sharing knowledge, information and computing resources within the boundaries of single large distributed organisations or within complex Virtual Organisations (VO) created to tackle specific projects. The proposed architecture is built on top of a resource federation management layer and is composed of a set of different resources. We show how this architecture will behave during a typical KDD process design and deployment, how it enables the execution of complex and distributed data mining tasks with high performance and how it provides a community of e-scientists with means to collaborate, retrieve and reuse both KDD algorithms, discovery processes and knowledge in a visual analytical environment. expand
|
|
|
Making every bit count: fast nonlinear axis scaling |
| |
Leejay Wu,
Christos Faloutsos
|
|
Pages: 664-669 |
|
doi>10.1145/775047.775146 |
|
Full text: PDF
|
|
Existing axis scaling and dimensionality methods focus on preserving structure, usually determined via the Euclidean distance. In other words, they inherently assume that the Euclidean distance is already correct. We instead propose a novel nonlinear ...
Existing axis scaling and dimensionality methods focus on preserving structure, usually determined via the Euclidean distance. In other words, they inherently assume that the Euclidean distance is already correct. We instead propose a novel nonlinear approach driven by an information-theoretic viewpoint, which we show is also strongly linked to intrinsic dimensionality, or degrees of freedom; and uniformity. Nonlinear transformations based on common probability distributions, combined with information-driven selection, simultaneously reduce the number of dimensions required and increase the value of those we retain. Experiments on real data confirm that this approach reveals correlations, finds novel attributes, and scales well. expand
|
|
|
B-EM: a classifier incorporating bootstrap with EM approach for data mining |
| |
Xintao Wu,
Jianping Fan,
Kalpathi R. Subramanian
|
|
Pages: 670-675 |
|
doi>10.1145/775047.775147 |
|
Full text: PDF
|
|
This paper investigates the problem of augmenting labeled data with unlabeled data to improve classification accuracy. This is significant for many applications such as image classification where obtaining classification labels is expensive, while large ...
This paper investigates the problem of augmenting labeled data with unlabeled data to improve classification accuracy. This is significant for many applications such as image classification where obtaining classification labels is expensive, while large unlabeled examples are easily available. We investigate an Expectation Maximization (EM) algorithm for learning from labeled and unlabeled data. The reason why unlabeled data boosts learning accuracy is because it provides the information about the joint probability distribution. A theoretical argument shows that the more unlabeled examples are combined in learning, the more accurate the result. We then introduce B-EM algorithm, based on the combination of EM with bootstrap method, to exploit the large unlabeled data while avoiding prohibitive I/O cost. Experimental results over both synthetic and real data sets that the proposed approach has a satisfactory performance. expand
|
|
|
A unifying framework for detecting outliers and change points from non-stationary time series data |
| |
Kenji Yamanishi,
Jun-ichi Takeuchi
|
|
Pages: 676-681 |
|
doi>10.1145/775047.775148 |
|
Full text: PDF
|
|
We are concerned with the issues of outlier detection and change point detection from a data stream. In the area of data mining, there have been increased interest in these issues since the former is related to fraud detection, rare event discovery, ...
We are concerned with the issues of outlier detection and change point detection from a data stream. In the area of data mining, there have been increased interest in these issues since the former is related to fraud detection, rare event discovery, etc., while the latter is related to event/trend by change detection, activity monitoring, etc. Specifically, it is important to consider the situation where the data source is non-stationary, since the nature of data source may change over time in real applications. Although in most previous work outlier detection and change point detection have not been related explicitly, this paper presents a unifying framework for dealing with both of them on the basis of the theory of on-line learning of non-stationary time series. In this framework a probabilistic model of the data source is incrementally learned using an on-line discounting learning algorithm, which can track the changing data source adaptively by forgetting the effect of past data gradually. Then the score for any given data is calculated to measure its deviation from the learned model, with a higher score indicating a high possibility of being an outlier. Further change points in a data stream are detected by applying this scoring method into a time series of moving averaged losses for prediction using the learned model. Specifically we develop an efficient algorithms for on-line discounting learning of auto-regression models from time series data, and demonstrate the validity of our framework through simulation and experimental applications to stock market data analysis. expand
|
|
|
CLOPE: a fast and effective clustering algorithm for transactional data |
| |
Yiling Yang,
Xudong Guan,
Jinyuan You
|
|
Pages: 682-687 |
|
doi>10.1145/775047.775149 |
|
Full text: PDF
|
|
This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume. Starting from a heuristic method of increasing the height-to-width ratio of the cluster histogram, ...
This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume. Starting from a heuristic method of increasing the height-to-width ratio of the cluster histogram, we develop a novel algorithm -- CLOPE, which is very fast and scalable, while being quite effective. We demonstrate the performance of our algorithm on two real world datasets, and compare CLOPE with the state-of-art algorithms. expand
|
|
|
Topic-conditioned novelty detection |
| |
Yiming Yang,
Jian Zhang,
Jaime Carbonell,
Chun Jin
|
|
Pages: 688-693 |
|
doi>10.1145/775047.775150 |
|
Full text: PDF
|
|
Automated detection of the first document reporting each new event in temporally-sequenced streams of documents is an open challenge. In this paper we propose a new approach which addresses this problem in two stages: 1) using a supervised learning algorithm ...
Automated detection of the first document reporting each new event in temporally-sequenced streams of documents is an open challenge. In this paper we propose a new approach which addresses this problem in two stages: 1) using a supervised learning algorithm to classify the on-line document stream into pre-defined broad topic categories, and 2) performing topic-conditioned novelty detection for documents in each topic. We also focus on exploiting named-entities for event-level novelty detection and using feature-based heuristics derived from the topic histories. Evaluating these methods using a set of broadcast news stories, our results show substantial performance gains over the traditional one-level approach to the novelty detection problem. expand
|
|
|
Transforming classifier scores into accurate multiclass probability estimates |
| |
Bianca Zadrozny,
Charles Elkan
|
|
Pages: 694-699 |
|
doi>10.1145/775047.775151 |
|
Full text: PDF
|
|
Class membership probability estimates are important for many applications of data mining in which classification outputs are combined with other sources of information for decision-making, such as example-dependent misclassification costs, the outputs ...
Class membership probability estimates are important for many applications of data mining in which classification outputs are combined with other sources of information for decision-making, such as example-dependent misclassification costs, the outputs of other classifiers, or domain knowledge. Previous calibration methods apply only to two-class problems. Here, we show how to obtain accurate probability estimates for multiclass problems by combining calibrated binary probability estimates. We also propose a new method for obtaining calibrated two-class probability estimates that can be applied to any classifier that produces a ranking of examples. Using naive Bayes and support vector machine classifiers, we give experimental results from a variety of two-class and multiclass domains, including direct marketing, text categorization and digit recognition. expand
|