|
|
Apprenticeship learning via inverse reinforcement learning |
| |
Pieter Abbeel,
Andrew Y. Ng
|
|
Page: 1 |
|
doi>10.1145/1015330.1015430 |
|
Full text: PDF
|
|
We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as ...
We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be difficult to write down an explicit reward function specifying exactly how different desiderata should be traded off. We think of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and give an algorithm for learning the task demonstrated by the expert. Our algorithm is based on using "inverse reinforcement learning" to try to recover the unknown reward function. We show that our algorithm terminates in a small number of iterations, and that even though we may never recover the expert's reward function, the policy output by the algorithm will attain performance close to that of the expert, where here performance is measured with respect to the expert's unknown reward function. expand
|
|
|
Learning to track 3D human motion from silhouettes |
| |
Ankur Agarwal,
Bill Triggs
|
|
Page: 2 |
|
doi>10.1145/1015330.1015343 |
|
Full text: PDF
|
|
We describe a sparse Bayesian regression method for recovering 3D human body motion directly from silhouettes extracted from monocular video sequences. No detailed body shape model is needed, and realism is ensured by training on real human motion capture ...
We describe a sparse Bayesian regression method for recovering 3D human body motion directly from silhouettes extracted from monocular video sequences. No detailed body shape model is needed, and realism is ensured by training on real human motion capture data. The tracker estimates 3D body pose by using Relevance Vector Machine regression to combine a learned autoregressive dynamical model with robust shape descriptors extracted automatically from image silhouettes. We studied several different combination methods, the most effective being to learn a nonlinear observation-update correction based on joint regression with respect to the predicted state and the observations. We demonstrate the method on a 54-parameter full body pose model, both quantitatively using motion capture based test sequences, and qualitatively on a test video sequence. expand
|
|
|
A multiplicative up-propagation algorithm |
| |
Jong-Hoon Ahn,
Seungjin Choi,
Jong-Hoon Oh
|
|
Page: 3 |
|
doi>10.1145/1015330.1015379 |
|
Full text: PDF
|
|
We present a generalization of the nonnegative matrix factorization (NMF), where a multilayer generative network with nonnegative weights is used to approximate the observed nonnegative data. The multilayer generative network with nonnegativity constraints, ...
We present a generalization of the nonnegative matrix factorization (NMF), where a multilayer generative network with nonnegative weights is used to approximate the observed nonnegative data. The multilayer generative network with nonnegativity constraints, is learned by a multiplicative uppropagation algorithm, where the weights in each layer are updated in a multiplicative fashion while the mismatch ratio is propagated from the bottom to the top layer. The monotonic convergence of the multiplicative up-propagation algorithm is shown. In contrast to NMF, the multiplicative uppropagation is an algorithm that can learn hierarchical representations, where complex higher-level representations are defined in terms of less complex lower-level representations. The interesting behavior of our algorithm is demonstrated with face image data. expand
|
|
|
Gaussian process classification for segmenting and annotating sequences |
| |
Yasemin Altun,
Thomas Hofmann,
Alexander J. Smola
|
|
Page: 4 |
|
doi>10.1145/1015330.1015433 |
|
Full text: PDF
|
|
Many real-world classification tasks involve the prediction of multiple, inter-dependent class labels. A prototypical case of this sort deals with prediction of a sequence of labels for a sequence of observations. Such problems arise naturally in the ...
Many real-world classification tasks involve the prediction of multiple, inter-dependent class labels. A prototypical case of this sort deals with prediction of a sequence of labels for a sequence of observations. Such problems arise naturally in the context of annotating and segmenting observation sequences. This paper generalizes Gaussian Process classification to predict multiple labels by taking dependencies between neighboring labels into account. Our approach is motivated by the desire to retain rigorous probabilistic semantics, while overcoming limitations of parametric methods like Conditional Random Fields, which exhibit conceptual and computational difficulties in high-dimensional input spaces. Experiments on named entity recognition and pitch accent prediction tasks demonstrate the competitiveness of our approach. expand
|
|
|
Redundant feature elimination for multi-class problems |
| |
Annalisa Appice,
Michelangelo Ceci,
Simon Rawles,
Peter Flach
|
|
Page: 5 |
|
doi>10.1145/1015330.1015397 |
|
Full text: PDF
|
|
We consider the problem of eliminating redundant Boolean features for a given data set, where a feature is redundant if it separates the classes less well than another feature or set of features. Lavrač et al. proposed the algorithm REDUCE that ...
We consider the problem of eliminating redundant Boolean features for a given data set, where a feature is redundant if it separates the classes less well than another feature or set of features. Lavrač et al. proposed the algorithm REDUCE that works by pairwise comparison of features, i.e., it eliminates a feature if it is redundant with respect to another feature. Their algorithm operates in an ILP setting and is restricted to two-class problems. In this paper we improve their method and extend it to multiple classes. Central to our approach is the notion of a neighbourhood of examples: a set of examples of the same class where the number of different features between examples is relatively small. Redundant features are eliminated by applying a revised version of the REDUCE method to each pair of neighbourhoods of different class. We analyse the performance of our method on a range of data sets. expand
|
|
|
Multiple kernel learning, conic duality, and the SMO algorithm |
| |
Francis R. Bach,
Gert R. G. Lanckriet,
Michael I. Jordan
|
|
Page: 6 |
|
doi>10.1145/1015330.1015424 |
|
Full text: PDF
|
|
While classical kernel-based classifiers are based on a single kernel, in practice it is often desirable to base classifiers on combinations of multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for the support ...
While classical kernel-based classifiers are based on a single kernel, in practice it is often desirable to base classifiers on combinations of multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for the support vector machine (SVM), and showed that the optimization of the coefficients of such a combination reduces to a convex optimization problem known as a quadratically-constrained quadratic program (QCQP). Unfortunately, current convex optimization toolboxes can solve this problem only for a small number of kernels and a small number of data points; moreover, the sequential minimal optimization (SMO) techniques that are essential in large-scale implementations of the SVM cannot be applied because the cost function is non-differentiable. We propose a novel dual formulation of the QCQP as a second-order cone programming problem, and show how to exploit the technique of Moreau-Yosida regularization to yield a formulation to which SMO techniques can be applied. We present experimental results that show that our SMO-based algorithm is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes. expand
|
|
|
Feature subset selection for learning preferences: a case study |
| |
Antonio Bahamonde,
Gustavo F. Bayón,
Jorge Díez,
José Ramón Quevedo,
Oscar Luaces,
Juan José del Coz,
Jaime Alonso,
Félix Goyache
|
|
Page: 7 |
|
doi>10.1145/1015330.1015378 |
|
Full text: PDF
|
|
In this paper we tackle a real world problem, the search of a function to evaluate the merits of beef cattle as meat producers. The independent variables represent a set of live animals' measurements; while the outputs cannot be captured with a single ...
In this paper we tackle a real world problem, the search of a function to evaluate the merits of beef cattle as meat producers. The independent variables represent a set of live animals' measurements; while the outputs cannot be captured with a single number, since the available experts tend to assess each animal in a relative way, comparing animals with the other partners in the same batch. Therefore, this problem can not be solved by means of regression methods; our approach is to learn the preferences of the experts when they order small groups of animals. Thus, the problem can be reduced to a binary classification, and can be dealt with a Support Vector Machine (SVM) improved with the use of a feature subset selection (FSS) method. We develop a method based on Recursive Feature Elimination (RFE) that employs an adaptation of a metric based method devised for model selection (ADJ). Finally, we discuss the extension of the resulting method to more general settings, and provide a comparison with other possible alternatives. expand
|
|
|
An information theoretic analysis of maximum likelihood mixture estimation for exponential families |
| |
Arindam Banerjee,
Inderjit Dhillon,
Joydeep Ghosh,
Srujana Merugu
|
|
Page: 8 |
|
doi>10.1145/1015330.1015431 |
|
Full text: PDF
|
|
An important task in unsupervised learning is maximum likelihood mixture estimation (MLME) for exponential families. In this paper, we prove a mathematical equivalence between this MLME problem and the rate distortion problem for Bregman divergences. ...
An important task in unsupervised learning is maximum likelihood mixture estimation (MLME) for exponential families. In this paper, we prove a mathematical equivalence between this MLME problem and the rate distortion problem for Bregman divergences. We also present new theoretical results in rate distortion theory for Bregman divergences. Further, an analysis of the problems as a trade-off between compression and preservation of information is presented that yields the information bottleneck method as an interesting special case. expand
|
|
|
Unifying collaborative and content-based filtering |
| |
Justin Basilico,
Thomas Hofmann
|
|
Page: 9 |
|
doi>10.1145/1015330.1015394 |
|
Full text: PDF
|
|
Collaborative and content-based filtering are two paradigms that have been applied in the context of recommender systems and user preference prediction. This paper proposes a novel, unified approach that systematically integrates all available training ...
Collaborative and content-based filtering are two paradigms that have been applied in the context of recommender systems and user preference prediction. This paper proposes a novel, unified approach that systematically integrates all available training information such as past user-item ratings as well as attributes of items or users to learn a prediction function. The key ingredient of our method is the design of a suitable kernel or similarity function between user-item pairs that allows simultaneous generalization across the user and item dimensions. We propose an on-line algorithm (JRank) that generalizes perceptron learning. Experimental results on the EachMovie data set show significant improvements over standard approaches. expand
|
|
|
C4.5 competence map: a phase transition-inspired approach |
| |
Nicolas Baskiotis,
Michèle Sebag
|
|
Page: 10 |
|
doi>10.1145/1015330.1015398 |
|
Full text: PDF
|
|
How to determine a priori whether a learning algorithm is suited to a learning problem instance is a major scientific and technological challenge. A first step toward this goal, inspired by the Phase Transition (PT) paradigm developed in the Constraint ...
How to determine a priori whether a learning algorithm is suited to a learning problem instance is a major scientific and technological challenge. A first step toward this goal, inspired by the Phase Transition (PT) paradigm developed in the Constraint Satisfaction domain, is presented in this paper.Based on the PT paradigm, extensive and principled experiments allow for constructing the Competence Map associated to a learning algorithm, describing the regions where this algorithm on average fails or succeeds. The approach is illustrated on the long and widely used C4.5 algorithm. A non trivial failure region in the landscape of k-term DNF languages is observed and some interpretations are offered for the experimental results. expand
|
|
|
Integrating constraints and metric learning in semi-supervised clustering |
| |
Mikhail Bilenko,
Sugato Basu,
Raymond J. Mooney
|
|
Page: 11 |
|
doi>10.1145/1015330.1015360 |
|
Full text: PDF
|
|
Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a ...
Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distance-function learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semi-supervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semi-supervised clustering algorithms. expand
|
|
|
Variational methods for the Dirichlet process |
| |
David M. Blei,
Michael I. Jordan
|
|
Page: 12 |
|
doi>10.1145/1015330.1015439 |
|
Full text: PDF
|
|
Variational inference methods, including mean field methods and loopy belief propagation, have been widely used for approximate probabilistic inference in graphical models. While often less accurate than MCMC, variational methods provide a fast deterministic ...
Variational inference methods, including mean field methods and loopy belief propagation, have been widely used for approximate probabilistic inference in graphical models. While often less accurate than MCMC, variational methods provide a fast deterministic approximation to marginal and conditional probabilities. Such approximations can be particularly useful in high dimensional problems where sampling methods are too slow to be effective. A limitation of current methods, however, is that they are restricted to parametric probabilistic models. MCMC does not have such a limitation; indeed, MCMC samplers have been developed for the Dirichlet process (DP), a nonparametric distribution on distributions (Ferguson, 1973) that is the cornerstone of Bayesian nonparametric statistics (Escobar & West, 1995; Neal, 2000). In this paper, we develop a mean-field variational approach to approximate inference for the Dirichlet process, where the approximate posterior is based on the truncated stick-breaking construction (Ishwaran & James, 2001). We compare our approach to DP samplers for Gaussian DP mixture models. expand
|
|
|
Semi-supervised learning using randomized mincuts |
| |
Avrim Blum,
John Lafferty,
Mugizi Robert Rwebangira,
Rajashekar Reddy
|
|
Page: 13 |
|
doi>10.1145/1015330.1015429 |
|
Full text: PDF
|
|
In many application domains there is a large amount of unlabeled data but only a very limited amount of labeled training data. One general approach that has been explored for utilizing this unlabeled data is to construct a graph on all the data points ...
In many application domains there is a large amount of unlabeled data but only a very limited amount of labeled training data. One general approach that has been explored for utilizing this unlabeled data is to construct a graph on all the data points based on distance relationships among examples, and then to use the known labels to perform some type of graph partitioning. One natural partitioning to use is the minimum cut that agrees with the labeled data (Blum & Chawla, 2001), which can be thought of as giving the most probable label assignment if one views labels as generated according to a Markov Random Field on the graph. Zhu et al. (2003) propose a cut based on a relaxation of this field, and Joachims (2003) gives an algorithm based on finding an approximate min-ratio cut.In this paper, we extend the mincut approach by adding randomness to the graph structure. The resulting algorithm addresses several short-comings of the basic mincut approach, and can be given theoretical justification from both a Markov random field perspective and from sample complexity considerations. In cases where the graph does not have small cuts for a given classification problem, randomization may not help. However, our experiments on several datasets show that when the structure of the graph supports small cuts, this can result in highly accurate classifiers with good accuracy/coverage tradeoffs. In addition, we are able to achieve good performance with a very simple graph-construction procedure. expand
|
|
|
Nonparametric classification with polynomial MPMC cascades |
| |
Sander M. Bohte,
Markus Breitenbach,
Gregory Z. Grudic
|
|
Page: 14 |
|
doi>10.1145/1015330.1015416 |
|
Full text: PDF
|
|
A new class of nonparametric algorithms for high-dimensional binary classification is proposed using cascades of low dimensional polynomial structures. Construction of polynomial cascades is based on Minimax Probability Machine Classification (MPMC), ...
A new class of nonparametric algorithms for high-dimensional binary classification is proposed using cascades of low dimensional polynomial structures. Construction of polynomial cascades is based on Minimax Probability Machine Classification (MPMC), which results in direct estimates of classification accuracy, and provides a simple stopping criteria that does not require expensive cross-validation measures. This Polynomial MPMC Cascade (PMC) algorithm is constructed in linear time with respect to the input space dimensionality, and linear time in the number of examples, making it a potentially attractive alternative to algorithms like support vector machines and standard MPMC. Experimental evidence is given showing that, compared to state-of-the-art classifiers, PMCs are competitive; inherently fast to compute; not prone to overfitting; and generally yield accurate estimates of the maximum error rate on unseen data. expand
|
|
|
Estimating replicability of classifier learning experiments |
| |
Remco R. Bouckaert
|
|
Page: 15 |
|
doi>10.1145/1015330.1015338 |
|
Full text: PDF
|
|
Replicability of machine learning experiments measures how likely it is that the outcome of one experiment is repeated when performed with a different randomization of the data. In this paper, we present an estimator of replicability of an experiment ...
Replicability of machine learning experiments measures how likely it is that the outcome of one experiment is repeated when performed with a different randomization of the data. In this paper, we present an estimator of replicability of an experiment that is efficient. More precisely, the estimator is unbiased and has lowest variance in the class of estimators formed by a linear combination of outcomes of experiments on a given data set.We gathered empirical data for comparing experiments consisting of different sampling schemes and hypothesis tests. Both factors are shown to have an impact on replicability of experiments. The data suggests that sign tests should not be used due to low replicability. Ranked sum tests show better performance, but the combination of a sorted runs sampling scheme with a t-test gives the most desirable performance judged on Type I and II error and replicability. expand
|
|
|
Co-EM support vector learning |
| |
Ulf Brefeld,
Tobias Scheffer
|
|
Page: 16 |
|
doi>10.1145/1015330.1015350 |
|
Full text: PDF
|
|
Multi-view algorithms, such as co-training and co-EM, utilize unlabeled data when the available attributes can be split into independent and compatible subsets. Co-EM outperforms co-training for many problems, but it requires the underlying learner to ...
Multi-view algorithms, such as co-training and co-EM, utilize unlabeled data when the available attributes can be split into independent and compatible subsets. Co-EM outperforms co-training for many problems, but it requires the underlying learner to estimate class probabilities, and to learn from probabilistically labeled data. Therefore, co-EM has so far only been studied with naive Bayesian learners. We cast linear classifiers into a probabilistic framework and develop a co-EM version of the Support Vector Machine. We conduct experiments on text classification problems and compare the family of semi-supervised support vector algorithms under different conditions, including violations of the assumptions underlying multi-view learning. For some problems, such as course web page classification, we observe the most accurate results reported so far. expand
|
|
|
Active learning of label ranking functions |
| |
Klaus Brinker
|
|
Page: 17 |
|
doi>10.1145/1015330.1015331 |
|
Full text: PDF
|
|
The effort necessary to construct labeled sets of examples in a supervised learning scenario is often disregarded, though in many applications, it is a time-consuming and expensive procedure. While this already constitutes a major issue in classification ...
The effort necessary to construct labeled sets of examples in a supervised learning scenario is often disregarded, though in many applications, it is a time-consuming and expensive procedure. While this already constitutes a major issue in classification learning, it becomes an even more serious problem when dealing with the more complex target domain of total orders over a set of alternatives. Considering both the pairwise decomposition and the constraint classification technique to represent label ranking functions, we introduce a novel generalization of pool-based active learning to address this problem. expand
|
|
|
Ensemble selection from libraries of models |
| |
Rich Caruana,
Alexandru Niculescu-Mizil,
Geoff Crew,
Alex Ksikes
|
|
Page: 18 |
|
doi>10.1145/1015330.1015432 |
|
Full text: PDF
|
|
We present a method for constructing ensembles from libraries of thousands of models. Model libraries are generated using different learning algorithms and parameter settings. Forward stepwise selection is used to add to the ensemble the models that ...
We present a method for constructing ensembles from libraries of thousands of models. Model libraries are generated using different learning algorithms and parameter settings. Forward stepwise selection is used to add to the ensemble the models that maximize its performance. Ensemble selection allows ensembles to be optimized to performance metric such as accuracy, cross entropy, mean precision, or ROC Area. Experiments with seven test problems and ten metrics demonstrate the benefit of ensemble selection. expand
|
|
|
A comparative study on methods for reducing myopia of hill-climbing search in multirelational learning |
| |
Lourdes Peña Castillo,
Stefan Wrobel
|
|
Page: 19 |
|
doi>10.1145/1015330.1015334 |
|
Full text: PDF
|
|
Hill-climbing search is the most commonly used search algorithm in ILP systems because it permits the generation of theories in short running times. However, a well known drawback of this greedy search strategy is its myopia. Macro-operators (or ...
Hill-climbing search is the most commonly used search algorithm in ILP systems because it permits the generation of theories in short running times. However, a well known drawback of this greedy search strategy is its myopia. Macro-operators (or macros for short), a recently proposed technique to reduce the search space explored by exhaustive search, can also be argued to reduce the myopia of hill-climbing search by automatically performing a variable-depth look-ahead in the search space. Surprisingly, macros have not been employed in a greedy learner. In this paper, we integrate macros into a hill-climbing learner. In a detailed comparative study in several domains, we show that indeed a hill-climbing learner using macros performs significantly better than current state-of-the-art systems involving other techniques for reducing myopia, such as fixed-depth look-ahead, template-based look-ahead, beam-search, or determinate literals. In addition, macros, in contrast to some of the other approaches, can be computed fully automatically and do not require user involvement nor special domain properties such as determinacy. expand
|
|
|
Locally linear metric adaptation for semi-supervised clustering |
| |
Hong Chang,
Dit-Yan Yeung
|
|
Page: 20 |
|
doi>10.1145/1015330.1015391 |
|
Full text: PDF
|
|
Many supervised and unsupervised learning algorithms are very sensitive to the choice of an appropriate distance metric. While classification tasks can make use of class label information for metric learning, such information is generally unavailable ...
Many supervised and unsupervised learning algorithms are very sensitive to the choice of an appropriate distance metric. While classification tasks can make use of class label information for metric learning, such information is generally unavailable in conventional clustering tasks. Some recent research sought to address a variant of the conventional clustering problem called semi-supervised clustering, which performs clustering in the presence of some background knowledge or supervisory information expressed as pairwise similarity or dissimilarity constraints. However, existing metric learning methods for semi-supervised clustering mostly perform global metric learning through a linear transformation. In this paper, we propose a new metric learning method which performs nonlinear transformation globally but linear transformation locally. In particular, we formulate the learning problem as an optimization problem and present two methods for solving it. Through some toy data sets, we show empirically that our locally linear metric adaptation (LLMA) method can handle some difficult cases that cannot be handled satisfactorily by previous methods. We also demonstrate the effectiveness of our method on some real data sets. expand
|
|
|
A graphical model for protein secondary structure prediction |
| |
Wei Chu,
Zoubin Ghahramani,
David L. Wild
|
|
Page: 21 |
|
doi>10.1145/1015330.1015354 |
|
Full text: PDF
|
|
In this paper, we present a graphical model for protein secondary structure prediction. This model extends segmental semi-Markov models (SSMM) to exploit multiple sequence alignment profiles which contain information from evolutionarily related sequences. ...
In this paper, we present a graphical model for protein secondary structure prediction. This model extends segmental semi-Markov models (SSMM) to exploit multiple sequence alignment profiles which contain information from evolutionarily related sequences. A novel parameterized model is proposed as the likelihood function for the SSMM to capture the segmental conformation. By incorporating the information from long range interactions in ß-sheets, this model is capable of carrying out inference on contact maps. The numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising. expand
|
|
|
Take a walk and cluster genes: a TSP-based approach to optimal rearrangement clustering |
| |
Sharlee Climer,
Weixiong Zhang
|
|
Page: 22 |
|
doi>10.1145/1015330.1015419 |
|
Full text: PDF
|
|
Cluster analysis is a fundamental problem and technique in many areas related to machine learning. In this paper, we consider rearrangement clustering, which is the problem of finding sets of objects that share common or similar features by arranging ...
Cluster analysis is a fundamental problem and technique in many areas related to machine learning. In this paper, we consider rearrangement clustering, which is the problem of finding sets of objects that share common or similar features by arranging the rows (objects) of a matrix (specifying object features) in such a way that adjacent objects are similar to each other (based on a similarity measure of the features) so as to maximize the overall similarity. Based on formulating this problem as the Traveling Salesman Problem (TSP), we develop a new TSP-based optimal clustering algorithm called TSPCluster. We overcome a flaw that is inherent in previous approaches by relaxing restrictions on dissimilarities between clusters. Our new algorithm has three important features: finding the optimal k clusters for a given k, automatically detecting cluster borders, and ascertaining a set of most viable clustering results that make good balances among maximizing the overall similarity within clusters and dissimilarity between clusters. We apply TSPCluster to cluster and display ~500 genes of flowering plant Arabidopsis which are regulated under various abiotic stress conditions. We compare TSPCluster to the bond energy algorithm and two existing clustering algorithms. Our TSPCluster code is available at (Climer & Zhang, 2004). expand
|
|
|
Links between perceptrons, MLPs and SVMs |
| |
Ronan Collobert,
Samy Bengio
|
|
Page: 23 |
|
doi>10.1145/1015330.1015415 |
|
Full text: PDF
|
|
We propose to study links between three important classification algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters ...
We propose to study links between three important classification algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin. These ideas are extended afterward to the case of MLPs. Moreover, under some assumptions it also appears that MLPs are a kind of mixture of SVMs, maximizing the margin in the hidden layer space. Finally, we present a very simple MLP based on the previous findings, which yields better performances in generalization and speed than the other models. expand
|
|
|
Communication complexity as a lower bound for learning in games |
| |
Vincent Conitzer,
Tuomas Sandholm
|
|
Page: 24 |
|
doi>10.1145/1015330.1015351 |
|
Full text: PDF
|
|
A fast-growing body of research in the AI and machine learning communities addresses learning in games, where there are multiple learners with different interests. This research adds to more established research on learning in games conducted ...
A fast-growing body of research in the AI and machine learning communities addresses learning in games, where there are multiple learners with different interests. This research adds to more established research on learning in games conducted in economics. In part because of a clash of fields, there are widely varying requirements on learning algorithms in this domain. The goal of this paper is to demonstrate how communication complexity can be used as a lower bound on the required learning time or cost. Because this lower bound does not assume any requirements on the learning algorithm, it is universal, applying under any set of requirements on the learning algorithm.We characterize exactly the communication complexity of various solution concepts from game theory, namely Nash equilibrium, iterated dominant strategies (both strict and weak), and backwards induction. This gives the tighest lower bounds on learning in games that can be obtained with this method. expand
|
|
|
Distribution kernels based on moments of counts |
| |
Corinna Cortes,
Mehryar Mohri
|
|
Page: 25 |
|
doi>10.1145/1015330.1015434 |
|
Full text: PDF
|
|
Many applications in text and speech processing require the analysis of distributions of variable-length sequences. We recently introduced a general kernel framework, rational kernels, to extend kernel methods to the analysis of such variable-length ...
Many applications in text and speech processing require the analysis of distributions of variable-length sequences. We recently introduced a general kernel framework, rational kernels, to extend kernel methods to the analysis of such variable-length sequences or more generally weighted automata. These kernels are efficient to compute and have been successfully used in applications such as spoken-dialog classification using Support Vector Machines.However, the rational kernels previously introduced do not fully encompass distributions over alternate sequences. Prior similarity measures between two weighted automata are based only on the expected counts of co-occurring subsequences and ignore similarities (or dissimilarities) in higher order moments of the distributions of these counts.In this paper, we introduce a new family of rational kernels, moment kernels, that precisely exploit this additional information. These kernels are distribution kernels based on moments of counts of strings. We describe efficient algorithms to compute moment kernels and apply them to several difficult spoken-dialog classification tasks. Our experiments show that using the second moment of the counts of n-gram sequences consistently improves the classification accuracy in these tasks. expand
|
|
|
A needle in a haystack: local one-class optimization |
| |
Koby Crammer,
Gal Chechik
|
|
Page: 26 |
|
doi>10.1145/1015330.1015399 |
|
Full text: PDF
|
|
This paper addresses the problem of finding a small and coherent subset of points in a given data. This problem, sometimes referred to as one-class or set covering, requires to find a small-radius ball that covers as many data points as ...
This paper addresses the problem of finding a small and coherent subset of points in a given data. This problem, sometimes referred to as one-class or set covering, requires to find a small-radius ball that covers as many data points as possible. It rises naturally in a wide range of applications, from finding gene-modules to extracting documents' topics, where many data points are irrelevant to the task at hand, or in applications where only positive examples are available. Most previous approaches to this problem focus on identifying and discarding a possible set of outliers. In this paper we adopt an opposite approach which directly aims to find a small set of coherently structured regions, by using a loss function that focuses on local properties of the data. We formalize the learning task as an optimization problem using the Information-Bottleneck principle. An algorithm to solve this optimization problem is then derived and analyzed. Experiments on gene expression data and a text document corpus demonstrate the merits of our approach. expand
|
|
|
Large margin hierarchical classification |
| |
Ofer Dekel,
Joseph Keshet,
Yoram Singer
|
|
Page: 27 |
|
doi>10.1145/1015330.1015374 |
|
Full text: PDF
|
|
We present an algorithmic framework for supervised classification learning where the set of labels is organized in a predefined hierarchical structure. This structure is encoded by a rooted tree which induces a metric over the label set. Our approach ...
We present an algorithmic framework for supervised classification learning where the set of labels is organized in a predefined hierarchical structure. This structure is encoded by a rooted tree which induces a metric over the label set. Our approach combines ideas from large margin kernel methods and Bayesian analysis. Following the large margin principle, we associate a prototype with each label in the tree and formulate the learning task as an optimization problem with varying margin constraints. In the spirit of Bayesian methods, we impose similarity requirements between the prototypes corresponding to adjacent labels in the hierarchy. We describe new online and batch algorithms for solving the constrained optimization problem. We derive a worst case loss-bound for the online algorithm and provide generalization analysis for its batch counterpart. We demonstrate the merits of our approach with a series of experiments on synthetic, text and speech data. expand
|
|
|
Training conditional random fields via gradient tree boosting |
| |
Thomas G. Dietterich,
Adam Ashenfelter,
Yaroslav Bulatov
|
|
Page: 28 |
|
doi>10.1145/1015330.1015428 |
|
Full text: PDF
|
|
Conditional Random Fields (CRFs; Lafferty, McCallum, & Pereira, 2001) provide a flexible and powerful model for learning to assign labels to elements of sequences in such applications as part-of-speech tagging, text-to-speech mapping, protein and DNA ...
Conditional Random Fields (CRFs; Lafferty, McCallum, & Pereira, 2001) provide a flexible and powerful model for learning to assign labels to elements of sequences in such applications as part-of-speech tagging, text-to-speech mapping, protein and DNA sequence analysis, and information extraction from web pages. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features. This paper describes a new method for training CRFs by applying Friedman's (1999) gradient tree boosting method. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees. Regression trees are learned by stage-wise optimizations similar to Adaboost, but with the objective of maximizing the conditional likelihood P(Y|X) of the CRF model. By growing regression trees, interactions among features are introduced only as needed, so although the parameter space is potentially immense, the search algorithm does not explicitly consider the large space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially like previous algorithms based on iterative scaling and gradient descent. expand
|
|
|
K-means clustering via principal component analysis |
| |
Chris Ding,
Xiaofeng He
|
|
Page: 29 |
|
doi>10.1145/1015330.1015408 |
|
Full text: PDF
|
|
Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we prove that principal components ...
Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. New lower bounds for K-means objective function are derived, which is the total variance minus the eigenvalues of the data covariance matrix. These results indicate that unsupervised dimension reduction is closely related to unsupervised learning. Several implications are discussed. On dimension reduction, the result provides new insights to the observed effectiveness of PCA-based data reductions, beyond the conventional noise-reduction explanation that PCA, via singular value decomposition, provides the best low-dimensional linear approximation of the data. On learning, the result suggests effective techniques for K-means data clustering. DNA gene expression and Internet newsgroups are analyzed to illustrate our results. Experiments indicate that the new bounds are within 0.5-1.5% of the optimal values. expand
|
|
|
Linearized cluster assignment via spectral ordering |
| |
Chris Ding,
Xiaofeng He
|
|
Page: 30 |
|
doi>10.1145/1015330.1015407 |
|
Full text: PDF
|
|
Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. They are most conveniently applied to 2-way clustering problems. When applying to multi-way clustering, either the 2-way spectral clustering is recursively applied or an ...
Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. They are most conveniently applied to 2-way clustering problems. When applying to multi-way clustering, either the 2-way spectral clustering is recursively applied or an embedding to spectral space is done and some other methods are used to cluster the points. Here we propose and study a K-way cluster assignment method. The method transforms the problem to find valleys and peaks of a 1-D quantity called cluster crossing, which measures the symmetric cluster overlap across a cut point along a linear ordering of the data points. The method can either determine K clusters in one shot or recursively split a current cluster into several smaller ones. We show that a linear ordering based on a distance sensitive objective has a continuous solution which is the eigenvector of the Laplacian, showing the close relationship between clustering and ordering. The method relies on the connectivity matrix constructed as the truncated spectral expansion of the similarity matrix, useful for revealing cluster structure. The method is applied to newsgroups to illustrate introduced concepts; experiments show it outperforms the recursive 2-way clustering and the standard K-means clustering. expand
|
|
|
The Bayesian backfitting relevance vector machine |
| |
Aaron D'Souza,
Sethu Vijayakumar,
Stefan Schaal
|
|
Page: 31 |
|
doi>10.1145/1015330.1015358 |
|
Full text: PDF
|
|
Traditional non-parametric statistical learning techniques are often computationally attractive, but lack the same generalization and model selection abilities as state-of-the-art Bayesian algorithms which, however, are usually computationally prohibitive. ...
Traditional non-parametric statistical learning techniques are often computationally attractive, but lack the same generalization and model selection abilities as state-of-the-art Bayesian algorithms which, however, are usually computationally prohibitive. This paper makes several important contributions that allow Bayesian learning to scale to more complex, real-world learning scenarios. Firstly, we show that backfitting --- a traditional non-parametric, yet highly efficient regression tool --- can be derived in a novel formulation within an expectation maximization (EM) framework and thus can finally be given a probabilistic interpretation. Secondly, we show that the general framework of sparse Bayesian learning and in particular the relevance vector machine (RVM), can be derived as a highly efficient algorithm using a Bayesian version of backfitting at its core. As we demonstrate on several regression and classification benchmarks, Bayesian backfitting offers a compelling alternative to current regression methods, especially when the size and dimensionality of the data challenge computational resources. expand
|
|
|
Learning probabilistic motion models for mobile robots |
| |
Austin I. Eliazar,
Ronald Parr
|
|
Page: 32 |
|
doi>10.1145/1015330.1015413 |
|
Full text: PDF
|
|
Machine learning methods are often applied to the problem of learning a map from a robot's sensor data, but they are rarely applied to the problem of learning a robot's motion model. The motion model, which can be influenced by robot idiosyncrasies and ...
Machine learning methods are often applied to the problem of learning a map from a robot's sensor data, but they are rarely applied to the problem of learning a robot's motion model. The motion model, which can be influenced by robot idiosyncrasies and terrain properties, is a crucial aspect of current algorithms for Simultaneous Localization and Mapping (SLAM). In this paper we concentrate on generating the correct motion model for a robot by applying EM methods in conjunction with a current SLAM algorithm. In contrast to previous calibration approaches, we not only estimate the mean of the motion, but also the interdependencies between motion terms, and the variances in these terms. This can be used to provide a more focused proposal distribution to a particle filter used in a SLAM algorithm, which can reduce the resources needed for localization while decreasing the chance of losing track of the robot's position. We validate this approach by recovering a good motion model despite initialization with a poor one. Further experiments validate the generality of the learned model in similar circumstances. expand
|
|
|
Lookahead-based algorithms for anytime induction of decision trees |
| |
Saher Esmeir,
Shaul Markovitch
|
|
Page: 33 |
|
doi>10.1145/1015330.1015373 |
|
Full text: PDF
|
|
The majority of the existing algorithms for learning decision trees are greedy---a tree is induced top-down, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Furthermore, the greedy ...
The majority of the existing algorithms for learning decision trees are greedy---a tree is induced top-down, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Furthermore, the greedy algorithms require a fixed amount of time and are not able to generate a better tree if additional time is available. To overcome this problem, we present two lookahead-based algorithms for anytime induction of decision trees, thus allowing tradeoff between tree quality and learning time. The first one is depth-k lookahead, where a larger time allocation permits larger k. The second algorithm uses a novel strategy for evaluating candidate splits; a stochastic version of ID3 is repeatedly invoked to estimate the size of the tree in which each split results, and the one that minimizes the expected size is preferred. Experimental results indicate that for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. expand
|
|
|
A Monte Carlo analysis of ensemble classification |
| |
Roberto Esposito,
Lorenza Saitta
|
|
Page: 34 |
|
doi>10.1145/1015330.1015386 |
|
Full text: PDF
|
|
In this paper we extend previous results providing a theoretical analysis of a new Monte Carlo ensemble classifier. The framework allows us to characterize the conditions under which the ensemble approach can be expected to outperform the single hypothesis ...
In this paper we extend previous results providing a theoretical analysis of a new Monte Carlo ensemble classifier. The framework allows us to characterize the conditions under which the ensemble approach can be expected to outperform the single hypothesis classifier. Moreover, we provide a closed form expression for the distribution of the true ensemble accuracy, as well as of its mean and variance. We then exploit this result in order to analyze the expected error behavior in a particularly interesting case. expand
|
|
|
Relational sequential inference with reliable observations |
| |
Alan Fern,
Robert Givan
|
|
Page: 35 |
|
doi>10.1145/1015330.1015420 |
|
Full text: PDF
|
|
We present a trainable sequential-inference technique for processes with large state and observation spaces and relational structure. Our method assumes "reliable observations", i.e. that each process state persists long enough to be reliably inferred ...
We present a trainable sequential-inference technique for processes with large state and observation spaces and relational structure. Our method assumes "reliable observations", i.e. that each process state persists long enough to be reliably inferred from the observations it generates. We introduce the idea of a "state-inference function" (from observation sequences to underlying hidden states) for representing knowledge about a process and develop an efficient sequential-inference algorithm, utilizing this function, that is correct for processes that generate reliable observations consistent with the state-inference function. We describe a representation for state-inference functions in relational domains and give a corresponding supervised learning algorithm. Experiments, in relational video interpretation, show that our technique provides significantly improved accuracy and speed relative to a variety of recent, hand-coded, non-trainable systems. expand
|
|
|
Solving cluster ensemble problems by bipartite graph partitioning |
| |
Xiaoli Zhang Fern,
Carla E. Brodley
|
|
Page: 36 |
|
doi>10.1145/1015330.1015414 |
|
Full text: PDF
|
|
A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. ...
A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. We introduce a new reduction method that constructs a bipartite graph from a given cluster ensemble. The resulting graph models both instances and clusters of the ensemble simultaneously as vertices in the graph. Our approach retains all of the information provided by a given ensemble, allowing the similarity among instances and the similarity among clusters to be considered collectively in forming the final clustering. Further, the resulting graph partitioning problem can be solved efficiently. We empirically evaluate the proposed approach against two commonly used graph formulations and show that it is more robust and achieves comparable or better performance in comparison to its competitors. expand
|
|
|
Delegating classifiers |
| |
César Ferri,
Peter Flach,
José Hernández-Orallo
|
|
Page: 37 |
|
doi>10.1145/1015330.1015395 |
|
Full text: PDF
|
|
A sensible use of classifiers must be based on the estimated reliability of their predictions. A cautious classifier would delegate the difficult or uncertain predictions to other, possibly more specialised, classifiers. In this paper we analyse and ...
A sensible use of classifiers must be based on the estimated reliability of their predictions. A cautious classifier would delegate the difficult or uncertain predictions to other, possibly more specialised, classifiers. In this paper we analyse and develop this idea of delegating classifiers in a systematic way. First, we design a two-step scenario where a first classifier chooses which examples to classify and delegates the difficult examples to train a second classifier. Secondly, we present an iterated scenario involving an arbitrary number of chained classifiers. We compare these scenarios to classical ensemble methods, such as bagging and boosting. We show experimentally that our approach is not far behind these methods in terms of accuracy, but with several advantages: (i) improved efficiency, since each classifier learns from fewer examples than the previous one; (ii) improved comprehensibility, since each classification derives from a single classifier; and (iii) the possibility to simplify the overall multi-classifier by removing the parts that lead to delegation. expand
|
|
|
A pitfall and solution in multi-class feature selection for text classification |
| |
George Forman
|
|
Page: 38 |
|
doi>10.1145/1015330.1015356 |
|
Full text: PDF
|
|
Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root ...
Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements. expand
|
|
|
Ensembles of nested dichotomies for multi-class problems |
| |
Eibe Frank,
Stefan Kramer
|
|
Page: 39 |
|
doi>10.1145/1015330.1015363 |
|
Full text: PDF
|
|
Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multi-class classification task into a system of ...
Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multi-class classification task into a system of dichotomies and provide a statistically sound way of applying two-class learning algorithms to multi-class problems (assuming these algorithms generate class probability estimates). However, there are usually many candidate trees for a given problem and in the standard approach the choice of a particular tree is based on domain knowledge that may not be available in practice. An alternative is to treat every system of nested dichotomies as equally likely and to form an ensemble classifier based on this assumption. We show that this approach produces more accurate classifications than applying C4.5 and logistic regression directly to multi-class problems. Our results also show that ensembles of nested dichotomies produce more accurate classifiers than pairwise classification if both techniques are used with C4.5, and comparable results for logistic regression. Compared to error-correcting output codes, they are preferable if logistic regression is used, and comparable in the case of C4.5. An additional benefit is that they generate class probability estimates. Consequently they appear to be a good general-purpose method for applying binary classifiers to multi-class problems. expand
|
|
|
A fast iterative algorithm for fisher discriminant using heterogeneous kernels |
| |
Glenn Fung,
Murat Dundar,
Jinbo Bi,
Bharat Rao
|
|
Page: 40 |
|
doi>10.1145/1015330.1015409 |
|
Full text: PDF
|
|
We propose a fast iterative classification algorithm for Kernel Fisher Discriminant (KFD) using heterogeneous kernel models. In contrast with the standard KFD that requires the user to predefine a kernel function, we incorporate the task of choosing ...
We propose a fast iterative classification algorithm for Kernel Fisher Discriminant (KFD) using heterogeneous kernel models. In contrast with the standard KFD that requires the user to predefine a kernel function, we incorporate the task of choosing an appropriate kernel into the optimization problem to be solved. The choice of kernel is defined as a linear combination of kernels belonging to a potentially large family of different positive semidefinite kernels. The complexity of our algorithm does not increase significantly with respect to the number of kernels on the kernel family. Experiments on several benchmark datasets demonstrate that generalization performance of the proposed algorithm is not significantly different from that achieved by the standard KFD in which the kernel parameters have been tuned using cross validation. We also present results on a real-life colon cancer dataset that demonstrate the efficiency of the proposed method. expand
|
|
|
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 |
| |
Evgeniy Gabrilovich,
Shaul Markovitch
|
|
Page: 41 |
|
doi>10.1145/1015330.1015388 |
|
Full text: PDF
|
|
Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance ...
Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support vector machines peaks when no feature selection is performed. We describe a class of text categorization problems that are characterized with many redundant features. Even though most of these features are relevant, the underlying concepts can be concisely captured using only a few features, while keeping all of them has substantially detrimental effect on categorization accuracy. We develop a novel measure that captures feature redundancy, and use it to analyze a large collection of datasets. We show that for problems plagued with numerous redundant features the performance of C4.5 is significantly superior to that of SVM, while aggressive feature selection allows SVM to beat C4.5 by a narrow margin. expand
|
|
|
A MFoM learning approach to robust multiclass multi-label text categorization |
| |
Sheng Gao,
Wen Wu,
Chin-Hui Lee,
Tat-Seng Chua
|
|
Page: 42 |
|
doi>10.1145/1015330.1015361 |
|
Full text: PDF
|
|
We propose a multiclass (MC) classification approach to text categorization (TC). To fully take advantage of both positive and negative training examples, a maximal figure-of-merit (MFoM) learning algorithm is introduced to train high performance MC ...
We propose a multiclass (MC) classification approach to text categorization (TC). To fully take advantage of both positive and negative training examples, a maximal figure-of-merit (MFoM) learning algorithm is introduced to train high performance MC classifiers. In contrast to conventional binary classification, the proposed MC scheme assigns a uniform score function to each category for each given test sample, and thus the classical Bayes decision rules can now be applied. Since all the MC MFoM classifiers are simultaneously trained, we expect them to be more robust and work better than the binary MFoM classifiers, which are trained separately and are known to give the best TC performance. Experimental results on the Reuters-21578 TC task indicate that the MC MFoM classifiers achieve a micro-averaging F1 value of 0.377, which is significantly better than 0.138, obtained with the binary MFoM classifiers, for the categories with less than 4 training samples. Furthermore, for all 90 categories, most with large training sizes, the MC MFoM classifiers give a micro-averaging F1 value of 0.888, better than 0.884, obtained with the binary MFoM classifiers. expand
|
|
|
Margin based feature selection - theory and algorithms |
| |
Ran Gilad-Bachrach,
Amir Navot,
Naftali Tishby
|
|
Page: 43 |
|
doi>10.1145/1015330.1015352 |
|
Full text: PDF
|
|
Feature selection is the task of choosing a small set out of a given set of features that capture the relevant properties of the data. In the context of supervised classification problems the relevance is determined by the given labels on the training ...
Feature selection is the task of choosing a small set out of a given set of features that capture the relevant properties of the data. In the context of supervised classification problems the relevance is determined by the given labels on the training data. A good choice of features is a key for building compact and accurate classifiers. In this paper we introduce a margin based feature selection criterion and apply it to measure the quality of sets of features. Using margins we devise novel selection algorithms for multi-class classification problems and provide theoretical generalization bound. We also study the well known Relief algorithm and show that it resembles a gradient ascent over our margin criterion. We apply our new algorithm to various datasets and show that our new Simba algorithm, which directly optimizes the margin, outperforms Relief. expand
|
|
|
Tractable learning of large Bayes net structures from sparse data |
| |
Anna Goldenberg,
Andrew Moore
|
|
Page: 44 |
|
doi>10.1145/1015330.1015406 |
|
Full text: PDF
|
|
This paper addresses three questions. Is it useful to attempt to learn a Bayesian network structure with hundreds of thousands of nodes? How should such structure search proceed practically? The third question arises out of our approach to the second: ...
This paper addresses three questions. Is it useful to attempt to learn a Bayesian network structure with hundreds of thousands of nodes? How should such structure search proceed practically? The third question arises out of our approach to the second: how can Frequent Sets (Agrawal et al., 1993), which are extremely popular in the area of descriptive data mining, be turned into a probabilistic model?Large sparse datasets with hundreds of thousands of records and attributes appear in social networks, warehousing, supermarket transactions and web logs. The complexity of structural search made learning of factored probabilistic models on such datasets unfeasible. We propose to use Frequent Sets to significantly speed up the structural search. Unlike previous approaches, we not only cache n-way sufficient statistics, but also exploit their local structure. We also present an empirical evaluation of our algorithm applied to several massive datasets. expand
|
|
|
Parameter space exploration with Gaussian process trees |
| |
Robert B. Gramacy,
Herbert K. H. Lee,
William G. Macready
|
|
Page: 45 |
|
doi>10.1145/1015330.1015367 |
|
Full text: PDF
|
|
Computer experiments often require dense sweeps over input parameters to obtain a qualitative understanding of their response. Such sweeps can be prohibitively expensive, and are unnecessary in regions where the response is easy predicted; well-chosen ...
Computer experiments often require dense sweeps over input parameters to obtain a qualitative understanding of their response. Such sweeps can be prohibitively expensive, and are unnecessary in regions where the response is easy predicted; well-chosen designs could allow a mapping of the response with far fewer simulation runs. Thus, there is a need for computationally inexpensive surrogate models and an accompanying method for selecting small designs. We explore a general methodology for addressing this need that uses non-stationary Gaussian processes. Binary trees partition the input space to facilitate non-stationarity and a Bayesian interpretation provides an explicit measure of predictive uncertainty that can be used to guide sampling. Our methods are illustrated on several examples, including a motivating example involving computational fluid dynamics simulation of a NASA reentry vehicle. expand
|
|
|
Learning Bayesian network classifiers by maximizing conditional likelihood |
| |
Daniel Grossman,
Pedro Domingos
|
|
Page: 46 |
|
doi>10.1145/1015330.1015339 |
|
Full text: PDF
|
|
Bayesian networks are a powerful probabilistic representation, and their use for classification has received considerable attention. However, they tend to perform poorly when learned in the standard way. This is attributable to a mismatch between the ...
Bayesian networks are a powerful probabilistic representation, and their use for classification has received considerable attention. However, they tend to perform poorly when learned in the standard way. This is attributable to a mismatch between the objective function used (likelihood or a function thereof) and the goal of classification (maximizing accuracy or conditional likelihood). Unfortunately, the computational cost of optimizing structure and parameters for conditional likelihood is prohibitive. In this paper we show that a simple approximation---choosing structures by maximizing conditional likelihood while setting parameters by maximum likelihood---yields good results. On a large suite of benchmark datasets, this approach produces better class probability estimates than naive Bayes, TAN, and generatively-trained Bayesian networks. expand
|
|
|
A kernel view of the dimensionality reduction of manifolds |
| |
Jihun Ham,
Daniel D. Lee,
Sebastian Mika,
Bernhard Schölkopf
|
|
Page: 47 |
|
doi>10.1145/1015330.1015417 |
|
Full text: PDF
|
|
We interpret several well-known algorithms for dimensionality reduction of manifolds as kernel methods. Isomap, graph Laplacian eigenmap, and locally linear embedding (LLE) all utilize local neighborhood information to construct a global embedding of ...
We interpret several well-known algorithms for dimensionality reduction of manifolds as kernel methods. Isomap, graph Laplacian eigenmap, and locally linear embedding (LLE) all utilize local neighborhood information to construct a global embedding of the manifold. We show how all three algorithms can be described as kernel PCA on specially constructed Gram matrices, and illustrate the similarities and differences between the algorithms with representative examples. expand
|
|
|
A theoretical characterization of linear SVM-based feature selection |
| |
Douglas Hardin,
Ioannis Tsamardinos,
Constantin F. Aliferis
|
|
Page: 48 |
|
doi>10.1145/1015330.1015421 |
|
Full text: PDF
|
|
Most prevalent techniques in Support Vector Machine (SVM) feature selection are based on the intuition that the weights of features that are close to zero are not required for optimal classification. In this paper we show that indeed, in the sample limit, ...
Most prevalent techniques in Support Vector Machine (SVM) feature selection are based on the intuition that the weights of features that are close to zero are not required for optimal classification. In this paper we show that indeed, in the sample limit, the irrelevant variables (in a theoretical and optimal sense) will be given zero weight by a linear SVM, both in the soft and the hard margin case. However, SVM-based methods have certain theoretical disadvantages too. We present examples where the linear SVM may assign zero weights to strongly relevant variables (i.e., variables required for optimal estimation of the distribution of the target variable) and where weakly relevant features (i.e., features that are superfluous for optimal feature selection given other features) may get non-zero weights. We contrast and theoretically compare with Markov-Blanket based feature selection algorithms that do not have such disadvantages in a broad class of distributions and could also be used for causal discovery. expand
|
|
|
Optimising area under the ROC curve using gradient descent |
| |
Alan Herschtal,
Bhavani Raskutti
|
|
Page: 49 |
|
doi>10.1145/1015330.1015366 |
|
Full text: PDF
|
|
This paper introduces RankOpt, a linear binary classifier which optimises the area under the ROC curve (the AUC). Unlike standard binary classifiers, RankOpt adopts the AUC statistic as its objective function, and optimises it directly using gradient ...
This paper introduces RankOpt, a linear binary classifier which optimises the area under the ROC curve (the AUC). Unlike standard binary classifiers, RankOpt adopts the AUC statistic as its objective function, and optimises it directly using gradient descent. The problems with using the AUC statistic as an objective function are that it is non-differentiable, and of complexity O(n2) in the number of data observations. RankOpt uses a differentiable approximation to the AUC which is accurate, and computationally efficient, being of complexity O(n.) This enables the gradient descent to be performed in reasonable time. The performance of RankOpt is compared with a number of other linear binary classifiers, over a number of different classification problems. In almost all cases it is found that the performance of RankOpt is significantly better than the other classifiers tested. expand
|
|
|
Boosting margin based distance functions for clustering |
| |
Tomer Hertz,
Aharon Bar-Hillel,
Daphna Weinshall
|
|
Page: 50 |
|
doi>10.1145/1015330.1015389 |
|
Full text: PDF
|
|
The performance of graph based clustering methods critically depends on the quality of the distance function used to compute similarities between pairs of neighboring nodes. In this paper we learn distance functions by training binary classifiers with ...
The performance of graph based clustering methods critically depends on the quality of the distance function used to compute similarities between pairs of neighboring nodes. In this paper we learn distance functions by training binary classifiers with margins. The classifiers are defined over the product space of pairs of points and are trained to distinguish whether two points come from the same class or not. The signed margin is used as the distance value. Our main contribution is a distance learning method (DistBoost), which combines boosting hypotheses over the product space with a weak learner based on partitioning the original feature space. Each weak hypothesis is a Gaussian mixture model computed using a semi-supervised constrained EM algorithm, which is trained using both unlabeled and labeled data. We also consider SVM and decision trees boosting as margin based classifiers in the product space. We experimentally compare the margin based distance functions with other existing metric learning methods, and with existing techniques for the direct incorporation of constraints into various clustering algorithms. Clustering performance is measured on some benchmark databases from the UCI repository, a sample from the MNIST database, and a data set of color images of animals. In most cases the DistBoost algorithm significantly and robustly outperformed its competitors. expand
|
|
|
Learning large margin classifiers locally and globally |
| |
Kaizhu Huang,
Haiqin Yang,
Irwin King,
Michael R. Lyu
|
|
Page: 51 |
|
doi>10.1145/1015330.1015365 |
|
Full text: PDF
|
|
A new large margin classifier, named Maxi-Min Margin Machine (M4) is proposed in this paper. This new classifier is constructed based on both a "local: and a "global" view of data, while the most popular large margin classifier, Support Vector ...
A new large margin classifier, named Maxi-Min Margin Machine (M4) is proposed in this paper. This new classifier is constructed based on both a "local: and a "global" view of data, while the most popular large margin classifier, Support Vector Machine (SVM) and the recently-proposed important model, Minimax Probability Machine (MPM) consider data only either locally or globally. This new model is theoretically important in the sense that SVM and MPM can both be considered as its special case. Furthermore, the optimization of M4 can be cast as a sequential conic programming problem, which can be solved efficiently. We describe the M4 model definition, provide a clear geometrical interpretation, present theoretical justifications, propose efficient solving methods, and perform a series of evaluations on both synthetic data sets and real world benchmark data sets. Its comparison with SVM and MPM also demonstrates the advantages of our new model. expand
|
|
|
Testing the significance of attribute interactions |
| |
Aleks Jakulin,
Ivan Bratko
|
|
Page: 52 |
|
doi>10.1145/1015330.1015377 |
|
Full text: PDF
|
|
Attribute interactions are the irreducible dependencies between attributes. Interactions underlie feature relevance and selection, the structure of joint probability and classification models: if and only if the attributes interact, they should be connected. ...
Attribute interactions are the irreducible dependencies between attributes. Interactions underlie feature relevance and selection, the structure of joint probability and classification models: if and only if the attributes interact, they should be connected. While the issue of 2-way interactions, especially of those between an attribute and the label, has already been addressed, we introduce an operational definition of a generalized n-way interaction by highlighting two models: the reductionistic part-to-whole approximation, where the model of the whole is reconstructed from models of the parts, and the holistic reference model, where the whole is modelled directly. An interaction is deemed significant if these two models are significantly different. In this paper, we propose the Kirkwood superposition approximation for constructing part-to-whole approximations. To model data, we do not assume a particular structure of interactions, but instead construct the model by testing for the presence of interactions. The resulting map of significant interactions is a graphical model learned from the data. We confirm that the P-values computed with the assumption of the asymptotic X2 distribution closely match those obtained with the boot-strap. expand
|
|
|
Learning and discovery of predictive state representations in dynamical systems with reset |
| |
Michael R. James,
Satinder Singh
|
|
Page: 53 |
|
doi>10.1145/1015330.1015359 |
|
Full text: PDF
|
|
Predictive state representations (PSRs) are a recently proposed way of modeling controlled dynamical systems. PSR-based models use predictions of observable outcomes of tests that could be done on the system as their state representation, ...
Predictive state representations (PSRs) are a recently proposed way of modeling controlled dynamical systems. PSR-based models use predictions of observable outcomes of tests that could be done on the system as their state representation, and have model parameters that define how the predictive state representation changes over time as actions are taken and observations noted. Learning PSR-based models requires solving two subproblems: 1) discovery of the tests whose predictions constitute state, and 2) learning the model parameters that define the dynamics. So far, there have been no results available on the discovery subproblem while for the learning subproblem an approximate-gradient algorithm has been proposed (Singh et al., 2003) with mixed results (it works on some domains and not on others). In this paper, we provide the first discovery algorithm and a new learning algorithm for linear PSRs for the special class of controlled dynamical systems that have a reset operation. We provide experimental verification of our algorithms. Finally, we also distinguish our work from prior work by Jaeger (2000) on observable operator models (OOMs). expand
|
|
|
Boosting grammatical inference with confidence oracles |
| |
Jean-Christophe Janodet,
Richard Nock,
Marc Sebban,
Henri-Maxime Suchier
|
|
Page: 54 |
|
doi>10.1145/1015330.1015336 |
|
Full text: PDF
|
|
In this paper we focus on the adaptation of boosting to grammatical inference. We aim at improving the performance of state merging algorithms in the presence of noisy data by using, in the update rule, additional information provided by an oracle. This ...
In this paper we focus on the adaptation of boosting to grammatical inference. We aim at improving the performance of state merging algorithms in the presence of noisy data by using, in the update rule, additional information provided by an oracle. This strategy requires the construction of a new weighting scheme that takes into account the confidence in the labels of the examples. We prove that our new framework preserves the theoretical properties of boosting. Using the state merging algorithm RPNI*, we describe an experimental study on various datasets, showing a dramatic improvement of performances. expand
|
|
|
Multi-task feature and kernel selection for SVMs |
| |
Tony Jebara
|
|
Page: 55 |
|
doi>10.1145/1015330.1015426 |
|
Full text: PDF
|
|
We compute a common feature selection or kernel selection configuration for multiple support vector machines (SVMs) trained on different yet inter-related datasets. The method is advantageous when multiple classification tasks and differently labeled ...
We compute a common feature selection or kernel selection configuration for multiple support vector machines (SVMs) trained on different yet inter-related datasets. The method is advantageous when multiple classification tasks and differently labeled datasets exist over a common input space. Different datasets can mutually reinforce a common choice of representation or relevant features for their various classifiers. We derive a multi-task representation learning approach using the maximum entropy discrimination formalism. The resulting convex algorithms maintain the global solution properties of support vector machines. However, in addition to multiple SVM classification/regression parameters they also jointly estimate an optimal subset of features or optimal combination of kernels. Experiments are shown on standardized datasets. expand
|
|
|
A spatio-temporal extension to Isomap nonlinear dimension reduction |
| |
Odest Chadwicke Jenkins,
Maja J. Matarić
|
|
Page: 56 |
|
doi>10.1145/1015330.1015357 |
|
Full text: PDF
|
|
We present an extension of Isomap nonlinear dimension reduction (Tenenbaum et al., 2000) for data with both spatial and temporal relationships. Our method, ST-Isomap, augments the existing Isomap framework to consider temporal relationships in local ...
We present an extension of Isomap nonlinear dimension reduction (Tenenbaum et al., 2000) for data with both spatial and temporal relationships. Our method, ST-Isomap, augments the existing Isomap framework to consider temporal relationships in local neighborhoods that can be propagated globally via a shortest-path mechanism. Two instantiations of ST-Isomap are presented for sequentially continuous and segmented data. Results from applying ST-Isomap to real-world data collected from human motion performance and humanoid robot teleoperation are also presented. expand
|
|
|
Robust feature induction for support vector machines |
| |
Rong Jin,
Huan Liu
|
|
Page: 57 |
|
doi>10.1145/1015330.1015370 |
|
Full text: PDF
|
|
The goal of feature induction is to automatically create nonlinear combinations of existing features as additional input features to improve classification accuracy. Typically, nonlinear features are introduced into a support vector machine (SVM) through ...
The goal of feature induction is to automatically create nonlinear combinations of existing features as additional input features to improve classification accuracy. Typically, nonlinear features are introduced into a support vector machine (SVM) through a nonlinear kernel function. One disadvantage of such an approach is that the feature space induced by a kernel function is usually of high dimension and therefore will substantially increase the chance of over-fitting the training data. Another disadvantage is that nonlinear features are induced implicitly and therefore are difficult for people to understand which induced features are critical to the classification performance. In this paper, we propose a boosting-style algorithm that can explicitly induces important nonlinear features for SVMs. We present empirical studies with discussion to show that this approach is effective in improving classification accuracy for SVMs. The comparison with an SVM model using nonlinear kernels also indicates that this approach is effective and robust, particularly when the number of training data is small. expand
|
|
|
Kernel-based discriminative learning algorithms for labeling sequences, trees, and graphs |
| |
Hisashi Kashima,
Yuta Tsuboi
|
|
Page: 58 |
|
doi>10.1145/1015330.1015383 |
|
Full text: PDF
|
|
We introduce a new perceptron-based discriminative learning algorithm for labeling structured data such as sequences, trees, and graphs. Since it is fully kernelized and uses pointwise label prediction, large features, including arbitrary number of hidden ...
We introduce a new perceptron-based discriminative learning algorithm for labeling structured data such as sequences, trees, and graphs. Since it is fully kernelized and uses pointwise label prediction, large features, including arbitrary number of hidden variables, can be incorporated with polynomial time complexity. This is in contrast to existing labelers that can handle only features of a small number of hidden variables, such as Maximum Entropy Markov Models and Conditional Random Fields. We also introduce several kernel functions for labeling sequences, trees, and graphs and efficient algorithms for them. expand
|
|
|
Bellman goes relational |
| |
Kristian Kersting,
Martijn Van Otterlo,
Luc De Raedt
|
|
Page: 59 |
|
doi>10.1145/1015330.1015401 |
|
Full text: PDF
|
|
Motivated by the interest in relational reinforcement learning, we introduce a novel relational Bellman update operator called REBEL. It employs a constraint logic programming language to compactly represent Markov decision processes over relational ...
Motivated by the interest in relational reinforcement learning, we introduce a novel relational Bellman update operator called REBEL. It employs a constraint logic programming language to compactly represent Markov decision processes over relational domains. Using REBEL, a novel value iteration algorithm is developed in which abstraction (over states and actions) plays a major role. This framework provides new insights into relational reinforcement learning. Convergence results as well as experiments are presented. expand
|
|
|
Gradient LASSO for feature selection |
| |
Yongdai Kim,
Jinseog Kim
|
|
Page: 60 |
|
doi>10.1145/1015330.1015364 |
|
Full text: PDF
|
|
LASSO (Least Absolute Shrinkage and Selection Operator) is a useful tool to achieve the shrinkage and variable selection simultaneously. Since LASSO uses the L1 penalty, the optimization should rely on the quadratic program (QP) or ...
LASSO (Least Absolute Shrinkage and Selection Operator) is a useful tool to achieve the shrinkage and variable selection simultaneously. Since LASSO uses the L1 penalty, the optimization should rely on the quadratic program (QP) or general non-linear program which is known to be computational intensive. In this paper, we propose a gradient descent algorithm for LASSO. Even though the final result is slightly less accurate, the proposed algorithm is computationally simpler than QP or non-linear program, and so can be applied to large size problems. We provide the convergence rate of the algorithm, and illustrate it with simulated models as well as real data sets. expand
|
|
|
Sparse cooperative Q-learning |
| |
Jelle R. Kok,
Nikos Vlassis
|
|
Page: 61 |
|
doi>10.1145/1015330.1015410 |
|
Full text: PDF
|
|
Learning in multiagent systems suffers from the fact that both the state and the action space scale exponentially with the number of agents. In this paper we are interested in using Q-learning to learn the coordinated actions of a group of cooperative ...
Learning in multiagent systems suffers from the fact that both the state and the action space scale exponentially with the number of agents. In this paper we are interested in using Q-learning to learn the coordinated actions of a group of cooperative agents, using a sparse representation of the joint state-action space of the agents. We first examine a compact representation in which the agents need to explicitly coordinate their actions only in a predefined set of states. Next, we use a coordination-graph approach in which we represent the Q-values by value rules that specify the coordination dependencies of the agents at particular states. We show how Q-learning can be efficiently applied to learn a coordinated policy for the agents in the above framework. We demonstrate the proposed method on the predator-prey domain, and we compare it with other related multiagent Q-learning methods. expand
|
|
|
Authorship verification as a one-class classification problem |
| |
Moshe Koppel,
Jonathan Schler
|
|
Page: 62 |
|
doi>10.1145/1015330.1015448 |
|
Full text: PDF
|
|
In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the "depth of ...
In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the "depth of difference" between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. expand
|
|
|
Leveraging the margin more carefully |
| |
Nir Krause,
Yoram Singer
|
|
Page: 63 |
|
doi>10.1145/1015330.1015344 |
|
Full text: PDF
|
|
Boosting is a popular approach for building accurate classifiers. Despite the initial popular belief, boosting algorithms do exhibit overfitting and are sensitive to label noise. Part of the sensitivity of boosting algorithms to outliers and noise can ...
Boosting is a popular approach for building accurate classifiers. Despite the initial popular belief, boosting algorithms do exhibit overfitting and are sensitive to label noise. Part of the sensitivity of boosting algorithms to outliers and noise can be attributed to the unboundedness of the margin-based loss functions that they employ. In this paper we describe two leveraging algorithms that build on boosting techniques and employ a bounded loss function of the margin. The first algorithm interleaves the expectation maximization (EM) algorithm with boosting steps. The second algorithm decomposes a non-convex loss into a difference of two convex losses. We prove that both algorithms converge to a stationary point. We also analyze the generalization properties of the algorithms using the Rademacher complexity. We describe experiments with both synthetic data and natural data (OCR and text) that demonstrate the merits of our framework, in particular robustness to outliers. expand
|
|
|
Kernel conditional random fields: representation and clique selection |
| |
John Lafferty,
Xiaojin Zhu,
Yan Liu
|
|
Page: 64 |
|
doi>10.1145/1015330.1015337 |
|
Full text: PDF
|
|
Kernel conditional random fields (KCRFs) are introduced as a framework for discriminative modeling of graph-structured data. A representer theorem for conditional graphical models is given which shows how kernel conditional random fields arise from risk ...
Kernel conditional random fields (KCRFs) are introduced as a framework for discriminative modeling of graph-structured data. A representer theorem for conditional graphical models is given which shows how kernel conditional random fields arise from risk minimization procedures defined using Mercer kernels on labeled graphs. A procedure for greedily selecting cliques in the dual representation is then proposed, which allows sparse representations. By incorporating kernels and implicit feature spaces into conditional graphical models, the framework enables semi-supervised learning algorithms for structured data through the use of graph kernels. The framework and clique selection methods are demonstrated in synthetic data experiments, and are also applied to the problem of protein secondary structure prediction. expand
|
|
|
Learning to learn with the informative vector machine |
| |
Neil D. Lawrence,
John C. Platt
|
|
Page: 65 |
|
doi>10.1145/1015330.1015382 |
|
Full text: PDF
|
|
This paper describes an efficient method for learning the parameters of a Gaussian process (GP). The parameters are learned from multiple tasks which are assumed to have been drawn independently from the same GP prior. An efficient algorithm is obtained ...
This paper describes an efficient method for learning the parameters of a Gaussian process (GP). The parameters are learned from multiple tasks which are assumed to have been drawn independently from the same GP prior. An efficient algorithm is obtained by extending the informative vector machine (IVM) algorithm to handle the multi-task learning case. The multi-task IVM (MTIVM) saves computation by greedily selecting the most informative examples from the separate tasks. The MT-IVM is also shown to be more efficient than random sub-sampling on an artificial data-set and more effective than the traditional IVM in a speaker dependent phoneme recognition task. expand
|
|
|
Hyperplane margin classifiers on the multinomial manifold |
| |
Guy Lebanon,
John Lafferty
|
|
Page: 66 |
|
doi>10.1145/1015330.1015333 |
|
Full text: PDF
|
|
The assumptions behind linear classifiers for categorical data are examined and reformulated in the context of the multinomial manifold, the simplex of multinomial models furnished with the Riemannian structure induced by the Fisher information. This ...
The assumptions behind linear classifiers for categorical data are examined and reformulated in the context of the multinomial manifold, the simplex of multinomial models furnished with the Riemannian structure induced by the Fisher information. This leads to a new view of hyperplane classifiers which, together with a generalized margin concept, shows how to adapt existing margin-based hyperplane models to multinomial geometry. Experiments show the new classification framework to be effective for text classification, where the categorical structure of the data is modeled naturally within the multinomial family. expand
|
|
|
Probabilistic tangent subspace: a unified view |
| |
Jianguo Lee,
Jingdong Wang,
Changshui Zhang,
Zhaoqi Bian
|
|
Page: 67 |
|
doi>10.1145/1015330.1015362 |
|
Full text: PDF
|
|
Tangent Distance (TD) is one classical method for invariant pattern classification. However, conventional TD need pre-obtain tangent vectors, which is difficult except for image objects. This paper extends TD to more general pattern classification tasks. ...
Tangent Distance (TD) is one classical method for invariant pattern classification. However, conventional TD need pre-obtain tangent vectors, which is difficult except for image objects. This paper extends TD to more general pattern classification tasks. The basic assumption is that tangent vectors can be approximately represented by the pattern variations. We propose three probabilistic subspace models to encode the variations: the linear subspace, nonlinear subspace, and manifold subspace models. These three models are addressed in a unified view, namely Probabilistic Tangent Subspace (PTS). Experiments show that PTS can achieve promising classification performance in non-image data sets. expand
|
|
|
Entropy-based criterion in categorical clustering |
| |
Tao Li,
Sheng Ma,
Mitsunori Ogihara
|
|
Page: 68 |
|
doi>10.1145/1015330.1015404 |
|
Full text: PDF
|
|
Entropy-type measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropy-based criterion in clustering categorical data. It first shows that the entropy-based criterion can be derived in the formal framework ...
Entropy-type measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropy-based criterion in clustering categorical data. It first shows that the entropy-based criterion can be derived in the formal framework of probabilistic clustering models and establishes the connection between the criterion and the approach based on dissimilarity co-efficients. An iterative Monte-Carlo procedure is then presented to search for the partitions minimizing the criterion. Experiments are conducted to show the effectiveness of the proposed procedure. expand
|
|
|
Decision trees with minimal costs |
| |
Charles X. Ling,
Qiang Yang,
Jianning Wang,
Shichao Zhang
|
|
Page: 69 |
|
doi>10.1145/1015330.1015369 |
|
Full text: PDF
|
|
We propose a simple, novel and yet effective method for building and testing decision trees that minimizes the sum of the misclassification and test costs. More specifically, we first put forward an original and simple splitting criterion for attribute ...
We propose a simple, novel and yet effective method for building and testing decision trees that minimizes the sum of the misclassification and test costs. More specifically, we first put forward an original and simple splitting criterion for attribute selection in tree building. Our tree-building algorithm has many desirable properties for a cost-sensitive learning system that must account for both types of costs. Then, assuming that the test cases may have a large number of missing values, we design several intelligent test strategies that can suggest ways of obtaining the missing values at a cost in order to minimize the total cost. We experimentally compare these strategies and C4.5, and demonstrate that our new algorithms significantly outperform C4.5 and its variations. In addition, our algorithm's complexity is similar to that of C4.5, and is much lower than that of previous work. Our work is useful for many diagnostic tasks which must factor in the misclassification and test costs for obtaining missing information. expand
|
|
|
Extensions of marginalized graph kernels |
| |
Pierre Mahé,
Nobuhisa Ueda,
Tatsuya Akutsu,
Jean-Luc Perret,
Jean-Philippe Vert
|
|
Page: 70 |
|
doi>10.1145/1015330.1015446 |
|
Full text: PDF
|
|
Positive definite kernels between labeled graphs have recently been proposed. They enable the application of kernel methods, such as support vector machines, to the analysis and classification of graphs, for example, chemical compounds. These graph kernels ...
Positive definite kernels between labeled graphs have recently been proposed. They enable the application of kernel methods, such as support vector machines, to the analysis and classification of graphs, for example, chemical compounds. These graph kernels are obtained by marginalizing a kernel between paths with respect to a random walk model on the graph vertices along the edges. We propose two extensions of these graph kernels, with the double goal to reduce their computation time and increase their relevance as measure of similarity between graphs. First, we propose to modify the label of each vertex by automatically adding information about its environment with the use of the Morgan algorithm. Second, we suggest a modification of the random walk model to prevent the walk from coming back to a vertex that was just visited. These extensions are then tested on benchmark experiments of chemical compounds classification, with promising results. expand
|
|
|
Dynamic abstraction in reinforcement learning via clustering |
| |
Shie Mannor,
Ishai Menache,
Amit Hoze,
Uri Klein
|
|
Page: 71 |
|
doi>10.1145/1015330.1015355 |
|
Full text: PDF
|
|
We consider a graph theoretic approach for automatic construction of options in a dynamic environment. A map of the environment is generated on-line by the learning agent, representing the topological structure of the state transitions. A clustering ...
We consider a graph theoretic approach for automatic construction of options in a dynamic environment. A map of the environment is generated on-line by the learning agent, representing the topological structure of the state transitions. A clustering algorithm is then used to partition the state space to different regions. Policies for reaching the different parts of the space are separately learned and added to the model in a form of options (macro-actions). The options are used for accelerating the Q-Learning algorithm. We extend the basic algorithm and consider building a map that includes preliminary indication of the location of "interesting" regions of the state space, where the value gradient is significant and additional exploration might be beneficial. Experiments indicate significant speedups, especially in the initial learning phase. expand
|
|
|
Bias and variance in value function estimation |
| |
Shie Mannor,
Duncan Simester,
Peng Sun,
John N. Tsitsiklis
|
|
Page: 72 |
|
doi>10.1145/1015330.1015402 |
|
Full text: PDF
|
|
We consider the bias and variance of value function estimation that are caused by using an empirical model instead of the true model. We analyze these bias and variance for Markov processes from a classical (frequentist) statistical point of view, and ...
We consider the bias and variance of value function estimation that are caused by using an empirical model instead of the true model. We analyze these bias and variance for Markov processes from a classical (frequentist) statistical point of view, and in a Bayesian setting. Using a second order approximation, we provide explicit expressions for the bias and variance in terms of the transition counts and the reward statistics. We present supporting experiments with artificial Markov chains and with a large transactional database provided by a mail-order catalog firm. expand
|
|
|
The multiple multiplicative factor model for collaborative filtering |
| |
Benjamin Marlin,
Richard S. Zemel
|
|
Page: 73 |
|
doi>10.1145/1015330.1015437 |
|
Full text: PDF
|
|
We describe a class of causal, discrete latent variable models called Multiple Multiplicative Factor models (MMFs). A data vector is represented in the latent space as a vector of factors that have discrete, non-negative expression levels. Each factor ...
We describe a class of causal, discrete latent variable models called Multiple Multiplicative Factor models (MMFs). A data vector is represented in the latent space as a vector of factors that have discrete, non-negative expression levels. Each factor proposes a distribution over the data vector. The distinguishing feature of MMFs is that they combine the factors' proposed distributions multiplicatively, taking into account factor expression levels. The product formulation of MMFs allow factors to specialize to a subset of the items, while the causal generative semantics mean MMFs can readily accommodate missing data. This makes MMFs distinct from both directed models with mixture semantics and undirected product models. In this paper we present empirical results from the collaborative filtering domain showing that a binary/multinomial MMF model matches the performance of the best existing models while learning an interesting latent space description of the users. expand
|
|
|
Diverse ensembles for active learning |
| |
Prem Melville,
Raymond J. Mooney
|
|
Page: 74 |
|
doi>10.1145/1015330.1015385 |
|
Full text: PDF
|
|
Query by Committee is an effective approach to selective sampling in which disagreement amongst an ensemble of hypotheses is used to select data for labeling. Query by Bagging and Query by Boosting are two practical implementations of this approach that ...
Query by Committee is an effective approach to selective sampling in which disagreement amongst an ensemble of hypotheses is used to select data for labeling. Query by Bagging and Query by Boosting are two practical implementations of this approach that use Bagging and Boosting, respectively, to build the committees. For effective active learning, it is critical that the committee be made up of consistent hypotheses that are very different from each other. DECORATE is a recently developed method that directly constructs such diverse committees using artificial training data. This paper introduces ACTIVE-DECORATE, which uses DECORATE committees to select good training examples. Extensive experimental results demonstrate that, in general, ACTIVE-DECORATE outperforms both Query by Bagging and Query by Boosting. expand
|
|
|
Convergence of synchronous reinforcement learning with linear function approximation |
| |
Artur Merke,
Ralf Schoknecht
|
|
Page: 75 |
|
doi>10.1145/1015330.1015390 |
|
Full text: PDF
|
|
Synchronous reinforcement learning (RL) algorithms with linear function approximation are representable as inhomogeneous matrix iterations of a special form (Schoknecht & Merke, 2003). In this paper we state conditions of convergence for general inhomogeneous ...
Synchronous reinforcement learning (RL) algorithms with linear function approximation are representable as inhomogeneous matrix iterations of a special form (Schoknecht & Merke, 2003). In this paper we state conditions of convergence for general inhomogeneous matrix iterations and prove that they are both necessary and sufficient. This result extends the work presented in (Schoknecht & Merke, 2003), where only a sufficient condition of convergence was proved. As the condition of convergence is necessary and sufficient, the new result is suitable to prove convergence and divergence of RL algorithms with function approximation. We use the theorem to deduce a new concise proof of convergence for the synchronous residual gradient algorithm (Baird, 1995). Moreover, we derive a counterexample for which the uniform RL algorithm (Merke & Schoknecht, 2002) diverges. This yields a negative answer to the open question if the uniform RL algorithm converges for arbitrary multiple transitions. expand
|
|
|
Learning to fly by combining reinforcement learning with behavioural cloning |
| |
Eduardo F. Morales,
Claude Sammut
|
|
Page: 76 |
|
doi>10.1145/1015330.1015384 |
|
Full text: PDF
|
|
Reinforcement learning deals with learning optimal or near optimal policies while interacting with the environment. Application domains with many continuous variables are difficult to solve with existing reinforcement learning methods due to the large ...
Reinforcement learning deals with learning optimal or near optimal policies while interacting with the environment. Application domains with many continuous variables are difficult to solve with existing reinforcement learning methods due to the large search space. In this paper, we use a relational representation to define powerful abstractions that allow us to incorporate domain knowledge and re-use previously learned policies in other similar problems. We also describe how to learn useful actions from human traces using a behavioural cloning approach combined with an exploration phase. Since several conflicting actions may be induced for the same abstract state, reinforcement learning is used to learn an optimal policy over this reduced space. It is shown experimentally how a combination of behavioural cloning and reinforcement learning using a relational representation is powerful enough to learn how to fly an aircraft through different points in space and different turbulence conditions. expand
|
|
|
Learning first-order rules from data with multiple parts: applications on mining chemical compound data |
| |
Cholwich Nattee,
Sukree Sinthupinyo,
Masayuki Numao,
Takashi Okada
|
|
Page: 77 |
|
doi>10.1145/1015330.1015447 |
|
Full text: PDF
|
|
Inductive learning of first-order theory based on examples has serious bottleneck in the enormous hypothesis search space needed, making existing learning approaches perform poorly when compared to the propositional approach. Moreover, in order to choose ...
Inductive learning of first-order theory based on examples has serious bottleneck in the enormous hypothesis search space needed, making existing learning approaches perform poorly when compared to the propositional approach. Moreover, in order to choose the appropiate candidates, all Inductive Logic Programming (ILP) systems only use quantitive information, e.g. number of examples covered and length of rules, which is insufficient for search space having many similar candidates. This paper introduces a novel approach to improve ILP by incorporating the qualitative information into the search heuristics by focusing only on a kind of data where one instance consists of several parts, as well as relations among parts. This approach aims to find the hypothesis describing each class by using both individual and relational characteristics of parts of examples. This kind of data can be found in various domains, especially in representing chemical compound structure. Each compound is composed of atoms as parts, and bonds as relations between two atoms. We apply the proposed approach for discovering rules describing the activity of compounds from their structures from two real-world datasets: mutagenicity in nitroaromatic compounds and dopamine antagonist compounds. The results were compared to the existing method using ten-fold cross validation, and we found that the proposed method significantly produced more accurate results in prediction. expand
|
|
|
Feature selection, L1 vs. L2 regularization, and rotational invariance |
| |
Andrew Y. Ng
|
|
Page: 78 |
|
doi>10.1145/1015330.1015435 |
|
Full text: PDF
|
|
We consider supervised learning in the presence of very many irrelevant features, and study two different regularization methods for preventing overfitting. Focusing on logistic regression, we show that using L1 regularization of the ...
We consider supervised learning in the presence of very many irrelevant features, and study two different regularization methods for preventing overfitting. Focusing on logistic regression, we show that using L1 regularization of the parameters, the sample complexity (i.e., the number of training examples required to learn "well,") grows only logarithmically in the number of irrelevant features. This logarithmic rate matches the best known bounds for feature selection, and indicates that L1 regularized logistic regression can be effective even if there are exponentially many irrelevant features as there are training examples. We also give a lower-bound showing that any rotationally invariant algorithm---including logistic regression with L2 regularization, SVMs, and neural networks trained by backpropagation---has a worst case sample complexity that grows at least linearly in the number of irrelevant features. expand
|
|
|
Active learning using pre-clustering |
| |
Hieu T. Nguyen,
Arnold Smeulders
|
|
Page: 79 |
|
doi>10.1145/1015330.1015349 |
|
Full text: PDF
|
|
The paper is concerned with two-class active learning. While the common approach for collecting data in active learning is to select samples close to the classification boundary, better performance can be achieved by taking into account the prior data ...
The paper is concerned with two-class active learning. While the common approach for collecting data in active learning is to select samples close to the classification boundary, better performance can be achieved by taking into account the prior data distribution. The main contribution of the paper is a formal framework that incorporates clustering into active learning. The algorithm first constructs a classifier on the set of the cluster representatives, and then propagates the classification decision to the other samples via a local noise model. The proposed model allows to select the most representative samples as well as to avoid repeatedly labeling samples in the same cluster. During the active learning process, the clustering is adjusted using the coarse-to-fine strategy in order to balance between the advantage of large clusters and the accuracy of the data representation. The results of experiments in image databases show a better performance of our algorithm compared to the current methods. expand
|
|
|
Decentralized detection and classification using kernel methods |
| |
XuanLong Nguyen,
Martin J. Wainwright,
Michael I. Jordan
|
|
Page: 80 |
|
doi>10.1145/1015330.1015438 |
|
Full text: PDF
|
|
We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known, we address ...
We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known, we address the problem when only a set of empirical samples is available. We propose a novel algorithm using the framework of empirical risk minimization and marginalized kernels, and analyze its computational and statistical properties both theoretically and empirically. We provide an efficient implementation of the algorithm, and demonstrate its performance on both simulated and real data sets. expand
|
|
|
Learning with non-positive kernels |
| |
Cheng Soon Ong,
Xavier Mary,
Stéphane Canu,
Alexander J. Smola
|
|
Page: 81 |
|
doi>10.1145/1015330.1015443 |
|
Full text: PDF
|
|
In this paper we show that many kernel methods can be adapted to deal with indefinite kernels, that is, kernels which are not positive semidefinite. They do not satisfy Mercer's condition and they induce associated functional spaces called Reproducing ...
In this paper we show that many kernel methods can be adapted to deal with indefinite kernels, that is, kernels which are not positive semidefinite. They do not satisfy Mercer's condition and they induce associated functional spaces called Reproducing Kernel Kre&icaron;n Spaces (RKKS), a generalization of Reproducing Kernel Hilbert Spaces (RKHS).Machine learning in RKKS shares many "nice" properties of learning in RKHS, such as orthogonality and projection. However, since the kernels are indefinite, we can no longer minimize the loss, instead we stabilize it. We show a general representer theorem for constrained stabilization and prove generalization bounds by computing the Rademacher averages of the kernel class. We list several examples of indefinite kernels and investigate regularization methods to solve spline interpolation. Some preliminary experiments with indefinite kernels for spline smoothing are reported for truncated spectral factorization, Landweber-Fridman iterations, and MR-II. expand
|
|
|
Sequential information bottleneck for finite data |
| |
Jaakko Peltonen,
Janne Sinkkonen,
Samuel Kaski
|
|
Page: 82 |
|
doi>10.1145/1015330.1015375 |
|
Full text: PDF
|
|
The sequential information bottleneck (sIB) algorithm clusters co-occurrence data such as text documents vs. words. We introduce a variant that models sparse co-occurrence data by a generative process. This turns the objective function of sIB, mutual ...
The sequential information bottleneck (sIB) algorithm clusters co-occurrence data such as text documents vs. words. We introduce a variant that models sparse co-occurrence data by a generative process. This turns the objective function of sIB, mutual information, into a Bayes factor, while keeping it intact asymptotically, for non-sparse data. Experimental performance of the new algorithm is comparable to the original sIB for large data sets, and better for smaller, sparse sets. expand
|
|
|
A maximum entropy approach to species distribution modeling |
| |
Steven J. Phillips,
Miroslav Dudík,
Robert E. Schapire
|
|
Page: 83 |
|
doi>10.1145/1015330.1015412 |
|
Full text: PDF
|
|
We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximum-entropy techniques for this problem, specifically, sequential-update algorithms that can handle a very large ...
We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximum-entropy techniques for this problem, specifically, sequential-update algorithms that can handle a very large number of features. We describe experiments comparing maxent with a standard distribution-modeling tool, called GARP, on a dataset containing observation data for North American breeding birds. We also study how well maxent performs as a function of the number of training examples and training time, analyze the use of regularization to avoid overfitting when the number of examples is small, and explore the interpretability of models constructed using maxent. expand
|
|
|
Incremental learning of linear model trees |
| |
Duncan Potts
|
|
Page: 84 |
|
doi>10.1145/1015330.1015372 |
|
Full text: PDF
|
|
A linear model tree is a decision tree with a linear functional model in each leaf. Previous model tree induction algorithms have operated on the entire training set, however there are many situations when an incremental learner is advantageous. In this ...
A linear model tree is a decision tree with a linear functional model in each leaf. Previous model tree induction algorithms have operated on the entire training set, however there are many situations when an incremental learner is advantageous. In this paper we demonstrate that model trees can be induced incrementally using an algorithm that scales linearly with the number of examples. An incremental node splitting rule is presented, together with incremental methods for stopping the growth of the tree and pruning. Empirical testing in three domains, where the emphasis is on learning a dynamic model of the environment, shows that the algorithm can learn a more accurate approximation from fewer examples than other incremental methods. In addition the induced models are smaller, and the learner requires less prior knowledge about the domain. expand
|
|
|
Predictive automatic relevance determination by expectation propagation |
| |
Yuan (Alan) Qi,
Thomas P. Minka,
Rosalind W. Picard,
Zoubin Ghahramani
|
|
Page: 85 |
|
doi>10.1145/1015330.1015418 |
|
Full text: PDF
|
|
In many real-world classification problems the input contains a large number of potentially irrelevant features. This paper proposes a new Bayesian framework for determining the relevance of input features. This approach extends one of the most successful ...
In many real-world classification problems the input contains a large number of potentially irrelevant features. This paper proposes a new Bayesian framework for determining the relevance of input features. This approach extends one of the most successful Bayesian methods for feature selection and sparse learning, known as Automatic Relevance Determination (ARD). ARD finds the relevance of features by optimizing the model marginal likelihood, also known as the evidence. We show that this can lead to overfitting. To address this problem, we propose Predictive ARD based on estimating the predictive performance of the classifier. While the actual leave-one-out predictive performance is generally very costly to compute, the expectation propagation (EP) algorithm proposed by Minka provides an estimate of this predictive performance as a side-effect of its iterations. We exploit this in our algorithm to do feature selection, and to select data points in a sparse Bayesian kernel classifier. Moreover, we provide two other improvements to previous algorithms, by replacing Laplace's approximation with the generally more accurate EP, and by incorporating the fast optimization algorithm proposed by Faul and Tipping. Our experiments show that our method based on the EP estimate of predictive performance is more accurate on test data than relevance determination by optimizing the evidence. expand
|
|
|
Sequential skewing: an improved skewing algorithm |
| |
Soumya Ray,
David Page
|
|
Page: 86 |
|
doi>10.1145/1015330.1015392 |
|
Full text: PDF
|
|
This paper extends previous work on the Skewing algorithm, a promising approach that allows greedy decision tree induction algorithms to handle problematic functions such as parity functions with a lower run-time penalty than Lookahead. A deficiency ...
This paper extends previous work on the Skewing algorithm, a promising approach that allows greedy decision tree induction algorithms to handle problematic functions such as parity functions with a lower run-time penalty than Lookahead. A deficiency of the previously proposed algorithm is its inability to scale up to high dimensional problems. In this paper, we describe a modified algorithm that scales better with increasing numbers of variables. We present experiments with randomly generated Boolean functions that evaluate the algorithm's response to increasing dimensions. We also evaluate the algorithm on a challenging real world biomedical problem, that of SH3 domain binding. Our results indicate that our algorithm almost always outperforms an information gain-based decision tree learner. expand
|
|
|
Learning to cluster using local neighborhood structure |
| |
Rómer Rosales,
Kannan Achan,
Brendan Frey
|
|
Page: 87 |
|
doi>10.1145/1015330.1015403 |
|
Full text: PDF
|
|
This paper introduces an approach for clustering/classification which is based on the use of local, high-order structure present in the data. For some problems, this local structure might be more relevant for classification than other measures of point ...
This paper introduces an approach for clustering/classification which is based on the use of local, high-order structure present in the data. For some problems, this local structure might be more relevant for classification than other measures of point similarity used by popular unsupervised and semi-supervised clustering methods. Under this approach, changes in the class label are associated to changes in the local properties of the data. Using this idea, we also pursue to learn how to cluster given examples of clustered data (including from different datasets). We make these concepts formal by presenting a probability model that captures their fundamentals and show that in this setting, learning to cluster is a well defined and tractable task. Based on probabilistic inference methods, we then present an algorithm for computing the posterior probability distribution of class labels for each data point. Experiments in the domain of spatial grouping and functional gene classification are used to illustrate and test these concepts. expand
|
|
|
Learning low dimensional predictive representations |
| |
Matthew Rosencrantz,
Geoff Gordon,
Sebastian Thrun
|
|
Page: 88 |
|
doi>10.1145/1015330.1015441 |
|
Full text: PDF
|
|
Predictive state representations (PSRs) have recently been proposed as an alternative to partially observable Markov decision processes (POMDPs) for representing the state of a dynamical system (Littman et al., 2001). We present a learning algorithm ...
Predictive state representations (PSRs) have recently been proposed as an alternative to partially observable Markov decision processes (POMDPs) for representing the state of a dynamical system (Littman et al., 2001). We present a learning algorithm that learns a PSR from observational data. Our algorithm produces a variant of PSRs called transformed predictive state representations (TPSRs). We provide an efficient principal-components-based algorithm for learning a TPSR, and show that TPSRs can perform well in comparison to Hidden Markov Models learned with Baum-Welch in a real world robot tracking task for low dimensional representations and long prediction horizons. expand
|
|
|
Model selection via the AUC |
| |
Saharon Rosset
|
|
Page: 89 |
|
doi>10.1145/1015330.1015400 |
|
Full text: PDF
|
|
We present a statistical analysis of the AUC as an evaluation criterion for classification scoring models. First, we consider significance tests for the difference between AUC scores of two algorithms on the same test set. We derive exact moments under ...
We present a statistical analysis of the AUC as an evaluation criterion for classification scoring models. First, we consider significance tests for the difference between AUC scores of two algorithms on the same test set. We derive exact moments under simplifying assumptions and use them to examine approximate practical methods from the literature. We then compare AUC to empirical misclassification error when the prediction goal is to minimize future error rate. We show that the AUC may be preferable to empirical error even in this case and discuss the tradeoff between approximation error and estimation error underlying this phenomenon. expand
|
|
|
Towards tight bounds for rule learning |
| |
Ulrich Rückert,
Stefan Kramer
|
|
Page: 90 |
|
doi>10.1145/1015330.1015387 |
|
Full text: PDF
|
|
While there is a lot of empirical evidence showing that traditional rule learning approaches work well in practice, it is nearly impossible to derive analytical results about their predictive accuracy. In this paper, we investigate rule-learning from ...
While there is a lot of empirical evidence showing that traditional rule learning approaches work well in practice, it is nearly impossible to derive analytical results about their predictive accuracy. In this paper, we investigate rule-learning from a theoretical perspective. We show that the application of McAllester's PAC-Bayesian bound to rule learning yields a practical learning algorithm, which is based on ensembles of weighted rule sets. Experiments with the resulting learning algorithm show not only that it is competitive with state-of-the-art rule learners, but also that its error rate can often be bounded tightly. In fact, the bound turns out to be tighter than one of the "best" bounds for a practical learning scheme known so far (the Set Covering Machine). Finally, we prove that the bound can be further improved by allowing the learner to abstain from uncertain predictions. expand
|
|
|
Adaptive cognitive orthotics: combining reinforcement learning and constraint-based temporal reasoning |
| |
Matthew Rudary,
Satinder Singh,
Martha E. Pollack
|
|
Page: 91 |
|
doi>10.1145/1015330.1015411 |
|
Full text: PDF
|
|
Reminder systems support people with impaired prospective memory and/or executive function, by providing them with reminders of their functional daily activities. We integrate temporal constraint reasoning with reinforcement learning (RL) to build an ...
Reminder systems support people with impaired prospective memory and/or executive function, by providing them with reminders of their functional daily activities. We integrate temporal constraint reasoning with reinforcement learning (RL) to build an adaptive reminder system and in a simulated environment demonstrate that it can personalize to a user and adapt to both short- and long-term changes. In addition to advancing the application domain, our integrated algorithm contributes to research on temporal constraint reasoning by showing how RL can select an optimal policy from amongst a set of temporally consistent ones, and it contributes to the work on RL by showing how temporal constraint reasoning can be used to dramatically reduce the space of actions from which an RL agent needs to learn. expand
|
|
|
Online learning of conditionally I.I.D. data |
| |
Daniil Ryabko
|
|
Page: 92 |
|
doi>10.1145/1015330.1015340 |
|
Full text: PDF
|
|
In this work we consider the task of relaxing the i.i.d assumption in online pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Online pattern recognition is predicting a sequence ...
In this work we consider the task of relaxing the i.i.d assumption in online pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Online pattern recognition is predicting a sequence of labels based on objects given for each label and on examples (pairs of objects and labels) learned so far. Traditionally, this task is considered under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over under a much weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects only, while the only condition on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold.We find a broad class of learning algorithms for which estimations of the probability of a classification error achieved under the classical i.i.d. assumption can be generalised to the similar estimates for the case of conditionally i.i.d. distributed examples. expand
|
|
|
Coalition calculation in a dynamic agent environment |
| |
Ted Scully,
Michael G. Madden,
Gerard Lyons
|
|
Page: 93 |
|
doi>10.1145/1015330.1015380 |
|
Full text: PDF
|
|
We consider a dynamic market-place of self-interested agents with differing capabilities. A task to be completed is proposed to the agent population. An agent attempts to form a coalition of agents to perform the task. Before proposing a coalition, the ...
We consider a dynamic market-place of self-interested agents with differing capabilities. A task to be completed is proposed to the agent population. An agent attempts to form a coalition of agents to perform the task. Before proposing a coalition, the agent must determine the optimal set of agents with whom to enter into a coalition for this task; we refer to this activity as coalition calculation. To determine the optimal coalition, the agent must have a means of calculating the value of any given coalition. Multiple metrics (cost, time, quality etc.) determine the true value of a coalition. However, because of conflicting metrics, differing metric importance and the tendency of metric importance to vary over time, it is difficult to obtain a true valuation of a given coalition. Previous work has not addressed these issues. We present a solution based on the adaptation of a multi-objective optimization evolutionary algorithm. In order to obtain a true valuation of any coalition, we use the concept of Pareto dominance coupled with a distance weighting algorithm. We determine the Pareto optimal set of coalitions and then use an instance-based learning algorithm to select the optimal coalition. We show through empirical evaluation that the proposed technique is capable of eliciting metric importance and adapting to metric variation over time. expand
|
|
|
Online and batch learning of pseudo-metrics |
| |
Shai Shalev-Shwartz,
Yoram Singer,
Andrew Y. Ng
|
|
Page: 94 |
|
doi>10.1145/1015330.1015376 |
|
Full text: PDF
|
|
We describe and analyze an online algorithm for supervised learning of pseudo-metrics. The algorithm receives pairs of instances and predicts their similarity according to a pseudo-metric. The pseudo-metrics we use are quadratic forms parameterized by ...
We describe and analyze an online algorithm for supervised learning of pseudo-metrics. The algorithm receives pairs of instances and predicts their similarity according to a pseudo-metric. The pseudo-metrics we use are quadratic forms parameterized by positive semi-definite matrices. The core of the algorithm is an update rule that is based on successive projections onto the positive semi-definite cone and onto half-space constraints imposed by the examples. We describe an efficient procedure for performing these projections, derive a worst case mistake bound on the similarity predictions, and discuss a dual version of the algorithm in which it is simple to incorporate kernel operators. The online algorithm also serves as a building block for deriving a large-margin batch algorithm. We demonstrate the merits of the proposed approach by conducting experiments on MNIST dataset and on document filtering. expand
|
|
|
Using relative novelty to identify useful temporal abstractions in reinforcement learning |
| |
Özgür Şimşek,
Andrew G. Barto
|
|
Page: 95 |
|
doi>10.1145/1015330.1015353 |
|
Full text: PDF
|
|
We present a new method for automatically creating useful temporal abstractions in reinforcement learning. We argue that states that allow the agent to transition to a different region of the state space are useful subgoals, and propose a method for ...
We present a new method for automatically creating useful temporal abstractions in reinforcement learning. We argue that states that allow the agent to transition to a different region of the state space are useful subgoals, and propose a method for identifying them using the concept of relative novelty. When such a state is identified, a temporally-extended activity (e.g., an option) is generated that takes the agent efficiently to this state. We illustrate the utility of the method in a number of tasks. expand
|
|
|
Generative modeling for continuous non-linearly embedded visual inference |
| |
Cristian Sminchisescu,
Allan Jepson
|
|
Page: 96 |
|
doi>10.1145/1015330.1015371 |
|
Full text: PDF
|
|
Many difficult visual perception problems, like 3D human motion estimation, can be formulated in terms of inference using complex generative models, defined over high-dimensional state spaces. Despite progress, optimizing such models is difficult because ...
Many difficult visual perception problems, like 3D human motion estimation, can be formulated in terms of inference using complex generative models, defined over high-dimensional state spaces. Despite progress, optimizing such models is difficult because prior knowledge cannot be flexibly integrated in order to reshape an initially designed representation space. Nonlinearities, inherent sparsity of high-dimensional training sets, and lack of global continuity makes dimensionality reduction challenging and low-dimensional search inefficient. To address these problems, we present a learning and inference algorithm that restricts visual tracking to automatically extracted, non-linearly embedded, low-dimensional spaces. This formulation produces a layered generative model with reduced state representation, that can be estimated using efficient continuous optimization methods. Our prior flattening method allows a simple analytic treatment of low-dimensional intrinsic curvature constraints, and allows consistent interpolation operations. We analyze reduced manifolds for human interaction activities, and demonstrate that the algorithm learns continuous generative models that are useful for tracking and for the reconstruction of 3D human motion in monocular video. expand
|
|
|
Efficient hierarchical MCMC for policy search |
| |
Malcolm Strens
|
|
Page: 97 |
|
doi>10.1145/1015330.1015381 |
|
Full text: PDF
|
|
Many inference and optimization tasks in machine learning can be solved by sampling approaches such as Markov Chain Monte Carlo (MCMC) and simulated annealing. These methods can be slow if a single target density query requires many runs of a simulation ...
Many inference and optimization tasks in machine learning can be solved by sampling approaches such as Markov Chain Monte Carlo (MCMC) and simulated annealing. These methods can be slow if a single target density query requires many runs of a simulation (or a complete sweep of a training data set). We introduce a hierarchy of MCMC samplers that allow most steps to be taken in the solution space using only a small sample of simulation runs (or training examples). This is shown to accelerate learning in a policy search optimization task. expand
|
|
|
Automated hierarchical mixtures of probabilistic principal component analyzers |
| |
Ting Su,
Jennifer G. Dy
|
|
Page: 98 |
|
doi>10.1145/1015330.1015393 |
|
Full text: PDF
|
|
Many clustering algorithms fail when dealing with high dimensional data. Principal component analysis (PCA) is a popular dimensionality reduction algorithm. However, it assumes a single multivariate Gaussian model, which provides a global linear projection ...
Many clustering algorithms fail when dealing with high dimensional data. Principal component analysis (PCA) is a popular dimensionality reduction algorithm. However, it assumes a single multivariate Gaussian model, which provides a global linear projection of the data. Mixture of probabilistic principal component analyzers (PPCA) provides a better model to the clustering paradigm. It provides a local linear PCA projection for each multivariate Gaussian cluster component. We extend this model to build hierarchical mixtures of PPCA. Hierarchical clustering provides a flexible representation showing relationships among clusters in various perceptual levels. We introduce an automated hierarchical mixture of PPCA algorithm, which utilizes the integrated classification likelihood as a criterion for splitting and stopping the addition of hierarchical levels. An automated approach requires automated methods for initialization, determining the number of principal component dimensions, and determining when to split clusters. We address each of these in the paper. This automated approach results in a coarse to fine local component model with varying projections and with different number of dimensions for each cluster. expand
|
|
|
Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data |
| |
Charles Sutton,
Khashayar Rohanimanesh,
Andrew McCallum
|
|
Page: 99 |
|
doi>10.1145/1015330.1015422 |
|
Full text: PDF
|
|
In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields ...
In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges---a distributed state representation as in dynamic Bayesian networks (DBNs)---and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data. expand
|
|
|
Interpolation-based Q-learning |
| |
Csaba Szepesvári,
William D. Smart
|
|
Page: 100 |
|
doi>10.1145/1015330.1015445 |
|
Full text: PDF
|
|
We consider a variant of Q-learning in continuous state spaces under the total expected discounted cost criterion combined with local function approximation methods. Provided that the function approximator satisfies certain interpolation properties, ...
We consider a variant of Q-learning in continuous state spaces under the total expected discounted cost criterion combined with local function approximation methods. Provided that the function approximator satisfies certain interpolation properties, the resulting algorithm is shown to converge with probability one. The limit function is shown to satisfy a fixed point equation of the Bellman type, where the fixed point operator depends on the stationary distribution of the exploration policy and the function approximation method. The basic algorithm is extended in several ways. In particular, a variant of the algorithm is obtained that is shown to converge in probability to the optimal Q function. Preliminary computer simulations are presented that confirm the validity of the approach. expand
|
|
|
SVM-based generalized multiple-instance learning via approximate box counting |
| |
Qingping Tao,
Stephen Scott,
N. V. Vinodchandran,
Thomas Takeo Osugi
|
|
Page: 101 |
|
doi>10.1145/1015330.1015405 |
|
Full text: PDF
|
|
The multiple-instance learning (MIL) model has been very successful in application areas such as drug discovery and content-based image-retrieval. Recently, a generalization of this model and an algorithm for this generalization were introduced, showing ...
The multiple-instance learning (MIL) model has been very successful in application areas such as drug discovery and content-based image-retrieval. Recently, a generalization of this model and an algorithm for this generalization were introduced, showing significant advantages over the conventional MIL model in certain application areas. Unfortunately, this algorithm is inherently inefficient, preventing scaling to high dimensions. We reformulate this algorithm using a kernel for a support vector machine, reducing its time complexity from exponential to polynomial. Computing the kernel is equivalent to counting the number of axis-parallel boxes in a discrete, bounded space that contain at least one point from each of two multisets P and Q. We show that this problem is #P-complete, but then give a fully polynomial randomized approximation scheme (FPRAS) for it. Finally, we empirically evaluate our kernel. expand
|
|
|
Learning associative Markov networks |
| |
Ben Taskar,
Vassil Chatalbashev,
Daphne Koller
|
|
Page: 102 |
|
doi>10.1145/1015330.1015444 |
|
Full text: PDF
|
|
Markov networks are extensively used to model complex sequential, spatial, and relational interactions in fields as diverse as image processing, natural language analysis, and bioinformatics. However, inference and learning in general Markov networks ...
Markov networks are extensively used to model complex sequential, spatial, and relational interactions in fields as diverse as image processing, natural language analysis, and bioinformatics. However, inference and learning in general Markov networks is intractable. In this paper, we focus on learning a large subclass of such models (called associative Markov networks) that are tractable or closely approximable. This subclass contains networks of discrete variables with K labels each and clique potentials that favor the same labels for all variables in the clique. Such networks capture the "guilt by association" pattern of reasoning present in many domains, in which connected ("associated") variables tend to have the same label. Our approach exploits a linear programming relaxation for the task of finding the best joint assignment in such networks, which provides an approximate quadratic program (QP) for the problem of learning a margin-maximizing Markov network. We show that for associative Markov network over binary-valued variables, this approximate QP is guaranteed to return an optimal parameterization for Markov networks of arbitrary topology. For the nonbinary case, optimality is not guaranteed, but the relaxation produces good solutions in practice. Experimental results with hypertext and newswire classification show significant advantages over standard approaches. expand
|
|
|
Learning random walk models for inducing word dependency distributions |
| |
Kristina Toutanova,
Christopher D. Manning,
Andrew Y. Ng
|
|
Page: 103 |
|
doi>10.1145/1015330.1015442 |
|
Full text: PDF
|
|
Many NLP tasks rely on accurately estimating word dependency probabilities P(ω1|ω2), where the words w1 and w2 have a particular relationship (such as verb-object). Because of the ...
Many NLP tasks rely on accurately estimating word dependency probabilities P(ω1|ω2), where the words w1 and w2 have a particular relationship (such as verb-object). Because of the sparseness of counts of such dependencies, smoothing and the ability to use multiple sources of knowledge are important challenges. For example, if the probability P(N|V) of noun N being the subject of verb V is high, and V takes similar objects to V', and V' is synonymous to V", then we want to conclude that P(N|V") should also be reasonably high---even when those words did not cooccur in the training data.To capture these higher order relationships, we propose a Markov chain model, whose stationary distribution is used to give word probability estimates. Unlike the manually defined random walks used in some link analysis algorithms, we show how to automatically learn a rich set of parameters for the Markov chain's transition probabilities. We apply this model to the task of prepositional phrase attachment, obtaining an accuracy of 87.54%. expand
|
|
|
Support vector machine learning for interdependent and structured output spaces |
| |
Ioannis Tsochantaridis,
Thomas Hofmann,
Thorsten Joachims,
Yasemin Altun
|
|
Page: 104 |
|
doi>10.1145/1015330.1015341 |
|
Full text: PDF
|
|
Learning general functional dependencies is one of the main goals in machine learning. Recent progress in kernel-based methods has focused on designing flexible and powerful input representations. This paper addresses the complementary issue of problems ...
Learning general functional dependencies is one of the main goals in machine learning. Recent progress in kernel-based methods has focused on designing flexible and powerful input representations. This paper addresses the complementary issue of problems involving complex outputs such as multiple dependent output variables and structured output spaces. We propose to generalize multiclass Support Vector Machine learning in a formulation that involves features extracted jointly from inputs and outputs. The resulting optimization problem is solved efficiently by a cutting plane algorithm that exploits the sparseness and structural decomposition of the problem. We demonstrate the versatility and effectiveness of our method on problems ranging from supervised grammar learning and named-entity recognition, to taxonomic text classification and sequence alignment. expand
|
|
|
A hierarchical method for multi-class support vector machines |
| |
Volkan Vural,
Jennifer G. Dy
|
|
Page: 105 |
|
doi>10.1145/1015330.1015427 |
|
Full text: PDF
|
|
We introduce a framework, which we call Divide-by-2 (DB2), for extending support vector machines (SVM) to multi-class problems. DB2 offers an alternative to the standard one-against-one and one-against-rest algorithms. For an N class problem, ...
We introduce a framework, which we call Divide-by-2 (DB2), for extending support vector machines (SVM) to multi-class problems. DB2 offers an alternative to the standard one-against-one and one-against-rest algorithms. For an N class problem, DB2 produces an N − 1 node binary decision tree where nodes represent decision boundaries formed by N − 1 SVM binary classifiers. This tree structure allows us to present a generalization and a time complexity analysis of DB2. Our analysis and related experiments show that, DB2 is faster than one-against-one and one-against-rest algorithms in terms of testing time, significantly faster than one-against-rest in terms of training time, and that the cross-validation accuracy of DB2 is comparable to these two methods. expand
|
|
|
Learning a kernel matrix for nonlinear dimensionality reduction |
| |
Kilian Q. Weinberger,
Fei Sha,
Lawrence K. Saul
|
|
Page: 106 |
|
doi>10.1145/1015330.1015345 |
|
Full text: PDF
|
|
We investigate how to learn a kernel matrix for high dimensional data that lies on or near a low dimensional manifold. Noting that the kernel matrix implicitly maps the data into a nonlinear feature space, we show how to discover a mapping that "unfolds" ...
We investigate how to learn a kernel matrix for high dimensional data that lies on or near a low dimensional manifold. Noting that the kernel matrix implicitly maps the data into a nonlinear feature space, we show how to discover a mapping that "unfolds" the underlying manifold from which the data was sampled. The kernel matrix is constructed by maximizing the variance in feature space subject to local constraints that preserve the angles and distances between nearest neighbors. The main optimization involves an instance of semidefinite programming---a fundamentally different computation than previous algorithms for manifold learning, such as Isomap and locally linear embedding. The optimized kernels perform better than polynomial and Gaussian kernels for problems in manifold learning, but worse for problems in large margin classification. We explain these results in terms of the geometric properties of different kernels and comment on various interpretations of other manifold learning algorithms as kernel methods. expand
|
|
|
Approximate inference by Markov chains on union spaces |
| |
Max Welling,
Michal Rosen-Zvi,
Yee Whye Teh
|
|
Page: 107 |
|
doi>10.1145/1015330.1015396 |
|
Full text: PDF
|
|
A standard method for approximating averages in probabilistic models is to construct a Markov chain in the product space of the random variables with the desired equilibrium distribution. Since the number of configurations in this space grows exponentially ...
A standard method for approximating averages in probabilistic models is to construct a Markov chain in the product space of the random variables with the desired equilibrium distribution. Since the number of configurations in this space grows exponentially with the number of random variables we often need to represent the distribution with samples. In this paper we show that if one is interested in averages over single variables only, an alternative Markov chain defined on the much smaller "union space", which can be evolved exactly, becomes feasible. The transition kernel of this Markov chain is based on conditional distributions for pairs of variables and we present ways to approximate them using approximate inference algorithms such as mean field, factorized neighbors and belief propagation. Robustness to these approximations and error bounds on the estimates follow from stability analysis for Markov chains. We also present ideas on a new class of algorithms that iterate between increasingly accurate estimates for conditional and marginal distributions. Experiments validate the proposed methods. expand
|
|
|
Utile distinction hidden Markov models |
| |
Daan Wierstra,
Marco Wiering
|
|
Page: 108 |
|
doi>10.1145/1015330.1015346 |
|
Full text: PDF
|
|
This paper addresses the problem of constructing good action selection policies for agents acting in partially observable environments, a class of problems generally known as Partially Observable Markov Decision Processes. We present a novel approach ...
This paper addresses the problem of constructing good action selection policies for agents acting in partially observable environments, a class of problems generally known as Partially Observable Markov Decision Processes. We present a novel approach that uses a modification of the well-known Baum-Welch algorithm for learning a Hidden Markov Model (HMM) to predict both percepts and utility in a non-deterministic world. This enables an agent to make decisions based on its previous history of actions, observations, and rewards. Our algorithm, called Utile Distinction Hidden Markov Models (UDHMM), handles the creation of memory well in that it tends to create perceptual and utility distinctions only when needed, while it can still discriminate states based on histories of arbitrary length. The experimental results in highly stochastic problem domains show very good performance. expand
|
|
|
P3VI: a partitioned, prioritized, parallel value iterator |
| |
David Wingate,
Kevin D. Seppi
|
|
Page: 109 |
|
doi>10.1145/1015330.1015440 |
|
Full text: PDF
|
|
We present an examination of the state-of-the-art for using value iteration to solve large-scale discrete Markov Decision Processes. We introduce an architecture which combines three independent performance enhancements (the intelligent prioritization ...
We present an examination of the state-of-the-art for using value iteration to solve large-scale discrete Markov Decision Processes. We introduce an architecture which combines three independent performance enhancements (the intelligent prioritization of computation, state partitioning, and massively parallel processing) into a single algorithm. We show that each idea improves performance in a different way, meaning that algorithm designers do not have to trade one improvement for another. We give special attention to parallelization issues, discussing how to efficiently partition states, distribute partitions to processors, minimize message passing and ensure high scalability. We present experimental results which demonstrate that this approach solves large problems in reasonable time. expand
|
|
|
Improving SVM accuracy by training on auxiliary data sources |
| |
Pengcheng Wu,
Thomas G. Dietterich
|
|
Page: 110 |
|
doi>10.1145/1015330.1015436 |
|
Full text: PDF
|
|
The standard model of supervised learning assumes that training and test data are drawn from the same underlying distribution. This paper explores an application in which a second, auxiliary, source of data is available drawn from a different distribution. ...
The standard model of supervised learning assumes that training and test data are drawn from the same underlying distribution. This paper explores an application in which a second, auxiliary, source of data is available drawn from a different distribution. This auxiliary data is more plentiful, but of significantly lower quality, than the training and test data. In the SVM framework, a training example has two roles: (a) as a data point to constrain the learning process and (b) as a candidate support vector that can form part of the definition of the classifier. The paper considers using the auxiliary data in either (or both) of these roles. This auxiliary data framework is applied to a problem of classifying images of leaves of maple and oak trees using a kernel derived from the shapes of the leaves. Experiments show that when the training data set is very small, training with auxiliary data can produce large improvements in accuracy, even when the auxiliary data is significantly different from the training (and test) data. The paper also introduces techniques for adjusting the kernel scores of the auxiliary data points to make them more comparable to the training data points. expand
|
|
|
Bayesian haplo-type inference via the dirichlet process |
| |
Eric Xing,
Roded Sharan,
Michael I. Jordan
|
|
Page: 111 |
|
doi>10.1145/1015330.1015423 |
|
Full text: PDF
|
|
The problem of inferring haplotypes from genotypes of single nucleotide polymorphisms (SNPs) is essential for the understanding of genetic variation within and among populations, with important applications to the genetic analysis of disease propensities ...
The problem of inferring haplotypes from genotypes of single nucleotide polymorphisms (SNPs) is essential for the understanding of genetic variation within and among populations, with important applications to the genetic analysis of disease propensities and other complex traits. The problem can be formulated as a mixture model, where the mixture components correspond to the pool of haplotypes in the population. The size of this pool is unknown; indeed, knowing the size of the pool would correspond to knowing something significant about the genome and its history. Thus methods for fitting the genotype mixture must crucially address the problem of estimating a mixture with an unknown number of mixture components. In this paper we present a Bayesian approach to this problem based on a nonparametric prior known as the Dirichlet process. The model also incorporates a likelihood that captures statistical errors in the haplotype/genotype relationship. We apply our approach to the analysis of both simulated and real genotype data, and compare to extant methods. expand
|
|
|
Generalized low rank approximations of matrices |
| |
Jieping Ye
|
|
Page: 112 |
|
doi>10.1145/1015330.1015347 |
|
Full text: PDF
|
|
We consider the problem of computing low rank approximations of matrices. The novelty of our approach is that the low rank approximations are on a sequence of matrices. Unlike the problem of low rank approximations of a single matrix, which was well ...
We consider the problem of computing low rank approximations of matrices. The novelty of our approach is that the low rank approximations are on a sequence of matrices. Unlike the problem of low rank approximations of a single matrix, which was well studied in the past, the proposed algorithm in this paper does not admit a closed form solution in general. We did extensive experiments on face image data to evaluate the effectiveness of the proposed algorithm and compare the computed low rank approximations with those obtained from traditional Singular Value Decomposition based method. expand
|
|
|
Feature extraction via generalized uncorrelated linear discriminant analysis |
| |
Jieping Ye,
Ravi Janardan,
Qi Li,
Haesun Park
|
|
Page: 113 |
|
doi>10.1145/1015330.1015348 |
|
Full text: PDF
|
|
Feature extraction is important in many applications, such as text and image retrieval, because of high dimensionality. Uncorrelated Linear Discriminant Analysis (ULDA) was recently proposed for feature extraction. The extracted features via ULDA were ...
Feature extraction is important in many applications, such as text and image retrieval, because of high dimensionality. Uncorrelated Linear Discriminant Analysis (ULDA) was recently proposed for feature extraction. The extracted features via ULDA were shown to be statistically uncorrelated, which is desirable for many applications. In this paper, we will first propose the ULDA/QR algorithm to simplify the previous implementation of ULDA. Then we propose the ULDA/GSVD algorithm, based on a novel optimization criterion, to address the singularity problem. It is applicable for undersampled problem, where the data dimension is much larger than the data size, such as text and image retrieval. The novel criterion used in ULDA/GSVD is the perturbed version of the one from ULDA/QR, while surprisingly, the solution to ULDA/GSVD is shown to be independent of the amount of perturbation applied. We did extensive experiments on text and face image data to show the effectiveness of ULDA/GSVD and compare with other popular feature extraction algorithms. expand
|
|
|
Learning and evaluating classifiers under sample selection bias |
| |
Bianca Zadrozny
|
|
Page: 114 |
|
doi>10.1145/1015330.1015425 |
|
Full text: PDF
|
|
Classifier learning methods commonly assume that the training data consist of randomly drawn examples from the same distribution as the test examples about which the learned model is expected to make predictions. In many practical situations, however, ...
Classifier learning methods commonly assume that the training data consist of randomly drawn examples from the same distribution as the test examples about which the learned model is expected to make predictions. In many practical situations, however, this assumption is violated, in a problem known in econometrics as sample selection bias. In this paper, we formalize the sample selection bias problem in machine learning terms and study analytically and experimentally how a number of well-known classifier learning methods are affected by it. We also present a bias correction method that is particularly useful for classifier evaluation under sample selection bias. expand
|
|
|
Probabilistic score estimation with piecewise logistic regression |
| |
Jian Zhang,
Yiming Yang
|
|
Page: 115 |
|
doi>10.1145/1015330.1015335 |
|
Full text: PDF
|
|
Well-calibrated probabilities are necessary in many applications like probabilistic frameworks or cost-sensitive tasks. Based on previous success of asymmetric Laplace method in calibrating text classifiers' scores, we propose to use piecewise logistic ...
Well-calibrated probabilities are necessary in many applications like probabilistic frameworks or cost-sensitive tasks. Based on previous success of asymmetric Laplace method in calibrating text classifiers' scores, we propose to use piecewise logistic regression, which is a simple extension of standard logistic regression, as an alternative method in the discriminative family. We show that both methods have the flexibility to be piecewise linear functions in log-odds, but they are based on quite different assumptions. We evaluated asymmetric Laplace method, piecewise logistic regression and standard logistic regression over standard text categorization collections (Reuters-21578 and TRECAP) with three classifiers (SVM, Naive Bayes and Logistic Regression Classifier), and observed that piecewise logistic regression performs significantly better than the other two methods in the log-loss metric. expand
|
|
|
Solving large scale linear prediction problems using stochastic gradient descent algorithms |
| |
Tong Zhang
|
|
Page: 116 |
|
doi>10.1145/1015330.1015332 |
|
Full text: PDF
|
|
Linear prediction methods, such as least squares for regression, logistic regression and support vector machines for classification, have been extensively used in statistics and machine learning. In this paper, we study stochastic gradient descent (SGD) ...
Linear prediction methods, such as least squares for regression, logistic regression and support vector machines for classification, have been extensively used in statistics and machine learning. In this paper, we study stochastic gradient descent (SGD) algorithms on regularized forms of linear prediction methods. This class of methods, related to online algorithms such as perceptron, are both efficient and very simple to implement. We obtain numerical rate of convergence for such algorithms, and discuss its implications. Experiments on text data will be provided to demonstrate numerical and statistical consequences of our theoretical findings. expand
|
|
|
Surrogate maximization/minimization algorithms for AdaBoost and the logistic regression model |
| |
Zhihua Zhang,
James T. Kwok,
Dit-Yan Yeung
|
|
Page: 117 |
|
doi>10.1145/1015330.1015342 |
|
Full text: PDF
|
|
Surrogate maximization (or minimization) (SM) algorithms are a family of algorithm that can be regarded as a generalization of expectation-maximization (EM) algorithms. There are three major approaches to the construction of surrogate function, all relying ...
Surrogate maximization (or minimization) (SM) algorithms are a family of algorithm that can be regarded as a generalization of expectation-maximization (EM) algorithms. There are three major approaches to the construction of surrogate function, all relying on the convexity of some function. In this paper, we solve the boosting problem by proposing SM algorithms for the corresponding optimization problem. Specifically, for AdaBoost, we derive an SM algorithm that can be shown to be identical to the algorithm proposed by Collins et al. (2002) based on Bregman distance. More importantly, for LogitBoost (or logistic boosting), we use several methods to construct different surrogate functions which result in different SM algorithms. By combining multiple methods, we are able to derive an SM algorithm that is also the same as an algorithm derived by Collins et al. (2002). Our approach based on SM algorithms is much simpler and convergence results follow naturally. expand
|
|
|
Bayesian inference for transductive learning of kernel matrix using the Tanner-Wong data augmentation algorithm |
| |
Zhihua Zhang,
Dit-Yan Yeung,
James T. Kwok
|
|
Page: 118 |
|
doi>10.1145/1015330.1015368 |
|
Full text: PDF
|
|
In kernel methods, an interesting recent development seeks to learn a good kernel from empirical data automatically. In this paper, by regarding the transductive learning of the kernel matrix as a missing data problem, we propose a Bayesian hierarchical ...
In kernel methods, an interesting recent development seeks to learn a good kernel from empirical data automatically. In this paper, by regarding the transductive learning of the kernel matrix as a missing data problem, we propose a Bayesian hierarchical model for the problem and devise the Tanner-Wong data augmentation algorithm for making inference on the model. The Tanner-Wong algorithm is closely related to Gibbs sampling, and it also bears a strong resemblance to the expectation-maximization (EM) algorithm. For an efficient implementation, we propose a simplified Bayesian hierarchical model and the corresponding Tanner-Wong algorithm. We express the relationship between the kernel on the input space and the kernel on the output space as a symmetric-definite generalized eigenproblem. Based on this eigenproblem, an efficient approach to choosing the base kernel matrices is presented. The effectiveness of our Bayesian model with the Tanner-Wong algorithm is demonstrated through some classification experiments showing promising results. expand
|