|
|
Exploration and apprenticeship learning in reinforcement learning |
| |
Pieter Abbeel,
Andrew Y. Ng
|
|
Pages: 1 - 8 |
|
doi>10.1145/1102351.1102352 |
|
Full text: Pdf
|
|
We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn near-optimal policies by using "exploration policies" to drive the system towards poorly modeled states, so as ...
We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn near-optimal policies by using "exploration policies" to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many systems; for example, on an autonomous helicopter, overly aggressive exploration may well result in a crash. In this paper, we consider the apprenticeship learning setting in which a teacher demonstration of the task is available. We show that, given the initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance (compared to the teacher) simply by repeatedly executing "exploitation policies" that try to maximize rewards. In finite-state MDPs, our algorithm scales polynomially in the number of states; in continuous-state linear dynamical systems, it scales polynomially in the dimension of the state. These results are proved using a martingale construction over relative losses. expand
|
|
|
Active learning for Hidden Markov Models: objective functions and algorithms |
| |
Brigham Anderson,
Andrew Moore
|
|
Pages: 9 - 16 |
|
doi>10.1145/1102351.1102353 |
|
Full text: Pdf
|
|
Hidden Markov Models (HMMs) model sequential data in many fields such as text/speech processing and biosignal analysis. Active learning algorithms learn faster and/or better by closing the data-gathering loop, i.e., they choose the examples most informative ...
Hidden Markov Models (HMMs) model sequential data in many fields such as text/speech processing and biosignal analysis. Active learning algorithms learn faster and/or better by closing the data-gathering loop, i.e., they choose the examples most informative with respect to their learning objectives. We introduce a framework and objective functions for active learning in three fundamental HMM problems: model learning, state estimation, and path estimation. In addition, we describe a new set of algorithms for efficiently finding optimal greedy queries using these objective functions. The algorithms are fast, i.e., linear in the number of time steps to select the optimal query and we present empirical results showing that these algorithms can significantly reduce the need for labelled training data. expand
|
|
|
Tempering for Bayesian C&RT |
| |
Nicos Angelopoulos,
James Cussens
|
|
Pages: 17 - 24 |
|
doi>10.1145/1102351.1102354 |
|
Full text: Pdf
|
|
This paper concerns the experimental assessment of tempering as a technique for improving Bayesian inference for C&RT models. Full Bayesian inference requires the computation of a posterior over all possible trees. Since exact computation is not ...
This paper concerns the experimental assessment of tempering as a technique for improving Bayesian inference for C&RT models. Full Bayesian inference requires the computation of a posterior over all possible trees. Since exact computation is not possible Markov chain Monte Carlo (MCMC) methods are used to produce an approximation. C&RT posteriors have many local modes: tempering aims to prevent the Markov chain getting stuck in these modes. Our results show that a clear improvement is achieved using tempering. expand
|
|
|
Fast condensed nearest neighbor rule |
| |
Fabrizio Angiulli
|
|
Pages: 25 - 32 |
|
doi>10.1145/1102351.1102355 |
|
Full text: Pdf
|
|
We present a novel algorithm for computing a training set consistent subset for the nearest neighbor decision rule. The algorithm, called FCNN rule, has some desirable properties. Indeed, it is order independent, and has subquadratic worst case time ...
We present a novel algorithm for computing a training set consistent subset for the nearest neighbor decision rule. The algorithm, called FCNN rule, has some desirable properties. Indeed, it is order independent, and has subquadratic worst case time complexity, while it requires few iterations to converge, and it is likely to select points very close to the decision boundary. We compare the FCNN rule with state of the art competence preservation algorithms on large multidimensional training sets, showing that it outperforms existing methods in terms of learning speed and learning scaling behavior, and in terms of size of the model, while it guarantees a comparable prediction accuracy. expand
|
|
|
Predictive low-rank decomposition for kernel methods |
| |
Francis R. Bach,
Michael I. Jordan
|
|
Pages: 33 - 40 |
|
doi>10.1145/1102351.1102356 |
|
Full text: Pdf
|
|
Low-rank matrix decompositions are essential tools in the application of kernel methods to large-scale learning problems. These decompositions have generally been treated as black boxes---the decomposition of the kernel matrix that they deliver is independent ...
Low-rank matrix decompositions are essential tools in the application of kernel methods to large-scale learning problems. These decompositions have generally been treated as black boxes---the decomposition of the kernel matrix that they deliver is independent of the specific learning task at hand---and this is a potentially significant source of inefficiency. In this paper, we present an algorithm that can exploit side information (e.g., classification labels, regression responses) in the computation of low-rank decompositions for kernel matrices. Our algorithm has the same favorable scaling as state-of-the-art methods such as incomplete Cholesky decomposition---it is linear in the number of data points and quadratic in the rank of the approximation. We present simulation results that show that our algorithm yields decompositions of significantly smaller rank than those found by incomplete Cholesky decomposition. expand
|
|
|
Multi-way distributional clustering via pairwise interactions |
| |
Ron Bekkerman,
Ran El-Yaniv,
Andrew McCallum
|
|
Pages: 41 - 48 |
|
doi>10.1145/1102351.1102357 |
|
Full text: Pdf
|
|
We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in co-occurrence data. In this scheme, multiple ...
We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in co-occurrence data. In this scheme, multiple clustering systems are generated aiming at maximizing an objective function that measures multiple pairwise mutual information between cluster variables. To implement this idea, we propose an algorithm that interleaves top-down clustering of some variables and bottom-up clustering of the other variables, with a local optimization correction routine. Focusing on document clustering we present an extensive empirical study of two-way, three-way and four-way applications of our scheme using six real-world datasets including the 20 News-groups (20NG) and the Enron email collection. Our multi-way distributional clustering (MDC) algorithms consistently and significantly outperform previous state-of-the-art information theoretic clustering algorithms. expand
|
|
|
Error limiting reductions between classification tasks |
| |
Alina Beygelzimer,
Varsha Dani,
Tom Hayes,
John Langford,
Bianca Zadrozny
|
|
Pages: 49 - 56 |
|
doi>10.1145/1102351.1102358 |
|
Full text: Pdf
|
|
We introduce a reduction-based model for analyzing supervised learning tasks. We use this model to devise a new reduction from multi-class cost-sensitive classification to binary classification with the following guarantee: If the learned binary classifier ...
We introduce a reduction-based model for analyzing supervised learning tasks. We use this model to devise a new reduction from multi-class cost-sensitive classification to binary classification with the following guarantee: If the learned binary classifier has error rate at most ε then the cost-sensitive classifier has cost at most 2ε times the expected sum of costs of all possible lables. Since cost-sensitive classification can embed any bounded loss finite choice supervised learning task, this result shows that any such task can be solved using a binary classification oracle. Finally, we present experimental results showing that our new reduction outperforms existing algorithms for multi-class cost-sensitive learning. expand
|
|
|
Multi-instance tree learning |
| |
Hendrik Blockeel,
David Page,
Ashwin Srinivasan
|
|
Pages: 57 - 64 |
|
doi>10.1145/1102351.1102359 |
|
Full text: Pdf
|
|
We introduce a novel algorithm for decision tree learning in the multi-instance setting as originally defined by Dietterich et al. It differs from existing multi-instance tree learners in a few crucial, well-motivated details. Experiments on synthetic ...
We introduce a novel algorithm for decision tree learning in the multi-instance setting as originally defined by Dietterich et al. It differs from existing multi-instance tree learners in a few crucial, well-motivated details. Experiments on synthetic and real-life datasets confirm the beneficial effect of these differences and show that the resulting system outperforms the existing multi-instance decision tree learners. expand
|
|
|
Action respecting embedding |
| |
Michael Bowling,
Ali Ghodsi,
Dana Wilkinson
|
|
Pages: 65 - 72 |
|
doi>10.1145/1102351.1102360 |
|
Full text: Pdf
|
|
Dimensionality reduction is the problem of finding a low-dimensional representation of high-dimensional input data. This paper examines the case where additional information is known about the data. In particular, we assume the data are given in a sequence ...
Dimensionality reduction is the problem of finding a low-dimensional representation of high-dimensional input data. This paper examines the case where additional information is known about the data. In particular, we assume the data are given in a sequence with action labels associated with adjacent data points, such as might come from a mobile robot. The goal is a variation on dimensionality reduction, where the output should be a representation of the input data that is both low-dimensional and respects the actions (i.e., actions correspond to simple transformations in the output representation). We show how this variation can be solved with a semidefinite program. We evaluate the technique in a synthetic, robot-inspired domain, demonstrating qualitatively superior representations and quantitative improvements on a data prediction task. expand
|
|
|
Clustering through ranking on manifolds |
| |
Markus Breitenbach,
Gregory Z. Grudic
|
|
Pages: 73 - 80 |
|
doi>10.1145/1102351.1102361 |
|
Full text: Pdf
|
|
Clustering aims to find useful hidden structures in data. In this paper we present a new clustering algorithm that builds upon the consistency method (Zhou, et.al., 2003), a semi-supervised learning technique with the property of learning very ...
Clustering aims to find useful hidden structures in data. In this paper we present a new clustering algorithm that builds upon the consistency method (Zhou, et.al., 2003), a semi-supervised learning technique with the property of learning very smooth functions with respect to the intrinsic structure revealed by the data. Other methods, e.g. Spectral Clustering, obtain good results on data that reveals such a structure. However, unlike Spectral Clustering, our algorithm effectively detects both global and within-class outliers, and the most representative examples in each class. Furthermore, we specify an optimization framework that estimates all learning parameters, including the number of clusters, directly from data. Finally, we show that the learned cluster-models can be used to add previously unseen points to clusters without re-learning the original cluster model. Encouraging experimental results are obtained on a number of real world problems. expand
|
|
|
Reducing overfitting in process model induction |
| |
Will Bridewell,
Narges Bani Asadi,
Pat Langley,
Ljupčo Todorovski
|
|
Pages: 81 - 88 |
|
doi>10.1145/1102351.1102362 |
|
Full text: Pdf
|
|
In this paper, we review the paradigm of inductive process modeling, which uses background knowledge about possible component processes to construct quantitative models of dynamical systems. We note that previous methods for this task tend to overfit ...
In this paper, we review the paradigm of inductive process modeling, which uses background knowledge about possible component processes to construct quantitative models of dynamical systems. We note that previous methods for this task tend to overfit the training data, which suggests ensemble learning as a likely response. However, such techniques combine models in ways that reduce comprehensibility, making their output much less accessible to domain scientists. As an alternative, we introduce a new approach that induces a set of process models from different samples of the training data and uses them to guide a final search through the space of model structures. Experiments with synthetic and natural data suggest this method reduces error and decreases the chance of including unnecessary processes in the model. We conclude by discussing related work and suggesting directions for additional research. expand
|
|
|
Learning to rank using gradient descent |
| |
Chris Burges,
Tal Shaked,
Erin Renshaw,
Ari Lazier,
Matt Deeds,
Nicole Hamilton,
Greg Hullender
|
|
Pages: 89 - 96 |
|
doi>10.1145/1102351.1102363 |
|
Full text: Pdf
|
|
We investigate using gradient descent methods for learning ranking functions; we propose a simple probabilistic cost function, and we introduce RankNet, an implementation of these ideas using a neural network to model the underlying ranking function. ...
We investigate using gradient descent methods for learning ranking functions; we propose a simple probabilistic cost function, and we introduce RankNet, an implementation of these ideas using a neural network to model the underlying ranking function. We present test results on toy data and on data from a commercial internet search engine. expand
|
|
|
Learning class-discriminative dynamic Bayesian networks |
| |
John Burge,
Terran Lane
|
|
Pages: 97 - 104 |
|
doi>10.1145/1102351.1102364 |
|
Full text: Pdf
|
|
In many domains, a Bayesian network's topological structure is not known a priori and must be inferred from data. This requires a scoring function to measure how well a proposed network topology describes a set of data. Many commonly used scores such ...
In many domains, a Bayesian network's topological structure is not known a priori and must be inferred from data. This requires a scoring function to measure how well a proposed network topology describes a set of data. Many commonly used scores such as BD, BDE, BDEU, etc., are not well suited for class discrimination. Instead, scores such as the class-conditional likelihood (CCL) should be employed. Unfortunately, CCL does not decompose and its application to large domains is not feasible. We introduce a decomposable score, approximate conditional likelihood (ACL) that is capable of identifying class discriminative structures. We show that dynamic Bayesian networks (DBNs) trained with ACL have classification efficacies competitive to those trained with CCL on a set of simulated data experiments. We also show that ACL-trained DBNs outperform BDE-trained DBNs, Gaussian naïve Bayes networks and support vector machines within a neuroscience domain too large for CCL. expand
|
|
|
Recognition and reproduction of gestures using a probabilistic framework combining PCA, ICA and HMM |
| |
Sylvain Calinon,
Aude Billard
|
|
Pages: 105 - 112 |
|
doi>10.1145/1102351.1102365 |
|
Full text: Pdf
|
|
This paper explores the issue of recognizing, generalizing and reproducing arbitrary gestures. We aim at extracting a representation that encapsulates only the key aspects of the gesture and discards the variability intrinsic to each person's motion. ...
This paper explores the issue of recognizing, generalizing and reproducing arbitrary gestures. We aim at extracting a representation that encapsulates only the key aspects of the gesture and discards the variability intrinsic to each person's motion. We compare a decomposition into principal components (PCA) and independent components (ICA) as a first step of preprocessing in order to decorrelate and denoise the data, as well as to reduce the dimensionality of the dataset to make this one tractable. In a second stage of processing, we explore the use of a probabilistic encoding through continuous Hidden Markov Models (HMMs), as a way to encapsulate the sequential nature and intrinsic variability of the motions in stochastic finite state automata. Finally, the method is validated in a humanoid robot to reproduce a variety of gestures performed by a human demonstrator. expand
|
|
|
Predicting probability distributions for surf height using an ensemble of mixture density networks |
| |
Michael Carney,
Pádraig Cunningham,
Jim Dowling,
Ciaran Lee
|
|
Pages: 113 - 120 |
|
doi>10.1145/1102351.1102366 |
|
Full text: Pdf
|
|
There is a range of potential applications of Machine Learning where it would be more useful to predict the probability distribution for a variable rather than simply the most likely value for that variable. In meteorology and in finance it is often ...
There is a range of potential applications of Machine Learning where it would be more useful to predict the probability distribution for a variable rather than simply the most likely value for that variable. In meteorology and in finance it is often important to know the probability of a variable falling within (or outside) different ranges. In this paper we consider the prediction of surf height with the objective of predicting if it will fall within a given 'surfable' range. Prediction problems such as this are considerably more difficult if the distribution of the phenomenon is significantly different from a normal distribution. This is the case with the surf data we have studied. To address this we use an ensemble of mixture density networks to predict the probability density function. Our evaluation shows that this is an effective solution. We also describe a web-based application that presents these predictions in a usable manner. expand
|
|
|
Hedged learning: regret-minimization with learning experts |
| |
Yu-Han Chang,
Leslie Pack Kaelbling
|
|
Pages: 121 - 128 |
|
doi>10.1145/1102351.1102367 |
|
Full text: Pdf
|
|
In non-cooperative multi-agent situations, there cannot exist a globally optimal, yet opponent-independent learning algorithm. Regret-minimization over a set of strategies optimized for potential opponent models is proposed as a good framework for deciding ...
In non-cooperative multi-agent situations, there cannot exist a globally optimal, yet opponent-independent learning algorithm. Regret-minimization over a set of strategies optimized for potential opponent models is proposed as a good framework for deciding how to behave in such situations. Using longer playing horizons and experts that learn as they play, the regret-minimization framework can be extended to overcome several shortcomings of earlier approaches to the problem of multi-agent learning. expand
|
|
|
Variational Bayesian image modelling |
| |
Li Cheng,
Feng Jiao,
Dale Schuurmans,
Shaojun Wang
|
|
Pages: 129 - 136 |
|
doi>10.1145/1102351.1102368 |
|
Full text: Pdf
|
|
We present a variational Bayesian framework for performing inference, density estimation and model selection in a special class of graphical models---Hidden Markov Random Fields (HMRFs). HMRFs are particularly well suited to image modelling and in this ...
We present a variational Bayesian framework for performing inference, density estimation and model selection in a special class of graphical models---Hidden Markov Random Fields (HMRFs). HMRFs are particularly well suited to image modelling and in this paper, we apply them to the problem of image segmentation. Unfortunately, HMRFs are notoriously hard to train and use because the exact inference problems they create are intractable. Our main contribution is to introduce an efficient variational approach for performing approximate inference of the Bayesian formulation of HMRFs, which we can then apply to the density estimation and model selection problems that arise when learning image models from data. With this variational approach, we can conveniently tackle the problem of image segmentation. We present experimental results which show that our technique outperforms recent HMRF-based segmentation methods on real world images. expand
|
|
|
Preference learning with Gaussian processes |
| |
Wei Chu,
Zoubin Ghahramani
|
|
Pages: 137 - 144 |
|
doi>10.1145/1102351.1102369 |
|
Full text: Pdf
|
|
In this paper, we propose a probabilistic kernel approach to preference learning based on Gaussian processes. A new likelihood function is proposed to capture the preference relations in the Bayesian framework. The generalized formulation is also applicable ...
In this paper, we propose a probabilistic kernel approach to preference learning based on Gaussian processes. A new likelihood function is proposed to capture the preference relations in the Bayesian framework. The generalized formulation is also applicable to tackle many multiclass problems. The overall approach has the advantages of Bayesian methods for model selection and probabilistic prediction. Experimental results compared against the constraint classification approach on several benchmark datasets verify the usefulness of this algorithm. expand
|
|
|
New approaches to support vector ordinal regression |
| |
Wei Chu,
S. Sathiya Keerthi
|
|
Pages: 145 - 152 |
|
doi>10.1145/1102351.1102370 |
|
Full text: Pdf
|
|
In this paper, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered ...
In this paper, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The SMO algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on benchmark datasets verify the usefulness of these approaches. expand
|
|
|
A general regression technique for learning transductions |
| |
Corinna Cortes,
Mehryar Mohri,
Jason Weston
|
|
Pages: 153 - 160 |
|
doi>10.1145/1102351.1102371 |
|
Full text: Pdf
|
|
The problem of learning a transduction, that is a string-to-string mapping, is a common problem arising in natural language processing and computational biology. Previous methods proposed for learning such mappings are based on classification ...
The problem of learning a transduction, that is a string-to-string mapping, is a common problem arising in natural language processing and computational biology. Previous methods proposed for learning such mappings are based on classification techniques. This paper presents a new and general regression technique for learning transductions and reports the results of experiments showing its effectiveness. Our transduction learning consists of two phases: the estimation of a set of regression coefficients and the computation of the pre-image corresponding to this set of coefficients. A novel and conceptually cleaner formulation of kernel dependency estimation provides a simple framework for estimating the regression coefficients, and an efficient algorithm for computing the pre-image from the regression coefficients extends the applicability of kernel dependency estimation to output sequences. We report the results of a series of experiments illustrating the application of our regression technique for learning transductions. expand
|
|
|
Learning to compete, compromise, and cooperate in repeated general-sum games |
| |
Jacob W. Crandall,
Michael A. Goodrich
|
|
Pages: 161 - 168 |
|
doi>10.1145/1102351.1102372 |
|
Full text: Pdf
|
|
Learning algorithms often obtain relatively low average payoffs in repeated general-sum games between other learning agents due to a focus on myopic best-response and one-shot Nash equilibrium (NE) strategies. A less myopic approach places focus on NEs ...
Learning algorithms often obtain relatively low average payoffs in repeated general-sum games between other learning agents due to a focus on myopic best-response and one-shot Nash equilibrium (NE) strategies. A less myopic approach places focus on NEs of the repeated game, which suggests that (at the least) a learning agent should possess two properties. First, an agent should never learn to play a strategy that produces average payoffs less than the minimax value of the game. Second, an agent should learn to cooperate/compromise when beneficial. No learning algorithm from the literature is known to possess both of these properties. We present a reinforcement learning algorithm (M-Qubed) that provably satisfies the first property and empirically displays (in self play) the second property in a wide range of games. expand
|
|
|
Learning as search optimization: approximate large margin methods for structured prediction |
| |
Hal Daumé, III,
Daniel Marcu
|
|
Pages: 169 - 176 |
|
doi>10.1145/1102351.1102373 |
|
Full text: Pdf
|
|
Mappings to structured output spaces (strings, trees, partitions, etc.) are typically learned using extensions of classification algorithms to simple graphical structures (eg., linear chains) in which search and parameter estimation can be performed ...
Mappings to structured output spaces (strings, trees, partitions, etc.) are typically learned using extensions of classification algorithms to simple graphical structures (eg., linear chains) in which search and parameter estimation can be performed exactly. Unfortunately, in many complex problems, it is rare that exact search or parameter estimation is tractable. Instead of learning exact models and searching via heuristic means, we embrace this difficulty and treat the structured output problem in terms of approximate search. We present a framework for learning as search optimization, and two parameter updates with convergence the-orems and bounds. Empirical evidence shows that our integrated approach to learning and decoding can outperform exact models at smaller computational cost. expand
|
|
|
Multimodal oriented discriminant analysis |
| |
Fernando De la Torre,
Takeo Kanade
|
|
Pages: 177 - 184 |
|
doi>10.1145/1102351.1102374 |
|
Full text: Pdf
|
|
Linear discriminant analysis (LDA) has been an active topic of research during the last century. However, the existing algorithms have several limitations when applied to visual data. LDA is only optimal for Gaussian distributed classes with equal covariance ...
Linear discriminant analysis (LDA) has been an active topic of research during the last century. However, the existing algorithms have several limitations when applied to visual data. LDA is only optimal for Gaussian distributed classes with equal covariance matrices, and only classes-1 features can be extracted. On the other hand, LDA does not scale well to high dimensional data (overfitting), and it cannot handle optimally multimodal distributions. In this paper, we introduce Multimodal Oriented Discriminant Analysis (MODA), a LDA extension which can overcome these drawbacks. A new formulation and several novelties are proposed:• An optimal dimensionality reduction for multimodal Gaussian classes with different covariances is derived. The new criteria allows for extracting more than classes-1 features.• A covariance approximation is introduced to improve generalization and avoid over-fitting when dealing with high dimensional data.• A linear time iterative majorization method is suggested in order to find a local optimum.Several synthetic and real experiments on face recognition show that MODA outperform existing linear techniques. expand
|
|
|
A practical generalization of Fourier-based learning |
| |
Adam Drake,
Dan Ventura
|
|
Pages: 185 - 192 |
|
doi>10.1145/1102351.1102375 |
|
Full text: Pdf
|
|
This paper presents a search algorithm for finding functions that are highly correlated with an arbitrary set of data. The functions found by the search can be used to approximate the unknown function that generated the data. A special case of this approach ...
This paper presents a search algorithm for finding functions that are highly correlated with an arbitrary set of data. The functions found by the search can be used to approximate the unknown function that generated the data. A special case of this approach is a method for learning Fourier representations. Empirical results demonstrate that on typical real-world problems the most highly correlated functions can be found very quickly, while combinations of these functions provide good approximations of the unknown function. expand
|
|
|
Combining model-based and instance-based learning for first order regression |
| |
Kurt Driessens,
Sašo Džeroski
|
|
Pages: 193 - 200 |
|
doi>10.1145/1102351.1102376 |
|
Full text: Pdf
|
|
The introduction of relational reinforcement learning and the RRL algorithm gave rise to the development of several first order regression algorithms. So far, these algorithms have employed either a model-based approach or an instance-based approach. ...
The introduction of relational reinforcement learning and the RRL algorithm gave rise to the development of several first order regression algorithms. So far, these algorithms have employed either a model-based approach or an instance-based approach. As a consequence, they suffer from the typical drawbacks of model-based learning such as coarse function approximation or those of lazy learning such as high computational intensity.In this paper we develop a new regression algorithm that combines the strong points of both approaches and tries to avoid the normally inherent draw-backs. By combining model-based and instance-based learning, we produce an incremental first order regression algorithm that is both computationally efficient and produces better predictions earlier in the learning experiment. expand
|
|
|
Reinforcement learning with Gaussian processes |
| |
Yaakov Engel,
Shie Mannor,
Ron Meir
|
|
Pages: 201 - 208 |
|
doi>10.1145/1102351.1102377 |
|
Full text: Pdf
|
|
Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated ...
Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated in the original GPTD paper (Engel et al., 2003). The first is the issue of stochasticity in the state transitions, and the second is concerned with action selection and policy improvement. We present a new generative model for the value function, deduced from its relation with the discounted return. We derive a corresponding on-line algorithm for learning the posterior moments of the value Gaussian process. We also present a SARSA based extension of GPTD, termed GPSARSA, that allows the selection of actions and the gradual improvement of policies without requiring a world-model. expand
|
|
|
Experimental comparison between bagging and Monte Carlo ensemble classification |
| |
Roberto Esposito,
Lorenza Saitta
|
|
Pages: 209 - 216 |
|
doi>10.1145/1102351.1102378 |
|
Full text: Pdf
|
|
Properties of ensemble classification can be studied using the framework of Monte Carlo stochastic algorithms. Within this framework it is also possible to define a new ensemble classifier, whose accuracy probability distribution can be computed exactly. ...
Properties of ensemble classification can be studied using the framework of Monte Carlo stochastic algorithms. Within this framework it is also possible to define a new ensemble classifier, whose accuracy probability distribution can be computed exactly. This paper has two goals: first, an experimental comparison between the theoretical predictions and experimental results; second, a systematic comparison between bagging and Monte Carlo ensemble classification. expand
|
|
|
Supervised clustering with support vector machines |
| |
Thomas Finley,
Thorsten Joachims
|
|
Pages: 217 - 224 |
|
doi>10.1145/1102351.1102379 |
|
Full text: Pdf
|
|
Supervised clustering is the problem of training a clustering algorithm to produce desirable clusterings: given sets of items and complete clusterings over these sets, we learn how to cluster future sets of items. Example applications include ...
Supervised clustering is the problem of training a clustering algorithm to produce desirable clusterings: given sets of items and complete clusterings over these sets, we learn how to cluster future sets of items. Example applications include noun-phrase coreference clustering, and clustering news articles by whether they refer to the same topic. In this paper we present an SVM algorithm that trains a clustering algorithm by adapting the item-pair similarity measure. The algorithm may optimize a variety of different clustering functions to a variety of clustering performance measures. We empirically evaluate the algorithm for noun-phrase and news article clustering. expand
|
|
|
Optimal assignment kernels for attributed molecular graphs |
| |
Holger Fröhlich,
Jörg K. Wegner,
Florian Sieker,
Andreas Zell
|
|
Pages: 225 - 232 |
|
doi>10.1145/1102351.1102380 |
|
Full text: Pdf
|
|
We propose a new kernel function for attributed molecular graphs, which is based on the idea of computing an optimal assignment from the atoms of one molecule to those of another one, including information on neighborhood, membership to a certain structural ...
We propose a new kernel function for attributed molecular graphs, which is based on the idea of computing an optimal assignment from the atoms of one molecule to those of another one, including information on neighborhood, membership to a certain structural element and other characteristics for each atom. As a byproduct this leads to a new class of kernel functions. We demonstrate how the necessary computations can be carried out efficiently. Compared to marginalized graph kernels our method in some cases leads to a significant reduction of the prediction error. Further improvement can be gained, if expert knowledge is combined with our method. We also investigate a reduced graph representation of molecules by collapsing certain structural elements, like e.g. rings, into a single node of the molecular graph. expand
|
|
|
Closed-form dual perturb and combine for tree-based models |
| |
Pierre Geurts,
Louis Wehenkel
|
|
Pages: 233 - 240 |
|
doi>10.1145/1102351.1102381 |
|
Full text: Pdf
|
|
This paper studies the aggregation of predictions made by tree-based models for several perturbed versions of the attribute vector of a test case. A closed-form approximation of this scheme combined with cross-validation to tune the level of perturbation ...
This paper studies the aggregation of predictions made by tree-based models for several perturbed versions of the attribute vector of a test case. A closed-form approximation of this scheme combined with cross-validation to tune the level of perturbation is proposed. This yields soft-tree models in a parameter free way. and preserves their interpretability. Empirical evaluations, on classification and regression problems, show that accuracy and bias/variance tradeoff are improved significantly at the price of an acceptable computational overhead. The method is further compared and combined with tree bagging. expand
|
|
|
Hierarchic Bayesian models for kernel learning |
| |
Mark Girolami,
Simon Rogers
|
|
Pages: 241 - 248 |
|
doi>10.1145/1102351.1102382 |
|
Full text: Pdf
|
|
The integration of diverse forms of informative data by learning an optimal combination of base kernels in classification or regression problems can provide enhanced performance when compared to that obtained from any single data source. We present a ...
The integration of diverse forms of informative data by learning an optimal combination of base kernels in classification or regression problems can provide enhanced performance when compared to that obtained from any single data source. We present a Bayesian hierarchical model which enables kernel learning and present effective variational Bayes estimators for regression and classification. Illustrative experiments demonstrate the utility of the proposed method. Matlab code replicating results reported is available at http://www.dcs.gla.ac.uk/~srogers/kernel_comb.html. expand
|
|
|
Online feature selection for pixel classification |
| |
Karen Glocer,
Damian Eads,
James Theiler
|
|
Pages: 249 - 256 |
|
doi>10.1145/1102351.1102383 |
|
Full text: Pdf
|
|
Online feature selection (OFS) provides an efficient way to sort through a large space of features, particularly in a scenario where the feature space is large and features take a significant amount of memory to store. Image processing operators, and ...
Online feature selection (OFS) provides an efficient way to sort through a large space of features, particularly in a scenario where the feature space is large and features take a significant amount of memory to store. Image processing operators, and especially combinations of image processing operators, provide a rich space of potential features for use in machine learning for image processing tasks but they are expensive to generate and store. In this paper we apply OFS to the problem of edge detection in grayscale imagery. We use a standard data set and compare our results to those obtained with traditional edge detectors, as well as with results obtained more recently using "statistical edge detection." We compare several different OFS approaches, including hill climbing, best first search, and grafting. expand
|
|
|
Learning strategies for story comprehension: a reinforcement learning approach |
| |
Eugene Grois,
David C. Wilkins
|
|
Pages: 257 - 264 |
|
doi>10.1145/1102351.1102384 |
|
Full text: Pdf
|
|
This paper describes the use of machine learning to improve the performance of natural language question answering systems. We present a model for improving story comprehension through inductive generalization and reinforcement learning, based on classified ...
This paper describes the use of machine learning to improve the performance of natural language question answering systems. We present a model for improving story comprehension through inductive generalization and reinforcement learning, based on classified examples. In the process, the model selects the most relevant and useful pieces of lexical information to be used by the inference procedure. We compare our approach to three prior non-learning systems, and evaluate the conditions under which learning is effective. We demonstrate that a learning-based approach can improve upon "matching and extraction"-only techniques. expand
|
|
|
Near-optimal sensor placements in Gaussian processes |
| |
Carlos Guestrin,
Andreas Krause,
Ajit Paul Singh
|
|
Pages: 265 - 272 |
|
doi>10.1145/1102351.1102385 |
|
Full text: Pdf
|
|
When monitoring spatial phenomena, which are often modeled as Gaussian Processes (GPs), choosing sensor locations is a fundamental task. A common strategy is to place sensors at the points of highest entropy (variance) in the GP model. We propose a mutual ...
When monitoring spatial phenomena, which are often modeled as Gaussian Processes (GPs), choosing sensor locations is a fundamental task. A common strategy is to place sensors at the points of highest entropy (variance) in the GP model. We propose a mutual information criteria, and show that it produces better placements. Furthermore, we prove that finding the configuration that maximizes mutual information is NP-complete. To address this issue, we describe a polynomial-time approximation that is within (1 -- 1/e) of the optimum by exploiting the submodularity of our criterion. This algorithm is extended to handle local structure in the GP, yielding significant speedups. We demonstrate the advantages of our approach on two real-world data sets. expand
|
|
|
Robust one-class clustering using hybrid global and local search |
| |
Gunjan Gupta,
Joydeep Ghosh
|
|
Pages: 273 - 280 |
|
doi>10.1145/1102351.1102386 |
|
Full text: Pdf
|
|
Unsupervised learning methods often involve summarizing the data using a small number of parameters. In certain domains, only a small subset of the available data is relevant for the problem. One-Class Classification or One-Class Clustering ...
Unsupervised learning methods often involve summarizing the data using a small number of parameters. In certain domains, only a small subset of the available data is relevant for the problem. One-Class Classification or One-Class Clustering attempts to find a useful subset by locating a dense region in the data. In particular, a recently proposed algorithm called One-Class Information Ball (OC-IB) shows the advantage of modeling a small set of highly coherent points as opposed to pruning outliers. We present several modifications to OC-IB and integrate it with a global search that results in several improvements such as deterministic results, optimality guarantees, control over cluster size and extension to other cost functions. Empirical studies yield significantly better results on various real and artificial data. expand
|
|
|
Statistical and computational analysis of locality preserving projection |
| |
Xiaofei He,
Deng Cai,
Wanli Min
|
|
Pages: 281 - 288 |
|
doi>10.1145/1102351.1102387 |
|
Full text: Pdf
|
|
Recently, several manifold learning algorithms have been proposed, such as ISOMAP (Tenenbaum et al., 2000), Locally Linear Embedding (Roweis & Saul, 2000), Laplacian Eigenmap (Belkin & Niyogi, 2001), Locality Preserving Projection (LPP) (He & Niyogi, ...
Recently, several manifold learning algorithms have been proposed, such as ISOMAP (Tenenbaum et al., 2000), Locally Linear Embedding (Roweis & Saul, 2000), Laplacian Eigenmap (Belkin & Niyogi, 2001), Locality Preserving Projection (LPP) (He & Niyogi, 2003), etc. All of them aim at discovering the meaningful low dimensional structure of the data space. In this paper, we present a statistical analysis of the LPP algorithm. Different from Principal Component Analysis (PCA) which obtains a subspace spanned by the largest eigenvectors of the global covariance matrix, we show that LPP obtains a subspace spanned by the smallest eigenvectors of the local covariance matrix. We applied PCA and LPP to real world document clustering task. Experimental results show that the performance can be significantly improved in the subspace, and especially LPP works much better than PCA. expand
|
|
|
Intrinsic dimensionality estimation of submanifolds in Rd |
| |
Matthias Hein,
Jean-Yves Audibert
|
|
Pages: 289 - 296 |
|
doi>10.1145/1102351.1102388 |
|
Full text: Pdf
|
|
We present a new method to estimate the intrinsic dimensionality of a submanifold M in Rd from random samples. The method is based on the convergence rates of a certain U-statistic on the manifold. We solve at least partially ...
We present a new method to estimate the intrinsic dimensionality of a submanifold M in Rd from random samples. The method is based on the convergence rates of a certain U-statistic on the manifold. We solve at least partially the question of the choice of the scale of the data. Moreover the proposed method is easy to implement, can handle large data sets and performs very well even for small sample sizes. We compare the proposed method to two standard estimators on several artificial as well as real data sets. expand
|
|
|
Bayesian hierarchical clustering |
| |
Katherine A. Heller,
Zoubin Ghahramani
|
|
Pages: 297 - 304 |
|
doi>10.1145/1102351.1102389 |
|
Full text: Pdf
|
|
We present a novel algorithm for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. This algorithm has several advantages over traditional distance-based agglomerative clustering algorithms. (1) It ...
We present a novel algorithm for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. This algorithm has several advantages over traditional distance-based agglomerative clustering algorithms. (1) It defines a probabilistic model of the data which can be used to compute the predictive distribution of a test point and the probability of it belonging to any of the existing clusters in the tree. (2) It uses a model-based criterion to decide on merging clusters rather than an ad-hoc distance metric. (3) Bayesian hypothesis testing is used to decide which merges are advantageous and to output the recommended depth of the tree. (4) The algorithm can be interpreted as a novel fast bottom-up approximate inference method for a Dirichlet process (i.e. countably infinite) mixture model (DPM). It provides a new lower bound on the marginal likelihood of a DPM by summing over exponentially many clusterings of the data in polynomial time. We describe procedures for learning the model hyperpa-rameters, computing the predictive distribution, and extensions to the algorithm. Experimental results on synthetic and real-world data sets demonstrate useful properties of the algorithm. expand
|
|
|
Online learning over graphs |
| |
Mark Herbster,
Massimiliano Pontil,
Lisa Wainer
|
|
Pages: 305 - 312 |
|
doi>10.1145/1102351.1102390 |
|
Full text: Pdf
|
|
We apply classic online learning techniques similar to the perceptron algorithm to the problem of learning a function defined on a graph. The benefit of our approach includes simple algorithms and performance guarantees that we naturally interpret in ...
We apply classic online learning techniques similar to the perceptron algorithm to the problem of learning a function defined on a graph. The benefit of our approach includes simple algorithms and performance guarantees that we naturally interpret in terms of structural properties of the graph, such as the algebraic connectivity or the diameter of the graph. We also discuss how these methods can be modified to allow active learning on a graph. We present preliminary experiments with encouraging results. expand
|
|
|
Adapting two-class support vector classification methods to many class problems |
| |
Simon I. Hill,
Arnaud Doucet
|
|
Pages: 313 - 320 |
|
doi>10.1145/1102351.1102391 |
|
Full text: Pdf
|
|
A geometric construction is presented which is shown to be an effective tool for understanding and implementing multi-category support vector classification. It is demonstrated how this construction can be used to extend many other existing two-class ...
A geometric construction is presented which is shown to be an effective tool for understanding and implementing multi-category support vector classification. It is demonstrated how this construction can be used to extend many other existing two-class kernel-based classification methodologies in a straightforward way while still preserving attractive properties of individual algorithms. Reducing training times through incorporating the results of pairwise classification is also discussed and experimental results presented. expand
|
|
|
A martingale framework for concept change detection in time-varying data streams |
| |
Shen-Shyang Ho
|
|
Pages: 321 - 327 |
|
doi>10.1145/1102351.1102392 |
|
Full text: Pdf
|
|
In a data streaming setting, data points are observed one by one. The concepts to be learned from the data points may change infinitely often as the data is streaming. In this paper, we extend the idea of testing exchangeability online (Vovk et al., ...
In a data streaming setting, data points are observed one by one. The concepts to be learned from the data points may change infinitely often as the data is streaming. In this paper, we extend the idea of testing exchangeability online (Vovk et al., 2003) to a martingale framework to detect concept changes in time-varying data streams. Two martingale tests are developed to detect concept changes using: (i) martingale values, a direct consequence of the Doob's Maximal Inequality, and (ii) the martingale difference, justified using the Hoeffding-Azuma Inequality. Under some assumptions, the second test theoretically has a lower probability than the first test of rejecting the null hypothesis, "no concept change in the data stream", when it is in fact correct. Experiments show that both martingale tests are effective in detecting concept changes in time-varying data streams simulated using two synthetic data sets and three benchmark data sets. expand
|
|
|
Multi-class protein fold recognition using adaptive codes |
| |
Eugene Ie,
Jason Weston,
William Stafford Noble,
Christina Leslie
|
|
Pages: 329 - 336 |
|
doi>10.1145/1102351.1102393 |
|
Full text: Pdf
|
|
We develop a novel multi-class classification method based on output codes for the problem of classifying a sequence of amino acids into one of many known protein structural classes, called folds. Our method learns relative weights between ...
We develop a novel multi-class classification method based on output codes for the problem of classifying a sequence of amino acids into one of many known protein structural classes, called folds. Our method learns relative weights between one-vs-all classifiers and encodes information about the protein structural hierarchy for multi-class prediction. Our code weighting approach significantly improves on the standard one-vs-all method for the fold recognition problem. In order to compare against widely used methods in protein sequence analysis, we also test nearest neighbor approaches based on the PSI-BLAST algorithm. Our code weight learning algorithm strongly outperforms these PSI-BLAST methods on every structure recognition problem we consider. expand
|
|
|
Learning approximate preconditions for methods in hierarchical plans |
| |
Okhtay Ilghami,
Héctor Muñoz-Avila,
Dana S. Nau,
David W. Aha
|
|
Pages: 337 - 344 |
|
doi>10.1145/1102351.1102394 |
|
Full text: Pdf
|
|
A significant challenge in developing planning systems for practical applications is the difficulty of acquiring the domain knowledge needed by such systems. One method for acquiring this knowledge is to learn it from plan traces, but this method typically ...
A significant challenge in developing planning systems for practical applications is the difficulty of acquiring the domain knowledge needed by such systems. One method for acquiring this knowledge is to learn it from plan traces, but this method typically requires a huge number of plan traces to converge. In this paper, we show that the problem with slow convergence can be circumvented by having the learner generate solution plans even before the planning domain is completely learned. Our empirical results show that these improvements reduce the size of the training set that is needed to find correct answers to a large percentage of planning problems in the test set. expand
|
|
|
Evaluating machine learning for information extraction |
| |
Neil Ireson,
Fabio Ciravegna,
Mary Elaine Califf,
Dayne Freitag,
Nicholas Kushmerick,
Alberto Lavelli
|
|
Pages: 345 - 352 |
|
doi>10.1145/1102351.1102395 |
|
Full text: Pdf
|
|
Comparative evaluation of Machine Learning (ML) systems used for Information Extraction (IE) has suffered from various inconsistencies in experimental procedures. This paper reports on the results of the Pascal Challenge on Evaluating Machine Learning ...
Comparative evaluation of Machine Learning (ML) systems used for Information Extraction (IE) has suffered from various inconsistencies in experimental procedures. This paper reports on the results of the Pascal Challenge on Evaluating Machine Learning for Information Extraction, which provides a standardised corpus, set of tasks, and evaluation methodology. The challenge is described and the systems submitted by the ten participants are briefly introduced and their performance is analysed. expand
|
|
|
Learn to weight terms in information retrieval using category information |
| |
Rong Jin,
Joyce Y. Chai,
Luo Si
|
|
Pages: 353 - 360 |
|
doi>10.1145/1102351.1102396 |
|
Full text: Pdf
|
|
How to assign appropriate weights to terms is one of the critical issues in information retrieval. Many term weighting schemes are unsupervised. They are either based on the empirical observation in information retrieval, or based on generative approaches ...
How to assign appropriate weights to terms is one of the critical issues in information retrieval. Many term weighting schemes are unsupervised. They are either based on the empirical observation in information retrieval, or based on generative approaches for language modeling. As a result, the existing term weighting schemes are usually insufficient in distinguishing informative words from the uninformative ones, which is crucial to the performance of information retrieval. In this paper, we present supervised term weighting schemes that automatically learn term weights based on the correlation between word frequency and category information of documents. Empirical studies with the ImageCLEF dataset have indicated that the proposed methods perform substantially better than the state-of-the-art approaches for term weighting and other alternatives that exploit category information for information retrieval. expand
|
|
|
A smoothed boosting algorithm using probabilistic output codes |
| |
Rong Jin,
Jian Zhang
|
|
Pages: 361 - 368 |
|
doi>10.1145/1102351.1102397 |
|
Full text: Pdf
|
|
AdaBoost.OC has shown to be an effective method in boosting "weak" binary classifiers for multi-class learning. It employs the Error Correcting Output Code (ECOC) method to convert a multi-class learning problem into a set of binary classification problems, ...
AdaBoost.OC has shown to be an effective method in boosting "weak" binary classifiers for multi-class learning. It employs the Error Correcting Output Code (ECOC) method to convert a multi-class learning problem into a set of binary classification problems, and applies the AdaBoost algorithm to solve them efficiently. In this paper, we propose a new boosting algorithm that improves the AdaBoost.OC algorithm in two aspects: 1) It introduces a smoothing mechanism into the boosting algorithm to alleviate the potential overfitting problem with the AdaBoost algorithm, and 2) It introduces a probabilistic coding scheme to generate binary codes for multiple classes such that training errors can be efficiently reduced. Empirical studies with seven UCI datasets have indicated that the proposed boosting algorithm is more robust and effective than the AdaBoost.OC algorithm for multi-class learning. expand
|
|
|
Efficient discriminative learning of Bayesian network classifier via boosted augmented naive Bayes |
| |
Yushi Jing,
Vladimir Pavlović,
James M. Rehg
|
|
Pages: 369 - 376 |
|
doi>10.1145/1102351.1102398 |
|
Full text: Pdf
|
|
The use of Bayesian networks for classification problems has received significant recent attention. Although computationally efficient, the standard maximum likelihood learning method tends to be suboptimal due to the mismatch between its optimization ...
The use of Bayesian networks for classification problems has received significant recent attention. Although computationally efficient, the standard maximum likelihood learning method tends to be suboptimal due to the mismatch between its optimization criteria (data likelihood) and the actual goal for classification (label prediction). Recent approaches to optimizing the classification performance during parameter or structure learning show promise, but lack the favorable computational properties of maximum likelihood learning. In this paper we present the Boosted Augmented Naive Bayes (BAN) classifier. We show that a combination of discriminative data-weighting with generative training of intermediate models can yield a computationally efficient method for discriminative parameter learning and structure selection. expand
|
|
|
A support vector method for multivariate performance measures |
| |
Thorsten Joachims
|
|
Pages: 377 - 384 |
|
doi>10.1145/1102351.1102399 |
|
Full text: Pdf
|
|
This paper presents a Support Vector Method for optimizing multivariate nonlinear performance measures like the F1-score. Taking a multivariate prediction approach, we give an algorithm with which such multivariate SVMs can be trained ...
This paper presents a Support Vector Method for optimizing multivariate nonlinear performance measures like the F1-score. Taking a multivariate prediction approach, we give an algorithm with which such multivariate SVMs can be trained in polynomial time for large classes of potentially non-linear performance measures, in particular ROCArea and all measures that can be computed from the contingency table. The conventional classification SVM arises as a special case of our method. expand
|
|
|
Error bounds for correlation clustering |
| |
Thorsten Joachims,
John Hopcroft
|
|
Pages: 385 - 392 |
|
doi>10.1145/1102351.1102400 |
|
Full text: Pdf
|
|
This paper presents a learning theoretical analysis of correlation clustering (Bansal et al., 2002). In particular, we give bounds on the error with which correlation clustering recovers the correct partition in a planted partition model (Condon & Karp, ...
This paper presents a learning theoretical analysis of correlation clustering (Bansal et al., 2002). In particular, we give bounds on the error with which correlation clustering recovers the correct partition in a planted partition model (Condon & Karp, 2001; McSherry, 2001). Using these bounds, we analyze how the accuracy of correlation clustering scales with the number of clusters and the sparsity of the graph. We also propose a statistical test that analyzes the significance of the clustering found by correlation clustering. expand
|
|
|
Interactive learning of mappings from visual percepts to actions |
| |
Sébastien Jodogne,
Justus H. Piater
|
|
Pages: 393 - 400 |
|
doi>10.1145/1102351.1102401 |
|
Full text: Pdf
|
|
We introduce flexible algorithms that can automatically learn mappings from images to actions by interacting with their environment. They work by introducing an image classifier in front of a Reinforcement Learning algorithm. The classifier partitions ...
We introduce flexible algorithms that can automatically learn mappings from images to actions by interacting with their environment. They work by introducing an image classifier in front of a Reinforcement Learning algorithm. The classifier partitions the visual space according to the presence or absence of highly informative local descriptors. The image classifier is incrementally refined by selecting new local descriptors when perceptual aliasing is detected. Thus, we reduce the visual input domain down to a size manageable by Reinforcement Learning, permitting us to learn direct percept-to-action mappings. Experimental results on a continuous visual navigation task illustrate the applicability of the framework. expand
|
|
|
A causal approach to hierarchical decomposition of factored MDPs |
| |
Anders Jonsson,
Andrew Barto
|
|
Pages: 401 - 408 |
|
doi>10.1145/1102351.1102402 |
|
Full text: Pdf
|
|
We present Variable Influence Structure Analysis, an algorithm that dynamically performs hierarchical decomposition of factored Markov decision processes. Our algorithm determines causal relationships between state variables and introduces temporally-extended ...
We present Variable Influence Structure Analysis, an algorithm that dynamically performs hierarchical decomposition of factored Markov decision processes. Our algorithm determines causal relationships between state variables and introduces temporally-extended actions that cause the values of state variables to change. Each temporally-extended action corresponds to a subtask that is significantly easier to solve than the overall task. Results from experiments show great promise in scaling to larger tasks. expand
|
|
|
A comparison of tight generalization error bounds |
| |
Matti Kääriäinen,
John Langford
|
|
Pages: 409 - 416 |
|
doi>10.1145/1102351.1102403 |
|
Full text: Pdf
|
|
We investigate the empirical applicability of several bounds (a number of which are new) on the true error rate of learned classifiers which hold whenever the examples are chosen independently at random from a fixed distribution.The collection of tricks ...
We investigate the empirical applicability of several bounds (a number of which are new) on the true error rate of learned classifiers which hold whenever the examples are chosen independently at random from a fixed distribution.The collection of tricks we use includes:1. A technique using unlabeled data for a tight derandomization of randomized bounds.2. A tight form of the progressive validation bound.3. The exact form of the test set bound.The bounds are implemented in the semibound package and are freely available. expand
|
|
|
Generalized LARS as an effective feature selection tool for text classification with SVMs |
| |
S. Sathiya Keerthi
|
|
Pages: 417 - 424 |
|
doi>10.1145/1102351.1102404 |
|
Full text: Pdf
|
|
In this paper we generalize the LARS feature selection method to the linear SVM model, derive an efficient algorithm for it, and empirically demonstrate its usefulness as a feature selection tool for text classification.
In this paper we generalize the LARS feature selection method to the linear SVM model, derive an efficient algorithm for it, and empirically demonstrate its usefulness as a feature selection tool for text classification. expand
|
|
|
Ensembles of biased classifiers |
| |
Rinat Khoussainov,
Andreas Heß,
Nicholas Kushmerick
|
|
Pages: 425 - 432 |
|
doi>10.1145/1102351.1102405 |
|
Full text: Pdf
|
|
We propose a novel ensemble learning algorithm called Triskel, which has two interesting features. First, Triskel learns an ensemble of classifiers, each biased to have high precision on instances from a single class (as opposed to, for example, boosting, ...
We propose a novel ensemble learning algorithm called Triskel, which has two interesting features. First, Triskel learns an ensemble of classifiers, each biased to have high precision on instances from a single class (as opposed to, for example, boosting, where the ensemble members are biased to maximise accuracy over a subset of instances from all classes). Second, the ensemble members' voting weights are assigned so that certain pairs of biased classifiers outweigh the rest of the ensemble, if their predictions agree. Our experiments demonstrate that Triskel often outperforms boosting, in terms of both accuracy and training time. We also present an ROC analysis, which shows that Triskel's iterative structure corresponds to a sequence of nested ROC spaces. The analysis predicts that Triskel works best when there are concavities in the ROC curves; this prediction agrees with our empirical results. expand
|
|
|
Computational aspects of Bayesian partition models |
| |
Mikko Koivisto,
Kismat Sood
|
|
Pages: 433 - 440 |
|
doi>10.1145/1102351.1102406 |
|
Full text: Pdf
|
|
The conditional distribution of a discrete variable y, given another discrete variable x, is often specified by assigning one multinomial distribution to each state of x. The cost of this rich parametrization is the loss of statistical ...
The conditional distribution of a discrete variable y, given another discrete variable x, is often specified by assigning one multinomial distribution to each state of x. The cost of this rich parametrization is the loss of statistical power in cases where the data actually fits a model with much fewer parameters. In this paper, we consider a model that partitions the state space of x into disjoint sets, and assigns a single Dirichlet-multinomial to each set. We treat the partition as an unknown variable which is to be integrated away when the interest is in a coarser level task, e.g., variable selection or classification. Based on two different computational approaches, we present two exact algorithms for integration over partitions. Respective complexity bounds are derived in terms of detailed problem characteristics, including the size of the data and the size of the state space of x. Experiments on synthetic data demonstrate the applicability of the algorithms. expand
|
|
|
Learning the structure of Markov logic networks |
| |
Stanley Kok,
Pedro Domingos
|
|
Pages: 441 - 448 |
|
doi>10.1145/1102351.1102407 |
|
Full text: Pdf
|
|
Markov logic networks (MLNs) combine logic and probability by attaching weights to first-order clauses, and viewing these as templates for features of Markov networks. In this paper we develop an algorithm for learning the structure of MLNs from relational ...
Markov logic networks (MLNs) combine logic and probability by attaching weights to first-order clauses, and viewing these as templates for features of Markov networks. In this paper we develop an algorithm for learning the structure of MLNs from relational databases, combining ideas from inductive logic programming (ILP) and feature induction in Markov networks. The algorithm performs a beam or shortest-first search of the space of clauses, guided by a weighted pseudo-likelihood measure. This requires computing the optimal weights for each candidate structure, but we show how this can be done efficiently. The algorithm can be used to learn an MLN from scratch, or to refine an existing knowledge base. We have applied it in two real-world domains, and found that it outperforms using off-the-shelf ILP systems to learn the MLN structure, as well as pure ILP, purely probabilistic and purely knowledge-based approaches. expand
|
|
|
Using additive expert ensembles to cope with concept drift |
| |
Jeremy Z. Kolter,
Marcus A. Maloof
|
|
Pages: 449 - 456 |
|
doi>10.1145/1102351.1102408 |
|
Full text: Pdf
|
|
We consider online learning where the target concept can change over time. Previous work on expert prediction algorithms has bounded the worst-case performance on any subsequence of the training data relative to the performance of the best expert. However, ...
We consider online learning where the target concept can change over time. Previous work on expert prediction algorithms has bounded the worst-case performance on any subsequence of the training data relative to the performance of the best expert. However, because these "experts" may be difficult to implement, we take a more general approach and bound performance relative to the actual performance of any online learner on this single subsequence. We present the additive expert ensemble algorithm AddExp, a new, general method for using any online learner for drifting concepts. We adapt techniques for analyzing expert prediction algorithms to prove mistake and loss bounds for a discrete and a continuous version of AddExp. Finally, we present pruning methods and empirical results for data sets with concept drift. expand
|
|
|
Semi-supervised graph clustering: a kernel approach |
| |
Brian Kulis,
Sugato Basu,
Inderjit Dhillon,
Raymond Mooney
|
|
Pages: 457 - 464 |
|
doi>10.1145/1102351.1102409 |
|
Full text: Pdf
|
|
Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are ...
Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective. A recent theoretical connection between kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets. expand
|
|
|
A brain computer interface with online feedback based on magnetoencephalography |
| |
Thomas Navin Lal,
Michael Schröder,
N. Jeremy Hill,
Hubert Preissl,
Thilo Hinterberger,
Jürgen Mellinger,
Martin Bogdan,
Wolfgang Rosenstiel,
Thomas Hofmann,
Niels Birbaumer,
Bernhard Schölkopf
|
|
Pages: 465 - 472 |
|
doi>10.1145/1102351.1102410 |
|
Full text: Pdf
|
|
The aim of this paper is to show that machine learning techniques can be used to derive a classifying function for human brain signal data measured by magnetoencephalography (MEG), for the use in a brain computer interface (BCI). This is especially helpful ...
The aim of this paper is to show that machine learning techniques can be used to derive a classifying function for human brain signal data measured by magnetoencephalography (MEG), for the use in a brain computer interface (BCI). This is especially helpful for evaluating quickly whether a BCI approach based on electroencephalography, on which training may be slower due to lower signal-to-noise ratio, is likely to succeed. We apply RCE and regularized SVMs to the experimental data of ten healthy subjects performing a motor imagery task. Four subjects were able to use a trained classifier to write a short name. Further analysis gives evidence that the proposed imagination task is suboptimal for the possible extension to a multiclass interface. To the best of our knowledge this paper is the first working online MEG-based BCI and is therefore a "proof of concept". expand
|
|
|
Relating reinforcement learning performance to classification performance |
| |
John Langford,
Bianca Zadrozny
|
|
Pages: 473 - 480 |
|
doi>10.1145/1102351.1102411 |
|
Full text: Pdf
|
|
We prove a quantitative connection between the expected sum of rewards of a policy and binary classification performance on created subproblems. This connection holds without any unobservable assumptions (no assumption of independence, small mixing time, ...
We prove a quantitative connection between the expected sum of rewards of a policy and binary classification performance on created subproblems. This connection holds without any unobservable assumptions (no assumption of independence, small mixing time, fully observable states, or even hidden states) and the resulting statement is independent of the number of states or actions. The statement is critically dependent on the size of the rewards and prediction performance of the created classifiers.We also provide some general guidelines for obtaining good classification performance on the created subproblems. In particular, we discuss possible methods for generating training examples for a classifier learning algorithm. expand
|
|
|
PAC-Bayes risk bounds for sample-compressed Gibbs classifiers |
| |
François Laviolette,
Mario Marchand
|
|
Pages: 481 - 488 |
|
doi>10.1145/1102351.1102412 |
|
Full text: Pdf
|
|
We extend the PAC-Bayes theorem to the sample-compression setting where each classifier is represented by two independent sources of information: a compression set which consists of a small subset of the training data, and a message string ...
We extend the PAC-Bayes theorem to the sample-compression setting where each classifier is represented by two independent sources of information: a compression set which consists of a small subset of the training data, and a message string of the additional information needed to obtain a classifier. The new bound is obtained by using a prior over a data-independent set of objects where each object gives a classifier only when the training data is provided. The new PAC-Bayes theorem states that a Gibbs classifier defined on a posterior over sample-compressed classifiers can have a smaller risk bound than any such (deterministic) sample-compressed classifier. expand
|
|
|
Heteroscedastic Gaussian process regression |
| |
Quoc V. Le,
Alex J. Smola,
Stéphane Canu
|
|
Pages: 489 - 496 |
|
doi>10.1145/1102351.1102413 |
|
Full text: Pdf
|
|
This paper presents an algorithm to estimate simultaneously both mean and variance of a non parametric regression problem. The key point is that we are able to estimate variance locally unlike standard Gaussian Process regression or SVMs. This ...
This paper presents an algorithm to estimate simultaneously both mean and variance of a non parametric regression problem. The key point is that we are able to estimate variance locally unlike standard Gaussian Process regression or SVMs. This means that our estimator adapts to the local noise. The problem is cast in the setting of maximum a posteriori estimation in exponential families. Unlike previous work, we obtain a convex optimization problem which can be solved via Newton's method. expand
|
|
|
Predicting relative performance of classifiers from samples |
| |
Rui Leite,
Pavel Brazdil
|
|
Pages: 497 - 503 |
|
doi>10.1145/1102351.1102414 |
|
Full text: Pdf
|
|
This paper is concerned with the problem of predicting relative performance of classification algorithms. It focusses on methods that use results on small samples and discusses the shortcomings of previous approaches. A new variant is proposed that exploits, ...
This paper is concerned with the problem of predicting relative performance of classification algorithms. It focusses on methods that use results on small samples and discusses the shortcomings of previous approaches. A new variant is proposed that exploits, as some previous approaches, meta-learning. The method requires that experiments be conducted on few samples. The information gathered is used to identify the nearest learning curve for which the sampling procedure was carried out fully. This in turn permits to generate a prediction regards the relative performance of algorithms. Experimental evaluation shows that the method competes well with previous approaches and provides quite good and practical solution to this problem. expand
|
|
|
Logistic regression with an auxiliary data source |
| |
Xuejun Liao,
Ya Xue,
Lawrence Carin
|
|
Pages: 505 - 512 |
|
doi>10.1145/1102351.1102415 |
|
Full text: Pdf
|
|
To achieve good generalization in supervised learning, the training and testing examples are usually required to be drawn from the same source distribution. In this paper we propose a method to relax this requirement in the context of logistic regression. ...
To achieve good generalization in supervised learning, the training and testing examples are usually required to be drawn from the same source distribution. In this paper we propose a method to relax this requirement in the context of logistic regression. Assuming Dp and Da are two sets of examples drawn from two mismatched distributions, where Da are fully labeled and Dp partially labeled, our objective is to complete the labels of Dp. We introduce an auxiliary variable μ for each example in Da to reflect its mismatch with Dp. Under an appropriate constraint the μ's are estimated as a byproduct, along with the classifier. We also present an active learning approach for selecting the labeled examples in Dp. The proposed algorithm, called "Migratory-Logit" or M-Logit, is demonstrated successfully on simulated as well as real data sets. expand
|
|
|
Predicting protein folds with structural repeats using a chain graph model |
| |
Yan Liu,
Eric P. Xing,
Jaime Carbonell
|
|
Pages: 513 - 520 |
|
doi>10.1145/1102351.1102416 |
|
Full text: Pdf
|
|
Protein fold recognition is a key step towards inferring the tertiary structures from amino-acid sequences. Complex folds such as those consisting of interacting structural repeats are prevalent in proteins involved in a wide spectrum of biological functions. ...
Protein fold recognition is a key step towards inferring the tertiary structures from amino-acid sequences. Complex folds such as those consisting of interacting structural repeats are prevalent in proteins involved in a wide spectrum of biological functions. However, extant approaches often perform inadequately due to their inability to capture long-range interactions between structural units and to handle low sequence similarities across proteins (under 25% identity). In this paper, we propose a chain graph model built on a causally connected series of segmentation conditional random fields (SCRFs) to address these issues. Specifically, the SCRF model captures long-range interactions within recurring structural units and the Bayesian network backbone decomposes cross-repeat interactions into locally computable modules consisting of repeat-specific SCRFs and a model for sequence motifs. We applied this model to predict β-helices and leucine-rich repeats, and found it significantly outperforms extant methods in predictive accuracy and/or computational efficiency. expand
|
|
|
Unsupervised evidence integration |
| |
Philip M. Long,
Vinay Varadan,
Sarah Gilman,
Mark Treshock,
Rocco A. Servedio
|
|
Pages: 521 - 528 |
|
doi>10.1145/1102351.1102417 |
|
Full text: Pdf
|
|
Many biological propositions can be supported by a variety of different types of evidence. It is often useful to collect together large numbers of such propositions, together with the evidence supporting them, into databases to be used in other analyses. ...
Many biological propositions can be supported by a variety of different types of evidence. It is often useful to collect together large numbers of such propositions, together with the evidence supporting them, into databases to be used in other analyses. Methods that automatically make preliminary choices about which propositions to include can be helpful, if they are accurate enough. This can involve weighing evidence of varying strength.We describe a method for learning a scoring function to weigh evidence of different types. The algorithm evaluates each source of evidence by the extent to which other sources tend to support it. The details are guided by a probabilistic formulation of the problem, building on previous theoretical work. We evaluate our method by applying it to predict protein-protein interactions in yeast, and using synthetic data. expand
|
|
|
Naive Bayes models for probability estimation |
| |
Daniel Lowd,
Pedro Domingos
|
|
Pages: 529 - 536 |
|
doi>10.1145/1102351.1102418 |
|
Full text: Pdf
|
|
Naive Bayes models have been widely used for clustering and classification. However, they are seldom used for general probabilistic learning and inference (i.e., for estimating and computing arbitrary joint, conditional and marginal distributions). In ...
Naive Bayes models have been widely used for clustering and classification. However, they are seldom used for general probabilistic learning and inference (i.e., for estimating and computing arbitrary joint, conditional and marginal distributions). In this paper we show that, for a wide range of benchmark datasets, naive Bayes models learned using EM have accuracy and learning time comparable to Bayesian networks with context-specific independence. Most significantly, naive Bayes inference is orders of magnitude faster than Bayesian network inference using Gibbs sampling and belief propagation. This makes naive Bayes models a very attractive alternative to Bayesian networks for general probability estimation, particularly in large or real-time domains. expand
|
|
|
ROC confidence bands: an empirical evaluation |
| |
Sofus A. Macskassy,
Foster Provost,
Saharon Rosset
|
|
Pages: 537 - 544 |
|
doi>10.1145/1102351.1102419 |
|
Full text: Pdf
|
|
This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well they perform. Such confidence bands represent the region ...
This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well they perform. Such confidence bands represent the region where the "true" ROC curve is expected to reside, with the designated confidence level. To assess the containment of the bands we begin with a synthetic world where we know the true ROC curve---specifically, where the class-conditional model scores are normally distributed. The only method that attains reasonable containment out-of-the-box produces non-parametric, "fixed-width" bands (FWBs). Next we move to a context more appropriate for machine learning evaluations: bands that with a certain confidence level will bound the performance of the model on future data. We introduce a correction to account for the larger uncertainty, and the widened FWBs continue to have reasonable containment. Finally, we assess the bands on 10 relatively large benchmark data sets. We conclude by recommending these FWBs, noting that being non-parametric they are especially attractive for machine learning studies, where the score distributions (1) clearly are not normal, and (2) even for the same data set vary substantially from learning method to learning method. expand
|
|
|
Modeling word burstiness using the Dirichlet distribution |
| |
Rasmus E. Madsen,
David Kauchak,
Charles Elkan
|
|
Pages: 545 - 552 |
|
doi>10.1145/1102351.1102420 |
|
Full text: Pdf
|
|
Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose ...
Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model. expand
|
|
|
Proto-value functions: developmental reinforcement learning |
| |
Sridhar Mahadevan
|
|
Pages: 553 - 560 |
|
doi>10.1145/1102351.1102421 |
|
Full text: Pdf
|
|
This paper presents a novel framework called proto-reinforcement learning (PRL), based on a mathematical model of a proto-value function: these are task-independent basis functions that form the building blocks of all value functions on ...
This paper presents a novel framework called proto-reinforcement learning (PRL), based on a mathematical model of a proto-value function: these are task-independent basis functions that form the building blocks of all value functions on a given state space manifold. Proto-value functions are learned not from rewards, but instead from analyzing the topology of the state space. Formally, proto-value functions are Fourier eigenfunctions of the Laplace-Beltrami diffusion operator on the state space manifold. Proto-value functions facilitate structural decomposition of large state spaces, and form geodesically smooth orthonormal basis functions for approximating any value function. The theoretical basis for proto-value functions combines insights from spectral graph theory, harmonic analysis, and Riemannian manifolds. Proto-value functions enable a novel generation of algorithms called representation policy iteration, unifying the learning of representation and behavior. expand
|
|
|
The cross entropy method for classification |
| |
Shie Mannor,
Dori Peleg,
Reuven Rubinstein
|
|
Pages: 561 - 568 |
|
doi>10.1145/1102351.1102422 |
|
Full text: Pdf
|
|
We consider support vector machines for binary classification. As opposed to most approaches we use the number of support vectors (the "L0 norm") as a regularizing term instead of the L1 or L2 norms. ...
We consider support vector machines for binary classification. As opposed to most approaches we use the number of support vectors (the "L0 norm") as a regularizing term instead of the L1 or L2 norms. In order to solve the optimization problem we use the cross entropy method to search over the possible sets of support vectors. The algorithm consists of solving a sequence of efficient linear programs. We report experiments where our method produces generalization errors that are similar to support vector machines, while using a considerably smaller number of support vectors. expand
|
|
|
Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees |
| |
H. Brendan McMahan,
Maxim Likhachev,
Geoffrey J. Gordon
|
|
Pages: 569 - 576 |
|
doi>10.1145/1102351.1102423 |
|
Full text: Pdf
|
|
MDPs are an attractive formalization for planning, but realistic problems often have intractably large state spaces. When we only need a partial policy to get from a fixed start state to a goal, restricting computation to states relevant to this task ...
MDPs are an attractive formalization for planning, but realistic problems often have intractably large state spaces. When we only need a partial policy to get from a fixed start state to a goal, restricting computation to states relevant to this task can make much larger problems tractable. We introduce a new algorithm, Bounded RTDP, which can produce partial policies with strong performance guarantees while only touching a fraction of the state space, even on problems where other algorithms would have to visit the full state space. To do so, Bounded RTDP maintains both upper and lower bounds on the optimal value function. The performance of Bounded RTDP is greatly aided by the introduction of a new technique to efficiently find suitable upper bounds; this technique can also be used to provide informed initialization to a wide range of other planning algorithms. expand
|
|
|
Comparing clusterings: an axiomatic view |
| |
Marina Meilǎ
|
|
Pages: 577 - 584 |
|
doi>10.1145/1102351.1102424 |
|
Full text: Pdf
|
|
This paper views clusterings as elements of a lattice. Distances between clusterings are analyzed in their relationship to the lattice. From this vantage point, we first give an axiomatic characterization of some criteria for comparing clusterings, including ...
This paper views clusterings as elements of a lattice. Distances between clusterings are analyzed in their relationship to the lattice. From this vantage point, we first give an axiomatic characterization of some criteria for comparing clusterings, including the variation of information and the unadjusted Rand index. Then we study other distances between partitions w.r.t these axioms and prove an impossibility result: there is no "sensible" criterion for comparing clusterings that is simultaneously (1) aligned with the lattice of partitions, (2) convexely additive, and (3) bounded. expand
|
|
|
Weighted decomposition kernels |
| |
Sauro Menchetti,
Fabrizio Costa,
Paolo Frasconi
|
|
Pages: 585 - 592 |
|
doi>10.1145/1102351.1102425 |
|
Full text: Pdf
|
|
We introduce a family of kernels on discrete data structures within the general class of decomposition kernels. A weighted decomposition kernel (WDK) is computed by dividing objects into substructures indexed by a selector. Two substructures are then ...
We introduce a family of kernels on discrete data structures within the general class of decomposition kernels. A weighted decomposition kernel (WDK) is computed by dividing objects into substructures indexed by a selector. Two substructures are then matched if their selectors satisfy an equality predicate, while the importance of the match is determined by a probability kernel on local distributions fitted on the substructures. Under reasonable assumptions, a WDK can be computed efficiently and can avoid combinatorial explosion of the feature space. We report experimental evidence that the proposed kernel is highly competitive with respect to more complex state-of-the-art methods on a set of problems in bioinformatics. expand
|
|
|
High speed obstacle avoidance using monocular vision and reinforcement learning |
| |
Jeff Michels,
Ashutosh Saxena,
Andrew Y. Ng
|
|
Pages: 593 - 600 |
|
doi>10.1145/1102351.1102426 |
|
Full text: Pdf
|
|
We consider the task of driving a remote control car at high speeds through unstructured outdoor environments. We present an approach in which supervised learning is first used to estimate depths from single monocular images. The learning algorithm can ...
We consider the task of driving a remote control car at high speeds through unstructured outdoor environments. We present an approach in which supervised learning is first used to estimate depths from single monocular images. The learning algorithm can be trained either on real camera images labeled with ground-truth distances to the closest obstacles, or on a training set consisting of synthetic graphics images. The resulting algorithm is able to learn monocular vision cues that accurately estimate the relative depths of obstacles in a scene. Reinforcement learning/policy search is then applied within a simulator that renders synthetic scenes. This learns a control policy that selects a steering direction as a function of the vision system's output. We present results evaluating the predictive ability of the algorithm both on held out test data, and in actual autonomous driving experiments. expand
|
|
|
Dynamic preferences in multi-criteria reinforcement learning |
| |
Sriraam Natarajan,
Prasad Tadepalli
|
|
Pages: 601 - 608 |
|
doi>10.1145/1102351.1102427 |
|
Full text: Pdf
|
|
The current framework of reinforcement learning is based on maximizing the expected returns based on scalar rewards. But in many real world situations, tradeoffs must be made among multiple objectives. Moreover, the agent's preferences between different ...
The current framework of reinforcement learning is based on maximizing the expected returns based on scalar rewards. But in many real world situations, tradeoffs must be made among multiple objectives. Moreover, the agent's preferences between different objectives may vary with time. In this paper, we consider the problem of learning in the presence of time-varying preferences among multiple objectives, using numeric weights to represent their importance. We propose a method that allows us to store a finite number of policies, choose an appropriate policy for any weight vector and improve upon it. The idea is that although there are infinitely many weight vectors, they may be well-covered by a small number of optimal policies. We show this empirically in two domains: a version of the Buridan's ass problem and network routing. expand
|
|
|
Learning first-order probabilistic models with combining rules |
| |
Sriraam Natarajan,
Prasad Tadepalli,
Eric Altendorf,
Thomas G. Dietterich,
Alan Fern,
Angelo Restificar
|
|
Pages: 609 - 616 |
|
doi>10.1145/1102351.1102428 |
|
Full text: Pdf
|
|
First-order probabilistic models allow us to model situations in which a random variable in the first-order model may have a large and varying numbers of parent variables in the ground ("unrolled") model. One approach to compactly describing such models ...
First-order probabilistic models allow us to model situations in which a random variable in the first-order model may have a large and varying numbers of parent variables in the ground ("unrolled") model. One approach to compactly describing such models is to independently specify the probability of a random variable conditioned on each individual parent (or small sets of parents) and then combine these conditional distributions via a combining rule (e.g., Noisy-OR). This paper presents algorithms for learning with combining rules. Specifically, algorithms based on gradient descent and expectation maximization are derived, implemented, and evaluated on synthetic data and on a real-world task. The results demonstrate that the algorithms are able to learn the parameters of both the individual parent-target distributions and the combining rules. expand
|
|
|
An efficient method for simplifying support vector machines |
| |
DucDung Nguyen,
TuBao Ho
|
|
Pages: 617 - 624 |
|
doi>10.1145/1102351.1102429 |
|
Full text: Pdf
|
|
In this paper we describe a new method to reduce the complexity of support vector machines by reducing the number of necessary support vectors included in their solutions. The reduction process iteratively selects two nearest support vectors belonging ...
In this paper we describe a new method to reduce the complexity of support vector machines by reducing the number of necessary support vectors included in their solutions. The reduction process iteratively selects two nearest support vectors belonging to the same class and replaces them by a newly constructed vector. Through the analysis of relation between vectors in the input and feature spaces, we present the construction of new vectors that requires to find the unique maximum point of a one-variable function on the interval (0, 1), not to minimize a function of many variables with local minimums in former reduced set methods. Experimental results on real life datasets show that the proposed method is effective in reducing number of support vectors and preserving machine's generalization performance. expand
|
|
|
Predicting good probabilities with supervised learning |
| |
Alexandru Niculescu-Mizil,
Rich Caruana
|
|
Pages: 625 - 632 |
|
doi>10.1145/1102351.1102430 |
|
Full text: Pdf
|
|
We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding ...
We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Models such as Naive Bayes, which make unrealistic independence assumptions, push probabilities toward 0 and 1. Other models such as neural nets and bagged trees do not have these biases and predict well calibrated probabilities. We experiment with two ways of correcting the biased probabilities predicted by some learning methods: Platt Scaling and Isotonic Regression. We qualitatively examine what kinds of distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. The empirical results show that after calibration boosted trees, random forests, and SVMs predict the best probabilities. expand
|
|
|
Recycling data for multi-agent learning |
| |
Santiago Ontañón,
Enric Plaza
|
|
Pages: 633 - 640 |
|
doi>10.1145/1102351.1102431 |
|
Full text: Pdf
|
|
Learning agents can improve performance cooperating with other agents, particularly learning agents forming a committee outperform individual agents. This "ensemble effect" is well known for multi-classifier systems in Machine Learning. However, multi-classifier ...
Learning agents can improve performance cooperating with other agents, particularly learning agents forming a committee outperform individual agents. This "ensemble effect" is well known for multi-classifier systems in Machine Learning. However, multi-classifier systems assume all data is known to all classifiers while we focus on agents that learn from cases (examples) that are owned and stored individually. In this article we focus on how individual agents can engage in bargaining activities that improve the performance of both individual agents and the committee. The agents are capable of self-evaluation and determining that some data used for learning is unnecessary. This "refuse" data can then be exploited by other agents that might found some part of it profitable to improve their performance. The experiments we performed show that this approach improves both individual and committee performance and we analyze how these results in terms of the "ensemble effect". expand
|
|
|
A graphical model for chord progressions embedded in a psychoacoustic space |
| |
Jean-François Paiement,
Douglas Eck,
Samy Bengio,
David Barber
|
|
Pages: 641 - 648 |
|
doi>10.1145/1102351.1102432 |
|
Full text: Pdf
|
|
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed ...
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies. expand
|
|
|
Q-learning of sequential attention for visual object recognition from informative local descriptors |
| |
Lucas Paletta,
Gerald Fritz,
Christin Seifert
|
|
Pages: 649 - 656 |
|
doi>10.1145/1102351.1102433 |
|
Full text: Pdf
|
|
This work provides a framework for learning sequential attention in real-world visual object recognition, using an architecture of three processing stages. The first stage rejects irrelevant local descriptors based on an information theoretic saliency ...
This work provides a framework for learning sequential attention in real-world visual object recognition, using an architecture of three processing stages. The first stage rejects irrelevant local descriptors based on an information theoretic saliency measure, providing candidates for foci of interest (FOI). The second stage investigates the information in the FOI using a codebook matcher and providing weak object hypotheses. The third stage integrates local information via shifts of attention, resulting in chains of descriptor-action pairs that characterize object discrimination. A Q-learner adapts then from explorative search and evaluative feedback from entropy decreases on the attention sequences, eventually prioritizing shifts that lead to a geometry of descriptor-action scanpaths that is highly discriminative with respect to object recognition. The methodology is successfully evaluated on indoors (COIL-20 database) and outdoors (TSG-20 database) imagery, demonstrating significant impact by learning, outperforming standard local descriptor based methods both in recognition accuracy and processing time. expand
|
|
|
Discriminative versus generative parameter and structure learning of Bayesian network classifiers |
| |
Franz Pernkopf,
Jeff Bilmes
|
|
Pages: 657 - 664 |
|
doi>10.1145/1102351.1102434 |
|
Full text: Pdf
|
|
In this paper, we compare both discriminative and generative parameter learning on both discriminatively and generatively structured Bayesian network classifiers. We use either maximum likelihood (ML) or conditional maximum likelihood (CL) ...
In this paper, we compare both discriminative and generative parameter learning on both discriminatively and generatively structured Bayesian network classifiers. We use either maximum likelihood (ML) or conditional maximum likelihood (CL) to optimize network parameters. For structure learning, we use either conditional mutual information (CMI), the explaining away residual (EAR), or the classification rate (CR) as objective functions. Experiments with the naive Bayes classifier (NB), the tree augmented naive Bayes classifier (TAN), and the Bayesian multinet have been performed on 25 data sets from the UCI repository (Merz et al., 1997) and from (Kohavi & John, 1997). Our empirical study suggests that discriminative structures learnt using CR produces the most accurate classifiers on almost half the data sets. This approach is feasible, however, only for rather small problems since it is computationally expensive. Discriminative parameter learning produces on average a better classifier than ML parameter learning. expand
|
|
|
Optimizing abstaining classifiers using ROC analysis |
| |
Tadeusz Pietraszek
|
|
Pages: 665 - 672 |
|
doi>10.1145/1102351.1102435 |
|
Full text: Pdf
|
|
Classifiers that refrain from classification in certain cases can significantly reduce the misclassification cost. However, the parameters for such abstaining classifiers are often set in a rather ad-hoc manner. We propose a method to optimally build ...
Classifiers that refrain from classification in certain cases can significantly reduce the misclassification cost. However, the parameters for such abstaining classifiers are often set in a rather ad-hoc manner. We propose a method to optimally build a specific type of abstaining binary classifiers using ROC analysis. These classifiers are built based on optimization criteria in the following three models: cost-based, bounded-abstention and bounded-improvement. We demonstrate the usage and applications of these models to effectively reduce misclassification cost in real classification systems. The method has been validated with a ROC building algorithm and cross-validation on 15 UCI KDD datasets. expand
|
|
|
Independent subspace analysis using geodesic spanning trees |
| |
Barnabás Póczos,
András Lõrincz
|
|
Pages: 673 - 680 |
|
doi>10.1145/1102351.1102436 |
|
Full text: Pdf
|
|
A novel algorithm for performing Independent Subspace Analysis, the estimation of hidden independent subspaces is introduced. This task is a generalization of Independent Component Analysis. The algorithm works by estimating the multi-dimensional differential ...
A novel algorithm for performing Independent Subspace Analysis, the estimation of hidden independent subspaces is introduced. This task is a generalization of Independent Component Analysis. The algorithm works by estimating the multi-dimensional differential entropy. The estimation utilizes minimal geodesic spanning trees matched to the sample points. Numerical studies include (i) illustrative examples, (ii) a generalization of the cocktail-party problem to songs played by bands, and (iii) an example on mixed independent subspaces, where subspaces have dependent sources, which are pairwise independent. expand
|
|
|
A model for handling approximate, noisy or incomplete labeling in text classification |
| |
Ganesh Ramakrishnan,
Krishna Prasad Chitrapura,
Raghu Krishnapuram,
Pushpak Bhattacharyya
|
|
Pages: 681 - 688 |
|
doi>10.1145/1102351.1102437 |
|
Full text: Pdf
|
|
We introduce a Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process. Given a labeled or partially labeled training corpus of text documents, the model estimates the joint distribution of training ...
We introduce a Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process. Given a labeled or partially labeled training corpus of text documents, the model estimates the joint distribution of training documents and class labels by using a generalization of the Expectation Maximization algorithm. The estimates can be used in standard classification models to reduce error rates. Since uncertainties in the labeling are taken into account, the model provides an elegant mechanism to deal with noisy labels. We provide an intuitive modification to the EM iterations by re-estimating the empirical. distribution in order to reinforce feature values in unlabeled data and to reduce the influence of noisily labeled examples. Considerable improvement in the classification accuracies of two popular classification algorithms on standard labeled data-sets with and without artificially introduced noise, as well as in the presence and absence of unlabeled data, indicates that this may be a promising method to reduce the burden of manual labeling. expand
|
|
|
Healing the relevance vector machine through augmentation |
| |
Carl Edward Rasmussen,
Joaquin Quiñonero-Candela
|
|
Pages: 689 - 696 |
|
doi>10.1145/1102351.1102438 |
|
Full text: Pdf
|
|
The Relevance Vector Machine (RVM) is a sparse approximate Bayesian kernel method. It provides full predictive distributions for test cases. However, the predictive uncertainties have the unintuitive property, that they get smaller the further you move ...
The Relevance Vector Machine (RVM) is a sparse approximate Bayesian kernel method. It provides full predictive distributions for test cases. However, the predictive uncertainties have the unintuitive property, that they get smaller the further you move away from the training cases. We give a thorough analysis. Inspired by the analogy to non-degenerate Gaussian Processes, we suggest augmentation to solve the problem. The purpose of the resulting model, RVM*, is primarily to corroborate the theoretical and experimental analysis. Although RVM* could be used in practical applications, it is no longer a truly sparse model. Experiments show that sparsity comes at the expense of worse predictive. distributions. expand
|
|
|
Supervised versus multiple instance learning: an empirical comparison |
| |
Soumya Ray,
Mark Craven
|
|
Pages: 697 - 704 |
|
doi>10.1145/1102351.1102439 |
|
Full text: Pdf
|
|
We empirically study the relationship between supervised and multiple instance (MI) learning. Algorithms to learn various concepts have been adapted to the MI representation. However, it is also known that concepts that are PAC-learnable with one-sided ...
We empirically study the relationship between supervised and multiple instance (MI) learning. Algorithms to learn various concepts have been adapted to the MI representation. However, it is also known that concepts that are PAC-learnable with one-sided noise can be learned from MI data. A relevant question then is: how well do supervised learners do on MI data? We attempt to answer this question by looking at a cross section of MI data sets from various domains coupled with a number of learning algorithms including Diverse Density, Logistic Regression, nonlinear Support Vector Machines and FOIL. We consider a supervised and MI version of each learner. Several interesting conclusions emerge from our work: (1) no MI algorithm is superior across all tested domains, (2) some MI algorithms are consistently superior to their supervised counterparts, (3) using high false-positive costs can improve a supervised learner's performance in MI domains, and (4) in several domains, a supervised algorithm is superior to any MI algorithm we tested. expand
|
|
|
Generalized skewing for functions with continuous and nominal attributes |
| |
Soumya Ray,
David Page
|
|
Pages: 705 - 712 |
|
doi>10.1145/1102351.1102440 |
|
Full text: Pdf
|
|
This paper extends previous work on skewing, an approach to problematic functions in decision tree induction. The previous algorithms were applicable only to functions of binary variables. In this paper, we extend skewing to directly handle functions ...
This paper extends previous work on skewing, an approach to problematic functions in decision tree induction. The previous algorithms were applicable only to functions of binary variables. In this paper, we extend skewing to directly handle functions of continuous and nominal variables. We present experiments with randomly generated functions and a number of real world datasets to evaluate the algorithm's accuracy. Our results indicate that our algorithm almost always outperforms an Information Gain-based decision tree learner. expand
|
|
|
Fast maximum margin matrix factorization for collaborative prediction |
| |
Jasson D. M. Rennie,
Nathan Srebro
|
|
Pages: 713 - 719 |
|
doi>10.1145/1102351.1102441 |
|
Full text: Pdf
|
|
Maximum Margin Matrix Factorization (MMMF) was recently suggested (Srebro et al., 2005) as a convex, infinite dimensional alternative to low-rank approximations and standard factor models. MMMF can be formulated as a semi-definite programming (SDP) and ...
Maximum Margin Matrix Factorization (MMMF) was recently suggested (Srebro et al., 2005) as a convex, infinite dimensional alternative to low-rank approximations and standard factor models. MMMF can be formulated as a semi-definite programming (SDP) and learned using standard SDP solvers. However, current SDP solvers can only handle MMMF problems on matrices of dimensionality up to a few hundred. Here, we investigate a direct gradient-based optimization method for MMMF and demonstrate it on large collaborative prediction problems. We compare against results obtained by Marlin (2004) and find that MMMF substantially outperforms all nine methods he tested. expand
|
|
|
Coarticulation: an approach for generating concurrent plans in Markov decision processes |
| |
Khashayar Rohanimanesh,
Sridhar Mahadevan
|
|
Pages: 720 - 727 |
|
doi>10.1145/1102351.1102442 |
|
Full text: Pdf
|
|
We study an approach for performing concurrent activities in Markov decision processes (MDPs) based on the coarticulation framework. We assume that the agent has multiple degrees of freedom (DOF) in the action space which enables it to perform activities ...
We study an approach for performing concurrent activities in Markov decision processes (MDPs) based on the coarticulation framework. We assume that the agent has multiple degrees of freedom (DOF) in the action space which enables it to perform activities simultaneously. We demonstrate that one natural way for generating concurrency in the system is by coarticulating among the set of learned activities available to the agent. In general due to the multiple DOF in the system, often there exists a redundant set of admissible sub-optimal policies associated with each learned activity. Such flexibility enables the agent to concurrently commit to several subgoals according to their priority levels, given a new task defined in terms of a set of prioritized subgoals. We present efficient approximate algorithms for computing such policies and for generating concurrent plans. We also evaluate our approach in a simulated domain. expand
|
|
|
Why skewing works: learning difficult Boolean functions with greedy tree learners |
| |
Bernard Rosell,
Lisa Hellerstein,
Soumya Ray,
David Page
|
|
Pages: 728 - 735 |
|
doi>10.1145/1102351.1102443 |
|
Full text: Pdf
|
|
We analyze skewing, an approach that has been empirically observed to enable greedy decision tree learners to learn "difficult" Boolean functions, such as parity, in the presence of irrelevant variables. We prove tha, in an idealized setting, ...
We analyze skewing, an approach that has been empirically observed to enable greedy decision tree learners to learn "difficult" Boolean functions, such as parity, in the presence of irrelevant variables. We prove tha, in an idealized setting, for any function and choice of skew parameters, skewing finds relevant variables with probability 1. We present experiments exploring how different parameter choices affect the success of skewing in empirical settings. Finally, we analyze a variant of skewing called Sequential Skewing. expand
|
|
|
Integer linear programming inference for conditional random fields |
| |
Dan Roth,
Wen-tau Yih
|
|
Pages: 736 - 743 |
|
doi>10.1145/1102351.1102444 |
|
Full text: Pdf
|
|
Inference in Conditional Random Fields and Hidden Markov Models is done using the Viterbi algorithm, an efficient dynamic programming algorithm. In many cases, general (non-local and non-sequential) constraints may exist over the output sequence, but ...
Inference in Conditional Random Fields and Hidden Markov Models is done using the Viterbi algorithm, an efficient dynamic programming algorithm. In many cases, general (non-local and non-sequential) constraints may exist over the output sequence, but cannot be incorporated and exploited in a natural way by this inference procedure. This paper proposes a novel inference procedure based on integer linear programming (ILP) and extends CRF models to naturally and efficiently support general constraint structures. For sequential constraints, this procedure reduces to simple linear programming as the inference process. Experimental evidence is supplied in the context of an important NLP problem, semantic role labeling. expand
|
|
|
Learning hierarchical multi-category text classification models |
| |
Juho Rousu,
Craig Saunders,
Sandor Szedmak,
John Shawe-Taylor
|
|
Pages: 744 - 751 |
|
doi>10.1145/1102351.1102445 |
|
Full text: Pdf
|
|
We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification ...
We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. The algorithm's predictive accuracy is competitive with other recently introduced hierarchical multi-category or multilabel classification learning algorithms. expand
|
|
|
Expectation maximization algorithms for conditional likelihoods |
| |
Jarkko Salojärvi,
Kai Puolamäki,
Samuel Kaski
|
|
Pages: 752 - 759 |
|
doi>10.1145/1102351.1102446 |
|
Full text: Pdf
|
|
We introduce an expectation maximization-type (EM) algorithm for maximum likelihood optimization of conditional densities. It is applicable to hidden variable models where the distributions are from the exponential family. The algorithm can alternatively ...
We introduce an expectation maximization-type (EM) algorithm for maximum likelihood optimization of conditional densities. It is applicable to hidden variable models where the distributions are from the exponential family. The algorithm can alternatively be viewed as automatic step size selection for gradient ascent, where the amount of computation is traded off to guarantees that each step increases the likelihood. The tradeoff makes the algorithm computationally more feasible than the earlier conditional EM. The method gives a theoretical basis for extended Baum Welch algorithms used in discriminative hidden Markov models in speech recognition, and compares favourably with the current best method in the experiments. expand
|
|
|
Estimating and computing density based distance metrics |
| |
Sajama,
Alon Orlitsky
|
|
Pages: 760 - 767 |
|
doi>10.1145/1102351.1102447 |
|
Full text: Pdf
|
|
Density-based distance metrics have applications in semi-supervised learning, nonlinear interpolation and clustering. We consider density-based metrics induced by Riemannian manifold structures and estimate them using kernel density estimators for the ...
Density-based distance metrics have applications in semi-supervised learning, nonlinear interpolation and clustering. We consider density-based metrics induced by Riemannian manifold structures and estimate them using kernel density estimators for the underlying data distribution. We lower bound the rate of convergence of these plug-in path-length estimates and hence of the metric, as the sample size increases. We present an upper bound on the rate of convergence of all estimators of the metric. We also show that the metric can be consistently computed using the shortest path algorithm on a suitably constructed graph on the data samples and lower bound the convergence rate of the computation error. We present experiments illustrating the use of the metrics for semi-supervised classification and non-linear interpolation. expand
|
|
|
Supervised dimensionality reduction using mixture models |
| |
Sajama,
Alon Orlitsky
|
|
Pages: 768 - 775 |
|
doi>10.1145/1102351.1102448 |
|
Full text: Pdf
|
|
Given a classification problem, our goal is to find a low-dimensional linear transformation of the feature vectors which retains information needed to predict the class labels. We present a method based on maximum conditional likelihood estimation of ...
Given a classification problem, our goal is to find a low-dimensional linear transformation of the feature vectors which retains information needed to predict the class labels. We present a method based on maximum conditional likelihood estimation of mixture models. Use of mixture models allows us to approximate the distributions to any desired accuracy while use of conditional likelihood as the contrast function ensures that the selected subspace retains maximum possible mutual information between feature vectors and class labels. Classification experiments using Gaussian mixture components show that this method compares favorably to related dimension reduction techniques. Other distributions belonging to the exponential family can be used to reduce dimensions when data is of a special type, for example binary or integer valued data. We provide an EM-like algorithm for model estimation and present visualization experiments using Gaussian and Bernoulli mixture models. expand
|
|
|
Object correspondence as a machine learning problem |
| |
Bernhard Schölkopf,
Florian Steinke,
Volker Blanz
|
|
Pages: 776 - 783 |
|
doi>10.1145/1102351.1102449 |
|
Full text: Pdf
|
|
We propose machine learning methods for the estimation of deformation fields that transform two given objects into each other, thereby establishing a dense point to point correspondence. The fields are computed using a modified support vector machine ...
We propose machine learning methods for the estimation of deformation fields that transform two given objects into each other, thereby establishing a dense point to point correspondence. The fields are computed using a modified support vector machine containing a penalty enforcing that points of one object will be mapped to "similar" points on the other one. Our system, which contains little engineering or domain knowledge, delivers state of the art performance. We present application results including close to photorealistic morphs of 3D head models. expand
|
|
|
Analysis and extension of spectral methods for nonlinear dimensionality reduction |
| |
Fei Sha,
Lawrence K. Saul
|
|
Pages: 784 - 791 |
|
doi>10.1145/1102351.1102450 |
|
Full text: Pdf
|
|
Many unsupervised algorithms for nonlinear dimensionality reduction, such as locally linear embedding (LLE) and Laplacian eigenmaps, are derived from the spectral decompositions of sparse matrices. While these algorithms aim to preserve certain proximity ...
Many unsupervised algorithms for nonlinear dimensionality reduction, such as locally linear embedding (LLE) and Laplacian eigenmaps, are derived from the spectral decompositions of sparse matrices. While these algorithms aim to preserve certain proximity relations on average, their embeddings are not explicitly designed to preserve local features such as distances or angles. In this paper, we show how to construct a low dimensional embedding that maximally preserves angles between nearby data points. The embedding is derived from the bottom eigenvectors of LLE and/or Laplacian eigenmaps by solving an additional (but small) problem in semidefinite programming, whose size is independent of the number of data points. The solution obtained by semidefinite programming also yields an estimate of the data's intrinsic dimensionality. Experimental results on several data sets demonstrate the merits of our approach. expand
|
|
|
Non-negative tensor factorization with applications to statistics and computer vision |
| |
Amnon Shashua,
Tamir Hazan
|
|
Pages: 792 - 799 |
|
doi>10.1145/1102351.1102451 |
|
Full text: Pdf
|
|
We derive algorithms for finding a non-negative n-dimensional tensor factorization (n-NTF) which includes the non-negative matrix factorization (NMF) as a particular case when n = 2. We motivate the use of n-NTF in three areas ...
We derive algorithms for finding a non-negative n-dimensional tensor factorization (n-NTF) which includes the non-negative matrix factorization (NMF) as a particular case when n = 2. We motivate the use of n-NTF in three areas of data analysis: (i) connection to latent class models in statistics, (ii) sparse image coding in computer vision, and (iii) model selection problems. We derive a "direct" positive-preserving gradient descent algorithm and an alternating scheme based on repeated multiple rank-1 problems. expand
|
|
|
Fast inference and learning in large-state-space HMMs |
| |
Sajid M. Siddiqi,
Andrew W. Moore
|
|
Pages: 800 - 807 |
|
doi>10.1145/1102351.1102452 |
|
Full text: Pdf
|
|
For Hidden Markov Models (HMMs) with fully connected transition models, the three fundamental problems of evaluating the likelihood of an observation sequence, estimating an optimal state sequence for the observations, and learning the model parameters, ...
For Hidden Markov Models (HMMs) with fully connected transition models, the three fundamental problems of evaluating the likelihood of an observation sequence, estimating an optimal state sequence for the observations, and learning the model parameters, all have quadratic time complexity in the number of states. We introduce a novel class of non-sparse Markov transition matrices called Dense-Mostly-Constant (DMC) transition matrices that allow us to derive new algorithms for solving the basic HMM problems in sub-quadratic time. We describe the DMC HMM model and algorithms and attempt to convey some intuition for their usage. Empirical results for these algorithms show dramatic speedups for all three problems. In terms of accuracy, the DMC model yields strong results and outperforms the baseline algorithms even in domains known to violate the DMC assumption. expand
|
|
|
New d-separation identification results for learning continuous latent variable models |
| |
Ricardo Silva,
Richard Scheines
|
|
Pages: 808 - 815 |
|
doi>10.1145/1102351.1102453 |
|
Full text: Pdf
|
|
Learning the structure of graphical models is an important task, but one of considerable difficulty when latent variables are involved. Because conditional independences using hidden variables cannot be directly observed, one has to rely on alternative ...
Learning the structure of graphical models is an important task, but one of considerable difficulty when latent variables are involved. Because conditional independences using hidden variables cannot be directly observed, one has to rely on alternative methods to identify the d-separations that define the graphical structure. This paper describes new distribution-free techniques for identifying d-separations in continuous latent variable models when non-linear dependencies are allowed among hidden variables. expand
|
|
|
Identifying useful subgoals in reinforcement learning by local graph partitioning |
| |
Özgür Şimşek,
Alicia P. Wolfe,
Andrew G. Barto
|
|
Pages: 816 - 823 |
|
doi>10.1145/1102351.1102454 |
|
Full text: Pdf
|
|
We present a new subgoal-based method for automatically creating useful skills in reinforcement learning. Our method identifies subgoals by partitioning local state transition graphs---those that are constructed using only the most recent experiences ...
We present a new subgoal-based method for automatically creating useful skills in reinforcement learning. Our method identifies subgoals by partitioning local state transition graphs---those that are constructed using only the most recent experiences of the agent. The local scope of our subgoal discovery method allows it to successfully identify the type of subgoals we seek---states that lie between two densely-connected regions of the state space while producing an algorithm with low computational cost. expand
|
|
|
Beyond the point cloud: from transductive to semi-supervised learning |
| |
Vikas Sindhwani,
Partha Niyogi,
Mikhail Belkin
|
|
Pages: 824 - 831 |
|
doi>10.1145/1102351.1102455 |
|
Full text: Pdf
|
|
Due to its occurrence in engineering domains and implications for natural learning, the problem of utilizing unlabeled data is attracting increasing attention in machine learning. A large body of recent literature has focussed on the transductive ...
Due to its occurrence in engineering domains and implications for natural learning, the problem of utilizing unlabeled data is attracting increasing attention in machine learning. A large body of recent literature has focussed on the transductive setting where labels of unlabeled examples are estimated by learning a function defined only over the point cloud data. In a truly semi-supervised setting however, a learning machine has access to labeled and unlabeled examples and must make predictions on data points never encountered before. In this paper, we show how to turn transductive and standard supervised learning algorithms into semi-supervised learners. We construct a family of data-dependent norms on Reproducing Kernel Hilbert Spaces (RKHS). These norms allow us to warp the structure of the RKHS to reflect the underlying geometry of the data. We derive explicit formulas for the corresponding new kernels. Our approach demonstrates state of the art performance on a variety of classification tasks. expand
|
|
|
Active learning for sampling in time-series experiments with application to gene expression analysis |
| |
Rohit Singh,
Nathan Palmer,
David Gifford,
Bonnie Berger,
Ziv Bar-Joseph
|
|
Pages: 832 - 839 |
|
doi>10.1145/1102351.1102456 |
|
Full text: Pdf
|
|
Many time-series experiments seek to estimate some signal as a continuous function of time. In this paper, we address the sampling problem for such experiments: determining which time-points ought to be sampled in order to minimize the cost of ...
Many time-series experiments seek to estimate some signal as a continuous function of time. In this paper, we address the sampling problem for such experiments: determining which time-points ought to be sampled in order to minimize the cost of data collection. We restrict our attention to a growing class of experiments which measure multiple signals at each time-point and where raw materials/observations are archived initially, and selectively analyzed later, this analysis being the more expensive step. We present an active learning algorithm for iteratively choosing time-points to sample, using the uncertainty in the quality of the currently estimated time-dependent curve as the objective function. Using simulated data as well as gene expression data, we show that our algorithm performs well, and can significantly reduce experimental cost without loss of information. expand
|
|
|
Compact approximations to Bayesian predictive distributions |
| |
Edward Snelson,
Zoubin Ghahramani
|
|
Pages: 840 - 847 |
|
doi>10.1145/1102351.1102457 |
|
Full text: Pdf
|
|
We provide a general framework for learning precise, compact, and fast representations of the Bayesian predictive distribution for a model. This framework is based on minimizing the KL divergence between the true predictive density and a suitable compact ...
We provide a general framework for learning precise, compact, and fast representations of the Bayesian predictive distribution for a model. This framework is based on minimizing the KL divergence between the true predictive density and a suitable compact approximation. We consider various methods for doing this, both sampling based approximations, and deterministic approximations such as expectation propagation. These methods are tested on a mixture of Gaussians model for density estimation and on binary linear classification, with both synthetic data sets for visualization and several real data sets. Our results show significant reductions in prediction time and memory footprint. expand
|
|
|
Large scale genomic sequence SVM classifiers |
| |
Sören Sonnenburg,
Gunnar Rätsch,
Bernhard Schölkopf
|
|
Pages: 848 - 855 |
|
doi>10.1145/1102351.1102458 |
|
Full text: Pdf
|
|
In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently ...
In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modifications of an SMO-like SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences. expand
|
|
|
A theoretical analysis of Model-Based Interval Estimation |
| |
Alexander L. Strehl,
Michael L. Littman
|
|
Pages: 856 - 863 |
|
doi>10.1145/1102351.1102459 |
|
Full text: Pdf
|
|
Several algorithms for learning near-optimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Model-based Interval Estimation (MBIE) learns efficiently in practice, effectively balancing ...
Several algorithms for learning near-optimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Model-based Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper presents the first theoretical analysis of MBIE, proving its efficiency even under worst-case conditions. The paper also introduces a new performance metric, average loss, and relates it to its less "online" cousins from the literature. expand
|
|
|
Explanation-Augmented SVM: an approach to incorporating domain knowledge into SVM learning |
| |
Qiang Sun,
Gerald DeJong
|
|
Pages: 864 - 871 |
|
doi>10.1145/1102351.1102460 |
|
Full text: Pdf
|
|
We introduce a novel approach to incorporating domain knowledge into Support Vector Machines to improve their example efficiency. Domain knowledge is used in an Explanation Based Learning fashion to build justifications or explanations for why the training ...
We introduce a novel approach to incorporating domain knowledge into Support Vector Machines to improve their example efficiency. Domain knowledge is used in an Explanation Based Learning fashion to build justifications or explanations for why the training examples are assigned their given class labels. Explanations bias the large margin classifier through the interaction of training examples and domain knowledge. We develop a new learning algorithm for this Explanation-Augmented SVM (EA-SVM). It naturally extends to imperfect knowledge, a stumbling block to conventional EBL. Experimental results confirm desirable properties predicted by the analysis and demonstrate the approach on three domains. expand
|
|
|
Unifying the error-correcting and output-code AdaBoost within the margin framework |
| |
Yijun Sun,
Sinisa Todorovic,
Jian Li,
Dapeng Wu
|
|
Pages: 872 - 879 |
|
doi>10.1145/1102351.1102461 |
|
Full text: Pdf
|
|
In this paper, we present a new interpretation of AdaBoost.ECC and AdaBoost.OC. We show that AdaBoost.ECC performs stage-wise functional gradient descent on a cost function, defined in the domain of margin values, and that AdaBoost.OC is a shrinkage ...
In this paper, we present a new interpretation of AdaBoost.ECC and AdaBoost.OC. We show that AdaBoost.ECC performs stage-wise functional gradient descent on a cost function, defined in the domain of margin values, and that AdaBoost.OC is a shrinkage version of AdaBoost.ECC. These findings strictly explain some properties of the two algorithms. The gradient-minimization formulation of AdaBoost.ECC allows us to derive a new algorithm, referred to as AdaBoost.SECC, by explicitly exploiting shrinkage as regularization in AdaBoost.ECC. Experiments on diverse databases confirm our theoretical findings. Empirical results show that AdaBoost.SECC performs significantly better than AdaBoost.ECC and AdaBoost.OC. expand
|
|
|
Finite time bounds for sampling based fitted value iteration |
| |
Csaba Szepesvári,
Rémi Munos
|
|
Pages: 880 - 887 |
|
doi>10.1145/1102351.1102462 |
|
Full text: Pdf
|
|
In this paper we consider sampling based fitted value iteration for discounted, large (possibly infinite) state space, finite action Markovian Decision Problems where only a generative model of the transition probabilities and rewards is available. At ...
In this paper we consider sampling based fitted value iteration for discounted, large (possibly infinite) state space, finite action Markovian Decision Problems where only a generative model of the transition probabilities and rewards is available. At each step the image of the current estimate of the optimal value function under a Monte-Carlo approximation to the Bellman-operator is projected onto some function space. PAC-style bounds on the weighted Lp-norm approximation error are obtained as a function of the covering number and the approximation power of the function space, the iteration number and the sample size. expand
|
|
|
TD(λ) networks: temporal-difference networks with eligibility traces |
| |
Brian Tanner,
Richard S. Sutton
|
|
Pages: 888 - 895 |
|
doi>10.1145/1102351.1102463 |
|
Full text: Pdf
|
|
Temporal-difference (TD) networks have been introduced as a formalism for expressing and learning grounded world knowledge in a predictive form (Sutton & Tanner, 2005). Like conventional TD(0) methods, the learning algorithm for TD networks uses 1-step ...
Temporal-difference (TD) networks have been introduced as a formalism for expressing and learning grounded world knowledge in a predictive form (Sutton & Tanner, 2005). Like conventional TD(0) methods, the learning algorithm for TD networks uses 1-step backups to train prediction units about future events. In conventional TD learning, the TD(λ) algorithm is often used to do more general multi-step backups of future predictions. In our work, we introduce a generalization of the 1-step TD network specification that is based on the TD(λ) learning algorithm, creating TD(λ) networks. We present experimental results that show TD(λ) networks can learn solutions in more complex environments than TD networks. We also show that in problems that can be solved by TD networks, TD(λ) networks generally learn solutions much faster than their 1-step counterparts. Finally, we present an analysis of our algorithm that shows that the computational cost of TD(λ) networks is only slightly more than that of TD networks. expand
|
|
|
Learning structured prediction models: a large margin approach |
| |
Ben Taskar,
Vassil Chatalbashev,
Daphne Koller,
Carlos Guestrin
|
|
Pages: 896 - 903 |
|
doi>10.1145/1102351.1102464 |
|
Full text: Pdf
|
|
We consider large margin estimation in a broad range of prediction models where inference involves solving combinatorial optimization problems, for example, weighted graph-cuts or matchings. Our goal is to learn parameters such that inference using the ...
We consider large margin estimation in a broad range of prediction models where inference involves solving combinatorial optimization problems, for example, weighted graph-cuts or matchings. Our goal is to learn parameters such that inference using the model reproduces correct answers on the training data. Our method relies on the expressive power of convex optimization problems to compactly capture inference or solution optimality in structured prediction models. Directly embedding this structure within the learning formulation produces concise convex problems for efficient estimation of very complex and diverse models. We describe experimental results on a matching task, disulfide connectivity prediction, showing significant improvements over state-of-the-art methods. expand
|
|
|
Learning discontinuities with products-of-sigmoids for switching between local models |
| |
Marc Toussaint,
Sethu Vijayakumar
|
|
Pages: 904 - 911 |
|
doi>10.1145/1102351.1102465 |
|
Full text: Pdf
|
|
Sensorimotor data from many interesting physical interactions comprises discontinuities. While existing locally weighted learning approaches aim at learning smooth functions, we propose a model that learns how to switch discontinuously between local ...
Sensorimotor data from many interesting physical interactions comprises discontinuities. While existing locally weighted learning approaches aim at learning smooth functions, we propose a model that learns how to switch discontinuously between local models. The local responsibilities, usually represented by Gaussian kernels, are learned by a product of local sigmoidal classifiers that can represent complex shaped and sharply bounded regions. Local models are incrementally added. A locality prior constrains them to learn only local data---which is the key ingredient for incremental learning with local models. expand
|
|
|
Core Vector Regression for very large regression problems |
| |
Ivor W. Tsang,
James T. Kwok,
Kimo T. Lai
|
|
Pages: 912 - 919 |
|
doi>10.1145/1102351.1102466 |
|
Full text: Pdf
|
|
In this paper, we extend the recently proposed Core Vector Machine algorithm to the regression setting by generalizing the underlying minimum enclosing ball problem. The resultant Core Vector Regression (CVR) algorithm can be used with any linear/nonlinear ...
In this paper, we extend the recently proposed Core Vector Machine algorithm to the regression setting by generalizing the underlying minimum enclosing ball problem. The resultant Core Vector Regression (CVR) algorithm can be used with any linear/nonlinear kernels and can obtain provably approximately optimal solutions. Its asymptotic time complexity is linear in the number of training patterns m, while its space complexity is independent of m. Experiments show that CVR has comparable performance with SVR, but is much faster and produces much fewer support vectors on very large data sets. It is also successfully applied to large 3D point sets in computer graphics for the modeling of implicit surfaces. expand
|
|
|
Propagating distributions on a hypergraph by dual information regularization |
| |
Koji Tsuda
|
|
Pages: 920 - 927 |
|
doi>10.1145/1102351.1102467 |
|
Full text: Pdf
|
|
In the information regularization framework by Corduneanu and Jaakkola (2005), the distributions of labels are propagated on a hypergraph for semi-supervised learning. The learning is efficiently done by a Blahut-Arimoto-like two step algorithm, but, ...
In the information regularization framework by Corduneanu and Jaakkola (2005), the distributions of labels are propagated on a hypergraph for semi-supervised learning. The learning is efficiently done by a Blahut-Arimoto-like two step algorithm, but, unfortunately, one of the steps cannot be solved in a closed form. In this paper, we propose a dual version of information regularization, which is considered as more natural in terms of information geometry. Our learning algorithm has two steps, each of which can be solved in a closed form. Also it can be naturally applied to exponential family distributions such as Gaussians. In experiments, our algorithm is applied to protein classification based on a metabolic network and known functional categories. expand
|
|
|
Hierarchical Dirichlet model for document classification |
| |
Sriharsha Veeramachaneni,
Diego Sona,
Paolo Avesani
|
|
Pages: 928 - 935 |
|
doi>10.1145/1102351.1102468 |
|
Full text: Pdf
|
|
The proliferation of text documents on the web as well as within institutions necessitates their convenient organization to enable efficient retrieval of information. Although text corpora are frequently organized into concept hierarchies or taxonomies, ...
The proliferation of text documents on the web as well as within institutions necessitates their convenient organization to enable efficient retrieval of information. Although text corpora are frequently organized into concept hierarchies or taxonomies, the classification of the documents into the hierarchy is expensive in terms human effort. We present a novel and simple hierarchical Dirichlet generative model for text corpora and derive an efficient algorithm for the estimation of model parameters and the unsupervised classification of text documents into a given hierarchy. The class conditional feature means are assumed to be inter-related due to the hierarchical Bayesian structure of the model. We show that the algorithm provides robust estimates of the classification parameters by performing smoothing or regularization. We present experimental evidence on real web data that our algorithm achieves significant gains in accuracy over simpler models. expand
|
|
|
Implicit surface modelling as an eigenvalue problem |
| |
Christian Walder,
Olivier Chapelle,
Bernhard Schölkopf
|
|
Pages: 936 - 939 |
|
doi>10.1145/1102351.1102469 |
|
Full text: Pdf
|
|
We discuss the problem of fitting an implicit shape model to a set of points sampled from a co-dimension one manifold of arbitrary topology. The method solves a non-convex optimisation problem in the embedding function that defines the implicit by way ...
We discuss the problem of fitting an implicit shape model to a set of points sampled from a co-dimension one manifold of arbitrary topology. The method solves a non-convex optimisation problem in the embedding function that defines the implicit by way of its zero level set. By assuming that the solution is a mixture of radial basis functions of varying widths we attain the globally optimal solution by way of an equivalent eigenvalue problem, without using or constructing as an intermediate step the normal vectors of the manifold at each data point. We demonstrate the system on two and three dimensional data, with examples of missing data interpolation and set operations on the resultant shapes. expand
|
|
|
New kernels for protein structural motif discovery and function classification |
| |
Chang Wang,
Stephen D. Scott
|
|
Pages: 940 - 947 |
|
doi>10.1145/1102351.1102470 |
|
Full text: Pdf
|
|
We present new, general-purpose kernels for protein structure analysis, and describe how to apply them to structural motif discovery and function classification. Experiments show that our new methods are faster than conventional techniques, are capable ...
We present new, general-purpose kernels for protein structure analysis, and describe how to apply them to structural motif discovery and function classification. Experiments show that our new methods are faster than conventional techniques, are capable of finding structural motifs, and are very effective in function classification. In addition to strong cross-validation results, we found possible new oxidoreductases and cytochrome P450 reductases and a possible new structural motif in cytochrome P450 reductases. expand
|
|
|
Exploiting syntactic, semantic and lexical regularities in language modeling via directed Markov random fields |
| |
Shaojun Wang,
Shaomin Wang,
Russell Greiner,
Dale Schuurmans,
Li Cheng
|
|
Pages: 948 - 955 |
|
doi>10.1145/1102351.1102471 |
|
Full text: Pdf
|
|
We present a directed Markov random field (MRF) model that combines n-gram models, probabilistic context free grammars (PCFGs) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. Even though the ...
We present a directed Markov random field (MRF) model that combines n-gram models, probabilistic context free grammars (PCFGs) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. Even though the composite directed MRF model potentially has an exponential number of loops and becomes a context sensitive grammar, we are nevertheless able to estimate its parameters in cubic time using an efficient modified EM method, the generalized inside-outside algorithm, which extends the inside-outside algorithm to incorporate the effects of the n-gram and PLSA language models. We generalize various smoothing techniques to alleviate the sparseness of n-gram counts in cases where there are hidden variables. We also derive an analogous algorithm to calculate the probability of initial subsequence of a sentence, generated by the composite language model. Our experimental results on the Wall Street Journal corpus show that we obtain significant reductions in perplexity compared to the state-of-the-art baseline trigram model with Good-Turing and Kneser-Ney smoothings. expand
|
|
|
Bayesian sparse sampling for on-line reward optimization |
| |
Tao Wang,
Daniel Lizotte,
Michael Bowling,
Dale Schuurmans
|
|
Pages: 956 - 963 |
|
doi>10.1145/1102351.1102472 |
|
Full text: Pdf
|
|
We present an efficient "sparse sampling" technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration ...
We present an efficient "sparse sampling" technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration to achieve improved decision making while controlling computational cost. The idea is to grow a sparse lookahead tree, intelligently, by exploiting information in a Bayesian posterior---rather than enumerate action branches (standard sparse sampling) or compensate myopically (value of perfect information). The outcome is a flexible, practical technique for improving action selection in simple reinforcement learning scenarios. expand
|
|
|
Learning predictive representations from a history |
| |
Eric Wiewiora
|
|
Pages: 964 - 971 |
|
doi>10.1145/1102351.1102473 |
|
Full text: Pdf
|
|
Predictive State Representations (PSRs) have shown a great deal of promise as an alternative to Markov models. However, learning a PSR from a single stream of data generated from an environment remains a challenge. In this work, we present a formalism ...
Predictive State Representations (PSRs) have shown a great deal of promise as an alternative to Markov models. However, learning a PSR from a single stream of data generated from an environment remains a challenge. In this work, we present a formalism of PSRs and the domains they model. This formalization suggests an algorithm for learning PSRs that will (almost surely) converge to a globally optimal model given sufficient training data. expand
|
|
|
Incomplete-data classification using logistic regression |
| |
David Williams,
Xuejun Liao,
Ya Xue,
Lawrence Carin
|
|
Pages: 972 - 979 |
|
doi>10.1145/1102351.1102474 |
|
Full text: Pdf
|
|
A logistic regression classification algorithm is developed for problems in which the feature vectors may be missing data (features). Single or multiple imputation for the missing data is avoided by performing analytic integration with an estimated conditional ...
A logistic regression classification algorithm is developed for problems in which the feature vectors may be missing data (features). Single or multiple imputation for the missing data is avoided by performing analytic integration with an estimated conditional density function (conditioned on the non-missing data). Conditional density functions are estimated using a Gaussian mixture model (GMM), with parameter estimation performed using both expectation maximization (EM) and Variational Bayesian EM (VB-EM). Using widely available real data, we demonstrate the general advantage of the VB-EM GMM estimation for handling incomplete data, vis-à-vis the EM algorithm. Moreover, it is demonstrated that the approach proposed here is generally superior to standard imputation procedures. expand
|
|
|
Learning predictive state representations in dynamical systems without reset |
| |
Britton Wolfe,
Michael R. James,
Satinder Singh
|
|
Pages: 980 - 987 |
|
doi>10.1145/1102351.1102475 |
|
Full text: Pdf
|
|
Predictive state representations (PSRs) are a recently-developed way to model discrete-time, controlled dynamical systems. We present and describe two algorithms for learning a PSR model: a Monte Carlo algorithm and a temporal difference (TD) algorithm. ...
Predictive state representations (PSRs) are a recently-developed way to model discrete-time, controlled dynamical systems. We present and describe two algorithms for learning a PSR model: a Monte Carlo algorithm and a temporal difference (TD) algorithm. Both of these algorithms can learn models for systems without requiring a reset action as was needed by the previously available general PSR-model learning algorithm. We present empirical results that compare our two algorithms and also compare their performance with that of existing algorithms, including an EM algorithm for learning POMDP models. expand
|
|
|
Linear Asymmetric Classifier for cascade detectors |
| |
Jianxin Wu,
Matthew D. Mullin,
James M. Rehg
|
|
Pages: 988 - 995 |
|
doi>10.1145/1102351.1102476 |
|
Full text: Pdf
|
|
The detection of faces in images is fundamentally a rare event detection problem. Cascade classifiers provide an efficient computational solution, by leveraging the asymmetry in the distribution of faces vs. non-faces. Training a cascade classifier in ...
The detection of faces in images is fundamentally a rare event detection problem. Cascade classifiers provide an efficient computational solution, by leveraging the asymmetry in the distribution of faces vs. non-faces. Training a cascade classifier in turn requires a solution for the following subproblems: Design a classifier for each node in the cascade with very high detection rate but only moderate false positive rate. While there are a few strategies in the literature for indirectly addressing this asymmetric node learning goal, none of them are based on a satisfactory theoretical framework. We present a mathematical characterization of the node-learning problem and describe an effective closed form approximation to the optimal solution, which we call the Linear Asymmetric Classifier (LAC). We first use AdaBoost or AsymBoost to select features, and use LAC to learn a linear discriminant function to achieve the node learning goal. Experimental results on face detection show that LAC can improve the detection performance in comparison to standard methods. We also show that Fisher Discriminant Analysis on the features selected by AdaBoost yields better performance than AdaBoost itself. expand
|
|
|
Building Sparse Large Margin Classifiers |
| |
Mingrui Wu,
Bernhard Schölkopf,
Gökhan Bakir
|
|
Pages: 996 - 1003 |
|
doi>10.1145/1102351.1102477 |
|
Full text: Pdf
|
|
This paper presents an approach to build Sparse Large Margin Classifiers (SLMC) by adding one more constraint to the standard Support Vector Machine (SVM) training problem. The added constraint explicitly controls the sparseness of the ...
This paper presents an approach to build Sparse Large Margin Classifiers (SLMC) by adding one more constraint to the standard Support Vector Machine (SVM) training problem. The added constraint explicitly controls the sparseness of the classifier and an approach is provided to solve the formulated problem. When considering the dual of this problem. it can be seen that building an SLMC is equivalent to constructing an SVM with a modified kernel function. Further analysis of this kernel function indicates that the proposed approach essentially finds a discriminating subspace that can be spanned by a small number of vectors, and in this subspace different classes of data are linearly well separated. Experimental results over several classification benchmarks show that in most cases the proposed approach outperforms the state-of-art sparse learning algorithms. expand
|
|
|
Dirichlet enhanced relational learning |
| |
Zhao Xu,
Volker Tresp,
Kai Yu,
Shipeng Yu,
Hans-Peter Kriegel
|
|
Pages: 1004 - 1011 |
|
doi>10.1145/1102351.1102478 |
|
Full text: Pdf
|
|
We apply nonparametric hierarchical Bayesian modelling to relational learning. In a hierarchical Bayesian approach, model parameters can be "personalized", i.e., owned by entities or relationships, and are coupled via a common prior distribution. Flexibility ...
We apply nonparametric hierarchical Bayesian modelling to relational learning. In a hierarchical Bayesian approach, model parameters can be "personalized", i.e., owned by entities or relationships, and are coupled via a common prior distribution. Flexibility is added in a nonparametric hierarchical Bayesian approach, such that the learned knowledge can be truthfully represented. We apply our approach to a medical domain where we form a nonparametric hierarchical Bayesian model for relations involving hospitals, patients, procedures and diagnosis. The experiments show that the additional flexibility in a nonparametric hierarchical Bayes approach results in a more accurate model of the dependencies between procedures and diagnosis and gives significantly improved estimates of the probabilities of future procedures. expand
|
|
|
Learning Gaussian processes from multiple tasks |
| |
Kai Yu,
Volker Tresp,
Anton Schwaighofer
|
|
Pages: 1012 - 1019 |
|
doi>10.1145/1102351.1102479 |
|
Full text: Pdf
|
|
We consider the problem of multi-task learning, that is, learning multiple related functions. Our approach is based on a hierarchical Bayesian framework, that exploits the equivalence between parametric linear models and nonparametric Gaussian processes ...
We consider the problem of multi-task learning, that is, learning multiple related functions. Our approach is based on a hierarchical Bayesian framework, that exploits the equivalence between parametric linear models and nonparametric Gaussian processes (GPs). The resulting models can be learned easily via an EM-algorithm. Empirical studies on multi-label text categorization suggest that the presented models allow accurate solutions of these multi-task problems. expand
|
|
|
Augmenting naive Bayes for ranking |
| |
Harry Zhang,
Liangxiao Jiang,
Jiang Su
|
|
Pages: 1020 - 1027 |
|
doi>10.1145/1102351.1102480 |
|
Full text: Pdf
|
|
Naive Bayes is an effective and efficient learning algorithm in classification. In many applications, however, an accurate ranking of instances based on the class probability is more desirable. Unfortunately, naive Bayes has been found to produce poor ...
Naive Bayes is an effective and efficient learning algorithm in classification. In many applications, however, an accurate ranking of instances based on the class probability is more desirable. Unfortunately, naive Bayes has been found to produce poor probability estimates. Numerous techniques have been proposed to extend naive Bayes for better classification accuracy, of which selective Bayesian classifiers (SBC) (Langley & Sage, 1994), tree-augmented naive Bayes (TAN) (Friedman et al., 1997), NBTree (Kohavi, 1996), boosted naive Bayes (Elkan, 1997), and AODE (Webb et al., 2005) achieve remarkable improvement over naive Bayes in terms of classification accuracy. An interesting question is: Do these techniques also produce accurate ranking? In this paper, we first conduct a systematic experimental study on their efficacy for ranking. Then, we propose a new approach to augmenting naive Bayes for generating accurate ranking, called hidden naive Bayes (HNB). In an HNB, a hidden parent is created for each attribute to represent the influences from all other attributes, and thus a more accurate ranking is expected. HNB inherits the structural simplicity of naive Bayes and can be easily learned without structure learning. Our experiments show that HNB outperforms naive Bayes, SBC, boosted naive Bayes, NBTree, and TAN significantly, and performs slightly better than AODE in ranking. expand
|
|
|
A new Mallows distance based metric for comparing clusterings |
| |
Ding Zhou,
Jia Li,
Hongyuan Zha
|
|
Pages: 1028 - 1035 |
|
doi>10.1145/1102351.1102481 |
|
Full text: Pdf
|
|
Despite of the large number of algorithms developed for clustering, the study on comparing clustering results is limited. In this paper, we propose a measure for comparing clustering results to tackle two issues insufficiently addressed or even overlooked ...
Despite of the large number of algorithms developed for clustering, the study on comparing clustering results is limited. In this paper, we propose a measure for comparing clustering results to tackle two issues insufficiently addressed or even overlooked by existing methods: (a) taking into account the distance between cluster representatives when assessing the similarity of clustering results; (b) constructing a unified framework for defining a distance based on either hard or soft clustering and ensuring the triangle inequality under the definition. Our measure is derived from a complete and globally optimal matching between clusters in two clustering results. It is shown that the distance is an instance of the Mallows distance---a metric between probability distributions in statistics. As a result, the defined distance inherits desirable properties from the Mallows distance. Experiments show that our clustering distance measure successfully handles cases difficult for other measures. expand
|
|
|
Learning from labeled and unlabeled data on a directed graph |
| |
Dengyong Zhou,
Jiayuan Huang,
Bernhard Schölkopf
|
|
Pages: 1036 - 1043 |
|
doi>10.1145/1102351.1102482 |
|
Full text: Pdf
|
|
We propose a general framework for learning from labeled and unlabeled data on a directed graph in which the structure of the graph including the directionality of the edges is considered. The time complexity of the algorithm derived from this framework ...
We propose a general framework for learning from labeled and unlabeled data on a directed graph in which the structure of the graph including the directionality of the edges is considered. The time complexity of the algorithm derived from this framework is nearly linear due to recently developed numerical techniques. In the absence of labeled instances, this framework can be utilized as a spectral clustering method for directed graphs, which generalizes the spectral clustering approach for undirected graphs. We have applied our framework to real-world web classification problems and obtained encouraging results. expand
|
|
|
2D Conditional Random Fields for Web information extraction |
| |
Jun Zhu,
Zaiqing Nie,
Ji-Rong Wen,
Bo Zhang,
Wei-Ying Ma
|
|
Pages: 1044 - 1051 |
|
doi>10.1145/1102351.1102483 |
|
Full text: Pdf
|
|
The Web contains an abundance of useful semistructured information about real world objects, and our empirical study shows that strong sequence characteristics exist for Web information about objects of the same type across different Web sites. Conditional ...
The Web contains an abundance of useful semistructured information about real world objects, and our empirical study shows that strong sequence characteristics exist for Web information about objects of the same type across different Web sites. Conditional Random Fields (CRFs) are the state of the art approaches taking the sequence characteristics to do better labeling. However, as the information on a Web page is two-dimensionally laid out, previous linear-chain CRFs have their limitations for Web information extraction. To better incorporate the two-dimensional neighborhood interactions, this paper presents a two-dimensional CRF model to automatically extract object information from the Web. We empirically compare the proposed model with existing linear-chain CRF models for product information extraction, and the results show the effectiveness of our model. expand
|
|
|
Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning |
| |
Xiaojin Zhu,
John Lafferty
|
|
Pages: 1052 - 1059 |
|
doi>10.1145/1102351.1102484 |
|
Full text: Pdf
|
|
Graph-based methods for semi-supervised learning have recently been shown to be promising for combining labeled and unlabeled data in classification problems. However, inference for graph-based methods often does not scale well to very large data sets, ...
Graph-based methods for semi-supervised learning have recently been shown to be promising for combining labeled and unlabeled data in classification problems. However, inference for graph-based methods often does not scale well to very large data sets, since it requires inversion of a large matrix or solution of a large linear program. Moreover, such approaches are inherently transductive, giving predictions for only those points in the unlabeled set, and not for an arbitrary test point. In this paper a new approach is presented that preserves the strengths of graph-based semi-supervised learning while overcoming the limitations of scalability and non-inductive inference, through a combination of generative mixture models and discriminative regularization using the graph Laplacian. Experimental results show that this approach preserves the accuracy of purely graph-based transductive methods when the data has "manifold structure," and at the same time achieves inductive learning with significantly reduced computational cost. expand
|
|
|
Large margin non-linear embedding |
| |
Alexander Zien,
Joaquin Quiñonero Candela
|
|
Pages: 1060 - 1067 |
|
doi>10.1145/1102351.1102485 |
|
Full text: Pdf
|
|
It is common in classification methods to first place data in a vector space and then learn decision boundaries. We propose reversing that process: for fixed decision boundaries, we "learn" the location of the data. This way we (i) do not need a metric ...
It is common in classification methods to first place data in a vector space and then learn decision boundaries. We propose reversing that process: for fixed decision boundaries, we "learn" the location of the data. This way we (i) do not need a metric (or even stronger structure) - pairwise dissimilarities suffice; and additionally (ii) produce low-dimensional embeddings that can be analyzed visually. We achieve this by combining an entropy-based embedding method with an entropy-based version of semi-supervised logistic regression. We present results for clustering and semi-supervised classification. expand
|