NIPS 2005 Workshop on the

Accuracy-Regularization Frontier

Friday, December 9th, 2005
Westin Resort and Spa, Whistler, BC, Canada

Abstracts of Contributed Presentations

Workshop home | Description | Program

Grouped and Hierarchical Model Selection through Composite Absolute Penalties
Peng Zhao, Guilherme V. Rocha, and Bin Yu
Recently much attention has been devoted to model selection through regularization methods in regression and classification where the penalty is capable of selecting some of the regressors (e.g. Lasso in Tibshirani,1996). While the resulting sparsity leads to more interpretable models, one may want to further incorporate natural groupings or a particular hierarchical structure present within the predictors. The proposed Composite Absolute Penalties (CAP) method allows both grouping and hierarchical structures to be enforced on the estimated coefficients for the predictors, and the groups could overlap. Given k pre-defined groups of parameters, it uses L_\gamma_i,i=1,..,k norms as penalties within each group and puts an overall L_\gamma_0 penalty across groups. Properly choosing \gamma_i , it is possible to have sparse model estimates with grouped and hierarchical structures. Among many interesting choices, one special case makes the selection to operate across the groups as in the LASSO while, within a group, either all coefficients are zero or nonzero together which enforces the natural grouping. Furthermore, by allowing the groups to overlap, this property also provides a mechanism for expressing hierarchical relationships between the fitted coefficients. When CAP with L_1 and L_\infty norms is used for least squares regression, the exact regularization frontier can be computed efficiently by exploiting its piecewise linearity (similarly to LARS, Efron et. al 2004). For general cases, the BLasso algorithm of Zhao \& Yu (2004) is used. Simulations are carried out to illustrate the method in terms of sparsity and prediction performance relative to Lasso.
[PDF]

A Convex Approach to Learning the Ridge based on CV
K. Pelckmans, J.A.K. Suykens, B. De Moor
This paper advances results in model selection by relaxing the task of optimally tuning the regularization parameter in a number of algorithms with respect to the classical cross-validation performance criterion as a convex optimization problem. The proposed strategy differs from the scope of e.g. generalized cross-validation (GCV) as it concerns the efficient optimization, not the individual evaluation of the model selection criterion.
[PDF]

Basis Pursuit Learning and Multi-Objective Optimization
Martin Brown, Nick Costen, Georgios Papadopoulos

Computation of the entire regularization path for SVM in practice
Alexandre Belloni and Katya Scheinberg
We will discuss an implementation of parametric active set method which computes the entire regularization path for SVM. The method is similar to that described in the paper by Hastie, Rosset, Tibshirani and Zhu. We will compare this method to a dual active set method which computes just one solution on a regularization path. In theory, this is as hard as computing the entrie regularization path, but in practice this is not so. We will describe the challenges of parametric active set method, present computational comparison of the two methods on large-scale classification problems and discuss possible approach of reducing the computational time by computing an approximate regularization path.

Short Presentations

Considering cost asymmetry in learning SVMs
Francis Bach, David Heckerman, Eric Horvitz
Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classifiers for all feasible ratios of the costs associated with false positives and false negatives. For linear classifiers, the set of classifiers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classifiers whose performances can be plotted in the ROC plane. In this paper, we consider the alternative of varying the asymmetry of the cost function used for training. We present a path-following algorithm for the support vector machine (SVM) that can compute efficiently the entire ROC curve, that has the same computational properties as training a single classifier.
[PDF]

Exploring the regularization path for adaptive Gaussian kernel SVMs
Roland Memisevic, Nathan Srebro, Sam Roweis

Training-set method for choosing a regularization/accuracy
Saharon Rosset, Nathan Srebro
The standard, and often most appropriate, approach to choosing a regularization/accuracy tradeoff (i.e., regularization parameter) is to evaluate models along the frontier on held-out data, using validation sets or cross-validation. Nevertheless, it is interesting to consider methods that use only the regularization/accuracy frontier as calculated on the training set, without access to held-out data. The main question we ask is, can a good model be chosen from along the frontier based only on the frontier itself? This may be practically useful when data is scarce and/or computational complexity is an issue. It may also lead to interesting insights about the frontier and its properties: What is a good measure of the true complexity of models on the frontier? What are the ``correct'' parameterizations we should use for the regularization and accuracy measures when visualizing the frontier? Etc.

The Kernel LARS algorithm
Stéphane Canu, Vincent Guigue, Alain Rakotomamonjy and Gilles Gasso

Computing regularization paths for learning multiple kernels
Francis R. Bach, Romain Thibaux, Michael I. Jordan
The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm (Bach et al.,2004). In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case.
[PDF]
http://cmm.ensmp.fr/~bach/path/

The support vector decomposition machine
Francisco Pereira and Geoff Gordon
In machine learning problems with tens of thousands of features and only dozens or hundreds of independent training examples, dimensionality reduction is essential for good learning performance. In previous work, many researchers have treated the learning problem in two separate phases: first use an algorithm such as singular value decomposition to reduce the dimensionality of the data set, and then use a classification algorithm such as na\"ive Bayes or support vector machines to learn a classifier. We demonstrate that it is possible to combine the two goals of dimensionality reduction and classification into a single learning objective, and present a novel and efficient algorithm which optimizes this objective directly. We present experimental results in two domains, fMRI analysis and cell image processing, which show that we can achieve better learning performance and lower-dimensional representations than two-phase approaches can.
[PDF]

Last modified: Tue Nov 29 14:30:51 Eastern Standard Time 2005