Actuarial Expertise

An Introduction to Lasso Regression for Actuaries

The linear and generalized linear models are standard statistical tools for actuaries. However, new improvements over the standard tools have been developed that offer promising results. This article introduces actuaries to the lasso regression, a novel tool which effectively controls for overfitting in models with a large number of parameters. Lasso stands for “least absolute shrinkage and selection operator” according to the original academic paper on the method. The lasso regression is first presented in the context of linear modeling, and it is later extended to the generalized linear case.

What are the components of a lasso regression?

The typical regression data consists of n observed response values y1,y2,…,yn and p predictors xj,j=1,…,p in an n×p matrix. The goal is to estimate coefficients Bj of p predictors by minimizing the residual sums of squares (RSS):

The intercept term has been omitted without loss of generality. Also assume the data have been scaled by their means and standard deviations.

In lasso regression, the coefficients are estimated by minimizing:

RSS subject to

The term  is a “budget” term. In statistical literature, the terms “constraint” and “penalty” are also used to describe the budget term. The terms constraint, penalty and budget will be used interchangeably in this article.

The reader may be familiar with ridge regression, an established method in which the budget term takes the form . Ridge regression is often used to obtain stable estimates of coefficients in the presence of highly correlated predictors (multicollinearity). The lasso budget replaces the squared coefficients with the absolute values of the coefficients.

At first glance, it is not clear at all why the addition of the lasso budget can change anything in the regression, but it turns out to have a profound effect.

Before discussing this effect, one might ask why a budget term is needed. Two major reasons are: (1) to combat overfitting and (2) to obtain a parsimonious and thus more interpretable model. To address these two problems, we can either attempt to eliminate predictors by some sort of variable selection, or we can reduce the predictors’ effects by controlling their coefficients’ sizes.

How can we control the coefficients’ sizes?

We can put a budget of t2 on the total size of the coefficients. This budgeting controls the coefficients’ sizes by not allowing them to get too big. In fact, lasso and ridge regression are sometimes referred to as shrinkage methods because these methods shrink the coefficients’ sizes.

The most profound effect of the lasso budget is that lasso regression can shrink the sizes of the coefficients all the way to zero, effectively eliminating predictors and performing automatic variable selection.

 

The most profound effect of the lasso budget is that lasso regression can shrink the sizes of the coefficients all the way to zero, effectively eliminating predictors and performing automatic variable selection. Reducing the number of predictors reduces the chances of overfitting. Furthermore, a smaller and more interpretable model is obtained. In view of this important property of lasso regression, the moniker “lasso” is an apt one: A lasso is a rope used by cowboys to snare cattle from a herd. Similarly, lasso regression is a method a statistician uses to pull variables from a larger group of variables.

The impact of the budgets imposed by lasso and ridge regression is shown in Figure 1 when p=2 (two predictors). On the right, the small open point represents the unconstrained solution to minimizing RSS. When the ridge budget B12+B22t2 (which is a circle of radius t) is used, we have to move away from the open circle until the first contour of RSS intersects the constraint region. This is indicated by a black solid point. Something interesting happens in the lasso case on the left. The lasso constraint |B1|+|B2|≤t2 (which is a rhombus) has corners. When the first contour of RSS intersects the lasso constraint region, B1=0 and thus predictor x1 is eliminated. The argument here also applies to the case when p ≥3; the lasso constraint will have pointy edges, which increases the chances of eliminating variables.

Why would one prefer lasso over well-established variable selection methods that can combat overfitting and produce a parsimonious model?

In high-dimensional situations (large number of predictors), lasso regression offers substantial computational advantages over many existing variable selection methods. Although lasso was first proposed in the mid-1990s, these computational advantages were not realized until a new implementation of lasso took off in 2008. The new computational implementation uses a fast and efficient coordinate descent algorithm, an optimization algorithm popular in the machine-learning community, to estimate the lasso regression coefficients. The t2 value is chosen by cross validation as described next.

In cross validation, the data are randomly divided into G > 1 groups. Common values for G are 5 and 10. One group is left out as the validation group. The rest of the data in G-1 groups are used for fitting the lasso model across a range of t2 values. Next, each fitted model with its own t2 value is used to predict the response values that were in the validation group. The prediction accuracy, typically using mean-squared error, is recorded for each t2 value. This process is repeated G times, with one of the groups serving as the validation group and the remaining used for fitting the lasso model and predicting the response values in the validation group. This process results in G prediction accuracy measures for each t2 value. The G prediction accuracies are averaged to give one mean prediction accuracy measure for each t2 value. Finally, the t2 value with the best mean prediction accuracy is chosen as the model t2 value. This process may seem quite time-consuming but the new computational implementation of lasso performs well.

Just as a quick example, it took about 31 minutes to perform a lasso regression on a Dell laptop with 8 GB RAM and 2.2 GHz processor — this was done using R’s glmnet package (which is created by the foremost researchers in lasso regression) on simulated data with n = 500,000 and p = 500, of which 100 were noise variables. The data file size was about 4.4 GB in csv format. The software’s default settings were used, which included a 10-fold cross validation to determine the best value of t2. The lasso regression correctly eliminated all 100 noise variables.

The lasso budget can be applied in many situations with similar effects. One important application is the lasso budget in generalized linear models. Here the minimization is:

Negative log-likelihood subject to , where Bjs are now the parameters of the generalized linear model.

An important variant of the lasso budget, also used in generalized linear modeling, is the elastic net, which is a weighted average of the lasso and ridge budgets:

Note that when a =1, we are back to the lasso, and when a=0, we get the ridge. Like the t2 term, a is generally determined by cross validation. The elastic net budget is the recommended approach when dealing with many correlated predictors.

What are the disadvantages to using lasso?

There are no closed form solutions for the coefficients in lasso regression. Also, lasso regression tends to produce biased estimates of the coefficients. However, this bias is countered by the reduction in the variance of the coefficient estimates.

Where can you go from here if you need to learn more about lasso?

An excellent clear description of the method (without suffocating equations!) is found in chapter 6 of An Introduction to Statistical Learning with Applications in R, published in 2013, by Gareth James, Daniella Witten, Trevor Hastie and Robert Tibshirani. Tibshirani (the original creator of lasso) and Hastie are the leading researchers in lasso regression. The book can be downloaded for free from http://www-bcf.usc.edu/~gareth/ISL/.

Those seeking more details and mathematics should download the 2015 book, Statistical Learning with Sparsity: The Lasso and Generalizations, by Hastie, Tibshirani and Martin Wainwright from http://web.stanford.edu/~hastie/StatLearnSparsity/. This book is a tour de force of lasso regression.

When I first got the idea to write this article, I had planned on submitting an R tutorial on lasso regression using R’s glmnet package. However, I discovered a great tutorial maintained by Trevor Hastie and Junyang Qian at http://web.stanford.edu/~hastie/glmnet/glmnet_beta.html. This comprehensive tutorial shows how to do linear, logistic, multinomial, Poisson, multivariate and Cox hazard lasso regressions using the glmnet package.

For a specific example of a lasso regression using actuarial data and the glmnet package, see pages 189 to 193 of Computational Actuarial Science with R (2014), edited by Arthur Charpentier.


Kam Hamidieh, Ph.D., is a lecturer in the department of statistics and Jones Business School at Rice University. He can be reached at kh1@rice.edu.