LASSO stands for Least Absolute Shrinkage and Selection Operator. It was first introduced 21 years ago by Robert Tibshirani (Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B). In 2004 the four statistical masters: Efron, Hastie, Johnstone and Tibshirani joined together to write the paper Least angle regression published in the Annals of statistics. It is that paper that sent the LASSO to the podium. The reason? they removed a computational barrier. Armed with a new ingenious geometric interpretation, they presented an algorithm for solving the LASSO problem. The algorithm is as simple as solving an OLS problem, and with computer code to accompany their paper, the LASSO was set for its liftoff*.
The LASSO overall reduces model complexity. It does this by completely excluding some variables, using only a subset of the original potential explanatory variables. Since this can add to the story of the model, the reduction in complexity is a desired property. Clarity of authors’ exposition and well rehashed computer code are further reasons for the fully justified, full fledged LASSO flareup.
This is not a LASSO tutorial. Google-search results, undoubtedly refined over years of increased popularity, are clear enough by now. Also, if you are still reading this I imagine you already know what is the LASSO and how it works. To continue from this point, what follows is a selective list of milestones from the academic literature- some theoretical and practical extensions.
A short comment about the term consistency in the context of LASSO. The LASSO method simultaneously selects and estimates the covariates. The term consistency can mean then two things: (1) A recovery of the real value of the coefficients or, (2) recovery of what is called sparsity pattern; which variables should be included and which variables should be excluded. I like to call this the structure of the model. I wish all papers would clearly make this distinction between the two options but they very rarely do. Unless stated otherwise, it is (2) that is most often discussed in this context, in contrast to other statistical theories where (1) takes center stage.
- Zou, Hui & Hastie, Trevor (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320.
- Wang, Sijian, Nan, Bin, Rosset, Saharon & Zhu, Ji (2011). Random lasso. The annals of applied statistics, 5, 468.
- Tran, Minh-Ngoc, Nott, David J & Leng, Chenlei (2012). The predictive lasso. Statistics and computing, 22, 1069-1084.
- Park, Trevor & Casella, George (2008). The Bayesian lasso. Journal of the American Statistical Association, 103, 681-686.
- Wang, Hansheng, Guodong Li, and Chih‐Ling Tsai. Regression coefficient and autoregressive order shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69.1 (2007): 63-78.
- Yuan, Ming & Lin, Yi (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49-67.
- Zou, Hui (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101, 1418-1429.
- Tibshirani, Robert, Saunders, Michael, Rosset, Saharon, Zhu, Ji & Knight, Keith (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 91-108.
- Meinshausen, Nicolai (2007). Relaxed lasso. Computational Statistics & Data Analysis, 52, 374-393.
- Bach, Francis R. Bolasso: model consistent lasso estimation through the bootstrap. Proceedings of the 25th international conference on Machine learning. ACM, 2008.
- Witten, Daniela M., and Robert Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association 105.490 (2010): 713-726.
In high cross sectional dimension, for example when we have more names in the portfolio than we have data points (e.g. 500 names with yearly data, or in microarray data), the LASSO can only pick as much variables as data points allow. If the number of potential explanatory variables is 100 but we only have 20 time points, than only 20 variables can be chosen by the LASSO. This is simply an algebraic limitation of the LASSO solution. Apart from that, the LASSO has difficulties retrieving the real structure when we throw at him regressors with high correlation between them. To counter those two problems the authors propose to add a ridge penalty to the LASSO penalty; equation (3) in the published version. There are many advantages, not least is better prediction accuracy. In my opinion, this is the most interesting extension out of this list.
The LASSO is not without limitations. In the presence of highly correlation variables, the method tends to include all- or to remove all those highly correlated variables. This is not wise from a information-theoretic perspective. If two variables are highly correlated we better use one of those only since they have a lot of “overlap” in terms of information, and the LASSO solution does not follow this reasoning. To overcome this particular limitation, the authors propose a computationally intensive method, the random lasso method, for variable selection in linear models. It is computationally heavy since they use a two step procedure where both steps require bootstrapping. In step 1, the lasso method is applied to many bootstrap samples, each using a set of randomly selected covariates. A measure of importance is yielded from this step for each covariate. In step 2, a similar procedure to the first step is implemented with the exception that for each bootstrap sample, a subset of covariates is randomly selected with unequal selection probabilities determined by the covariates’ importance.
This is a fairly recent paper, but long overdue. The paper was published in Statistics and computing. This is a fairly math-hungry journal and so the paper is not easy to follow. It is also not easy because all is framed in terms of densities rather than point forecasts. However, the idea of the predictive LASSO (pLASSO) is straight forward: “The pLasso in this form differs from the original Lasso only in the way it replaces the observed responses by the predictions” (under their equation (9)). Which prediction do you place instead of the original Y? If I understand correctly, the predictions are made simply using the full model.
Estimation of the beta parameters in a Bayesian fashion (using Gibbs sampler). A nice by-product of the Bayesian approach is that you directly get confidence intervals (ok ok, Bayesian credible intervals). They also discuss an interesting extension, something they call “Huberized LASSO” or robust LASSO where instead of the usual squared loss function we can use a more outlier-resilient loss function like the Loss Absolute Deviation (LAD).
This is an extension for time series analsysis. Another LASSO penalty is introduced in order to choose the number of lags in a regression model with autoregressive errors.
The title of this paper is apt. Some variables are naturally grouped (e.g. companies from the same industry). Perhaps we would like to exclude all the names of a particular industry, meaning we use single names data but would like to decide on groups eventually. Instead of same penalty term for all variables, we define a penalty term for each group separately.
Derive a necessary condition for the lasso variable selection to be consistent. Consequently, there exist certain scenarios where the lasso is inconsistent for variable selection. Propose a new version of the lasso, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the LASSO penalty. This is an important paper, it is downloadable, well written and fairly simple to understand. The idea is to re-scale the LASSO penalty so that each variable gets a different penalty according to its presumed importance- hence the word adaptive in the title of the paper. High penalty for unimportant variables and low penalty for important variables. Generally speaking, the importance is estimated using a simple OLS; a high value of beta signals high importance. Equation 4 is key.
Very creative paper. The idea is to smooth the penalty term for variables which have natural ordering. For an example using time series, you don’t want the coefficient of lag number 3 to be zero and lag number 4 to be large (seasonality aside). Equation (3) in the paper adds another penalty term for the size of the difference between the coefficients of adjacent explanatory variables (so this is only relevant when you can naturally order the variables).
The settings is very high cross sectional dimension. This is not a very easy paper to understand since the author speaks about the norm of the coefficients instead of directly speaking about the penalty parameter like everyone else. But the idea itself is clear: in high cross sectional dimension where many explanatory variables are available but only few are indeed relevant, you can improve prediction using higher penalty. Higher penalty better prevents the LASSO from including too many variables which are actually irrelevant. In that sense the title is not very intuitive. What we relax is the norm of the coefficients, but we eventually end up with a stiffer penalty- so a better title in my opinion would be “Stiffer LASSO” (or “Stiff as an old man on a cold winter’s night LASSO”).
Detailed asymptotic analysis of model consistency of the Lasso (see initial comment on consistency).
They use the LASSO for clustering. Again, most useful when in high cross sectional dimension settings.
* In 2008 there was another suggestion by Kim, Jinseog, Yuwon Kim, and Yongdai Kim. A gradient-based optimization algorithm for lasso. Journal of Computational and Graphical Statistics 17.4 (2008): 994-1009. But this suggestion did not grab hold.
There is also a post which concerns inference in the LASSO, hence not included in the list above: A significance test for the lasso