The Power and Flexibility of Bayesian Variable Selection Methods

Naveen N. Narisetty, Department of Statistics, University of Illinois at Urbana-Champaign. 

High dimensional data are ubiquitous and are encountered on a regular basis by statistical scientists both in academia and in industry. A majority of the early research in statistics dealt with the settings where there are just a few covariates. With the advent of advanced genetic technology and cheap memory, the high dimensional data revolution has significantly occupied mainstream statistical research. While the statistical models with many parameters were familiar to statisticians in the name of nonparametric or semi-parametric models, the dimension considered was often still not larger than the sample size.

In gene expression datasets for instance, it is not uncommon to encounter datasets with just a few dozen observations but tens of thousands of genes if not more. This is also now very prevalent in business applications. An example is consumer behavior data. An important and common question that comes up pretty soon to the data analyst is –  “which of the available covariates are relevant for the outcome of interest?” This concerns the problem of variable selection (and more generally model selection) in Statistics. In a private conversation, Prof. Jim Berger, Duke University, called this the “evergreen problem” of Statistics as it is a fundamental and yet an ever challenging problem. The continuous research and the new insights regularly reported in all the reputed statistical journals stands as a testimony for this comment.

One of the most classical approaches to variable selection is best subset selection, where all possible combinations of the variables are considered and are evaluated. This is not feasible to be implemented even in moderately large data sets. As a compromise, stepwise algorithms such as the forward selection and backward elimination algorithms helped deal with slightly larger dimensional datasets. However, their performance was found to be not satisfactory in many common situations as they tend to get stuck in local solutions and not explore the model space efficiently.

The advent of LASSO due to Tibshirani (1996, JRSSB) changed the game of variable selection quite dramatically. By doing both variable selection and estimation, LASSO became a standard tool among users of large dimensional datasets. Further generalizations such as the adaptive LASSO, Elastic Net successfully followed the general strategy of penalization to provide more precise variable selection and estimation. A major appeal of these methods is that the objective functions involved for optimization are convex if the likelihood function is convex. This facilitates computations as their computations are quite scalable to large datasets and typically have computational complexity to be linear order in the number of parameters.

As we all know, there is no statistical method without any limitations. It was noted by several authors that the performance of convex penalty methods such as the LASSO are not well suited for variable selection especially when there are high correlations. In fact, the conditions under which LASSO is variable selection consistent are quite restrictive in terms of the possible correlations between the covariates. Nonconvex penalization methods such as SCAD (Fan and Li 2001, JASA) and MCP (Zhang 2010, Annals of Statistics) gave up the comfort of convex optimization in favor of better variable selection performance. While it is not possible to guarantee that the global minimizers of the SCAD and MCP objective functions be found, their better empirical and theoretical performance compared to LASSO, especially in settings with high correlation between covariates, proved them to be potential alternatives.

Bayesian approaches for variable selection have also been developed in parallel. Spike and slab priors are at the core of variable selection in the Bayesian framework. In their original form, a spike prior referred to the point mass prior at zero whereas a slab prior referred to a uniform prior quite pictorially. However, one disadvantage of a point mass spike prior is that it requires computational algorithms which can switch between models of different sizes such as the reversible jump MCMC algorithms or algorithms which are somewhat similar to stochastic versions of the stepwise algorithms. To avoid this, continuous versions of the spike priors are first proposed by George and McCulloch (1993, JASA) which facilitated computations using Gibbs sampling algorithms. In my thesis, I studied the theoretical properties associated with the Gaussian spike and slab priors and provided precise conditions under which the posterior distribution concentrates on the true model (for more details, see Narisetty and He 2014, Annals of Statistics).

While continuous spike and slab priors are nice theoretically and are easy to implement using standard software such as R, their computations have been quite slow as they involved large matrix operations such as matrix inverses. Recent work by Bhattacharya et al. (2015, Biometrika) has eased this computational burden to a large extent by intelligently avoiding large matrix operations using matrix identities. In our recent work, we further reduced the computational burden by exploring the sparsity structure in high dimensional settings, using a scalable Gibbs sample algorithm called Skinny Gibbs (see our recent paper Narisetty et al. (2018, JASA). In contrast with the standard Gibbs sampling algorithms used for Bayesian variable selection, Skinny Gibbs does not require large matrix operations and is much more scalable to high-dimensional problems both in memory and in computational efficiency. At the same time, it still retains to be model selection consistent under much weaker conditions than LASSO and works under higher correlation between the covariates. However, since Gibbs sampling algorithms attempt to explore the whole posterior distribution, the Skinny Gibbs is still not as LASSO.

One can argue that instead of attempting to compute the full posterior distribution, one may be satisfied with a posterior summary such as the posterior mode. In that case, the computations for Bayesian approaches for variable selection can also be done quite fast by using EM algorithms. This approach has gained popularity in recent years following the EMVS paper by Ročková and George (2014, JASA). One of my students Lingrui Gan (advised jointly with Feng Liang) has utilized this strategy successfully for graphical models (see an upcoming paper based on this work in JASA). Very interestingly, in many cases the optimization for computing the posterior mode turns out to be a convex problem (under some constraints) in spite of the highly non-convex shrinkage induced by the Bayesian priors. In that sense, the Bayesian regularization theoretically achieves variable selection consistency under much general situations compared to LASSO due to the non-convex nature of the Bayesian shrinkage, and yet the computations are as fast as LASSO.

There is an increasing attention on using the Bayesian framework for high dimensional problems due to the flexibility of the modeling paradigm and the recent advances, particularly in terms of computation, are quite promising. I have only mentioned a few such examples and of course there are a lot of other exciting developments in this direction which I will be glad to discuss about in a different note. I hope that the power and flexibility of Bayesian approaches demonstrated in some of the recent works I discussed will fuel more interest in using them in practice.

An important responsibility on the academic community is to effectively communicate the utility of the Bayesian framework for high dimensional data to practitioners and to teach the relevant methodological as well as computational techniques to our students. I wish that all the undergraduate students majoring in statistics get to learn at least some basics of Bayesian statistics, which I think is currently not the case in some programs. More importantly, our graduate students would greatly benefit from having an exposure to the Bayesian framework at least so that they will have a different perspective. In the graduate courses I teach, I make it a point to discuss the Bayesian methodology and especially its use for variable selection. This semester I am offering a special topics course at the University of Illinois which covers some advanced Bayesian modeling techniques that I believe will prepare our graduate students for careers in both academia and industry.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s