COS 33-6
Stochastic variable selection methods for ecological data
Increasingly ecologists are analyzing multivariate datasets where it is desirable to select a suitable subset of covariates from many variables. In the ecological literature, the typical approaches to this problem are stepwise regression and model comparison using information criterion. However, stepwise regression is prone to biased parameter estimates and reproducibility issues, while the latter is impractical for a large number of candidate models. Stochastic variable selection (SVS) methods, which rely on posterior probabilities computed from Markov chain Monte Carlo (MCMC) samples, have gained popularity in the statistical literature. Our objective was to compare the predictive performance of several SVS methods on four ecological datasets. The methods we considered are: ‘spike and slab’ variable selection (SSVS), Gibbs variable selection (GVS), Bayesian lasso regression (BLR), and ‘reversible jump’ MCMC (RJMCMC). We also selected variables with the deviance information criterion (DIC) to compare with a traditional method. For each dataset, we held out 25% of the observations for validation and applied each routine to the remaining data. We applied model averaging using Occam’s window to predict values of the response variable for the validation data, and compared performance by computing root mean squared prediction error (RMSPE) for each of the variable selection methods.
Results/Conclusions
Our preliminary results suggest that our approach, variable selection based on posterior probability of inclusion followed by model averaging of the ‘best’ subset of models, reduces RMSPE by approximately 15% when compared to a routine that selects a single best model using DIC. When compared to model averaging using DIC, the results are mixed. RJMCMC and BLR perform as well or slightly better for all datasets, while SSVS and GVS perform better for some datasets and worse for others. The latter two methods require careful tuning of pseudo-priors to ensure full visitation of the model space, which likely explains these results. However, all of these methods facilitate the consideration of all possible models, while variable selection with DIC is only practical for datasets with few covariates.
SVS allows for simultaneous testing of many variables, possesses natural penalties for model complexity, and avoids the bias associated with stepwise procedures. Further, these methods can be easily implemented with widely used software such as OpenBUGS or JAGS. We conclude that variable selection based on posterior probability, particularly using the reversible jump algorithm or lasso minimization functions, provides an appealing alternative to traditional methods and merits consideration by ecologists working with large, multivariate datasets.