SYMP 4-4 - Bridging the two cultures: Latent variable statistical modeling with boosted regression trees

Tuesday, August 7, 2012: 9:20 AM
Portland Blrm 251, Oregon Convention Center
Thomas G. Dietterich and Rebecca A. Hutchinson, School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR
Background/Question/Methods

In Breiman's "Two Cultures" paper, he contrasted statistical modeling (such as logistic regression) with prediction algorithms (such as random forests) and argued that inferences about traditional statistical models are risky unless the models also demonstrate high predictive accuracy on independent data. He discussed several applications where the goal is to make accurate predictions and showed that predictive algorithms gave substantially better results than statistical models and provided better insight into the relative importance of covariates.

However, prediction is not the only goal of statistical analysis. Consider the problem of mapping the distribution of a hard-to-detect species based on field surveys. A prediction approach to this data can only predict what an observer will detect--not the true distribution of the species. To model the true distribution, one must adopt a latent variable statistical model such as the Occupancy-Detection (OD) Model (Mackenzie et al.). The Occupancy-Detection Model contains two submodels: The Species Distribution Model predicts the probability that a site will be occupied by the species (based on site covariates), and the Detection Model predicts the probability that the species, when present, will be detected by the observer (based on detection covariates). Whether a site is occupied or not is the latent variable. A standard statistical modeling approach is to fit logistic regressions for these two submodels. However, that suffers from the drawbacks identified by Breiman.

We describe a hybrid approach that adopts the statistical structure of the Occupancy-Detection Model, but fits the two submodels using a prediction algorithm: boosted regression trees. The algorithm extends Freidman's L2-tree-boosting framework to latent variable models. In a simulation study, we compare this approach, which we call OD-BRT, to the standard logistic regression formulation of the OD model and to a purely predictive approach: boosted regression trees (BRT) applied to predict the raw observations.

Results/Conclusions

The results show that when predicting the raw observations, there is little difference in AUC scores between BRT and OD-BRT. This supports the claim that predictive accuracy by itself is not the only goal of ecological analysis. However, when predicting the true distribution of the species, our hybrid method, OD-BRT, out-performs the logistic regression formulation of the OD model.  This demonstrates that the hybrid OD-BRT method can provide more accurate species distribution models in situations where there are unknown interactions and non-linearities in either the occupancy or the detection processes.