Monday, August 6, 2007: 4:20 PM
J3, San Jose McEnery Convention Center
We demonstrated that model accuracy of non-standard regression and classification methods (CART and Random Forest) depended on two shape attributes of response patterns, diagonality and threshold strength, for large sample sizes (N=10,000). Diagonality measured how oblique a response pattern is oriented relative to predictor gradients, and threshold strength measured the degree to which a response topology resembles a step. More recently, we tested to see how the relationships between model accuracy and the shape attributes changed with sequentially smaller sample sizes (N=1000, 500, 250, 100, 50, and 25). We sub-sampled four response patterns, a diagonal Gaussian ridge, a non-diagonal Gaussian ridge, a diagonal threshold, and a non-diagonal threshold. We included a kernel smoother analysis (NPMR) to test alongside CART and Random Forest. Using external validation, we calculated the mean AUC, a measure of classification accuracy, and the mean R-sq, a measure of regression accuracy, for models built from 1000 bootstrap replicates at each sample size. For example, the rank order of the mean AUC for each pattern and sample size revealed the dependence of classification Random Forest accuracy on diagonality and threshold strength across sample sizes. Highly diagonal structure at small samples posed the greatest hazard for classification Random Forest with the minimum AUC at 0.6 and the mean AUC for the smallest sample (N=25) at 0.73. We advocate the utility of response pattern shape attributes not only to bring statistical methods closer to true underlying data shapes, but to bring empirical data shapes closer to their theoretical underpinnings. We provide examples of how diagonality and threshold strength can be used to better understand the ecology of response patterns using tree species distribution models for the Pacific United States.