Species distribution models are increasingly important for investigating the habitat requirements of introduced or native species, planning management policies for invasive species and conservation programs for endangered species, and understanding the essential determinants of biodiversity patterns. The sample size of data used for predictive models is often more or less restricted for many taxa and regions because of limited data availability. Therefore, to determine the effects of sample size and species response shape on model performance is critical for making reliable predictions. However, it is difficult for ecologists to evaluate species distribution models using field data because the true situation is unknown. Predictive models should be able to approximate the true relationship if ecologists want to use them as reliable tools. In our study, simulated data derived from generalized beta function were used to develop and validate models. Because simulated data provide known species-environment relationships, we can evaluate differences among model predictions relative to the true values, and obtain insights into the effects of specific factors on model performance. Simulation offers a good strategy to control the true relationship so that recommendations from model comparisons could better address solutions to sampling issues, and assist studies on invasive or endangered species.
Results/Conclusions
In this study, the generalized beta function was used for describing different shapes of species responses, and the true dataset was generated from it. Data with different sample sizes were randomly drawn from the true dataset and then used to run models. Modeling approaches including linear discriminant analysis, multiple logistic regression, random forests and artificial neural networks were developed on the sampled datasets. The quality of model predictions was evaluated using the area under the receiver operating characteristic curve (AUC) and several other performance metrics. Our results indicated that with increasing sample size, model accuracy increased and variability decreased across species response shapes and among models. For complex species response shapes, modeling approaches that can accommodate interactions among predictor variables and nonlinear associations such as artificial neural networks performed much better when sample size increased, whereas traditional approaches such as linear discriminant analysis and multiple logistic regression were much less sensitive to increasing sample size. Relative to the other models, random forests performed more consistently across different sample sizes and often had among the best performance. Whether the models were developed on more than half of the entire dataset or on less than half also significantly influenced the model performance.