Species distribution models are increasingly important for investigating the habitat requirements of introduced or native species, planning management programs for invasive species and conservation programs for endangered species, and understanding the essential determinants of biodiversity patterns. When using predictive models, the sample size of the data is often more or less restricted because of different data availability for different species. Therefore, to determine the effects of sample size and species response shape on model performance is important for making reliable predictions. However, it is difficult for ecologists to evaluate species distribution models using field data because the true situation is unknown. Predictive models should be able to approximate the true relationship if ecologists want to use them as reliable tools. In our study, simulated data derived from generalized beta function were used to develop and validate models. Because simulated data provide known species-environment relationships, we can evaluate differences among model predictions relative to the true values, and obtain insights into the effects of some factors on model performance. Simulation offers a good strategy to control the true relationship so that recommendations from model comparisons could better address specific solutions to sampling issues, and assist studies of invasive or endangered species.
Results/Conclusions
In this study, the generalized beta function was used for describing different shapes of species responses, and the true dataset was generated using it. Data with different sample sizes were randomly drawn from the true dataset and then used for developing models. Modeling approaches including linear discriminant analysis, multiple logistic regression, random forests and artificial neural networks were developed on the sampled datasets. The performance of model predictions was evaluated using the area under the receiver operating characteristic curve (AUC) and several other performance metrics. Our results indicated that, with increasing sample size, model accuracy increased and variability decreased across species response shapes and among models. Prediction success of artificial neural networks increased a lot when sample size increased, whereas random forests performed more consistently at different sample sizes. In addition, whether the models were developed on more than half of the entire data set or on less than half of the entire data set may significantly influence the model performance.