COS 145-7 - Evaluating new approaches to modeling data sets with many zeros: An example using anadromous fish counts

Thursday, August 9, 2012: 10:10 AM
C120, Oregon Convention Center
John T. Finn1, Martha E. Mather2, Matthew K. Burak1, Robert M. Muth1, John Kim3 and Mike Sutherland4, (1)Environmental Conservation, University of Massachusetts, Amherst, MA, (2)U.S. Geological Survey, Kansas Cooperative Fish and Wildlife Research Unit, Kansas State University, Manhattan, KS, (3)Corvallis Forestry Sciences Laboratory, USDA Forest Service Pacific Northwest Research Station, Corvallis, OR, (4)Mathematics and Statistics, University of Massachusetts, Amherst, MA
Background/Question/Methods

Effective environmental decision making requires statistics usable for real-world conservation problems. Ecological field data frequently violate assumptions of classical statistics (including normal distributions, and homogeneous variance). Although alternate analyses are available for data with heterogeneous variance and many zeros, practical guidelines for choosing, fitting, and evaluating extended generalized linear models are limited. Using upriver counts of a migratory fish species (alewife) as the response and time (linear and square terms) as the predictor, we compared six regression models: Poisson, zero-inflated Poisson (ZIP), negative binomial (NB), zero-inflated negative binomial (ZINB), generalized Poisson (GP), and zero-inflated generalized Poisson (ZIGP).  We evaluated models using the Akaike Information Criterion (AIC), and Pearson’s goodness of fit test (sum of squared Pearson residuals).  In addition, we ran 1,000 parametric simulations (using maximum likelihood parameter estimates) of each model to see whether each model reproduced the maximum, and proportion of zero counts that we observed.  The Vuong test was used to compare non-nested models.  Finally, we displayed the likelihood surfaces of each model versus time (the predictor variable) and fish count (the response variable) in two ways: colored ‘image’ plots (color represents likelihood), and 3-D surface plots (height on the z-axis represents likelihood).

Results/Conclusions

In 2008, 611 random 10-min segments of video were counted for the Town Brook in Plymouth Massachusetts.  Counts ranged from zero (37.5% of the observations) to a maximum of 1180 fish.  Based on AIC, the ZIGP model fits best, although the ZINB model was similar. Adding a zero-inflation factor improved some models but not others. Parametric simulations showed that only the GP and ZIGP models would reproduce the observed maximum, although several models reproduced the observed proportion of zero counts.  Visualization of likelihood surfaces clearly showed that Poisson and ZIP models do not fit this data set well, while both NB and GP distributions do a better (and similar) job representing the observed variability. The shape of the best fitting models was not at all normal, with the most likely count at any one time being zero, and higher counts being progressively less likely.  There is no increase in likelihood around the expected value.  Problems were encountered in finding a solution for the ZINB model, but the other models were relatively well-behaved. Lessons learned from this research can assist field ecologists in the selection and appropriate use of models on highly-variable and zero-dominated data sets for a range of organisms.