PS 56-106
Analysis of in situ monitoring data for near-shore microalgae communities obtained by imaging flow-cytometry: integrating machine-learning into a state-space model of phytoplankton abundance
Highly resolved data on phytoplankton community composition can now be collected in situ using the Imaging FlowCytobot (IFCB), a submersible flow-cytometer that images thousands of individual cells and colonies in water samples collected at 30 minute intervals. At the Martha’s Vineyard Coastal Observatory (MVCO), an IFCB deployed continuously since mid-2006 has recorded over 400 million images, providing unprecedented views of a natural phytoplankton assemblage. Analyses of species diversity and community composition within this data set require taxonomic classification of the imaged cells, but the quantity of images makes expert identification impractical. Supervised machine learning algorithms can increase data throughput but also introduce much higher error rates. To integrate the output of classification algorithms into standard methods for statistical inference, classification errors must be accurately represented by a probabilistic model.
Results/Conclusions
Ecological survey data that include classification errors have previously been modeled with multinomial distributions over the predicted class. Ensemble-based classification algorithms, in particular “random forests” of classification trees, produce richer, multivariate output incompatible with this approach. We describe a novel application of Dirichlet–multinomial mixture models to this multivariate output and demonstrate how to estimate model parameters without over–fitting. We compare models for univariate and multivariate output of classification algorithms using goodness-of-fit between the modeled distributions and separate, model validation data at two scales. In a small-scale problem, including just three classes of phytoplankton, the multivariate model fit the empirical distribution of random forest output better than the univariate model fit the empirical distribution over a single predicted class. This result suggests that the random forest algorithm, combined with our multivariate data model, should better represent uncertainty in the phytoplankton time series. However, in tests on a large-scale problem, involving 50 classes of phytoplankton observed at MVCO, both methods provide similar point estimates of community composition. More systematic model validation that may yet reveal a superior method for the large-scale problem will be presented.