COS 167-6 - Automating tropical pollen analysis using layered machine learning

Thursday, August 9, 2012: 3:20 PM
B117, Oregon Convention Center

ABSTRACT WITHDRAWN

Surangi W. Punyasena, University of Illinois; David K. Tcheng, University of Illinois/National Center for Supercomputing Applications; Derek S. Haselhorst, University of Illinois

Background/Question/Methods

Contained within the fossil pollen and spore record is one of the most comprehensive histories of terrestrial vegetation and its response to long-term environmental change. However, the inefficiencies inherent in the traditional collection of palynological data has meant that this record remains one of the few areas of scientific inquiry where automated data acquisition and image analysis have made little headway. Because of the higher taxonomic diversity and more limited number of experts, the difficulty of palynological data collection has caused tropical palynology, in particular, to lag behind temperate communities in the development of large spatial and long temporal pollen datasets.

In order to address the problem of data collection and data availability for tropical systems, we present a radical alternative to traditional palynological counts – an automated counting system capable of dealing with high-diversity pollen samples. Our instance-based machine learning system employs full bias optimization, plus a two-tiered cross-validation to address the problem of overfitting common to machine learning studies. This high-throughput system is being developed and tested using 18 years of pollen rain samples from Barro Colorado Island (BCI) and Parque National San Lorenzo, Panama. Our preliminary trials focus on 76 BCI pollen samples collected between 1996-2005.

 

Results/Conclusions

Although this research is ongoing, there are some key preliminary findings that we can present at this time. First, in a smaller related analysis of five temperate modern taxa, our machine-based classification system was able to achieve 93% accuracy in identifications, based on a 641-fold cross-validation. Second, in scaling up our analysis, we have developed a customized image analysis system that takes advantage of commercial scanning microscopes to produce high-resolution images of palynological slides. There is an enormous amount of image data in these scans, with a single slide averaging ~500 GB of information. Our current preliminary trials are of 50 randomly subsampled images from our 76 samples. Using these images, we have established the range of variability in human expert identifications (based on seven participants), and demonstrate that the range of machine uncertainty falls within that range.

The use of machine-based classifications has the potential to advance the quantity and quality of tropical palynological research by providing objective measures of the consistency and reliability of palynological identifications. Although preliminary, our results demonstrate that automation is a feasible and desirable alternative to standard counting practices – particularly in the case of high diversity samples.