PS 45-177 - pKluster: A tool for scalable k-means analysis of geospatiotemporal data sets

Wednesday, August 9, 2017
Exhibit Hall, Oregon Convention Center
Richard Tran Mills1, Forrest M. Hoffman2,3, Jitendra Kumar4, Vamsi Sripathi1, Sarat Sreepathi5 and William Hargrove6, (1)Technical Computing Engineering, Intel Corporation, Hillsboro, OR, (2)Department of Earth System Science, University of California, Irvine, CA, (3)Climate Change Science Institute (CCSI), Oak Ridge National Laboratory, Oak Ridge, TN, (4)Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, (5)Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, (6)Southern Research Station, USDA Forest Service, Eastern Forest Environmental Threat Assessment Center, Asheville, NC
Background/Question/Methods

The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Standard tools used on individual workstations are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe pKluster, an open-source tool we have developed for accelerated k-means clustering of geospatiotemporal data, and discuss its utilization for several ecological applications.

Results/Conclusions

pKluster can run on machines ranging from laptops to massively parallel machines, allowing it to scalably process massive geospatial data sets. Recently, it has been further enhanced with optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture. We describe some of these developments in detail and present performance studies that demonstrate the impact of these developments and the size of data sets that can be practically analyzed with the tool. We also demonstrate application of the tool in ecological studies including quantitative delineation of ecoregions, forest cover change detection, and classification of forest canopy structures from LiDAR point clouds, and we speculate on new kinds of analysis of climatic and ecological data sets that these capabilities could enable.