Wednesday, August 5, 2009

PS 49-93: DataONE: A virtual data center for biology, ecology, and the environmental sciences

William Michener1, Suzie Allard2, Paul Allen3, Peter Buneman4, Randy Butler5, John Cobb6, Robert Cook6, Patricia Cruse7, Ewa Deelman8, David DeRoure9, Cliff Duke10, Mike Frame11, Carole Goble12, Stephanie Hampton13, Donald Hobern14, Peter Honeyman15, Jeffery Horsburgh16, Viv Hutchison17, Matt Jones13, Steve Kelling18, Jeremy Kranowitz19, John Kunze7, Bertram Ludaescher20, Maribeth Manoff2, Ricardo Pereira21, Line Pouchard6, Robert Sandusky22, Ryan Scherle23, Mark S. Servilla1, Kathleen Smith23, Carol Tenopir2, Dave Vieglais24, Von Welch5, Jake Weltzin25, and Bruce Wilson6. (1) University of New Mexico, (2) University of Tennessee, (3) Cornell University, (4) University of Edinburgh, (5) University of Illinois - Urbana Champaign, (6) Oak Ridge National Laboratory, (7) University of California - California Digital Library, (8) University of Southern California, (9) University of Southampton, (10) Ecological Society of America, (11) U.S. Geological Survey - National Biological Information Infrastructure, (12) University of Manchester, (13) National Center for Ecological Analysis and Synthesis, (14) Atlas of Living Australia, (15) University of Michigan, (16) Utah State University, (17) US Geological Survey, (18) Cornell Lab of Ornithology, (19) The Keystone Center, (20) University of California - Davis, (21) Taxonomic Databases Working Group (Campinas, Brazil), (22) University of Illinois - Chicago, (23) National Evolutionary Synthesis Center, (24) University of Kansas, (25) USA National Phenology Network

Background/Question/Methods

Data about life on earth and the environment are often unavailable or unusable for numerous reasons.  Those data that are available are broadly dispersed and can be difficult to discover and use.  Because of the multiple data and metadata standards employed, integration and analyses have been difficult to achieve. As well, when analyses are completed, sharing and replication of workflows and results pose the next challenge.

DataONE is being designed and constructed to address four key challenges:

1.    Data loss—by preserving at-risk (orphaned) biological/ecological/environmental data from individual scientists

2.    Scattered data sources—by facilitating discovery and access of data through a single easy-to-use portal

3.    Data deluge–by providing a toolbox that empowers scientists and organizations to more easily and effectively manage, analyze, and synthesize data

4.    Poor data practices—by creating an informatics-literate workforce through innovative outreach and training efforts (e.g., best-practice videos, podcasts, on-line certificate programs, downloadable best practice guides and exemplars of data management plans)

Results/Conclusions

DataONE will enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it.

The system is designed around a nucleus of three existing data centers (coordinating nodes) and a broad array of data holdings such as those maintained by libraries, research networks, and academic and governmental organizations (member nodes). The cyberinfrastructure promotes the discovery and access of data by providing one-stop shopping for data and metadata (information about the data that enables its use) about Earth’s biota and environments.  DataONE provides tools (e.g., metadata management and scientific visualization tools as part of an “investigator’s toolbox”), training, and outreach to scientists and students in a concerted effort enabling and promoting data preservation, data stewardship, and data sharing. Through a series of working group meetings, computer and information scientists are engaged in developing and promulgating ontologies that will facilitate data integration and simplify creation of complex scientific workflows. The DataONE portal simplifies the process of acquiring and using appropriate scientific workflow software like Kepler and Taverna, as well as publishing and sharing new workflows via mechanisms such as myExperiment that allows workflows to be re-used and possibly adopted for other uses.