usenix conference policies
Big Data Curation
Renee Miller, University of Toronto/IBM
More than a decade ago, Peter Buneman used the term curated databases to refer to databases that are created and maintained using the (often substantial) effort and domain expertise of humans. These human experts clean the data, integrate it with new sources, prepare it for analysis, and share the data with other experts in their field. In data curation, one seeks to support human curators in all activities needed for maintaining and enhancing the value of their data over time. Curation includes data provenance, the process of understanding the origins of data, how it was created, cleaned, or integrated. Big Data offers opportunities to solve curation problems in new ways. The availability of massive data is making it possible to infer semantic connections among data, connections that are central to solving difficult integration, cleaning, and analysis problems. Some of the nuanced semantic differences that eluded enterprise-scale curation solutions can now be understood using evidence from Big Data. Big Data Curation leverages the human expertise that has been embedded in Big Data, be it in general knowledge data that has been created through mass collaboration, or in specialized knowledge-bases created by incentivized user communities who value the creation and maintenance of high quality data. In this talk, I describe our experience in Big Data Curation. This includes our experience over the last five years curating NIH Clinical Trials data that we have published as Open Linked Data at linkedCT.org. I overview how we have adapted some of the traditional solutions for data curation to account for (and take advantage of) Big Data.
author = {Renee Miller},
title = {Big Data Curation},
year = {2015},
address = {Edinburgh, Scotland},
publisher = {USENIX Association},
month = jul
}
connect with us