Authors: James W. McNally
Increasingly, organizations in both the public and private sector who fund the collection of research information expect and in many cases mandate the public sharing of these data as a condition of support. While representing a positive expression of the idea that information represents a public good, it has also resulted in a veritable flood of new studies, surveys and administrative records entering the public domain; the emergence of the Big Data model in the secondary analysis of research information. Much of these data are managed by data repositories that ingest, process, clean and enhance these files so they are accurate and consistent and introduce as little error as possible into the analysis stream of information. Conversely, this huge influx of Big Data resources has also resulted in higher expectations for the rapid release of data that complicates the need for a thorough review and cleaning of files before distribution. This paper reviews emerging approaches within a research data repository that seek to maintain high quality control while managing Big Data streams as they enter the system.
Keywords: component; Big Data, repository, distribution data error, processing