Fast data analysis with integrated statistical metadata in scientific datasets

Jialin Liu, Yong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many data-intensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. Recent studies have started to utilize indexing, subsetting, and data reorganization to manage the increasingly large datasets. In this work, we present an approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. Various subsetting schemes can affect the access pattern and the I/O performance. We present a comparison study of different subsetting schemes by focusing on three dominant factors, the shape, the concurrency, and the locality. The added statistical metadata slightly increases the original data size, and we evaluate the cost and trade-off as well. This work is the first study that utilizes statistical metadata with various subsetting schemes to perform fast queries and analyses on large datasets. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.

Original languageEnglish
Title of host publication2013 IEEE International Conference on Cluster Computing, CLUSTER 2013
DOIs
StatePublished - 2013
Event15th IEEE International Conference on Cluster Computing, CLUSTER 2013 - Indianapolis, IN, United States
Duration: Sep 23 2013Sep 27 2013

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
ISSN (Print)1552-5244

Conference

Conference15th IEEE International Conference on Cluster Computing, CLUSTER 2013
CountryUnited States
CityIndianapolis, IN
Period09/23/1309/27/13

Keywords

  • FASM
  • big data
  • data-intensive computing
  • high performance computing
  • statistical techniques
  • storage systems

Fingerprint Dive into the research topics of 'Fast data analysis with integrated statistical metadata in scientific datasets'. Together they form a unique fingerprint.

  • Cite this

    Liu, J., & Chen, Y. (2013). Fast data analysis with integrated statistical metadata in scientific datasets. In 2013 IEEE International Conference on Cluster Computing, CLUSTER 2013 [6702623] (Proceedings - IEEE International Conference on Cluster Computing, ICCC). https://doi.org/10.1109/CLUSTER.2013.6702623