Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets

Jialin Liu, Yong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many data intensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. When the data size keeps increasing, these high level I/O libraries face new challenges. Recent studies have started to utilize database techniques such as indexing and subsetting, and data reorganization to manage the increasing datasets. In this work, we present a new approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.

Original languageEnglish
Title of host publicationProceedings - 2012 SC Companion
Subtitle of host publicationHigh Performance Computing, Networking Storage and Analysis, SCC 2012
Pages1292-1295
Number of pages4
DOIs
StatePublished - 2012
Event2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012 - Salt Lake City, UT, United States
Duration: Nov 10 2012Nov 16 2012

Publication series

NameProceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012

Conference

Conference2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012
Country/TerritoryUnited States
CitySalt Lake City, UT
Period11/10/1211/16/12

Keywords

  • FASM
  • big data
  • data intensive computing
  • high performance computing
  • statistical techniques
  • storage systems

Fingerprint

Dive into the research topics of 'Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets'. Together they form a unique fingerprint.

Cite this