Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets

Jialin Liu; Yong Chen

doi:10.1109/SC.Companion.2012.156

Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets

Jialin Liu, Yong Chen

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

8 Scopus citations

Abstract

Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many data intensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. When the data size keeps increasing, these high level I/O libraries face new challenges. Recent studies have started to utilize database techniques such as indexing and subsetting, and data reorganization to manage the increasing datasets. In this work, we present a new approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.

Original language	English
Title of host publication	Proceedings - 2012 SC Companion
Subtitle of host publication	High Performance Computing, Networking Storage and Analysis, SCC 2012
Pages	1292-1295
Number of pages	4
DOIs	https://doi.org/10.1109/SC.Companion.2012.156
State	Published - 2012
Event	2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012 - Salt Lake City, UT, United States Duration: Nov 10 2012 → Nov 16 2012

Publication series

Name	Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012

Conference

Conference	2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012
Country/Territory	United States
City	Salt Lake City, UT
Period	11/10/12 → 11/16/12

Keywords

FASM
big data
data intensive computing
high performance computing
statistical techniques
storage systems

Access to Document

10.1109/SC.Companion.2012.156

Cite this

Liu, J., & Chen, Y. (2012). Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets. In Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012 (pp. 1292-1295). Article 6495938 (Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012). https://doi.org/10.1109/SC.Companion.2012.156

Liu, Jialin ; Chen, Yong. / Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets. Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012. 2012. pp. 1292-1295 (Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012).

@inproceedings{2791a11c2e7d4bdc9bb44c5507efc259,

title = "Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets",

abstract = "Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many data intensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. When the data size keeps increasing, these high level I/O libraries face new challenges. Recent studies have started to utilize database techniques such as indexing and subsetting, and data reorganization to manage the increasing datasets. In this work, we present a new approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.",

keywords = "FASM, big data, data intensive computing, high performance computing, statistical techniques, storage systems",

author = "Jialin Liu and Yong Chen",

year = "2012",

doi = "10.1109/SC.Companion.2012.156",

language = "English",

isbn = "9780769549569",

series = "Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012",

pages = "1292--1295",

booktitle = "Proceedings - 2012 SC Companion",

note = "2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012 ; Conference date: 10-11-2012 Through 16-11-2012",

}

Liu, J & Chen, Y 2012, Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets. in Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012., 6495938, Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012, pp. 1292-1295, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012, Salt Lake City, UT, United States, 11/10/12. https://doi.org/10.1109/SC.Companion.2012.156

Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets. / Liu, Jialin; Chen, Yong.
Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012. 2012. p. 1292-1295 6495938 (Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets

AU - Liu, Jialin

AU - Chen, Yong

PY - 2012

Y1 - 2012

N2 - Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many data intensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. When the data size keeps increasing, these high level I/O libraries face new challenges. Recent studies have started to utilize database techniques such as indexing and subsetting, and data reorganization to manage the increasing datasets. In this work, we present a new approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.

AB - Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many data intensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. When the data size keeps increasing, these high level I/O libraries face new challenges. Recent studies have started to utilize database techniques such as indexing and subsetting, and data reorganization to manage the increasing datasets. In this work, we present a new approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.

KW - FASM

KW - big data

KW - data intensive computing

KW - high performance computing

KW - statistical techniques

KW - storage systems

UR - http://www.scopus.com/inward/record.url?scp=84876590301&partnerID=8YFLogxK

U2 - 10.1109/SC.Companion.2012.156

DO - 10.1109/SC.Companion.2012.156

M3 - Conference contribution

AN - SCOPUS:84876590301

SN - 9780769549569

T3 - Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012

SP - 1292

EP - 1295

BT - Proceedings - 2012 SC Companion

T2 - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012

Y2 - 10 November 2012 through 16 November 2012

ER -

Liu J, Chen Y. Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets. In Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012. 2012. p. 1292-1295. 6495938. (Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012). doi: 10.1109/SC.Companion.2012.156

Improving data analysis performance for high-performance computing with integrating statistical metadata in scientific datasets

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this