TY - GEN
T1 - Exploring Metadata Search Essentials for Scientific Data Management
AU - Zhang, Wei
AU - Byna, Suren
AU - Niu, Chenxu
AU - Chen, Yong
N1 - Publisher Copyright:
© 2019 IEEE.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2019/12
Y1 - 2019/12
N2 - Scientific experiments and observations store massive amounts of data in various scientific file formats. Metadata, which describes the characteristics of the data, is commonly used to sift through massive datasets in order to locate data of interest to scientists. Several indexing data structures (such as hash tables, trie, self-balancing search trees, sparse array, etc.) have been developed as part of efforts to provide an efficient method for locating target data. However, efficient determination of an indexing data structure remains unclear in the context of scientific data management, due to the lack of investigation on metadata, metadata queries, and corresponding data structures. In this study, we perform a systematic study of the metadata search essentials in the context of scientific data management. We study a real-world astronomy observation dataset and explore the characteristics of the metadata in the dataset. We also study possible metadata queries based on the discovery of the metadata characteristics and evaluate different data structures for various types of metadata attributes. Our evaluation on real-world dataset suggests that trie is a suitable data structure when prefix/suffix query is required, otherwise hash table should be used. We conclude our study with a summary of our findings. These findings provide a guideline and offers insights in developing metadata indexing methodologies for scientific applications.
AB - Scientific experiments and observations store massive amounts of data in various scientific file formats. Metadata, which describes the characteristics of the data, is commonly used to sift through massive datasets in order to locate data of interest to scientists. Several indexing data structures (such as hash tables, trie, self-balancing search trees, sparse array, etc.) have been developed as part of efforts to provide an efficient method for locating target data. However, efficient determination of an indexing data structure remains unclear in the context of scientific data management, due to the lack of investigation on metadata, metadata queries, and corresponding data structures. In this study, we perform a systematic study of the metadata search essentials in the context of scientific data management. We study a real-world astronomy observation dataset and explore the characteristics of the metadata in the dataset. We also study possible metadata queries based on the discovery of the metadata characteristics and evaluate different data structures for various types of metadata attributes. Our evaluation on real-world dataset suggests that trie is a suitable data structure when prefix/suffix query is required, otherwise hash table should be used. We conclude our study with a summary of our findings. These findings provide a guideline and offers insights in developing metadata indexing methodologies for scientific applications.
KW - Data Management
KW - HDF5
KW - Metadata Indexing
KW - Metadata Search
UR - http://www.scopus.com/inward/record.url?scp=85080129319&partnerID=8YFLogxK
U2 - 10.1109/HiPC.2019.00021
DO - 10.1109/HiPC.2019.00021
M3 - Conference contribution
AN - SCOPUS:85080129319
T3 - Proceedings - 26th IEEE International Conference on High Performance Computing, HiPC 2019
SP - 83
EP - 92
BT - Proceedings - 26th IEEE International Conference on High Performance Computing, HiPC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 December 2019 through 20 December 2019
ER -