Model-driven data layout selection for improving read performance

Jialin Liu, Surendra Byna, Bin Dong, Kesheng Wu, Yong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

Performance of reading scientific data from a parallel file system depends on the organization of data on physical storage devices. Data is often immutable after producers of data, such as large-scale simulations, experiments, and observations, write the data to the parallel file system. As a result, read performance of data analysis tasks is often slow when the read pattern does not conform with the original organization of the data. For example, reading small noncontiguous chunks of data from a large array is many times slower than reading the same size of contiguous chunks of data. Towards improving the data read performance during analysis phase, we are developing the Scientific Data Services (SDS) framework for automatically reorganizing previously written data to conform with the known read patterns. In this paper, we introduce a model-driven strategy for selecting the data layouts that benefit the performance of different read patterns. We have developed a parallel I/O model based on the striping parameters on Lustre file system and the block-level striping on RAID-based disks within an Object Storage Target (OST) of Lustre. We have applied the model to reorganize large 3D array datasets on a Cray XE6 platform and achieved 9X to 128X improvement in accessing the reorganized data compared to reading the data in its original layout.

Original languageEnglish
Title of host publicationProceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
PublisherIEEE Computer Society
Pages1708-1716
Number of pages9
ISBN (Electronic)9780769552088
DOIs
StatePublished - Nov 27 2014
Event28th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014 - Phoenix, United States
Duration: May 19 2014May 23 2014

Publication series

NameProceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014

Conference

Conference28th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
Country/TerritoryUnited States
CityPhoenix
Period05/19/1405/23/14

Keywords

  • Big Data
  • High performance computing
  • I/O Performance Model
  • Scientific Data Management
  • Scientific Services (SDS)

Fingerprint

Dive into the research topics of 'Model-driven data layout selection for improving read performance'. Together they form a unique fingerprint.

Cite this