TY - JOUR
T1 - Segmented In-Advance Data Analytics for Fast Scientific Discovery
AU - Liu, Jialin
AU - Chen, Yong
N1 - Publisher Copyright:
© 2013 IEEE.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/4/1
Y1 - 2020/4/1
N2 - Scientific discovery usually involves data generation, data preprocessing, data storage and data analysis. As the data volume exceeds a few terabytes (TB) in a single simulation run, the data movement, which happens during each cycle of the scientific discovery, continues to be the bottleneck in most scientific big data applications. A lot of research works have been conducted on reducing the data movement. Among the existing efforts and based on our previous research, reusing the analysis results shows a significant potential in optimizing the data movement between analysis operations. In this work, we propose the Segmented In-Advance (SIA) data analytics approach for optimizing the data movement and we also provide a cloud-based elastic distributed in-memory database to manage the intermediate analysis results. The fundamental idea of this Segmented In-Advance approach is to analyze the history operations and to predict the future interesting analytics operations. The predicted analysis operation is in-advance performed on the finer segmented dataset and the segmented results are distributed in an in-memory key-value store for future reuse. The evaluation shows that the segmented in-advance data analytics approach achieves 1.2X-6.1X speedup. The evaluation also shows a good scalability of the in-memory distributed data store. The proposed Segmented In-Advance data analytics approach is a promising data movement reduction solution for scientific big data applications and fast scientific discovery.
AB - Scientific discovery usually involves data generation, data preprocessing, data storage and data analysis. As the data volume exceeds a few terabytes (TB) in a single simulation run, the data movement, which happens during each cycle of the scientific discovery, continues to be the bottleneck in most scientific big data applications. A lot of research works have been conducted on reducing the data movement. Among the existing efforts and based on our previous research, reusing the analysis results shows a significant potential in optimizing the data movement between analysis operations. In this work, we propose the Segmented In-Advance (SIA) data analytics approach for optimizing the data movement and we also provide a cloud-based elastic distributed in-memory database to manage the intermediate analysis results. The fundamental idea of this Segmented In-Advance approach is to analyze the history operations and to predict the future interesting analytics operations. The predicted analysis operation is in-advance performed on the finer segmented dataset and the segmented results are distributed in an in-memory key-value store for future reuse. The evaluation shows that the segmented in-advance data analytics approach achieves 1.2X-6.1X speedup. The evaluation also shows a good scalability of the in-memory distributed data store. The proposed Segmented In-Advance data analytics approach is a promising data movement reduction solution for scientific big data applications and fast scientific discovery.
KW - Big data
KW - Data intensive computing
KW - Scientific computing
KW - Segmented in-advance data analytics
UR - http://www.scopus.com/inward/record.url?scp=85086762902&partnerID=8YFLogxK
U2 - 10.1109/TCC.2016.2541142
DO - 10.1109/TCC.2016.2541142
M3 - Article
AN - SCOPUS:85086762902
VL - 8
SP - 432
EP - 442
JO - IEEE Transactions on Cloud Computing
JF - IEEE Transactions on Cloud Computing
SN - 2168-7161
IS - 2
M1 - 7431946
ER -