TY - JOUR
T1 - Segmented In-Advance Data Analytics for Fast Scientific Discovery
AU - Liu, Jialin
AU - Chen, Yong
N1 - Funding Information:
This research is sponsored in part by the National Science Foundation under grant CNS-1338078 and CCF-1409946. The authors also acknowledge the High Performance Computing Center at Texas Tech University for providing the resources that have contributed to the research results.
Publisher Copyright:
© 2013 IEEE.
PY - 2020/4/1
Y1 - 2020/4/1
N2 - Scientific discovery usually involves data generation, data preprocessing, data storage and data analysis. As the data volume exceeds a few terabytes (TB) in a single simulation run, the data movement, which happens during each cycle of the scientific discovery, continues to be the bottleneck in most scientific big data applications. A lot of research works have been conducted on reducing the data movement. Among the existing efforts and based on our previous research, reusing the analysis results shows a significant potential in optimizing the data movement between analysis operations. In this work, we propose the Segmented In-Advance (SIA) data analytics approach for optimizing the data movement and we also provide a cloud-based elastic distributed in-memory database to manage the intermediate analysis results. The fundamental idea of this Segmented In-Advance approach is to analyze the history operations and to predict the future interesting analytics operations. The predicted analysis operation is in-advance performed on the finer segmented dataset and the segmented results are distributed in an in-memory key-value store for future reuse. The evaluation shows that the segmented in-advance data analytics approach achieves 1.2X-6.1X speedup. The evaluation also shows a good scalability of the in-memory distributed data store. The proposed Segmented In-Advance data analytics approach is a promising data movement reduction solution for scientific big data applications and fast scientific discovery.
AB - Scientific discovery usually involves data generation, data preprocessing, data storage and data analysis. As the data volume exceeds a few terabytes (TB) in a single simulation run, the data movement, which happens during each cycle of the scientific discovery, continues to be the bottleneck in most scientific big data applications. A lot of research works have been conducted on reducing the data movement. Among the existing efforts and based on our previous research, reusing the analysis results shows a significant potential in optimizing the data movement between analysis operations. In this work, we propose the Segmented In-Advance (SIA) data analytics approach for optimizing the data movement and we also provide a cloud-based elastic distributed in-memory database to manage the intermediate analysis results. The fundamental idea of this Segmented In-Advance approach is to analyze the history operations and to predict the future interesting analytics operations. The predicted analysis operation is in-advance performed on the finer segmented dataset and the segmented results are distributed in an in-memory key-value store for future reuse. The evaluation shows that the segmented in-advance data analytics approach achieves 1.2X-6.1X speedup. The evaluation also shows a good scalability of the in-memory distributed data store. The proposed Segmented In-Advance data analytics approach is a promising data movement reduction solution for scientific big data applications and fast scientific discovery.
KW - Big data
KW - Data intensive computing
KW - Scientific computing
KW - Segmented in-advance data analytics
UR - http://www.scopus.com/inward/record.url?scp=85086762902&partnerID=8YFLogxK
U2 - 10.1109/TCC.2016.2541142
DO - 10.1109/TCC.2016.2541142
M3 - Article
AN - SCOPUS:85086762902
SN - 2168-7161
VL - 8
SP - 432
EP - 442
JO - IEEE Transactions on Cloud Computing
JF - IEEE Transactions on Cloud Computing
IS - 2
M1 - 7431946
ER -