TY - GEN
T1 - Adaptive normalization in streaming data
AU - Gupta, Vibhuti
AU - Hewett, Rattikorn
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/11/20
Y1 - 2019/11/20
N2 - In today's digital era, data are everywhere from Internet of Things to health care or financial applications. This leads to potentially unbounded ever-growing Big data streams and it needs to be utilized effectively. Data normalization is an important preprocessing technique for data analytics. It helps prevent mismodeling and reduce the complexity inherent in the data especially for data integrated from multiple sources and contexts. Normalization of Big Data stream is challenging because of evolving inconsistencies, time and memory constraints, and nonavailability of whole data beforehand. This paper proposes a distributed approach to adaptive normalization for Big data stream. Using sliding windows of fixed size, it provides a simple mechanism to adapt the statistics for normalizing changing data in each window. Implemented on Apache Storm, a distributed realtime stream data framework, our approach exploits distributed data processing for efficient normalization. Unlike other existing adaptive approaches that normalize data for a specific use (e.g., classification), ours does not. Moreover, our adaptive mechanism allows flexible controls, via user-specified thresholds, for normalization tradeoffs between time and precision. The paper illustrates our proposed approach along with a few other techniques and experiments on both synthesized and real-world data. The normalized data obtained from our proposed approach, on 160,000 instances of data stream, improves over the baseline by 89% with 0.0041 root-mean-square error compared with the actual data.
AB - In today's digital era, data are everywhere from Internet of Things to health care or financial applications. This leads to potentially unbounded ever-growing Big data streams and it needs to be utilized effectively. Data normalization is an important preprocessing technique for data analytics. It helps prevent mismodeling and reduce the complexity inherent in the data especially for data integrated from multiple sources and contexts. Normalization of Big Data stream is challenging because of evolving inconsistencies, time and memory constraints, and nonavailability of whole data beforehand. This paper proposes a distributed approach to adaptive normalization for Big data stream. Using sliding windows of fixed size, it provides a simple mechanism to adapt the statistics for normalizing changing data in each window. Implemented on Apache Storm, a distributed realtime stream data framework, our approach exploits distributed data processing for efficient normalization. Unlike other existing adaptive approaches that normalize data for a specific use (e.g., classification), ours does not. Moreover, our adaptive mechanism allows flexible controls, via user-specified thresholds, for normalization tradeoffs between time and precision. The paper illustrates our proposed approach along with a few other techniques and experiments on both synthesized and real-world data. The normalized data obtained from our proposed approach, on 160,000 instances of data stream, improves over the baseline by 89% with 0.0041 root-mean-square error compared with the actual data.
KW - Big data stream
KW - Normalization
KW - Preprocessing
UR - http://www.scopus.com/inward/record.url?scp=85079142862&partnerID=8YFLogxK
U2 - 10.1145/3372454.3372466
DO - 10.1145/3372454.3372466
M3 - Conference contribution
AN - SCOPUS:85079142862
T3 - ACM International Conference Proceeding Series
SP - 12
EP - 17
BT - ICBDR 2019 - Proceedings of the 2019 3rd International Conference on Big Data Research
PB - Association for Computing Machinery
T2 - 3rd International Conference on Big Data Research, ICBDR 2019
Y2 - 20 November 2019 through 21 November 2019
ER -