TY - GEN
T1 - Fast Automatic Determination of Cluster Numbers for High Dimensional Big Data
AU - Safari, Zohreh
AU - Mursi, Khalid T.
AU - Zhuang, Yu
N1 - Funding Information:
*Co-first authors, whose work was partially supported by National Science Foundation under Grant No. CNS-1526055.
Publisher Copyright:
© 2020 ACM.
PY - 2020/3/9
Y1 - 2020/3/9
N2 - For a large volume of data, the clustering algorithm is of significant importance to categorize and analyze data. Accordingly, choosing the optimal number of clusters (K) is an essential factor, but it also is a tricky problem in big data analysis. More importantly, it is to efficiently determine the best K automatically, which is the main issue in clustering algorithms. Indeed, considering both the quality and efficiency of the clustering algorithm during defining K can be a trade-off that is our primary purpose to overcome. K-Means is still one of the popular clustering algorithms, which has a shortcoming that K needs to be pre-set. We introduce a new process with fewer K-Means running, which selects the most promising time to run the K-Means algorithm. To achieve this goal, we applied Bisecting K-Means and a different splitting measure, which all are contributed to efficiently determine the number of clusters automatically while maintaining the quality of clustering for a large set of high dimensional data. We carried out our experimental studies on different data sets and found that our procedure has the flexibility of choosing different criteria for determining the optimal K under each of them. Experiments indicate higher efficiency through decreasing of computation cost compared with the Ray Turi method or with the use of only the K-Means algorithm.
AB - For a large volume of data, the clustering algorithm is of significant importance to categorize and analyze data. Accordingly, choosing the optimal number of clusters (K) is an essential factor, but it also is a tricky problem in big data analysis. More importantly, it is to efficiently determine the best K automatically, which is the main issue in clustering algorithms. Indeed, considering both the quality and efficiency of the clustering algorithm during defining K can be a trade-off that is our primary purpose to overcome. K-Means is still one of the popular clustering algorithms, which has a shortcoming that K needs to be pre-set. We introduce a new process with fewer K-Means running, which selects the most promising time to run the K-Means algorithm. To achieve this goal, we applied Bisecting K-Means and a different splitting measure, which all are contributed to efficiently determine the number of clusters automatically while maintaining the quality of clustering for a large set of high dimensional data. We carried out our experimental studies on different data sets and found that our procedure has the flexibility of choosing different criteria for determining the optimal K under each of them. Experiments indicate higher efficiency through decreasing of computation cost compared with the Ray Turi method or with the use of only the K-Means algorithm.
KW - Big Data
KW - Bisecting K-Means
KW - Cluster Validity
KW - Clustering
KW - K-Means
UR - http://www.scopus.com/inward/record.url?scp=85098272375&partnerID=8YFLogxK
U2 - 10.1145/3388142.3388164
DO - 10.1145/3388142.3388164
M3 - Conference contribution
AN - SCOPUS:85098272375
T3 - ACM International Conference Proceeding Series
SP - 50
EP - 57
BT - ICCDA 2020 - Proceedings of the 4th International Conference on Compute and Data Analysis
PB - Association for Computing Machinery
T2 - 4th International Conference on Compute and Data Analysis, ICCDA 2020
Y2 - 9 March 2020 through 12 March 2020
ER -