Fast Automatic Determination of Cluster Numbers for High Dimensional Big Data

Zohreh Safari, Khalid T. Mursi, Yu Zhuang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

For a large volume of data, the clustering algorithm is of significant importance to categorize and analyze data. Accordingly, choosing the optimal number of clusters (K) is an essential factor, but it also is a tricky problem in big data analysis. More importantly, it is to efficiently determine the best K automatically, which is the main issue in clustering algorithms. Indeed, considering both the quality and efficiency of the clustering algorithm during defining K can be a trade-off that is our primary purpose to overcome. K-Means is still one of the popular clustering algorithms, which has a shortcoming that K needs to be pre-set. We introduce a new process with fewer K-Means running, which selects the most promising time to run the K-Means algorithm. To achieve this goal, we applied Bisecting K-Means and a different splitting measure, which all are contributed to efficiently determine the number of clusters automatically while maintaining the quality of clustering for a large set of high dimensional data. We carried out our experimental studies on different data sets and found that our procedure has the flexibility of choosing different criteria for determining the optimal K under each of them. Experiments indicate higher efficiency through decreasing of computation cost compared with the Ray Turi method or with the use of only the K-Means algorithm.

Original languageEnglish
Title of host publicationICCDA 2020 - Proceedings of the 4th International Conference on Compute and Data Analysis
PublisherAssociation for Computing Machinery
Pages50-57
Number of pages8
ISBN (Electronic)9781450376440
DOIs
StatePublished - Mar 9 2020
Event4th International Conference on Compute and Data Analysis, ICCDA 2020 - Silicon Valley, San Jose, United States
Duration: Mar 9 2020Mar 12 2020

Publication series

NameACM International Conference Proceeding Series

Conference

Conference4th International Conference on Compute and Data Analysis, ICCDA 2020
Country/TerritoryUnited States
CitySilicon Valley, San Jose
Period03/9/2003/12/20

Keywords

  • Big Data
  • Bisecting K-Means
  • Cluster Validity
  • Clustering
  • K-Means

Fingerprint

Dive into the research topics of 'Fast Automatic Determination of Cluster Numbers for High Dimensional Big Data'. Together they form a unique fingerprint.

Cite this