A Memory Capacity-Aware Algorithm for Fast Clustering of Disk-Resident Big Datasets

Ahmad O. Aseeri, Yu Zhuang, Mohammed Saeed Alkatheiri

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Clustering is one of the most commonly used data mining techniques. The K-means and its variants are popular clustering methods. The simplistic Lloyd K-means algorithm, with randomly chosen initial cluster centers, suffers from poor clustering quality and high iteration numbers, especially unsuitable for clustering large datasets. Successful methods that choose a good set of initial cluster centers include the algorithm of Bradley and Fayyad using sampled data subsets, and the bisecting K-means algorithm of Steinbach, Karypis, and Kumar. Recently, it was discovered that iterations in the two-means algorithm used in bisecting K-means to bisect a subset can be limited to small numbers while still maintaining the final clustering quality for the bisecting K-means algorithm. In this paper, for datasets larger than memory capacity, we develop an iteration limiting strategy for bisecting K-means which adaptively determines the number of iterations for each call of the two-means bisecting subroutine based on the memory capacity of a computer and the size of the data subset to be partitioned. The strategy has been incorporated into the bisecting K-means algorithm, applied to the large challenge-response datasets of Physical Unclonable Functions that the authors are investigating, with comparison with the sampled-subsets algorithm of Bradley and Fayyad. Testing results show high computing efficiency for the bisecting K-means algorithm incorporated with the iteration limiting strategy while exhibiting almost identical clustering quality.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages164-201
Number of pages38
ISBN (Electronic)9781538619551
DOIs
StatePublished - Mar 29 2018
Event15th IEEE International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017 - Orlando, United States
Duration: Nov 6 2017Nov 11 2017

Publication series

NameProceedings - 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017
Volume2018-January

Conference

Conference15th IEEE International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017
CountryUnited States
CityOrlando
Period11/6/1711/11/17

Keywords

  • Big data
  • Clustering
  • bisecting K-means
  • limited K-mean iterations

Fingerprint Dive into the research topics of 'A Memory Capacity-Aware Algorithm for Fast Clustering of Disk-Resident Big Datasets'. Together they form a unique fingerprint.

Cite this