A stratified sampling based clustering algorithm for large-scale data
Research output: Journal Publications and Reviews › RGC 21 - Publication in refereed journal › peer-review
Author(s)
Detail(s)
Original language | English |
---|---|
Pages (from-to) | 416-428 |
Journal / Publication | Knowledge-Based Systems |
Volume | 163 |
Online published | 10 Sept 2018 |
Publication status | Published - Jan 2019 |
Link(s)
Abstract
Large-scale data analysis is a challenging and relevant task for present-day research and industry. As a promising data analysis tool, clustering is becoming more important in the era of big data. In large-scale data clustering, sampling is an efficient and most widely used approximation technique. Recently, several sampling-based clustering algorithms have attracted considerable attention in large-scale data analysis owing to their efficiency. However, some of these existing algorithms have low clustering accuracy, whereas others have high computational complexity. To overcome these deficiencies, a stratified sampling based clustering algorithm for large-scale data is proposed in this paper. Its basic steps include: (1) obtaining a number of representative samples from different strata with a stratified sampling scheme, which are formed by locality sensitive hashing technique, (2) partitioning the chosen samples into different clusters using the fuzzy c-means clustering algorithm, (3) assigning the out-of-sample objects into their closest clusters via data labeling technique. The performance of the proposed algorithm is compared with the state-of-the-art sampling-based fuzzy c-means clustering algorithms on several large-scale data sets including synthetic and real ones. The experimental results show that the proposed algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets.
Research Area(s)
- Data labeling, Fuzzy c-means algorithm, Large-scale data, Stratified sampling
Citation Format(s)
A stratified sampling based clustering algorithm for large-scale data. / Zhao, Xingwang; Liang, Jiye; Dang, Chuangyin.
In: Knowledge-Based Systems, Vol. 163, 01.2019, p. 416-428.
In: Knowledge-Based Systems, Vol. 163, 01.2019, p. 416-428.
Research output: Journal Publications and Reviews › RGC 21 - Publication in refereed journal › peer-review