TY - GEN
T1 - A linear DBSCAN algorithm based on LSH
AU - Wu, Yi-Pu
AU - Guo, Jin-Jiang
AU - Zhang, Xue-Jie
N1 - Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].
PY - 2007
Y1 - 2007
N2 - DBSCAN algorithm is used widely because it can effectively handle noise points and deal with data of any type in clustering. However, it has two inherent limitations: high time complexity O(NlogN) and poor ability in dealing large-scale data. In this paper, a linear DBSCAN based on LSH is proposed. In our algorithm the process of Nearest Neighbor Search is optimized by hashing. Compared with the original DBSCAN algorithm, the time complexity of this improved DBSCAN descends to O(N). Experimentally, this improved DBSCAN makes a significant decrease in the running time while maintaining the Cluster quality of the results. Moreover, the speedup (the running time of original DBSCAN algorithm divided by the running time of improved algorithm) increases with the size and dimension of dataset, and the parameter Eps of our algorithm does not have a strong influence on the clustering result. These improved properties enable DBSCAN to be used in a large scope. © 2007 IEEE.
AB - DBSCAN algorithm is used widely because it can effectively handle noise points and deal with data of any type in clustering. However, it has two inherent limitations: high time complexity O(NlogN) and poor ability in dealing large-scale data. In this paper, a linear DBSCAN based on LSH is proposed. In our algorithm the process of Nearest Neighbor Search is optimized by hashing. Compared with the original DBSCAN algorithm, the time complexity of this improved DBSCAN descends to O(N). Experimentally, this improved DBSCAN makes a significant decrease in the running time while maintaining the Cluster quality of the results. Moreover, the speedup (the running time of original DBSCAN algorithm divided by the running time of improved algorithm) increases with the size and dimension of dataset, and the parameter Eps of our algorithm does not have a strong influence on the clustering result. These improved properties enable DBSCAN to be used in a large scope. © 2007 IEEE.
KW - Clustering
KW - DBSCAN
KW - Large-scale data
KW - LSH
KW - Unsupervised learning
UR - https://www.scopus.com/pages/publications/38049038439
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-38049038439&origin=recordpage
U2 - 10.1109/ICMLC.2007.4370588
DO - 10.1109/ICMLC.2007.4370588
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 142440973
SN - 9781424409730
VL - 5
T3 - Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, ICMLC 2007
SP - 2608
EP - 2614
BT - Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, ICMLC 2007
T2 - 6th International Conference on Machine Learning and Cybernetics, ICMLC 2007
Y2 - 19 August 2007 through 22 August 2007
ER -