Large Scale Multidimensional Scaling and Clustering - RMGS

Project: Research

View graph of relations

Description

With the improvement of network accessibility, the advance of data warehousing techniques, and the reduction of data storage costs, massive data with high dimensions are produced everyday. Extracting valuable information from such large scale data has become an important subject in image processing, finance, network security, bioinformatics and an expanding list of industries. The ever-increasing data size demands efficient and effective analysis methods that facilitates large scale data processing.Among various analysis strategies, dimension reduction and clustering are two fundamental methods in data mining. For large scale data, although they are in different modalities, most of them can be extracted and transformed into metric values of pairwise similarities or high dimensional records, which assigns dimension reduction and clustering important roles in high dimensional data analysis. Algorithms like multidimensional scaling (MDS), principal component analysis (PCA), hierarchical clustering, etc. are extensively used in high dimensional data analysis in past decades, and achieved significant success. However, most of the existing algorithms and implementations highly relied on metric distance matrix and with time complexities of at least O(n2), which thwarts the efficient analysis on massive data. To address this issue, researchers have developed various distributed solutions, and some of those solutions have had profound impact on both academia and industry. Despite those success, few of them are designed for standalone computer, and small research and business groups cannot benefit from those outcomes. That promotes us to design and implement efficient solutions and assemble them into a C/C++ package with a friendly API and convenient to perform dimension reduction and clustering in modern computers. First, we focus on enhancement of three methods in dimension reduction and clustering: multidimensional scaling, clustering and cliques partitioning.In this proposal, we aim at designing and implementing a generalized package tailored to effi- ciently perform mentioned methods in dimensional reduction and clustering on large scale dataset with modern computer. We have applied the part of proposed methods in analysis of biology dataset in our projects. The performance is both effective and efficient when comparing to others. For ex- ample, our scalable MDS is 400 times faster than its alternatives. We expected the generalized package to be efficient and robust. 

Detail(s)

Project number9229012
Grant typeDON_RMG
StatusFinished
Effective start/end date1/01/2027/02/23