Abstract
Identifying similar data objects according to a scoring function is a core subroutine for many data mining and data analysis problems. In order to efficiently perform the similarity search, a considerable amount of research effort has been dedicated to devising spatial indexing methods. However, as the number of dimensions increases, those methods often degenerate to bruteforce search. Locality sensitive hashing (LSH) is a well-known technique for solving the problem of approximate similarity search in high-dimensional spaces. The method utilizes a family of hash functions to assign similar data items into the same bucket with a much higher probability than those are far apart. Therefore, a proximity query can be processed using multiple hash table lookups.Many solutions have been proposed to improve the efficiency and accuracy of the LSH lookup process. However, the query cost might still be prohibitive in those applications with a large number of query points. For example, in a data processing application, we may wish to identify all near duplicates when merging two image databases; in a genome sequence annotation application, a biologist may wish to annotate a set of sequences by searching for similar sequences from a set of labelled sequences. In such applications, we need to join two large datasets and there can be millions of query records. Consequently, the problem of how to efficiently use LSH for similarity join queries still remains to be
fully explored.
Compared to the similarity search problem for the datasets in the vector space, the query cost is extremely expensive for those complex data objects and similarity measures. For example, in the set similarity search problem, Hausdorff distance and its variants are used as the similarity measure. To calculate the Hausdorff distance between set X and set Y , |X|*|Y| Euclidean distance computations are required. Although the LSH method can be a practical solution, four challenges are required to be tackled: (i) a suitable way to apply the LSH scheme; (ii) a strategy for candidate generation; (iii) an efficient method for
candidate pruning; and (iii) an approach for controlling the error rate and LSH parameter setting.
Mining and analyzing large-scale multivariate time series are crucial in various applications, such as medical, scientific, and business operations. However, searching for similar multivariate time series has the following challenges. First, assessing the similarity between two time series is computationally expensive due to the reliance on an alignment-based similarity measure called dynamic time warping (DTW). Second, existing DTW-based solutions mainly focus on univariate time series and fail to scale as the number of dimensions increases. Since a multivariate time series is a collection of vectors rather than a single data point, a completely new strategy for candidate generation is required when we consider applying LSH to accelerate similarity search over multivariate time series. Accordingly, proper LSH parameters should be determined based on the
proposed candidate generation strategy.
Specifically, the main contributions of this thesis are outlined as follows:
1. A generic method is proposed to accelerate the process of joining two large datasets using LSH. The crux lies in identifying a representative query set Qs to reduce the number of LSH lookups, and then sharing the approximate results among nearby query points. In the proposed method, the representative selection problem is formulated as a minimum set multiple-coverage problem. Then, three heuristic-based solutions are proposed with an approximately linear cost. An extensive analysis is also performed on the proposed similarity join method to help determine proper LSH parameters. To demonstrate the generality of our method, the proposed approach is applied to sequence similarity join
and set similarity join problems. Experimental results confirm the findings from the theoretical analyses and show significant improvements compared with the state-of-the-art LSH method (QALSH).
2. An efficient and scalable LSH-based method is presented to speed up the set similarity search processing. Specifically, the LSH method is applied to identify candidate sets and then the MaxMin nature of the Hausdorff distance is exploited to avoid evaluating the distance of each set in the database with respect to the query set. This pruning concept is also generalized to the modified Hausdorff and partial Hausdorff distance functions. In order to set proper LSH parameters, an error analysis based on the proposed solution is
conducted. We perform the experimental evaluation by applying the proposed method to solve the authorship attribution query processing over a corpus obtained from an online book archive. Experimental results show that the proposed LSH-based method significantly improves the efficiency of set similarity search with a little accuracy sacrifice.
3. A fast LSH-based similarity search scheme is explored for multivariate time series. To reduce the number of exact DTW calculations, a type of envelop is defined for query sequence Q to limit the search space. Then LSH is used to identify candidate time series which are confined within the envelop. In addition, two strategies are proposed for candidate pruning. In terms of controlling the accuracy level, an extensive error analysis is performed to help determine the search range and appropriate LSH parameters. Experimental results in different types of query processing and data analysis show that the proposed method can achieve a significant improvement than the best existing practice.
| Date of Award | 11 Jan 2018 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Sarana NUTANONG (Supervisor) |
Keywords
- Similarity search
- Locality Sensitive hashing
- query processing
- complex data objects
- multivariate time series