C2Net : A Network-Efficient Approach to Collision Counting LSH Similarity Join

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

5 Scopus Citations
View graph of relations

Related Research Unit(s)

Detail(s)

Original languageEnglish
Pages (from-to)423-436
Journal / PublicationIEEE Transactions on Knowledge and Data Engineering
Volume31
Issue number3
Online published15 May 2018
Publication statusPublished - Mar 2019

Abstract

Similarity join of two datasets P and Q is a primitive operation that is useful in many application domains. The operation involves identifying pairs (p, q), in the Cartesian product of P and Q such that (p, q) satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost estimation frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20% data reduction and 50% reduction in shuffle time.

Research Area(s)

  • Aggregates, Distributed databases, Distributed processing, Distributed Processing, Join Processing, Locality Sensitive Hashing, Runtime, Schedules, Scheduling, Similarity Search, Task analysis