C2Net : A network-efficient approach to collision counting LSH similarity join(extended abstract)

Research output: Chapters, Conference Papers, Creative and Literary Works (RGC: 12, 32, 41, 45)32_Refereed conference paper (with ISBN/ISSN)peer-review

View graph of relations

Related Research Unit(s)

Detail(s)

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 35th International Conference on Data Engineering
Subtitle of host publicationICDE 2019
PublisherIEEE
Pages2121-2122
ISBN (Electronic)9781538674741
ISBN (Print)9781538674758
Publication statusPublished - Apr 2019

Publication series

NameProceedings - International Conference on Data Engineering
Volume2019-April
ISSN (Print)1063-6382
ISSN (Electronic)2375-026X

Conference

Title35th IEEE International Conference on Data Engineering (ICDE 2019)
LocationParisian Macao
PlaceMacao
Period8 - 11 April 2019

Abstract

Similarity join of two datasets P and Q is a primitive operation that is useful in many application domains. The operation involves identifying pairs (pq), in the Cartesian product of P and Q such that (p, q) satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join [2]. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20% data reduction and 50% reduction in shuffle time.

Research Area(s)

  • Distributed processing, Join processing, Locality sensitive hashing, Similarity search

Citation Format(s)

C2Net : A network-efficient approach to collision counting LSH similarity join(extended abstract). / Li, Hangyu; Nutanong, Sarana; Xu, Hong; Yu, Chenyun; Ha, Foryu.

Proceedings - 2019 IEEE 35th International Conference on Data Engineering: ICDE 2019. IEEE, 2019. p. 2121-2122 8731338 (Proceedings - International Conference on Data Engineering; Vol. 2019-April).

Research output: Chapters, Conference Papers, Creative and Literary Works (RGC: 12, 32, 41, 45)32_Refereed conference paper (with ISBN/ISSN)peer-review