Lossless Set Compression by Binary Tree Coding

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

A set of categorical items is a common data structure used in various applications, such as customer orders in online shopping. With the rise of big data and e-commerce, there is an increasing need to store such a large number of set data efficiently. This paper presents a novel lossless compression method that uses binary trees for set compression. The compressed code is generated by performing a depth-first search for items in the set on the binary tree. We prove that finding the optimal tree for minimizing the total code length for a given collection of sets is NP-hard, and the problem even remains NP-hard when we only need to assign items to the leaf nodes of a given tree. A heuristic algorithm is then proposed for building trees. To further improve the compression efficiency, we incorporate the entropy coding method into the coding process. Moreover, we reveal the underlying probabilistic expression of the binary tree model and present the conditional independence structure of the model. To validate the effectiveness of the proposed method, empirical studies are conducted on two real-world online e-commerce datasets. The results show that the proposed coding method significantly reduces the storage space required for set data and the tree structure for compressing sets can help identify conditional independent items. © 2025 ACM.
Original languageEnglish
Title of host publicationKDD '25
Subtitle of host publicationProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2
PublisherAssociation for Computing Machinery
Pages2801-2812
ISBN (Print)979-8-4007-1454-2
DOIs
Publication statusPublished - 3 Aug 2025
Event31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025) - Toronto, Canada
Duration: 3 Aug 20257 Aug 2025
https://kdd2025.kdd.org/
https://dl.acm.org/conference/kdd/proceedings

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Volume2
ISSN (Print)2154-817X

Conference

Conference31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025)
PlaceCanada
CityToronto
Period3/08/257/08/25
Internet address

Funding

The authors research is supported by City University of Hong Kong and the Hong Kong Research Grants Council.

Research Keywords

  • binary trees
  • categorical data
  • e-commerce
  • lossless compression
  • np-hardness
  • probabilistic model
  • set data

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'Lossless Set Compression by Binary Tree Coding'. Together they form a unique fingerprint.

Cite this