LEC-Codec: Learning-Based Genome Data Compression

Zhenhao Sun, Meng Wang, Shiqi Wang, Sam Kwong*

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

In this paper, we propose a Learning-based gEnome Codec (LEC), which is designed for high efficiency and enhanced flexibility. The LEC integrates several advanced technologies, including Group of Bases (GoB) compression, multi-stride coding and bidirectional prediction, all of which are aimed at optimizing the balance between coding complexity and performance in lossless compression. The model applied in our proposed codec is data-driven, based on deep neural networks to infer probabilities for each symbol, enabling fully parallel encoding and decoding with configured complexity for diverse applications. Based upon a set of configurations on compression ratios and inference speed, experimental results show that the proposed method is very efficient in terms of compression performance and provides improved flexibility in real-world applications. © 2024 IEEE.
Original languageEnglish
JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
DOIs
Publication statusOnline published - 3 Oct 2024

Funding

This work is supported in part by the Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China under Grant 2018AAA0101301, and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).

Research Keywords

  • Data compression
  • learning-based method
  • lossless genome compression
  • non-reference method

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'LEC-Codec: Learning-Based Genome Data Compression'. Together they form a unique fingerprint.

Cite this