Abstract
Correcting missing or erroneous data values is an essential task in data cleaning. Traditional pre-configuration error correction (EC) methods rely heavily on predefined rules or constraints, demanding significant domain knowledge and manual effort. While configuration-free EC approaches have been explored, they still demand extensive feature engineering or labeled data for intensive model training. In this paper, we propose a zero-training and interpretable EC system, named ZeroEC, that leverages large language models (LLMs) to generate chain-of-thoughts (CoTs) and correction rules for EC, without the need for model training. ZeroEC consists of two modules, contextual-relevant tuple search (CTS) and training-free explainable correction (TEC). CTS constructs a contextual-relevant tuple retriever using a weighted cosine similarity function to efficiently identify the most relevant tuples for each dirty tuple, reducing redundancy in the LLM prompts and lowering computational costs. TEC employs a clustering-based representative tuple sampling strategy to alleviate 'hallucination' risk by exposing LLMs to diverse types of data errors. It further prompts for generating correction CoTs for user-corrected representative tuples, as well as prompts for creating correction rules and explainable ECs, which automatically provide explanations for EC, all without the need for model training. Extensive experiments conducted on various real-world datasets demonstrate that ZeroEC achieves a 66.82% increase in accuracy and a 6.87x speedup compared to state-of-the-art methods. The codes and datasets of this paper are available at https://github.com/YangChen32768/ZeroEC. © 2025 IEEE.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2025 IEEE 41st International Conference on Data Engineering, |
| Publisher | IEEE |
| Pages | 2949-2962 |
| ISBN (Electronic) | 9798331536039 |
| ISBN (Print) | 979-8-3315-3604-6 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 41st IEEE International Conference on Data Engineering (ICDE 2025) - Hong Kong SAR, China Duration: 19 May 2025 → 23 May 2025 https://ieee-icde.org/2025 https://ieeexplore.ieee.org/xpl/conhome/11112833/proceeding |
Publication series
| Name | Proceedings - International Conference on Data Engineering |
|---|---|
| ISSN (Print) | 1084-4627 |
| ISSN (Electronic) | 2375-0286 |
Conference
| Conference | 41st IEEE International Conference on Data Engineering (ICDE 2025) |
|---|---|
| Place | China |
| City | Hong Kong SAR |
| Period | 19/05/25 → 23/05/25 |
| Internet address |
Funding
This work is supported by the National Key R&D Program under Grant No. 2023YFC2706404, the National NSFC under Grant No. 62372404, the Leading Goose R&D Program of Zhejiang under Grant No. 2024C01109, and the Fundamental Research Funds for the Central Universities under Grant No. 226-2024-00030.
Research Keywords
- Correction Chain-of-thoughts
- Correction Rule Generation
- Error Correction
- Large Language Models
Fingerprint
Dive into the research topics of 'A Zero-Training Error Correction System with Large Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver