TY - GEN
T1 - A practical data repository for causal learning with big data
AU - Cheng, Lu
AU - Guo, Ruocheng
AU - Moraffah, Raha
AU - Candan, K. Selçuk
AU - Raglin, Adrienne
AU - Liu, Huan
PY - 2020
Y1 - 2020
N2 - The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.
AB - The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.
KW - Benchmarking
KW - Big data
KW - Causal discovery
KW - Causal learning
KW - Datasets
KW - Treatment effect estimation
UR - http://www.scopus.com/inward/record.url?scp=85087008761&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85087008761&origin=recordpage
U2 - 10.1007/978-3-030-49556-5_23
DO - 10.1007/978-3-030-49556-5_23
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 978-3-030-49555-8
T3 - Lecture Notes in Computer Science
SP - 234
EP - 248
BT - Benchmarking, Measuring, and Optimizing
PB - Springer, Cham
T2 - 2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019
Y2 - 14 November 2019 through 16 November 2019
ER -