TY - GEN
T1 - Privacy-Preserving Machine Learning Algorithms for Big Data Systems
AU - Xu, Kaihe
AU - Yue, Hao
AU - Guo, Linke
AU - Guo, Yuanxiong
AU - Fang, Yuguang
N1 - Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].
PY - 2015/7/22
Y1 - 2015/7/22
N2 - Machine learning has played an increasing important role in big data systems due to its capability of efficiently discovering valuable knowledge and hidden information. Often times big data such as healthcare systems or financial systems may involve with multiple organizations who may have different privacy policy, and may not explicitly share their data publicly while joint data processing may be a must. Thus, how to share big data among distributed data processing entities while mitigating privacy concerns becomes a challenging problem. Traditional methods rely on cryptographic tools and/or randomization to preserve privacy. Unfortunately, this alone may be inadequate for the emerging big data systems because they are mainly designed for traditional small-scale data sets. In this paper, we propose a novel framework to achieve privacy-preserving machine learning where the training data are distributed and each shared data portion is of large volume. Specifically, we utilize the data locality property of Apache Hadoop architecture and only a limited number of cryptographic operations at the Reduce() procedures to achieve privacy-preservation. We show that the proposed scheme is secure in the semi-honest model and use extensive simulations to demonstrate its scalability and correctness.
AB - Machine learning has played an increasing important role in big data systems due to its capability of efficiently discovering valuable knowledge and hidden information. Often times big data such as healthcare systems or financial systems may involve with multiple organizations who may have different privacy policy, and may not explicitly share their data publicly while joint data processing may be a must. Thus, how to share big data among distributed data processing entities while mitigating privacy concerns becomes a challenging problem. Traditional methods rely on cryptographic tools and/or randomization to preserve privacy. Unfortunately, this alone may be inadequate for the emerging big data systems because they are mainly designed for traditional small-scale data sets. In this paper, we propose a novel framework to achieve privacy-preserving machine learning where the training data are distributed and each shared data portion is of large volume. Specifically, we utilize the data locality property of Apache Hadoop architecture and only a limited number of cryptographic operations at the Reduce() procedures to achieve privacy-preservation. We show that the proposed scheme is secure in the semi-honest model and use extensive simulations to demonstrate its scalability and correctness.
UR - http://www.scopus.com/inward/record.url?scp=84944328461&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-84944328461&origin=recordpage
U2 - 10.1109/ICDCS.2015.40
DO - 10.1109/ICDCS.2015.40
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 9781467372145
VL - 2015-July
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 318
EP - 327
BT - Proceedings - 2015 IEEE 35th International Conference on Distributed Computing Systems, ICDCS 2015
PB - IEEE
T2 - 35th IEEE International Conference on Distributed Computing Systems, ICDCS 2015
Y2 - 29 June 2015 through 2 July 2015
ER -