A Study on Data Privacy and Data Security in the Context of Large-Scale Data Analytics


Student thesis: Doctoral Thesis

View graph of relations

Related Research Unit(s)


Awarding Institution
Award date1 Dec 2017


In the current era, a huge number of people, devices, and sensors are well-connected through different digital networks, and the cross-plays among these entities produce enormous valuable information that facilitate the innovation and growth of organizations. The rising acceptance and the necessary development of data sharing tools and technologies have also raised serious threats to the privacy and security of individuals. A current problem is the publication and sharing of data on individuals without revealing sensitive information. These privacy-oriented problems may cause a strong negative reaction and restrain further organizational invention. Hence, to address the challenge of information privacy and security, researchers have been exploring privacy preserving methodologies extensively in recent years. The basic idea behind all privacy methods is to modify data and utilize data mining algorithms effectively while preserving both the privacy and utility of sensitive information contained in the data. In recent years, a new data anonymization approach for ensuring data utility and privacy has gained popularity. This approach, called k-anonymity, keeps data record inseparable from at least k-1 other data records in terms of certain identifying attributes. This k-anonymity model introduces the protection of individual identification, but fails to protect against attribute disclosure. To address attribute disclosure problems, the k-anonymity approach has been extended to the l-diversity approach where values of sensitive attributes are at least l-diverse within k-anonymous datasets. This diversity model has been defined recently as a sophisticated k-anonymity model. However, the l-diversity privacy approach focuses mainly on the diversity of values of sensitive attributes, but does not consider the categories of sensitive attributes, which is a cause for serious privacy concerns. Most existing anonymization methods focus on homogeneous privacy, which exerts an equal level of privacy on all records in a dataset without catering to their concrete needs. Therefore, they offer insufficient protection to some individuals while applying excessive privacy control to others.

Motivated by these two problems, I intended to enhance the current privacy paradigms to enable them preserve good data quality and individual-level data privacy. In my PhD thesis, I presented a novel privacy framework with customized (c,l)-diversity that focuses on both the frequency of values under the same categories and the customized privacy approach where individual preferences can be solicited easily from the authors of data. In this study, I conducted a thorough theoretical analysis to support my proposed approach. Through mathematical analysis, I revealed the circumstance where existing privacy approaches cannot protect privacy and established the superiority of my proposed solution. I developed a set of algorithms with respect to ideas of top-down specification, local recoding, and personalized privacy concepts, which are demonstrated clearly in this study. However, the main guide of my entire research is the design science research method, which was used specifically as a guide for the design of the novel privacy preserving framework underpinned by techniques, such as k-anonymity, l-diversity, and personalized data privacy.

The main theoretical contributions of my research is to develop a simple and effective privacy model called the (c,l)-diversity approach, which extends the l-diversity approach by limiting the relative frequency of the category of sensitive values in every group to never exceed a certain threshold value, c. In order to validate my proposed solutions, I introduced several theorems and mathematical proofs. The design of the proposed privacy framework was also verified with extensive experimental evaluations, which demonstrate that my proposed privacy model is more effective and efficient than the existing models. The practical implications of my PhD research is that different government organizations (e.g., Election Commission, Hospital, and Statistic Bureau) and private organizations can adopt our privacy model to release their customer data. Our proposed privacy approach allows an appropriate trade-off between data utility and privacy.

Although my proposed privacy artifact has been developed and several interesting discoveries have been recognized through my experimental works, some aspects of my research can be enhanced and improved in the future. For instance, the data deluge in the era of “big data” raises serious privacy concerns, but my proposed approach in its current form is not yet ready to handle privacy issues in a big data environment. To address this challenge, I intend to extend my proposed method as a privacy-preserving method for big data analytics in the future.

    Research areas

  • k-anonymity, l-diversity, Personalized Data Privacy, Algorithm, Data Publishing