Towards Advanced Top-k Recommendation Services in Emerging Application Scenarios
基於新興應用場景的高級k-最優推薦服務
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 21 Mar 2017 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(c36b6c9d-4ebc-4a99-b722-89785ec6c5f1).html |
---|---|
Other link(s) | Links |
Abstract
Top-k recommendation is widely used in different information and entertainment services which aims to provide a user with k items that are most relevant to the user’s interest. Top-k recommendation is an abstraction of different top-k queries. Besides traditional scenarios such as music and movie recommendation in websites or friend recommendation in social networks, top-k recommendation in emerging scenarios including data intensive computing, geosocial networks and public cloud have recently received a lot of attention.
Many data exploration applications like nearest neighbor search require the ability to identify the top-k results according to a scoring function. With the trend of processing large volumes of data, data exploration on big data requires data-intensive computing. In contrast with compute-intensive computing which devotes most of their execution time to computational requirements by performing multiple operations simultaneously on small volumes of data, data-intensive is used to describe applications that are I/O bound or with a need to process large volumes of data. It becomes a very timely topic to develop new data exploration applications which can search and process data-intensive computing workloads. The fundamental challenge to compute top-k for data-intensive workloads is about how to significantly reduce data analysis cycles to support real-time interaction with respect to massive amounts of data.
With the proliferation of smartphones, geosocial networks are gaining increasing popularity as a type of emerging social networks. Comparing to traditional social networks such as Facebook and Twitter, geosocial networks utilize user-provided location data to match users with a place, event, or person relevant to their interests, and enable further socialization activities based on such information. Among various location-based services mushroomed in geosocial networks, on-demand transportation services such as Uber and DiDi have been increasingly embraced by millions of mobile users. Users can submit their sources and destinations to find other users that match their trips for ridesharing, which brings the well understood benefit to assuage traffic congestion and save energy consumption. Ridesharing recommendation helps users decide whether they can rideshare and the top-k places to wait, which is more challenging than recommending where to take a taxi since the possible destinations of the coming taxis are also considered to predict whether ridesharing can be successful or not.
However, new services in geosocial networks like ridesharing also raise unique security and privacy issues that are not well investigated. The crucial digital assets of geosocial applications are the score functions based on patterns extracted from massive data with the recent advances of data mining or machine learning techniques. Considering more and more emerging geosocial applications are directly hosted at commercial public cloud that are not necessarily within the trust domain of service providers, it is urgent to propose a privacy-preserving framework to prevent geosocial applications like ridesharing recommendation from data breach.
Specifically, the main contributions of this thesis are outlined as follows:
1. A scalable framework with a novel entropy-based scheduling policy is proposed to handle data-intensive top-k computing in cross aggregate ranking workloads. An off-the-shelf solution to speed up computations is to exploit data parallelism of the problem. On the other hand, branch-and-bound approach is widely used for selective exploration in query processing. However, traditionally the exploration decisions are made sequentially by following the hierarchical organization of the solution space. The proposed method combines the selective exploration benefit of the branch-and-bound approach and the scalability and speed benefits of parallel and distributed processing by prioritizing multiple result-yielding partial aggregation tasks so that they can be processed in parallel. The performance of the proposed solution is evaluated using two different workloads and two different real datasets. Experimental results show that the entropy-based method consistently outperforms the benchmark methods by at most one order of magnitude in terms of running time.
2. A new application of ridesharing recommendation based on ridesharing potential of nearby places to provide users with suggestions about whether and where to wait for ridesharing is presented. As ridesharing depends on whether another passenger with similar source and destination can appear in time, kernel density estimation is used to capture destination distributions of taxis departing from certain places. Since taxi appearance is too dynamic for a single road, the underlying road network is grouped into road clusters. Only top-k road clusters with the probabilities greater than a threshold are recommended to users. If no cluster is returned, the user will be suggested not to wait for ridesharing and take a taxi directly. Experimental results based on real GPS data set show that the accuracy about whether ridesharing or not and the ridesharing successful ratio are respectively about 3 times and at most 40% better than the naive “wait-as-where-you-are” strategy, i.e., users wait at their sources for ridesharing.
3. A new privacy-preserving recommendation framework to securely perform ridesharing recommendation in public cloud is investigated. Cluster arrival patterns are modeled with kernel density estimation fused with departure probabilities for expected higher user satisfaction. Based on these cluster arrival patterns, searchable encryption is utilized to carefully protect all the proprietary data so as to allow authorised users to retrieve encrypted patterns with secure requests. These patterns are always encrypted and stored on the cloud server while answering for authorised on-demand encrypted requests from mobile users. Experimental results show that the efficiency of recommendation is not influenced by the privacy-preserving techniques.
4. Comprehensive surveys about top-k recommendation in each emerging scenario are explored. The unique characteristics of each scenario make the top-k recommendation more challenging for existing solutions. An overview of the field, motivations, challenges, and proposed solutions are presented.
Many data exploration applications like nearest neighbor search require the ability to identify the top-k results according to a scoring function. With the trend of processing large volumes of data, data exploration on big data requires data-intensive computing. In contrast with compute-intensive computing which devotes most of their execution time to computational requirements by performing multiple operations simultaneously on small volumes of data, data-intensive is used to describe applications that are I/O bound or with a need to process large volumes of data. It becomes a very timely topic to develop new data exploration applications which can search and process data-intensive computing workloads. The fundamental challenge to compute top-k for data-intensive workloads is about how to significantly reduce data analysis cycles to support real-time interaction with respect to massive amounts of data.
With the proliferation of smartphones, geosocial networks are gaining increasing popularity as a type of emerging social networks. Comparing to traditional social networks such as Facebook and Twitter, geosocial networks utilize user-provided location data to match users with a place, event, or person relevant to their interests, and enable further socialization activities based on such information. Among various location-based services mushroomed in geosocial networks, on-demand transportation services such as Uber and DiDi have been increasingly embraced by millions of mobile users. Users can submit their sources and destinations to find other users that match their trips for ridesharing, which brings the well understood benefit to assuage traffic congestion and save energy consumption. Ridesharing recommendation helps users decide whether they can rideshare and the top-k places to wait, which is more challenging than recommending where to take a taxi since the possible destinations of the coming taxis are also considered to predict whether ridesharing can be successful or not.
However, new services in geosocial networks like ridesharing also raise unique security and privacy issues that are not well investigated. The crucial digital assets of geosocial applications are the score functions based on patterns extracted from massive data with the recent advances of data mining or machine learning techniques. Considering more and more emerging geosocial applications are directly hosted at commercial public cloud that are not necessarily within the trust domain of service providers, it is urgent to propose a privacy-preserving framework to prevent geosocial applications like ridesharing recommendation from data breach.
Specifically, the main contributions of this thesis are outlined as follows:
1. A scalable framework with a novel entropy-based scheduling policy is proposed to handle data-intensive top-k computing in cross aggregate ranking workloads. An off-the-shelf solution to speed up computations is to exploit data parallelism of the problem. On the other hand, branch-and-bound approach is widely used for selective exploration in query processing. However, traditionally the exploration decisions are made sequentially by following the hierarchical organization of the solution space. The proposed method combines the selective exploration benefit of the branch-and-bound approach and the scalability and speed benefits of parallel and distributed processing by prioritizing multiple result-yielding partial aggregation tasks so that they can be processed in parallel. The performance of the proposed solution is evaluated using two different workloads and two different real datasets. Experimental results show that the entropy-based method consistently outperforms the benchmark methods by at most one order of magnitude in terms of running time.
2. A new application of ridesharing recommendation based on ridesharing potential of nearby places to provide users with suggestions about whether and where to wait for ridesharing is presented. As ridesharing depends on whether another passenger with similar source and destination can appear in time, kernel density estimation is used to capture destination distributions of taxis departing from certain places. Since taxi appearance is too dynamic for a single road, the underlying road network is grouped into road clusters. Only top-k road clusters with the probabilities greater than a threshold are recommended to users. If no cluster is returned, the user will be suggested not to wait for ridesharing and take a taxi directly. Experimental results based on real GPS data set show that the accuracy about whether ridesharing or not and the ridesharing successful ratio are respectively about 3 times and at most 40% better than the naive “wait-as-where-you-are” strategy, i.e., users wait at their sources for ridesharing.
3. A new privacy-preserving recommendation framework to securely perform ridesharing recommendation in public cloud is investigated. Cluster arrival patterns are modeled with kernel density estimation fused with departure probabilities for expected higher user satisfaction. Based on these cluster arrival patterns, searchable encryption is utilized to carefully protect all the proprietary data so as to allow authorised users to retrieve encrypted patterns with secure requests. These patterns are always encrypted and stored on the cloud server while answering for authorised on-demand encrypted requests from mobile users. Experimental results show that the efficiency of recommendation is not influenced by the privacy-preserving techniques.
4. Comprehensive surveys about top-k recommendation in each emerging scenario are explored. The unique characteristics of each scenario make the top-k recommendation more challenging for existing solutions. An overview of the field, motivations, challenges, and proposed solutions are presented.