Data Driven Modeling in HIV Interventions: Prediction, Optimization, and Decision


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date8 Nov 2021


The human immunodeficiency virus (HIV) is one of the major factors causing death all over the world (including China) in the last several years. HIV prevention interventions are effective and important ways to slow down the further transmission of the global HIV epidemic. The Joint United Nations Programme on HIV/AIDS announced its famous ''90-90-90'' target in 2014. To achieve this goal, efficient implementation programs of HIV interventions among key populations are crucial. Such key populations include men who have sex with men (MSM), persons who inject drugs (PWID), commercial sex workers (CSW), and so on. Such intervention programs include HIV self-testing (HIVST), partner notification (PN) from those living with HIV, pre-exposure prophylaxis (PrEP), and so on. This thesis takes MSM as an example population and takes HIVST secondary distribution as an example intervention strategy, to study data-driven modeling in HIV interventions. Our modeling methodologies are similar among other key populations and are similar in other intervention programs.

First, we predicted the MSM social network in a data-driven way. Traditional surveys only provide local observations about the topological structure of every isolated individual. In this chapter, we developed a novel data-driven model to reconstruct the MSM social network from locally observed topological information obtained by surveys. A large social network consisting of over one thousand users and their public relationships was obtained manually from, the largest MSM social media site in China. We followed the same survey-taking procedure as our real survey data to sample locally observed topological information of randomly selected individuals in the BlueD social network; based on this data, we adapted an Exponential Random Graph Model (ERGM) to estimate the global structure of the BlueD social network. The parameters were learned from the ERGM-based network reconstruction model and then used to predict the linking probability and to reconstruct the MSM social networks by two real-world survey datasets in two cities in Guangdong-Hong Kong-Macau Greater Bay Area. Our method performed well on reconstructing the BlueD social network, with high accuracy in terms of Mean Average Percentage Error (MAPE) for some key network measures and an acceptable AUC value from a link prediction angle. In conclusion, this data-driven modeling approach demonstrates the feasibility of using parameter learning methods to reconstruct the social networks of HIV key populations. The method has the potential to inform network-based intervention programs that consider the global social network structure.

Second, we predicted who are key influencers in a social network-based HIVST secondary distribution program in a data-driven way. HIVST has been rapidly scaled up in several countries, and additional strategies are needed to further expand testing uptake. Secondary distribution has people apply for multiple kits and pass these kits to neighboring people within their social networks. However, identifying key influencers can be difficult. This chapter developed and validated an innovative ensemble machine learning approach to identify key influencers among men who have sex with men for HIVST secondary distribution in China. More specifically, indexes applied for HIVST kits for distribution. Alters were those who received these kits. We defined three types of key-influential MSM (i.e., key influencers): (1) key distributors who are more likely to distribute more kits (e.g., no fewer than two kits in the past ten months); (2) key promoters who can contribute to finding first-time tested alters; (3) key detectors who can help to find HIV-positive alters. In our identification system, four machine learning models (logistic regression, support vector machine, decision tree, and random forest) were trained to identify key influencers for secondary distribution. An ensemble learning approach was employed in combining the predictions of these four models for the final classification. Our ensemble machine learning outperformed human identification (i.e., self-reported leadership scales cut-off method) in classification accuracy and F1 score. Simulation experiments were also run based on the result of ensemble machine learning identification and that of human identification to validate our approach and to further compare these two approaches.

Third, based on key influencers prediction, we optimized the resource allocation in HIVST secondary distribution program in a data-driven way. In the above secondary distribution of HIVST, individuals (defined as indexes) were given multiple testing kits, not only for self-use (i.e., self-testing) but also for distributing extra kits to people in their MSM social network (defined as alters). As a relatively new implementation strategy for expanding HIVST, related studies of secondary distribution mainly concentrate on developing new intervention approaches to further increase the effectiveness of this implementation strategy, from the perspective of traditional public health discipline. There are many points of HIVST secondary distribution in which mathematical optimization can play an important role. In this chapter, we considered a resource-constrained situation for testing kits in HIVST secondary distribution, and two data-driven integer linear programming models were proposed to maximize the overall economic benefit of HIVST secondary distribution based on our real implementation data from Chinese MSM. The objective function took expansion of normal alters and detection of positive & newly-tested alters into account. Greedy algorithms were developed to find the optimal solutions for our linear programming models. Results showed that our proposed data-driven approach could improve the total health economic benefit of HIVST secondary distribution.

Forth, we also conducted decision analytics by comparing different combinations of key influencers identification methods (our ensemble machine learning prediction v.s. conventional human identification based on self-reported leadership scales cut-off) with resource allocation methods (our integer programming optimization v.s. conventional kits self-application way). A conclusion could be drawn that through data-driven modeling in prediction and optimization for HIV intervention programs (i.e., HIVST secondary distribution in this study), healthcare stakeholders could make better decisions in practice. Additionally, both the prediction part and optimization part made insightful contributions to the final better decision. In summary, our data-driven modeling approaches could improve the efficiency and efficacy of HIV prevention interventions.