Statistical Methods for Variable Selection with False Discovery Rate Control and Applications to Human Genetic Data

Project: Research

View graph of relations

Description

Variable selection, also known as feature selection, is at the heart of many scientific problems. An example of the problem is the identification of genetic variants that influence disease risk from among the millions of variants in the human genome. The proposed project will develop statistical methods that allow simultaneous analyses of all variables with the guarantee that a large proportion of the selected variables truly influence the outcome. The ability to correctly distinguish important variables from a large number of non-important ones will enhance knowledge and interpretability in many subject matter areas and allow effective interventions. For example, discovering which genetic variants are associated with a disease will aid development of personalized genetic treatments with higher success rates for patients with these variants and build disease prediction models with higher accuracy. The statistical methods developed in this project will contribute to the field of statistics and data science with powerful tools that aim to extract influential variables from high-dimensional datasets containing a very large number of variables. Variable selection is particularly challenging in genetic studies due to the complex correlation structure among genetic variants, where causal variants are often correlated with adjacent noncausal variants. This makes it difficult to distinguish causal variants from non-causal ones and often leads to false discoveries, where non-causal variants are incorrectly identified as causal variants. The proposed methods build upon the knockoff framework, a recently developed statistical method that provides statistical guarantees on the false discovery rate in the presence of complex correlations and facilitates interpretability and reproducibility of results. The proposed methods will be applied to multiple large-scale human genetic datasets with the goal of obtaining a more comprehensive understanding of how genetic variants influence disease risk.

Detail(s)

Project number9048272
Grant typeECS
StatusActive
Effective start/end date1/01/24 → …