Abstract
In gene expression data analysis, the problems of cancer classification and gene selection are closely related. Successfully selecting informative genes significantly improve the classification performance. To identify informative genes from a large number of candidate genes, various methods have been proposed. However, the gene expression data may include some important correlation structures, and some of the genes can be divided into different groups based on their biological pathways. Many existing methods do not take into consideration the exact correlation structure within the data. Therefore, from both the knowledge discovery and biological perspectives, an ideal gene selection method should take this structural information into account. Moreover, the better generalization performance can be obtained by discovering correlation structure within data. In order to discover structure information among data and improve learning performance, we propose a structured penalized logistic regression model which simultaneously performs feature selection and model learning for gene expression data analysis. An efficient coordinate descent algorithm has been developed to optimize the model. The numerical simulation studies demonstrate that our method is able to select the highly correlated features. In addition, the results from real gene expression datasets show that the proposed method performs competitively with respect to previous approaches.
Original language | English |
---|---|
Pages (from-to) | 312-321 |
Journal | IEEE/ACM Transactions on Computational Biology and Bioinformatics |
Volume | 16 |
Issue number | 1 |
Online published | 30 Oct 2017 |
DOIs | |
Publication status | Published - Feb 2019 |
Funding
The work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 11300715].
Research Keywords
- Analytical models
- Correlation
- Data analysis
- Data models
- Gene expression
- Logistics
- Microarray
- Penalized logistic regression model
- Structured penalized regularization