Cost-Based High-Dimensional Feature Selection and Credit Scoring Models


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date2 Sep 2021


Mobile e-commerce has grown rapidly in the last decade because of the development of mobile network services, computing capabilities and the application of big data. It has reached a stage far beyond mobile payment and trading and extends to a much broader spectrum of day-to-day activities, such as mobile financing and social networking. Financial institutions have already incurred fundamental changes in credit risk management, with significant impacts on traditional credit risk policy, which is now inadequate for accurately evaluating an individual’s credit risk profile and in a timely manner. A large-scale dataset representing deep mobile application usage of 450,722 anonymous mobile users with a 28-month loan history and mobile behavior of both iOS and Android is designed. A unique set of mobile behavior-driven credit risk indicators is derived with a total of 4,689 variables characterizing user preferences, attitudes, geolocation, and temporal patterns. Empirical analysis demonstrates that the six newly discovered mobile behavior dimensions, in the context of positive and negative credit information, can add value for credit scoring in terms of accuracy and cost by applying the proposed cost-based feature selection methods.

Regarding credit scoring, many previous studies do not address feature acquisition cost variation. Considering that companies usually acquire groups of features with high cost rather than acquiring them individually, this work presents a framework of cost-based feature selection by estimating the feature expected yield function and applying it to practice for credit scoring. Two modified feature selection methods are presented that adapt the expected yield function to enable multiple feature selection and interdependence among features considered in the modeling process that constrain the feature acquisition budget while predicting adequately. First, a cost-based quadratic programming feature selection method and active learning-based credit scoring model are proposed that enable adaptive convolutional neural network (CNN) learning by adjusting a scalar parameter in the cost-based quadratic programming to progressively discover more nonredundant and high yield features into backpropagation CNN model training. The expected yield function in the quadratic matrix and linear vector selects the most cost-effective relevant features with a budget constraint and simultaneously identifies the majority of nonredundant features contributing to learning among feature interactions and representations in the modeling process. The proposed method achieves 20–25% fewer selected features on average while maintaining similar or better accuracy performance at a lower cost than baseline methods.

The second proposed feature selection and credit scoring method is the fusion of evolutionary computing and quantum computing. I introduce a cost-based quantum-inspired evolutionary algorithm (QIEA) to achieve faster and better convergence than the conventional genetic algorithm (GA). The QIEA enhances the exploration and exploitation power of the classical GA, with the use of quantum bit representation of individuals in the population, a quantum rotation gate operator as a variation operator and modified mutation and crossover operators. Dealing with a total budget that has to be met, common feature selection algorithms are often not suitable for identifying cost-effective and performing subsets for modeling tasks, and the proposed cost-based fitness function in QIEA is used to address this problem. Experimental results show that quantum-based methods are successful at identifying cost-effective features correctly, leading to good predictive performances even with only 70–80% number of features selected by conventional feature selection methods, and hence achieve lower total feature acquisition costs. Additionally, computational time by the proposed quantum-inspired methods can be reduced by 30–60% compared with GAs depending on different feature set sizes.

To summarize, I propose a high-dimensional heterogeneous credit scoring dataset covering both online and offline lending scenarios. I verify that mobile behavioral data improve credit scoring performance, and a carefully designed mobile data structure is available for the credit risk field research community or other applications. From experimental and simulation studies, the two proposed cost-based feature selection and credit scoring models can achieve promising results on a large-scale dataset in terms of higher predictive performance, faster computational time, and lower feature cost with budget constraints. I also provide insights and discussions about how to use these two models from the perspective of data characteristics, model assumptions, and applicable business scenarios. I hope this study will stimulate more studies examining the properties of mobile data and developing different methods to address them in the credit scoring field.

    Research areas

  • Credit Scoring, Quantum Computing, Cost-based Feature Section, Quadratic Programming, Big Data