Learning Theory of Deep Neural Networks and Distribution Regression
深度神經網絡和分佈回歸的學習理論
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution  

Supervisors/Advisors 

Award date  27 Jun 2022 
Link(s)
Permanent Link  https://scholars.cityu.edu.hk/en/theses/theses(ec52dd208a2d4d3b9f3f89b6e1201590).html 

Other link(s)  Links 
Abstract
The 21st century has witnessed the great empirical success of deep learning on various fields of science and engineering such as speech recognition, image classification, and many other learning tasks. However, the theoretical understanding of its empirical success has not been fully understood with respect to optimization, approximation and generalization, especially in the aspect of its empirical success on various learning tasks other than regression and classification, and the astonishing power of some neural network models with special structures on specific learning tasks. Therefore, this thesis aims to study the above two questions through theoretical analysis of two examples. The first one is about generalization analysis of estimators in the distribution regression learning task, and the second one is about approximation power of deep convolutional neural networks (DCNNs) with convolutional structures on function classes with special structures as well.
Firstly, we consider the regularized distribution regression scheme where we aim at regressing from probability distributions to realvalued responses by utilizing twostage sampling process and regularization scheme over a reproducing kernel Hilbert space (RKHS). Instead of considering the classical least squares loss, we study convergence rates of the structural risk minimization (SRM) estimators with some robust loss functions l_{σ} and Tikhonov regularization, which are more insensitive to outliers and nonGaussian noise. With a windowing function W and scaling parameter σ being appropriately chosen, l_{σ} can include a large variety of robust loss functions, which may even be nonconvex. We follow the techniques of previous studies on RKHS with usage of integral operators and effective dimension, and demonstrate that when the secondstage sample size n and scaling parameter σ are large enough, we can achieve satisfactory learning rates for a large range of regularity of the regression function.
Next, since neural networks have shown its great power in representation in practice, it is natural for us to consider the usage of neural networks in deep learning on the distribution regression task, the learning theory of which has not been studied before. We conduct generalization analysis from the simple case where the classical least squares loss is considered with the hypothesis space being fullyconnected neural networks (FNNs). We propose a new general FNN structure for the distribution regression task, and derive explicit approximation rates on approximating the classes of functionals composed of Hölder functions. By directly considering the Wasserstein distance between probability distributions instead of using kernel mean embeddings, the hypothesis space we construct is also proved to be a compact subset of continuous functionals on Borel probability distributions. We finally derive an almost optimal learning rate up to some logarithmic terms for the ERM estimator via a novel twostage error decomposition method when the second stage sample size is large enough. This work fills the gap in the theoretical work of the distribution regression task with neural networks.
Finally, we try to theoretically understand the great success of DCNNs in practice by answering the questions how DCNNs extract features and whether they can perform better in approximating or learning some classes of functions with special structures. We start from considering the composite functions ƒ(Q(x)) where Q is a polynomial and ƒ is a univariate Hölder function. We show that our generic DCNNs structure can automatically extract the features only by tuning two hyperparameters that determine the DCNN structure. We derive the explicit approximation rates on this classes of functions, and show that DCNNs have especially super efficiency on approximating radial functions comparing with shallow networks. We also conduct generalization analysis of the ERM algorithm on the regression task with least squares loss, and it is demonstrated that with the increase of the depth of DCNNs, the convergence rate first decreases to some optimal value and then increases, corresponding to the tradeoff phenomenon observed in practice.
Firstly, we consider the regularized distribution regression scheme where we aim at regressing from probability distributions to realvalued responses by utilizing twostage sampling process and regularization scheme over a reproducing kernel Hilbert space (RKHS). Instead of considering the classical least squares loss, we study convergence rates of the structural risk minimization (SRM) estimators with some robust loss functions l_{σ} and Tikhonov regularization, which are more insensitive to outliers and nonGaussian noise. With a windowing function W and scaling parameter σ being appropriately chosen, l_{σ} can include a large variety of robust loss functions, which may even be nonconvex. We follow the techniques of previous studies on RKHS with usage of integral operators and effective dimension, and demonstrate that when the secondstage sample size n and scaling parameter σ are large enough, we can achieve satisfactory learning rates for a large range of regularity of the regression function.
Next, since neural networks have shown its great power in representation in practice, it is natural for us to consider the usage of neural networks in deep learning on the distribution regression task, the learning theory of which has not been studied before. We conduct generalization analysis from the simple case where the classical least squares loss is considered with the hypothesis space being fullyconnected neural networks (FNNs). We propose a new general FNN structure for the distribution regression task, and derive explicit approximation rates on approximating the classes of functionals composed of Hölder functions. By directly considering the Wasserstein distance between probability distributions instead of using kernel mean embeddings, the hypothesis space we construct is also proved to be a compact subset of continuous functionals on Borel probability distributions. We finally derive an almost optimal learning rate up to some logarithmic terms for the ERM estimator via a novel twostage error decomposition method when the second stage sample size is large enough. This work fills the gap in the theoretical work of the distribution regression task with neural networks.
Finally, we try to theoretically understand the great success of DCNNs in practice by answering the questions how DCNNs extract features and whether they can perform better in approximating or learning some classes of functions with special structures. We start from considering the composite functions ƒ(Q(x)) where Q is a polynomial and ƒ is a univariate Hölder function. We show that our generic DCNNs structure can automatically extract the features only by tuning two hyperparameters that determine the DCNN structure. We derive the explicit approximation rates on this classes of functions, and show that DCNNs have especially super efficiency on approximating radial functions comparing with shallow networks. We also conduct generalization analysis of the ERM algorithm on the regression task with least squares loss, and it is demonstrated that with the increase of the depth of DCNNs, the convergence rate first decreases to some optimal value and then increases, corresponding to the tradeoff phenomenon observed in practice.