Mathematical Analysis of Deep Neural Networks

Student thesis: Doctoral Thesis

Abstract

Deep learning has attracted tremendous attention from various fields of science and technology recently. Wide applications including those in image processing and speech recognition have received great success. Based on deep neural network structures, it has a strong capability of obtaining data features and distinguishes itself from classical machine learning methods. Though it is successful in practical applications, there lacks a theoretical foundation for understanding the modeling, approximation, or generalization ability of deep learning models. Therefore, this thesis aims to analyze the ability of deep neural networks mathematically through three examples.

Approximation ability of a family of deep convolutional neural networks (CNNs) applied to functions on the unit sphere Sd−1 of Rd will be demonstrated first. There exist some results stating the rates of approximating functions from Sobolev space with high regularity, which is not the usual case in applications. Our analysis presents rates of uniform approximation when the approximated function lies in the Sobolev space Wr(Sd−1) with small regularity r > 0. The restriction can be relaxed in this situation by applying spherical harmonic expansions to construct a near-best approximation for functions f∈Wr(Sd−1). We also demonstrate the superiority of deep CNNs in approximating structured functions taking an additive ridge form. Our work verifies theoretically the modeling and approximation ability of deep CNNs followed by downsampling and one fully connected layer or two. The key idea of our spherical analysis is to use the inner product form of the reproducing kernels of the spaces of spherical harmonics and then apply convolutional factorizations of filters to realize the generated linear features.

In the second part of the thesis, we develop some generalization analysis of a deep CNN algorithm for binary classification with data on spheres. Error decomposition is a usual technique to compute the excess generalization error which decomposes the error into two parts, i.e. approximation error and sample error. An essential property of the classification problem is the lack of continuity or high smoothness of the target function associated with a convex loss function such as the hinge loss. This motivates us to consider the approximation of functions in the Lp space with 1 ≤ p ≤∞ and the generalization error is measured with respect to the p-norm loss φ(v) = (1 − v)+p. We provide rates of Lp-approximation when the approximated function lies in a Sobolev space using efficient cubature formulae on spheres and other tools from the spherical analysis. Then generalization bounds and learning rates for the excess misclassification error of the deep CNN classification algorithm are obtained based on Bernstein’s type inequalities and comparison theorem. In addition, we show that CNN-based classifiers can achieve the almost optimal rate by imposing a Tsybakov noise condition on the data distribution.

Finally, we try to show the generalization ability of deep neural networks for the pairwise ranking problem instead of classical pointwise learning problems like classification and regression. We apply symmetric deep fully connected neural networks to pairwise learning for ranking with a hinge loss and carry out the generalization analysis for this algorithm. A key step in our analysis is to characterize a function that minimizes the risk. We prove that it turns out to be the composition of an anti-symmetric function and the sign function. This motivates us to design our two-part deep neural networks with shared weights, which induces the anti-symmetric property of the networks. To overcome the discontinuity of the sign function when estimating the approximation error, a restriction on the distribution of data pairs called the noise condition is imposed. We present convergence rates of the approximation error in terms of function smoothness and the noise condition. To obtain the sample error for pairwise data, inequalities for U-statistics are also established. Error decomposition gives the direction throughout the whole analysis, which helps us obtain an excess generalization error bound.
Date of Award24 Jul 2023
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorHan FENG (Supervisor) & Dingxuan ZHOU (Co-supervisor)

Cite this

'