TY - JOUR
T1 - A Limitation of Gradient Descent Learning
AU - Sum, John
AU - Leung, Chi-Sing
AU - Ho, Kevin
PY - 2020/6
Y1 - 2020/6
N2 - Over decades, gradient descent has been applied to develop
learning algorithm to train a neural network (NN). In this brief,
a limitation of applying such algorithm to train an NN with persistent
weight noise is revealed. Let V (w) be the performance measure of an
ideal NN. V (w) is applied to develop the gradient descent learning (GDL).
With weight noise, the desired performance measure (denoted as J (w))
is E [V (w˜)|w], where w is the noisy weight vector. Applying GDL to ˜
train an NN with weight noise, the actual learning objective is clearly
not V (w) but another scalar function L(w). For decades, there is a
misconception that L (w) = J (w), and hence, the actual model attained
by the GDL is the desired model. However, we show that it might not:
1) with persistent additive weight noise, the actual model attained is the
desired model as L (w) = J (w); and 2) with persistent multiplicative
weight noise, the actual model attained is unlikely the desired model as
L (w) ≠ J (w). Accordingly, the properties of the models attained as
compared with the desired models are analyzed and the learning curves
are sketched. Simulation results on 1) a simple regression problem and
2) the MNIST handwritten digit recognition are presented to support
our claims.
AB - Over decades, gradient descent has been applied to develop
learning algorithm to train a neural network (NN). In this brief,
a limitation of applying such algorithm to train an NN with persistent
weight noise is revealed. Let V (w) be the performance measure of an
ideal NN. V (w) is applied to develop the gradient descent learning (GDL).
With weight noise, the desired performance measure (denoted as J (w))
is E [V (w˜)|w], where w is the noisy weight vector. Applying GDL to ˜
train an NN with weight noise, the actual learning objective is clearly
not V (w) but another scalar function L(w). For decades, there is a
misconception that L (w) = J (w), and hence, the actual model attained
by the GDL is the desired model. However, we show that it might not:
1) with persistent additive weight noise, the actual model attained is the
desired model as L (w) = J (w); and 2) with persistent multiplicative
weight noise, the actual model attained is unlikely the desired model as
L (w) ≠ J (w). Accordingly, the properties of the models attained as
compared with the desired models are analyzed and the learning curves
are sketched. Simulation results on 1) a simple regression problem and
2) the MNIST handwritten digit recognition are presented to support
our claims.
KW - Additive weight noise
KW - gradient descent algorithms
KW - MNIST
KW - multiplicative weight noise
UR - http://www.scopus.com/inward/record.url?scp=85085905108&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85085905108&origin=recordpage
U2 - 10.1109/TNNLS.2019.2927689
DO - 10.1109/TNNLS.2019.2927689
M3 - RGC 21 - Publication in refereed journal
C2 - 31398136
SN - 2162-237X
VL - 31
SP - 2227
EP - 2232
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 6
M1 - 8789696
ER -