TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
82 theory of deep learning
sights from the success of the sexual reproduction model in the evolution
of advanced organisms. Another motivation that was cited by [?
] was in terms of “balancing networks”. Despite several theoretical
works aimed at explaining Dropout 3 , it remains unclear what kind
3
of regularization does Dropout provide or what kinds of networks
does Dropout prefer and how that helps with generalization. In this
chapter, we work towards that goal by instantiating explicit forms of
regularizers due to Dropout and how they provide capacity control
in various machine learning including linear regression (Section 9.4),
matrix sensing (Section 9.1.1), matrix completion (Section 9.1.2), and
deep learning (Section 9.2).
9.1 Matrix Sensing
We begin with understanding dropout for matrix sensing, a problem
which arguably is an important instance of a matrix learning problem
with lots of applications, and is well understood from a theoretical
perspective. Here is the problem setup.
Let M ∗ ∈ R d 2×d 0 be a matrix with rank r ∗ := Rank(M ∗ ). Let
A (1) , . . . , A (n) be a set of measurement matrices of the same size as
M ∗ . The goal of matrix sensing is to recover the matrix M ∗ from n
observations of the form y i = 〈M ∗ , A (i) 〉 such that n ≪ d 2 d 0 . The
learning algorithm we consider is empirical risk minimization, and
we choose to represent the parameter matrix M ∈ R d 2×d 0 in terms of
product of its factors U ∈ R d 2×d 1 , V ∈ R d 0×d 1 :
min ̂L(U, V) := 1
U∈R d 2 ×d 1 ,V∈R d 0 ×d 1
n
n
∑
i=1
(y i − 〈UV ⊤ , A (i) 〉) 2 . (9.1)
When d 1 ≫ r ∗ , there exist many “bad” empirical minimizers,
i.e., those with a large true risk. However, recently, [? ] showed that
under restricted isometry property, despite the existence of such poor
ERM solutions, gradient descent with proper initialization is implicitly
biased towards finding solutions with minimum nuclear norm – this
is an important result which was first conjectured and empirically
verified by [? ].
We propose solving the ERM problem (9.1) with algorithmic regularization
due to dropout, where at training time, the corresponding
columns of U and V are dropped independently and identically according
to a Bernoulli random variable. As opposed to the implicit
effect of gradient descent, this dropout heuristic explicitly regularizes
the empirical objective. It is then natural to ask, in the case of matrix
sensing, if dropout also biases the ERM towards certain low norm
solutions. To answer this question, we begin with the observation
that dropout can be viewed as an instance of SGD on the following