26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

82 theory of deep learning

sights from the success of the sexual reproduction model in the evolution

of advanced organisms. Another motivation that was cited by [?

] was in terms of “balancing networks”. Despite several theoretical

works aimed at explaining Dropout 3 , it remains unclear what kind

3

of regularization does Dropout provide or what kinds of networks

does Dropout prefer and how that helps with generalization. In this

chapter, we work towards that goal by instantiating explicit forms of

regularizers due to Dropout and how they provide capacity control

in various machine learning including linear regression (Section 9.4),

matrix sensing (Section 9.1.1), matrix completion (Section 9.1.2), and

deep learning (Section 9.2).

9.1 Matrix Sensing

We begin with understanding dropout for matrix sensing, a problem

which arguably is an important instance of a matrix learning problem

with lots of applications, and is well understood from a theoretical

perspective. Here is the problem setup.

Let M ∗ ∈ R d 2×d 0 be a matrix with rank r ∗ := Rank(M ∗ ). Let

A (1) , . . . , A (n) be a set of measurement matrices of the same size as

M ∗ . The goal of matrix sensing is to recover the matrix M ∗ from n

observations of the form y i = 〈M ∗ , A (i) 〉 such that n ≪ d 2 d 0 . The

learning algorithm we consider is empirical risk minimization, and

we choose to represent the parameter matrix M ∈ R d 2×d 0 in terms of

product of its factors U ∈ R d 2×d 1 , V ∈ R d 0×d 1 :

min ̂L(U, V) := 1

U∈R d 2 ×d 1 ,V∈R d 0 ×d 1

n

n

i=1

(y i − 〈UV ⊤ , A (i) 〉) 2 . (9.1)

When d 1 ≫ r ∗ , there exist many “bad” empirical minimizers,

i.e., those with a large true risk. However, recently, [? ] showed that

under restricted isometry property, despite the existence of such poor

ERM solutions, gradient descent with proper initialization is implicitly

biased towards finding solutions with minimum nuclear norm – this

is an important result which was first conjectured and empirically

verified by [? ].

We propose solving the ERM problem (9.1) with algorithmic regularization

due to dropout, where at training time, the corresponding

columns of U and V are dropped independently and identically according

to a Bernoulli random variable. As opposed to the implicit

effect of gradient descent, this dropout heuristic explicitly regularizes

the empirical objective. It is then natural to ask, in the case of matrix

sensing, if dropout also biases the ERM towards certain low norm

solutions. To answer this question, we begin with the observation

that dropout can be viewed as an instance of SGD on the following

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!