26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

92 theory of deep learning

Proposition 9.2.2. Consider a two layer neural network f w (·) with

ReLU activation functions in the hidden layer. Furthermore, assume

that the marginal input distribution P X (x) is symmetric and

isotropic, i.e., P X (x) = P X (−x) and E[xx ⊤ ] = I. Then the expected

explicit regularizer due to dropout is given as

R(w) := E[ ̂R(w)] = λ 2

d 0 ,d 1 ,d 2

∑ W 2 (i 2 , i 1 ) 2 W 1 (i 1 , i 0 ) 2 , (9.10)

i 0 ,i 1 ,i 2 =1

Proof of Proposition 9.2.2. Using Proposition 9.2.1, we have that:

R(w) = E[ ̂R(w)] = λ

d 1

j=1

‖W 2 (:, j)‖ 2 E[σ(W 1 (j, :) ⊤ x) 2 ]

It remains to calculate the quantity E x [σ(W 1 (j, :) ⊤ x) 2 ]. By symmetry

assumption, we have that P X (x) = P X (−x). As a result, for any

v ∈ R d 0, we have that P(v ⊤ x) = P(−v ⊤ x) as well. That is, the

random variable z j := W 1 (j, :) ⊤ x is also symmetric about the origin.

It is easy to see that E z [σ(z) 2 ] = 1 2 E z[z 2 ].

E z

[σ(z) 2 ] =

=

= 1 2

∫ ∞

−∞

∫ ∞

0

∫ ∞

σ(z) 2 dµ(z)

σ(z) 2 dµ(z) =

−∞

∫ ∞

z 2 dµ(z) = 1 2 E z [z2 ].

0

z 2 dµ(z)

Plugging back the above identity in the expression of R(w), we get

that

R(w) = λ 2

d 1

∑ ‖W 2 (:, j)‖ 2 E[(W 1 (j, :) ⊤ x) 2 ] = λ d 1

2

∑ ‖W 2 (:, j)‖ 2 ‖W 1 (j, :)‖ 2

j=1 j=1

where the second equality follows from the assumption that the

distribution is isotropic.

9.3 Landscape of the Optimization Problem

While the focus in Section 9.2 was on understanding the implicit bias

of dropout in terms of the global optima of the resulting regularized

learning problem, here we focus on computational aspects of dropout

as an optimization procedure. Since dropout is a first-order method

and the landscape of the Dropout objective (e.g., Problem 9.11) is

highly non-convex, we can perhaps only hope to find a local minimum,

that too provided if the problem has no degenerate saddle

points [? ? ]. Therefore, in this section, we pose the following questions:

What is the implicit bias of dropout in terms of local minima? Do

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!