TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
92 theory of deep learning
Proposition 9.2.2. Consider a two layer neural network f w (·) with
ReLU activation functions in the hidden layer. Furthermore, assume
that the marginal input distribution P X (x) is symmetric and
isotropic, i.e., P X (x) = P X (−x) and E[xx ⊤ ] = I. Then the expected
explicit regularizer due to dropout is given as
R(w) := E[ ̂R(w)] = λ 2
d 0 ,d 1 ,d 2
∑ W 2 (i 2 , i 1 ) 2 W 1 (i 1 , i 0 ) 2 , (9.10)
i 0 ,i 1 ,i 2 =1
Proof of Proposition 9.2.2. Using Proposition 9.2.1, we have that:
R(w) = E[ ̂R(w)] = λ
d 1
∑
j=1
‖W 2 (:, j)‖ 2 E[σ(W 1 (j, :) ⊤ x) 2 ]
It remains to calculate the quantity E x [σ(W 1 (j, :) ⊤ x) 2 ]. By symmetry
assumption, we have that P X (x) = P X (−x). As a result, for any
v ∈ R d 0, we have that P(v ⊤ x) = P(−v ⊤ x) as well. That is, the
random variable z j := W 1 (j, :) ⊤ x is also symmetric about the origin.
It is easy to see that E z [σ(z) 2 ] = 1 2 E z[z 2 ].
E z
[σ(z) 2 ] =
=
= 1 2
∫ ∞
−∞
∫ ∞
0
∫ ∞
σ(z) 2 dµ(z)
σ(z) 2 dµ(z) =
−∞
∫ ∞
z 2 dµ(z) = 1 2 E z [z2 ].
0
z 2 dµ(z)
Plugging back the above identity in the expression of R(w), we get
that
R(w) = λ 2
d 1
∑ ‖W 2 (:, j)‖ 2 E[(W 1 (j, :) ⊤ x) 2 ] = λ d 1
2
∑ ‖W 2 (:, j)‖ 2 ‖W 1 (j, :)‖ 2
j=1 j=1
where the second equality follows from the assumption that the
distribution is isotropic.
9.3 Landscape of the Optimization Problem
While the focus in Section 9.2 was on understanding the implicit bias
of dropout in terms of the global optima of the resulting regularized
learning problem, here we focus on computational aspects of dropout
as an optimization procedure. Since dropout is a first-order method
and the landscape of the Dropout objective (e.g., Problem 9.11) is
highly non-convex, we can perhaps only hope to find a local minimum,
that too provided if the problem has no degenerate saddle
points [? ? ]. Therefore, in this section, we pose the following questions:
What is the implicit bias of dropout in terms of local minima? Do