TheoryofDeepLearning.2022

Recommendations

Info

88 theory of deep learninga rotation matrix Q such that√√‖ diag(p)Uq j‖‖ diag(q)Vq j‖ = 1 √‖ diag(p)UV ⊤√ diag(q)‖ ∗ .d 1We evaluate the expected dropout regularizer at UQ, VQ:R(UQ, VQ) ==d 1 √√∑ ‖ diag(p)Uq j‖ 2 ‖ diag(q)Vq j‖ 2j=1d 1∑j=11d 2 1√‖ diag(p)UV ⊤√ diag(q)‖ 2 ∗= 1 √‖ diag(p)UV ⊤√ diag(q)‖ 2 ∗d 1≤ Θ(UV ⊤ )which completes the proof.The results above are interesting because they connect Dropout,an algorithmic heuristic in deep learning, to strong complexitymeasures that are empirically effective as well as theoretically wellunderstood. To illustrate, here we give a generalization bound formatrix completion with dropout in terms of the value of the explicitregularizer at the minimum of the empirical problem.Theorem 9.1.2. Without loss of generality, assume that d 2 ≥ d 0 and‖M ∗ ‖ ≤ 1. Furthermore, assume that min i,j p(i)q(j) ≥ log(d 2)n √ . Letd 2 d 0(U, V) be a minimizer of the dropout ERM objective in equation (9.2),and assume that max i ‖U(i, :)‖ 2 ≤ γ, max i ‖V(i, :)‖ 2 ≤ γ. Let α besuch that ̂R(U, V) ≤ α/d 1 . Then, for any δ ∈ (0, 1), the followinggeneralization bounds holds with probability at least 1 − 2δ over asample of size n:√ √αd2 log(dL(U, V) ≤ ̂L(U, V) + C(1 + γ)2 )+ C ′ log(2/δ)(1 + γ 2 )n2nas long as n = Ω ( (d 1 γ 2 /α) 2 log(2/δ) ) , where C, C ′ are some absoluteconstants.The proof of Theorem 9.1.2 follows from standard generalizationbounds for l 2 loss [? ] based on the Rademacher complexity [? ] ofthe class of functions with weighted trace-norm bounded by √ α,i.e. M α := {M : ‖diag( √ p)Mdiag( √ q)‖ 2 ∗ ≤ α}. A bound on theRademacher complexity of this class was established by [? ]. Thetechnicalities here include showing that the explicit regularizer is wellconcentrated around its expected value, as well as deriving a boundon the supremum of the predictions. A few remarks are in order.We require that the sampling distributions be non-degenerate, asspecified by the condition min i,j p(i)q(j) ≥ log(d 2)n √ d 2 d 0. This is a natural
inductive biases due to algorithmic regularization 89requirement for bounding the Rademacher complexity of M α , asdiscussed in [? ].We note that for large enough sample size, ̂R(U, V) ≈ R(U, V) ≈Θ(UV ⊤ ) =d 1 1‖diag( √ p)UV ⊤ diag( √ q)‖ 2 ∗, where the second approximationis due the fact that the pair (U, V) is a minimizer. Thatis, compared to the weighted trace-norm, the value of the explicitregularizer at the minimizer roughly scales as 1/d 1 . Hence the assumption̂R(U, V) ≤ α/d 1 in the statement of the corollary.In practice, for models that are trained with dropout, the trainingerror ̂L(U, V) is negligible. Moreover, given that the sample size islarge enough, the third term can be made arbitrarily small. Havingsaid that, the second term, which is Õ(γ √ αd 2 /n), dominates theright hand side of generalization error bound in Theorem 9.1.2.The assumption max i ‖U(i, :)‖ 2 ≤ γ, max i ‖V(i, :)‖ 2 ≤ γ ismotivated by the practice of deep learning; such max-norm constraintsare typically used with dropout, where the norm of the vector ofincoming weights at each hidden unit is constrained to be bound by aconstant [? ]. In this case, if a dropout update violates this constraint,the weights of the hidden unit are projected back to the constraintnorm ball. In proofs, we need this assumption to give a concentrationbound for the empirical explicit regularizer, as well as bound thesupremum deviation between the predictions and the true values.We remark that the value of γ also determines the complexity ofthe function class. On one hand, the generalization gap explicitlydepends on and increases with γ. However, when γ is large, theconstraints on U, V are milder, so that ̂L(U, V) can be made smaller.Finally, the required sample size heavily depends on the valueof the explicit regularizer at the optima (α/d 1 ), and hence, on thedropout rate p. In particular, increasing the dropout rate increasesthe regularization parameter λ :=p1−p, thereby intensifies the penaltydue to the explicit regularizer. Intuitively, a larger dropout rate presults in a smaller α, thereby a tighter generalization gap can beguaranteed. We show through experiments that that is indeed thecase in practice.9.2 Deep neural networksNext, we focus on neural networks with multiple hidden layers.Let X ⊆ R d 0 and Y ⊆ R d k denote the input and output spaces,respectively. Let D denote the joint probability distribution on X × Y.Given n examples {(x i , y i)} n i=1 ∼ Dn drawn i.i.d. from the jointdistribution and a loss function l : Y × Y → R, the goal of learningis to find a hypothesis f w : X → Y, parameterized by w, that has asmall population risk L(w) := E D [l( f w (x), y)].
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
4Basics of generalization theoryGen
Page 33 and 34:
basics of generalization theory 33p
Page 35 and 36:
basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 87: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?