26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

List of Figures

3.1 Why it suffices to compute derivatives with respect to nodes. 22

3.2 Multivariate chain rule: derivative with respect to node z can be computed

using weighted sum of derivatives with respect to all nodes

that z feeds into. 23

3.3 Vector version of above 26

6.1 Steepest descent w.r.t ‖.‖4/3: the global minimum to which steepest descent

converges to depends on η. Here w 0 = [0, 0, 0], w

‖.‖ ∗ = arg min ψ∈G ‖w‖4 /3

denotes the minimum norm global minimum, and wη→0 ∞ denotes the solution

of infinitesimal SD with η → 0. Note that even as η → 0, the expected

characterization does not hold, i.e., wη→0 ∞ ̸= w∗ ‖.‖ . 44

7.1 Obstacles for nonconvex optimization. From left to right: local minimum,

saddle point and flat region. 57

8.1 Convergence rate vs. projections onto eigenvectors of the kernel matrix.

73

8.2 Generalization error vs. complexity measure. 74

9.1 Optimization landscape (top) and contour plot (bottom) for a single

hidden-layer linear autoencoder network with one dimensional input

and output and a hidden layer of width r = 2 with dropout,

for different values of the regularization parameter λ. Left: for λ =

0 the problem reduces to squared loss minimization, which is rotation

invariant as suggested by the level sets. Middle: for λ > 0 the

global optima shrink toward the origin. All local minima are global,

and are equalized, i.e. the weights are parallel to the vector (±1, ±1).

Right: as λ increases, global optima shrink further. 93

10.1 Visualization of Pearson’s Crab Data as mixture of two Gaussians.

(Credit: MIX homepage at McMaster University.) 102

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!