TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
List of Figures
3.1 Why it suffices to compute derivatives with respect to nodes. 22
3.2 Multivariate chain rule: derivative with respect to node z can be computed
using weighted sum of derivatives with respect to all nodes
that z feeds into. 23
3.3 Vector version of above 26
6.1 Steepest descent w.r.t ‖.‖4/3: the global minimum to which steepest descent
converges to depends on η. Here w 0 = [0, 0, 0], w
‖.‖ ∗ = arg min ψ∈G ‖w‖4 /3
denotes the minimum norm global minimum, and wη→0 ∞ denotes the solution
of infinitesimal SD with η → 0. Note that even as η → 0, the expected
characterization does not hold, i.e., wη→0 ∞ ̸= w∗ ‖.‖ . 44
7.1 Obstacles for nonconvex optimization. From left to right: local minimum,
saddle point and flat region. 57
8.1 Convergence rate vs. projections onto eigenvectors of the kernel matrix.
73
8.2 Generalization error vs. complexity measure. 74
9.1 Optimization landscape (top) and contour plot (bottom) for a single
hidden-layer linear autoencoder network with one dimensional input
and output and a hidden layer of width r = 2 with dropout,
for different values of the regularization parameter λ. Left: for λ =
0 the problem reduces to squared loss minimization, which is rotation
invariant as suggested by the level sets. Middle: for λ > 0 the
global optima shrink toward the origin. All local minima are global,
and are equalized, i.e. the weights are parallel to the vector (±1, ±1).
Right: as λ increases, global optima shrink further. 93
10.1 Visualization of Pearson’s Crab Data as mixture of two Gaussians.
(Credit: MIX homepage at McMaster University.) 102