TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
22 theory of deep learning
Computing the hessian ∇ 2 f (w) can be computational difficult
because it scales quadratically in d (which can be more than 1 million
in practice). Therefore, approximation of the hessian and its inverse is
used:
w ← w − ηQ(w)∇ f (w)
where Q(w) is supposed to be a good approximation of ∇ 2 f (w),
and sometimes is referred to as a pre-conditioner. In practice, often
people first approximate ∇ 2 f (w) by a diagonal matrix and then take
its inverse. E.g., one can use diag(∇ f (w)∇ f (w ⊤ )) to approximate
the Hessian, and then use the inverse of the diagonal matrix as the
pre-conditioner.
≪Tengyu notes: more on adagrad?≫