26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

22 theory of deep learning

Computing the hessian ∇ 2 f (w) can be computational difficult

because it scales quadratically in d (which can be more than 1 million

in practice). Therefore, approximation of the hessian and its inverse is

used:

w ← w − ηQ(w)∇ f (w)

where Q(w) is supposed to be a good approximation of ∇ 2 f (w),

and sometimes is referred to as a pre-conditioner. In practice, often

people first approximate ∇ 2 f (w) by a diagonal matrix and then take

its inverse. E.g., one can use diag(∇ f (w)∇ f (w ⊤ )) to approximate

the Hessian, and then use the inverse of the diagonal matrix as the

pre-conditioner.

≪Tengyu notes: more on adagrad?≫

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!