26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ultra-wide neural networks and neural tangent kernel 79

Applying this approximation inductively in Equation (8.16), we get

b (h) (x), b (h) (x ′ L

) → ∏

˙Σ (h′) (x, x ′ ).

h ′ =h

〈 〉 〈 〉

∂ f (w,x)

Finally, since

∂w , ∂ f (w,x′ )

∂w

= ∑ L+1 ∂ f (w,x)

h=1

, ∂ f (w,x′ )

, we obtain

the final NTK expression for the fully-connected neural network:

∂W (h) ∂W (h)

(

)

Θ (L) (x, x ′ L+1

) = ∑ Σ (h−1) (x, x ′ ) · ˙Σ (h′) (x, x ′ ) .

h=1

8.5 NTK in Practice

L+1

h ′ =h

Up to now we have shown an ultra-wide neural network with certain

initialization scheme and trained by gradient flow correspond to a

kernel with a particular kernel function. A natural question is: why

don’t we use this kernel classifier directly?

A recent line of work showed that NTKs can be empirically useful,

especially on small to medium scale datasets. Arora et al. [? ] tested

the NTK classifier on 90 small to medium scale datasets from UCI

database. 3 They found NTK can beat neural networks, other kernels

like Gaussian and the best previous classifier, random forest under

various metrics, including average rank, average accuracy, etc. This

suggests the NTK classifier should belong in any list of off-the-shelf

machine learning methods.

For every neural network architecture, one can derive a corresponding

kernel function. Du et al. [? ] derived graph NTK (GNTK)

for graph classification tasks. On various social network and bioinformatics

datasets, GNTK can outperform graph neural networks.

Similarly, Arora et al. [? ] derived convolutional NTK (CNTK)

formula that corresponds to convolutional neural networks. For

image classification task, in small-scale data and low-shot settings,

CNTKs can be quite strong [? ]. However, for large scale data, Arora

et al. [? ] found there is still a performance gap between CNTK and

CNN. It is an open problem to explain this phenomenon theoretically.

This may need to go beyond the NTK framework.

3

https://archive.ics.uci.edu/ml/datasets.php

8.6 Exercises

1. NTK formula for ReLU activation function: prove E w∼N (0,I)

[ ˙σw ⊤ x ˙σw ⊤ x ′] =

( )

π−arccos x⊤x ′

‖x‖ 2 ‖x ′ ‖ 2

.

2. Prove Equation (8.10)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!