22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

"… with great depth comes great complexity …"

Peter Parker

…and, along with that, overfitting.

But we also know that dropout works pretty well as a regularizer, so we can throw

that in the mix as well.

"How are we adding normalization, residual connections, and dropout

to our model?"

We’ll wrap each and every "sub-layer" with them! Cool, right? But that brings up

another question: How to wrap them? It turns out, we can wrap a "sub-layer" in one

of two ways: norm-last or norm-first.

Figure 10.7 - "Sub-Layers"—norm-last vs norm-first

The norm-last wrapper follows the "Attention Is All you Need" [149] paper to the

letter:

"We employ a residual connection around each of the two sub-layers, followed by

layer normalization. That is, the output of each sub-layer is

LayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented by

the sub-layer itself."

The norm-first wrapper follows the "sub-layer" implementation described in "The

Annotated Transformer," [150] which explicitly places norm first as opposed to last

Wrapping "Sub-Layers" | 809

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!