20.03.2021 Views

Deep-Learning-with-PyTorch

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Semantic segmentation: Per-pixel classification

363

6x6 Input

3 x3 Conv

4x4 Output

4x4 Input

(same as output)

3

x3 Conv

22

x x2

22

2x

Outputt

Max POol

1x1 Output

Figure 13.5 The convolutional architecture of a LunaModel block, consisting of two 3 × 3

convolutions followed by a max pool. The final pixel has a 6 × 6 receptive field.

(bad) due to the limited reach based on how much overlap multiple layers of small

convolutions will have. The classification model uses each downsampling layer to double

the effective reach of the following convolutions; and without that increase in

effective field size, each segmented pixel will only be able to consider a very local

neighborhood.

NOTE Assuming 3 × 3 convolutions, the receptive field size for a simple

model of stacked convolutions is 2 * L + 1, with L being the number of convolutional

layers.

Four layers of 3 × 3 convolutions will have a receptive field of 9 × 9 per output pixel. By

inserting a 2 × 2 max pool between the second and third convolutions, and another at

the end, we increase the receptive field to …

NOTE See if you can figure out the math yourself; when you’re done, check

back here.

... 16 × 16. The final series of conv-conv-pool has a receptive field of 6 × 6, but that

happens after the first max pool, which makes the final effective receptive field 12 × 12

in the original input resolution. The first two conv layers add a total border of 2 pixels

around the 12 × 12, for a total of 16 × 16.

So the question remains: how can we improve the receptive field of an output pixel

while maintaining a 1:1 ratio of input pixels to output pixels? One common answer is

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!