20.03.2021 Views

Deep-Learning-with-PyTorch

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Convolutions in action

205

The receptive field of output pixels

When the second 3 × 3 convolution kernel produces 21 in its conv output in figure

8.8, this is based on the top-left 3 × 3 pixels of the first max pool output. They, in turn,

correspond to the 6 × 6 pixels in the top-left corner in the first conv output, which in

turn are computed by the first convolution from the top-left 7 × 7 pixels. So the pixel

in the second convolution output is influenced by a 7 × 7 input square. The first

convolution also uses an implicitly “padded” column and row to produce the output in

the corner; otherwise, we would have an 8 × 8 square of input pixels informing a given

pixel (away from the boundary) in the second convolution’s output. In fancy language,

we say that a given output neuron of the 3 × 3-conv, 2 × 2-max-pool, 3 × 3-conv

construction has a receptive field of 8 × 8.

8.2.4 Putting it all together for our network

With these building blocks in our hands, we can now proceed to build our convolutional

neural network for detecting birds and airplanes. Let’s take our previous fully

connected model as a starting point and introduce nn.Conv2d and nn.MaxPool2d as

described previously:

# In[22]:

model = nn.Sequential(

nn.Conv2d(3, 16, kernel_size=3, padding=1),

nn.Tanh(),

nn.MaxPool2d(2),

nn.Conv2d(16, 8, kernel_size=3, padding=1),

nn.Tanh(),

nn.MaxPool2d(2),

# ...

)

The first convolution takes us from 3 RGB channels to 16, thereby giving the network

a chance to generate 16 independent features that operate to (hopefully) discriminate

low-level features of birds and airplanes. Then we apply the Tanh activation function.

The resulting 16-channel 32 × 32 image is pooled to a 16-channel 16 × 16 image

by the first MaxPool3d. At this point, the downsampled image undergoes another convolution

that generates an 8-channel 16 × 16 output. With any luck, this output will

consist of higher-level features. Again, we apply a Tanh activation and then pool to an

8-channel 8 × 8 output.

Where does this end? After the input image has been reduced to a set of 8 × 8 features,

we expect to be able to output some probabilities from the network that we can

feed to our negative log likelihood. However, probabilities are a pair of numbers in a

1D vector (one for airplane, one for bird), but here we’re still dealing with multichannel

2D features.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!