20.03.2021 Views

Deep-Learning-with-PyTorch

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

370 CHAPTER 13 Using segmentation to find suspected nodules

into the determination of that output pixel were padded, fabricated, or otherwise

incomplete. Thus the output of the original U-Net will tile perfectly, so it can be used

with images of any size (except at the edges of the input image, where some context

will be missing by definition).

There are two problems with us taking the same pixel-perfect approach for our

problem. The first is related to the interaction between convolution and downsampling,

and the second is related to the nature of our data being three-dimensional.

13.5.1 U-Net has very specific input size requirements

The first issue is that the sizes of the input and output patches for U-Net are very specific.

In order to have the two-pixel loss per convolution line up evenly before and after

downsampling (especially when considering the further convolutional shrinkage at that

lower resolution), only certain input sizes will work. The U-Net paper used 572 × 572

image patches, which resulted in 388 × 388 output maps. The input images are bigger

than our 512 × 512 CT slices, and the output is quite a bit smaller! That would mean any

nodules near the edge of the CT scan slice wouldn’t be segmented at all. Although this

setup works well when dealing with very large images, it’s not ideal for our use case.

We will address this issue by setting the padding flag of the U-Net constructor to

True. This will mean we can use input images of any size, and we will get output of the

same size. We may lose some fidelity near the edges of the image, since the receptive

field of pixels located there will include regions that have been artificially padded, but

that’s a compromise we decide to live with.

13.5.2 U-Net trade-offs for 3D vs. 2D data

The second issue is that our 3D data doesn’t line up exactly with U-Net’s 2D expected

input. Simply taking our 512 × 512 × 128 image and feeding it into a converted-to-3D

U-Net class won’t work, because we’ll exhaust our GPU memory. Each image is 2 9 by 2 9

by 2 7 , with 2 2 bytes per voxel. The first layer of U-Net is 64 channels, or 2 6 . That’s an exponent

of 9 + 9 + 7 + 2 + 6 = 33, or 8 GB just for the first convolutional layer. There are two convolutional

layers (16 GB); and then each downsampling halves the resolution but

doubles the channels, which is another 2 GB for each layer after the first downsample

(remember, halving the resolution results in one-eighth the data, since we’re working

with 3D data). So we’ve hit 20 GB before we even get to the second downsample, much

less anything on the upsample side of the model or anything dealing with autograd.

NOTE There are a number of clever and innovative ways to get around these

problems, and we in no way suggest that this is the only approach that will

ever work. 6 We do feel that this approach is one of the simplest that gets the

job done to the level we need for our project in this book. We’d rather keep

things simple so that we can focus on the fundamental concepts; the clever

stuff can come later, once you’ve mastered the basics.

6

For example, Stanislav Nikolov et al., “Deep Learning to Achieve Clinically Applicable Segmentation of Head

and Neck Anatomy for Radiotherapy,” https://arxiv.org/pdf/1809.04430.pdf.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!