20.03.2021 Views

Deep-Learning-with-PyTorch

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Running the training script

305

If the first epoch seems to be taking a very long time (more than 10 or 20 minutes), it

might be related to needing to prepare the cached data required by LunaDataset. See

section 10.5.1 for details about the caching. The exercises for chapter 10 included

writing a script to pre-stuff the cache in an efficient manner. We also provide the

prepcache.py file to do the same thing; it can be invoked with python -m p2ch11

.prepcache. Since we repeat our dsets.py files per chapter, the caching will need to be

repeated for every chapter. This is somewhat space and time inefficient, but it means we

can keep the code for each chapter much more well contained. For your future projects,

we recommend reusing your cache more heavily.

Once training is underway, we want to make sure we’re using the computing

resources at hand the way we expect. An easy way to tell if the bottleneck is data loading

or computation is to wait a few moments after the script starts to train (look for output

like E1 Training 16/7750, done at…) and then check both top and nvidia-smi:

• If the eight Python worker processes are consuming >80% CPU, then the cache

probably needs to be prepared (we know this here because the authors have

made sure there aren’t CPU bottlenecks in this project’s implementation; this

won’t be generally true).

• If nvidia-smi reports that GPU-Util is >80%, then you’re saturating your GPU.

We’ll discuss some strategies for efficient waiting in section 11.7.2.

The intent is that the GPU is saturated; we want to use as much of that computing

power as we can to complete epochs quickly. A single NVIDIA GTX 1080 Ti should

complete an epoch in under 15 minutes. Since our model is relatively simple, it

doesn’t take a lot of CPU preprocessing for the CPU to be the bottleneck. When working

with models with greater depth (or more needed calculations in general), processing

each batch will take longer, which will increase the amount of CPU processing we

can do before the GPU runs out of work before the next batch of input is ready.

11.7.1 Needed data for training

If the number of samples is less than 495,958 for training or 55,107 for validation, it

might make sense to do some sanity checking to be sure the full data is present and

accounted for. For your future projects, make sure your dataset returns the number of

samples that you expect.

First, let’s take a look at the basic directory structure of our data-unversioned/

part2/luna directory:

$ ls -1p data-unversioned/part2/luna/

subset0/

subset1/

...

subset9/

Next, let’s make sure we have one .mhd file and one .raw file for each series UID

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!