20.03.2021 Views

Deep-Learning-with-PyTorch

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

260 CHAPTER 10 Combining data sources into a unified dataset

since large and small nodules will not have the same features), series (to locate the

correct CT scan), and candidate center (to find the candidate in the larger CT). The

function that will build a list of these NoduleInfoTuple instances starts by using an inmemory

caching decorator, followed by getting the list of files present on disk.

Listing 10.2

dsets.py:32

Standard library inmemory

caching

requireOnDisk_bool defaults to

screening out series from data

subsets that aren’t in place yet.

@functools.lru_cache(1)

def getCandidateInfoList(requireOnDisk_bool=True):

mhd_list = glob.glob('data-unversioned/part2/luna/subset*/*.mhd')

presentOnDisk_set = {os.path.split(p)[-1][:-4] for p in mhd_list}

Since parsing some of the data files can be slow, we’ll cache the results of this function

call in memory. This will come in handy later, because we’ll be calling this function

more often in future chapters. Speeding up our data pipeline by carefully applying inmemory

or on-disk caching can result in some pretty impressive gains in training

speed. Keep an eye out for these opportunities as you work on your projects.

Earlier we said that we’ll support running our training program with less than the

full set of training data, due to the long download times and high disk space requirements.

The requireOnDisk_bool parameter is what makes good on that promise;

we’re detecting which LUNA series UIDs are actually present and ready to be loaded

from disk, and we’ll use that information to limit which entries we use from the CSV

files we’re about to parse. Being able to run a subset of our data through the training

loop can be useful to verify that the code is working as intended. Often a model’s

training results are bad to useless when doing so, but exercising our logging, metrics,

model check-pointing, and similar functionality is beneficial.

After we get our candidate information, we want to merge in the diameter information

from annotations.csv. First we need to group our annotations by series_uid,

as that’s the first key we’ll use to cross-reference each row from the two files.

Listing 10.3

dsets.py:40, def getCandidateInfoList

diameter_dict = {}

with open('data/part2/luna/annotations.csv', "r") as f:

for row in list(csv.reader(f))[1:]:

series_uid = row[0]

annotationCenter_xyz = tuple([float(x) for x in row[1:4]])

annotationDiameter_mm = float(row[4])

diameter_dict.setdefault(series_uid, []).append(

(annotationCenter_xyz, annotationDiameter_mm)

)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!