20.03.2021 Views

Deep-Learning-with-PyTorch

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

328 CHAPTER 12 Improving training with metrics and augmentation

Here, we can see that neg_correct is the same thing as trueNeg_count! That actually

makes sense, since non-nodule is our “negative” value (as in “a negative diagnosis”),

and if the classifier gets the prediction correct, then that’s a true negative. Similarly,

correctly labeled nodule samples are true positives.

We do need to add the variables for our false positive and false negative values.

That’s straightforward, since we can take the total number of benign labels and subtract

the count of the correct ones. What’s left is the count of non-nodule samples misclassified

as positive. Hence, they are false positives. Again, the false negative calculation is of

the same form, but uses nodule counts.

With those values, we can compute precision and recall and store them in metrics

_dict.

Listing 12.2

training.py:333, LunaTrainingApp.logMetrics

precision = metrics_dict['pr/precision'] = \

truePos_count / np.float32(truePos_count + falsePos_count)

recall = metrics_dict['pr/recall'] = \

truePos_count / np.float32(truePos_count + falseNeg_count)

Note the double assignment: while having separate precision and recall variables

isn’t strictly necessary, they improve the readability of the next section. We also extend

the logging statement in logMetrics to include the new values, but we skip the implementation

for now (we’ll revisit logging later in the chapter).

12.3.4 Our ultimate performance metric: The F1 score

While useful, neither precision nor recall entirely captures what we need in order to be

able to evaluate a model. As we’ve seen with Roxie and Preston, it’s possible to game

either one individually by manipulating our classification threshold, resulting in a

model that scores well on one or the other but does so at the expense of any real-world

utility. We need something that combines both of those values in a way that prevents such

gamesmanship. As we can see in figure 12.10, it’s time to introduce our ultimate metric.

The generally accepted way of combining precision and recall is by using the F1

score (https://en.wikipedia.org/wiki/F1_score). As with other metrics, the F1 score

ranges between 0 (a classifier with no real-world predictive power) and 1 (a classifier

that has perfect predictions). We will update logMetrics to include this as well.

Listing 12.3

training.py:338, LunaTrainingApp.logMetrics

metrics_dict['pr/f1_score'] = \

2 * (precision * recall) / (precision + recall)

At first glance, this might seem more complicated than we need, and it might not be

immediately obvious how the F1 score behaves when trading off precision for recall or

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!