20.03.2021 Views

Deep-Learning-with-PyTorch

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Semantic segmentation: Per-pixel classification

361

NOTE As we go through the code examples in this chapter, we’re going to

rely on you checking the code from GitHub for much of the larger context.

We’ll be omitting code that’s uninteresting or similar to what’s come before

in earlier chapters, so that we can focus on the crux of the issue at hand.

13.3 Semantic segmentation: Per-pixel classification

Often, segmentation is used to answer questions of the form “Where is a cat in this picture?”

Obviously, most pictures of a cat, like figure 13.3, have a lot of non-cat in them;

there’s the table or wall in the background, the keyboard the cat is sitting on, that kind

of thing. Being able to say “This pixel is part of the cat, and this other pixel is part of the

wall” requires fundamentally different model output and a different internal structure

from the classification models we’ve worked with thus far. Classification can tell us

whether a cat is present, while segmentation will tell us where we can find it.

ClaSsification vs. segmentation

CAT: Yes

Cat: here

Figure 13.3 Classification results in one or more binary flags, while segmentation produces

a mask or heatmap.

If your project requires differentiating between a near cat and a far cat, or a cat on the

left versus a cat on the right, then segmentation is probably the right approach. The

image-consuming classification models that we’ve implemented so far can be thought

of as funnels or magnifying glasses that take a large bunch of pixels and focus them

down into a single “point” (or, more accurately, a single set of class predictions), as

shown in figure 13.4. Classification models provide answers of the form “Yes, this huge

pile of pixels has a cat in it, somewhere,” or “No, no cats here.” This is great when you

don’t care where the cat is, just that there is (or isn’t) one in the image.

Repeated layers of convolution and downsampling mean the model starts by consuming

raw pixels to produce specific, detailed detectors for things like texture and

color, and then builds up higher-level conceptual feature detectors for parts like eyes

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!