22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Vision Transformer

The Transformer architecture is fairly flexible, and, although it was devised to

handle NLP tasks in the first place, it is already starting to spread to different areas,

including computer vision. Let’s take a look at one of the latest developments in the

field: the Vision Transformer (ViT). It was introduced by Dosovitskiy, A., et al. in

their paper "An Image is Worth 16x16 Words: Transformers for Image Recognition

at Scale." [152]

"Cool, but I thought the Transformer handled sequences, not images."

That’s a fair point. The answer is deceptively simple: Let’s break an image into a

sequence of patches.

Data Generation & Preparation

First, let’s bring back our multiclass classification problem from Chapter 5. We’re

generating a synthetic dataset of images that are going to have either a diagonal or

a parallel line, and labeling them according to the table below:

Line

Label/Class Index

Parallel (Horizontal OR Vertical) 0

Diagonal, Tilted to the Right 1

Diagonal, Tilted to the Left 2

Data Generation

1 images, labels = generate_dataset(img_size=12, n_images=1000,

2 binary=False, seed=17)

Each image, like the example below, is 12x12 pixels in size and has a single channel:

img = torch.as_tensor(images[2]).unsqueeze(0).float()/255.

846 | Chapter 10: Transform and Roll Out

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!