22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Even though we’re using HuggingFace’s Dataset class to build our

own dataset, we’re only using a fraction of its capabilities. For a

more detailed view of what it has to offer, make sure to check its

extensive documentation:

• Quick Tour [165]

• What’s in the Dataset Object [166]

• Loading a Dataset [167]

And, for a complete list of every dataset available, check the

HuggingFace Hub. [168]

Loading a Dataset

We can use HF’s (I will abbreviate HuggingFace as HF from now on)

load_dataset() to load from local files:

Data Preparation

1 from datasets import load_dataset, Split

2 dataset = load_dataset(path='csv',

3 data_files=new_fnames,

4 quotechar='\\',

5 split=Split.TRAIN)

The name of the first argument (path) may be a bit misleading—it is actually the

path to the dataset processing script, not the actual files. To load CSV files, we

simply use HF’s csv as in the example above. The list of actual files containing the

text (sentences, in our case) must be provided in the data_files argument. The

split argument is used to designate which split the dataset represents

(Split.TRAIN, Split.VALIDATION, or Split.TEST).

Moreover, the CSV script offers more options to control parsing and reading of the

CSV files, like quotechar, delimiter, column_names, skip_rows, and quoting. For

more details, please check the documentation on loading CSV files.

It is also possible to load data from JSON files, text files, Python dictionaries, and

Pandas dataframes.

892 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!