22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

14 # If there is a configuration file, builds a dictionary with

15 # the corresponding start and end lines of each text file

16 config_file = os.path.join(source, 'lines.cfg')

17 config = {}

18 if os.path.exists(config_file):

19 with open(config_file, 'r') as f:

20 rows = f.readlines()

21

22 for r in rows[1:]:

23 fname, start, end = r.strip().split(',')

24 config.update({fname: (int(start), int(end))})

25

26 new_fnames = []

27 # For each file of text

28 for fname in filenames:

29 # If there's a start and end line for that file, use it

30 try:

31 start, end = config[fname]

32 except KeyError:

33 start = None

34 end = None

35

36 # Opens the file, slices the configures lines (if any)

37 # cleans line breaks and uses the sentence tokenizer

38 with open(os.path.join(source, fname), 'r') as f:

39 contents = (

40 ''.join(f.readlines()[slice(start, end, None)])

41 .replace('\n', ' ').replace('\r', '')

42 )

43 corpus = sent_tokenize(contents, **kwargs)

44

45 # Builds a CSV file containing tokenized sentences

46 base = os.path.splitext(fname)[0]

47 new_fname = f'{base}.sent.csv'

48 new_fname = os.path.join(source, new_fname)

49 with open(new_fname, 'w') as f:

50 # Header of the file

51 if include_header:

52 if include_source:

53 f.write('sentence,source\n')

54 else:

55 f.write('sentence\n')

Building a Dataset | 889

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!