22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Gensim’s dictionary has a couple of methods that we can use for this:

• filter_extremes(): Keeps the first keep_n most frequent words only (it is also

possible to keep words that appear in at least no_below documents or to

remove words that appear in more than no_above fraction of documents).

• filter_tokens(): Removes tokens from a list of bad_ids (doc2idx() can be

used to get a list of the corresponding IDs of the bad words) or keeps only the

tokens from a list of good_ids.

"What if I want to remove words that appear less than X times in all

documents?"

That’s not directly supported by Gensim’s Dictionary, but we can use its cfs

attribute to find those tokens with low frequency and then filter them out using

filter_tokens():

Method to Find Rare Tokens

1 def get_rare_ids(dictionary, min_freq):

2 rare_ids = [t[0] for t in dictionary.cfs.items()

3 if t[1] < min_freq]

4 return rare_ids

904 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!