12.07.2015 Views

Think Python - Denison University

Think Python - Denison University

Think Python - Denison University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

13.3. Word histogram 12713.3 Word histogramHere isaprogram that reads a fileand builds a histogram of thewords inthe file:import stringdef process_file(filename):h = dict()fp = open(filename)for line in fp:process_line(line, h)return hdef process_line(line, h):line = line.replace('-', ' ')for word in line.split():word = word.strip(string.punctuation + string.whitespace)word = word.lower()h[word] = h.get(word, 0) + 1hist = process_file('emma.txt')This program readsemma.txt,which contains thetext ofEmma by Jane Austen.process_fileloopsthroughthelinesofthefile,passingthemoneatatimetoprocess_line. Thehistogramhisbeing used as an accumulator.process_lineusesthestringmethodreplacetoreplacehyphenswithspacesbeforeusingsplitto break the line into a list of strings. It traverses the list of words and uses strip and lower toremove punctuation and convert to lower case. (It is a shorthand to say that strings are “converted;”remember that stringareimmutable, somethods likestripandlowerreturnnew strings.)Finally, process_line updates the histogram by creating a new item or incrementing an existingone.To count the total number of words inthefile, wecan add up the frequencies inthe histogram:def total_words(h):return sum(h.values())The number ofdifferent words isjustthenumber of itemsinthe dictionary:def different_words(h):return len(h)Here issomecode toprintthe results:print 'Total number of words:', total_words(hist)print 'Number of different words:', different_words(hist)And theresults:Total number of words: 161073Number of different words: 7212

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!