21.07.2015 Views

GAWK: Effective AWK Programming

GAWK: Effective AWK Programming

GAWK: Effective AWK Programming

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 13: Practical awk Programs 243Fields], page 39) to pick out the individual words from the line, and the built-in variable NF(see Section 6.5 [Built-in Variables], page 110) to know how many fields are available. Foreach input word, it increments an element of the array freq to reflect that the word hasbeen seen an additional time.The second rule, because it has the pattern END, is not executed until the input has beenexhausted. It prints out the contents of the freq table that has been built up inside thefirst action. This program has several problems that would prevent it from being useful byitself on real text files:• Words are detected using the awk convention that fields are separated just by whitespace.Other characters in the input (except newlines) don’t have any special meaningto awk. This means that punctuation characters count as part of words.• The awk language considers upper- and lowercase characters to be distinct. Therefore,“bartender” and “Bartender” are not treated as the same word. This is undesirable,since in normal text, words are capitalized if they begin sentences, and a frequencyanalyzer should not be sensitive to capitalization.• The output does not come out in any useful order. You’re more likely to be interested inwhich words occur most frequently or in having an alphabetized table of how frequentlyeach word occurs.The way to solve these problems is to use some of awk’s more advanced features. First,we use tolower to remove case distinctions. Next, we use gsub to remove punctuationcharacters. Finally, we use the system sort utility to process the output of the awk script.Here is the new version of the program:# wordfreq.awk --- print list of word frequencies{}$0 = tolower($0) # remove case distinctions# remove punctuationgsub(/[^[:alnum:]_[:blank:]]/, "", $0)for (i = 1; i

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!