CS224N Problem Set 1 Lukas Biewald April 7 ... - Stanford AI Lab

CS224N Problem Set 1 Lukas Biewald April 7, 2004 

1. (a) cat WS96010* | sed ’1,//d;//,//d’ > wsjtext} 

Method 1: 

cat wsjtext | tr ’ ’ ’\\n’ | sed ’/^$/d’ > tokenized1} 

Method 2: 

cat wsjtext | tr -sc ’A-Za-z’ ’\\012’ > tokenized2} 

(b) Method 1: 

i. When there is punctuation it is included in the word such as 

“union,” is a token. 

ii. The dataset contains some weird location indicators in the text 

that are marked with the — charachter. Because of this the 

token — appears repeatedly, which is not exactly a word. 

Method 2: 

i. Acronyms like U.K. are split into separate tokens “U” and “K”. 

ii. Hyphenated words such as “double-digit” are split into separate 

tokens. 

2. (a) wc wsj.words says that there are 728630 word tokens. 

sort wsj.words | uniq -c | wc says that there are 23879 different word tokens. 

(b) sort wsj.words | uniq > words 

(c) i. sort wsj.words | uniq -c > word_counts 

ii. I used 

grep ’word’ word_counts 

Common 

plant 134 

fish 1 

car 

Less Common 

188 

. 

twins 

electric 

4 

103 

tank 

Rare 

14 

manifold 0 

binomial 2 

synecdoche 0 

iii. I used 

sort -n word_counts 

The most frequent words were: the, of, to, a, in, and, nmbr, s, 

for, that, is, on, it, prcnt, as. Note: some of these words were 

added like prcnt. “s” probably comes from the questionable way 

the tokenizer handles apostrophes. 

1

iv. tail +2 wsj.words > wsj.nextwords 

tail +3 wsj.words > wsj.thirdwords} 

paste wsj.words wsj.nextwords > bigrams 

paste wsj.words wsj.nextwords wsj.thirdwords > trigrams 

sort bigrams | uniq -c | sort -n 

sort trigrams | uniq -c | sort -n 

Most common bigrams: 

the company 

to the 

on the 

in nmbr 

u s 

nmbr nmbr 

for the 

dllr million 

in the 

of the 

Most common trigrams: 

of the year 

the dow jones 

in new york 

dllr billion in 

nmbr to nmbr 

the end of 

one of the 

dllr million in 

nmbr nmbr nmbr 

the u s 

3. (a) cut -f1 word_counts | sort -nr | nl > freq 

(b) It’s hard for me to find a windows machine in my lab, so I did everything 

in R. I hope this is ok. 

2

(c) 

logfreq 

0 2 4 6 8 10 

● 

● ● ● ●●● 

● ●●●● ●● ● ●●●●● 

● 

● 

0 2 4 6 8 10 

logrank 

The intercept was 14.1 and the slope was -1.4 for the linear regression. 

The data does approximately show the kind of curvature suggested 

by Mandelbrot’s law. 

3 

●

CS224N Problem Set 1 Lukas Biewald April 7 ... - Stanford AI Lab

Create successful ePaper yourself

Delete template?

Save as template?