Part of Speech Tagging of spoken Dutch using Neural Networks

Part of Speech Tagging of spoken Dutch 

using Neural Networks 

Egwin Boschman 

e.boschman@student.utwente.nl 

ABSTRACT 

This paper reports on the experimental research on the use of 

neural networks for Part of Speech Tagging of spoken Dutch. 

The findings of this paper show that a neural network can very 

well be used for tagging. The best constructed network in this 

paper reaches an overall performance of 97.3% recognition of 

the CGN corpus, a 10 million word corpus of spoken Dutch. 

Also the recognition of small corpora gives good performances. 

One minor disadvantage of neural networks is the fact that the 

best shape of a network partially depends on size of the corpus. 

Keywords 

Part of Speech Tagging, neural network, Corpus Gesproken 

Nederlands 

1. INTRODUCTION 

In the 16 th century Francis Bacon made his famous quote: 

Knowledge is Power. His words are even more true today. With 

the enormous amount of text available today, knowledge is 

increasingly more difficult to harvest. Many studies have been 

done to retrieve information automatically. One of the first 

steps of this automated retrieval is the analysis of the text. In 

this paper a small part of the analysis is covered, the Part of 

Speech (PoS) Tagging. Part of Speech tagging means tagging 

the words with their correct word classes. 

Figure 1. A tagged sentence. 

There are several ways of automatic tagging, which can be 

roughly divided in two main categories, rule based and 

stochastically based taggers. A rule based tagger tries to tag 

sentences according to handmade rules. Like after an article 

there will be a noun. This is just an example, of course this is 

not the complete rule, there can be adjectives between them. 

Building these rules requires much time and knowledge of the 

language. Stochastically based taggers do not use such rules; 

they compute the most likely tags for each word by learning 

lots of tagged sentences. These sentences, of course, must be 

tagged first, so lots of manpower is required. One example of a 

stochastically based tagger is a tagger based on hidden Markov 

models (HMM). It uses lots of data to compute and store lists of 

Permission to make digital or hard copies of all or part of this work for 

personal or classroom use is granted without fee provided that copies 

are not made or distributed for profit or commercial advantage and that 

copies bear this notice and the full citation on the first page. To copy 

otherwise, or republish, to post on servers or to redistribute to lists, 

requires prior specific permission. 

4th Twente Student Conference on IT , Enschede 30 January, 2006 

Copyright 2006, University of Twente, Faculty of Electrical 

Engineering, Mathematics and Computer Science 

occurrences of words and combinations of tags. All these 

tagged sentences together are called the corpus. In this paper 

another stochastically based tagger is used, a tagger with a 

neural network. 

Neural networks were a big hype in the nineties. In many 

research areas neural networks were used. (Like character 

recognition, expert systems) In this period several studies for 

new ways of Part of Speech Tagging also used these neural 

networks. The last few years this interest diminished. 

For some specific areas (like jurisprudence) the existing taggers 

are not the best way. Building new taggers for these problems is 

expensive and needs a lot of manpower. But several documents 

[nak90, sch94] showed that only a small training corpus is 

enough for pretty good results. So reintroducing neural 

networks could perhaps save lots of money and manpower. This 

paper tries to answer the question if a tagger based on neural 

networks retrieve a high accuracy with small and large corpora 

build from the CGN corpus. 

2. NEURAL NETWORKS 

A computer based neural network finds its origin in biology. It 

is a simple interpretation how biological neurons communicate 

with each other. Just like its biological counterparts, a computer 

neuron processes its input and passes its reaction onto another 

neuron. A biological neuron will fire if its inputs reach some 

threshold (bias) and will fire more when stimulated more. The 

computer neuron works almost the same, this is implemented 

with a so called bias neuron, describing the precise working is 

beyond the object of this paper, more information can be found 

at [nis03]. 

W t -1 feat1 

W 

t -1 

feat2 

W 

t 

feat1 

W t feat2 

Input layer 

Hidden layer 

Output layer 

Figure 2. A neural network with one hidden layer 

Tag1 

Tag2 

The last ten years a couple of researchers studied the use of 

neural networks for PoS (Part of Speech) taggers. Although 

they all differ quite a lot, one thing is always the same; they all 

use a feed forward network with back propagation. A feed 

forward network consists of a couple of input neurons, which 

pass their values onto a hidden layer, and every layer feeds its 

values into the next one, until it reaches the output layer. The

next equation shows the formulas. The input of n th neuron of the 

j th layer is the sum of the outputs of the layer before (layer j-1) 

times the weight between these neurons. Afterwards the 

standard sigmoid function is used to compute the output of this 

neuron. Neurons of the input layer do not use these formulas; 

their output is the input of the network. 

input 

j, 

n 

output 

= ( 

output 

j, 

n 

i= 

# neuronsLayer 

1 

= 

1+ 

e 

−input 

j−1 

j , n 

j−1, 

i 

* weight 

j, 

n, 

i 

Equation 1. Computation of the output of a neuron 

In the output layer every neuron fires between 0 and 1, because 

of this output function. The neuron with the highest output 

value will be the winner and its number represents the number 

of the tag. When the training of the network starts all the 

weights in the network get a random value, so the outcome will 

probably not match with the desired output. During the training 

of the network, the network computes the output (Feed 

forward). Then, with the desired output, adjusting the weights 

backwards, starting with the weights on the output layer 

finishing with the weights between the input layer and the first 

hidden layer (back propagation). The network will learn how to 

react properly on input by training all the data several times in a 

row. The power of neural networks is the fact that slightly 

different input will result in the same output, samples almost the 

same. [her91] 

2.1 Layers of the Network 

Some of the studies done by others differ in the use of hidden 

layers. One of the earliest studies [nak90] contained networks 

with two hidden layers. This made the networks quite complex. 

Later studies by Schmidt [sch94] and Marques [mar01] proved 

that such complex networks do not pay off. Marques compared 

a network without a hidden layer (only an input and output 

layer) to a network with a hidden layer and concluded that the 

first one was easier to train and reached a higher accuracy. 

2.2 Input Features 

The main difficulty with a neural network is the question what 

to use as input. Boggess [bog92] tried several features, like 

mapping the last letters of the words on the input neurons, and 

another method by representing all the words with a number 

and feeding this binary to the input neurons. All methods 

seemed useless. The recognition was bad and results differed 

greatly between the test runs. Boggess concluded his approach 

was not the right one. The only workable way found is feeding 

the network with a number of tags. This is done by a sliding 

window; this means the network can compute one tag at a time, 

for every word in the sentence the network will be fed with zero 

or more tags back, the current word and zero or more words 

ahead. Figure 2 shows a network with two tags and a sliding 

window of 2. The current word t and the previous word t-1 are 

fed, so the next step will be to fed word t+1 as current word and 

word t as previous word. This is a very small network, with a 

sliding window of size six tags and over 70 different tags, the 

input layer will contain over 400 neurons. Both Schmidt and 

Marques uses a lexicon with relative frequencies of the tags per 

word stored. These relative frequencies are fed to the input 

neurons. Feeding the network three words requires three times 

the number of tags from the tagset input neurons. All the 

approaches differ in the size of the sliding window, Marques 

uses a window of three words and Schmidt a window of six 

) 

words. Some [nak90] use their results found for word n as input 

for word n+1. So instead of the relative frequencies of the tags 

of the words before, the network gets the computed values of 

the tags before. This is more difficult to train, but can improve 

the recognition. Knowledge of the previous tags improves the 

recognition of the tags of the next words. 

2.3 Output of the network 

Because every network processes one word at a time the output 

is also one tag at a time. This can be done in two ways. One 

method is to give every output neuron its own tag, so there are 

as many output neurons as there are tags in the tagset [sch94]. 

The output neuron with the highest output value indicates the 

most likely tag. Another way is to number the tags and let the 

output neurons hold the binary representation of this number 

[bog92]. The second one saves a lot of output neurons, for 

example 89 tags can be numbered with 7 bits, giving 7 output 

neurons, all representing one bit when giving an output above 

some threshold. But there are some major disadvantages in 

this coding. With the first method there is always a neuron 

firing the most. The network will choose between two likely 

candidates. With the second one a small error may result in a 

totally different tag when one neuron is just a bit above 

threshold. It could very well give some kind of logical AND of 

the most likely tags, giving a totally different tag number. This 

makes it much more difficult to train. 

2.4 Final Remarks on Neural Networks 

The term learning rate occurs in the field of neural networks as 

well in the field of Part of Speech tagging. In the field of neural 

networks this means how much the weights of the network can 

change in one training step. A large learning rate results in a 

quick learning, but will not result in an overall best network (a 

local optimum). A small learning rate takes longer to train, but 

will end more easily in a stable, good network. Therefore a 

normal learning rate is used to make the network more stable. 

In PoS tagging the learning rate is the relation between the size 

of the corpus and the accuracy the tagger achieves. The learning 

rate shows how much the accuracy increases when the amount 

of training data grows. Another issue is over-learning. A neural 

network is powerful because it is able to compute answers to 

unseen samples. If a sample is more or less the same as another 

sample it will react the same. So the network will produce good 

results on slightly different input. But if the network is too large 

(to many hidden layers, to many neurons in the hidden layer), it 

will learn the samples instead of the bigger picture. A final 

remark must be made about the computation time. The larger 

the net, the more weights it contains, the longer it will take to 

compute an answer. Fortunately the computers today are 

powerful enough to process lots of data. Some years ago the 

computing speed would make it impossible to train as many 

nets as it would today. [sch94] claims he needs one day of 

computation power on a sparc10 for the training of a 2 million 

words corpus. But today, with a relatively slow computer 

(Athlon 2500+) one is able to test 1 million words with a large 

network in less than 2 minutes. 

3. RESEARCH QUESTIONS 

Considering the problems mentioned in the introduction the 

main question will be: 

Will the promising results for other languages (good tagging 

with small corpora) also hold for spoken Dutch Part of Speech 

tagging using the CGN corpus

3.1 Derived Research Questions 

The main question can not be answered without a few other 

questions. 

• Structure. Are the structures of the neural networks found 

in the literature [nak90, sch94] efficient for spoken Dutch 

texts 

• Input features. What are the best features for input 

neurons What size of the sliding window gives the best 

result 

• Accuracy. How does a neural network approach perform 

compared to other Taggers In particular how does 

perform compared to a HMM tagger and a Support Vector 

Machine tagger [lui05] 

• Learning rate. Many reports, like the report of Marques 

[mar01], claim that neural network have a very good 

learning rate, will these results hold for the CGN corpus 

4. LEXICON AND CORPUS 

Like all stochastic taggers, a neural network needs a lexicon 

with per word a list of possible tags with their relative 

frequencies. There are two ways to build this lexicon: 

• Make the lexicon as large as possible; use external 

lexicons with relative word frequencies, use all the tagged 

texts available. This should make the best list of 

occurrences of words. 

• Use only the words and tags from the training corpus, this 

gives a less complete lexicon. But results give a better 

view of the performance of the tagger, instead of the 

quality of the lexicon [mar01]. 

Marques [mar01] shows that the first one gives the best results 

with very small subsets (with only 10.000 words the results are 

almost at a maximum). This is obvious of course, the second 

one did not see that many words, its lexicon will not have a well 

built list with relative frequencies. 

4.1 The Corpus Gesproken Nederlands 

The CGN (Corpus Gesproken Nederlands) corpus is a large 

corpus, based on spoken Dutch. [cgn04] The whole corpus is 

around 9.000.000 words, divided in 15 different sections. 

Mentioned in the table 1. 

The choice of using the CGN got some implications. 

• The CGN corpus is built out of normal spoken sentences, 

so not every sentence in the corpus will be correct Dutch, 

making tagging a bit harder. 

• The CGN corpus is filled with Dutch spoken by people 

from the Netherlands and Belgium. There are some 

differences between those two dialects. Like sometimes the 

words are put in a different order, for example in the 

Netherlands people say: “vast en zeker” And in Belgium 

people use it the other way around: “zeker en vast”. (It 

means “for sure”). One third of the corpus is Flemish, the 

rest Dutch spoken by people from the Netherlands. No 

tests are done with the different dialects. 

This data is distributed over eleven files, putting the first line in 

file one the second in file two and so on. This makes the data 

equally spread over the eleven files. All testing is done on file 

one, called set0. Because of the large size of the corpus (all files 

contains over 100.000 sentences) most of the experiments are 

done with just one or two of the files for training. This will be 

mentioned for each test. 

Data 

Table 1. Overview of the corpus 

Size 

(in words) 

A Face to face conversations 2626172 

B Interviews with teacher Dutch 565433 

C Phone dialogue (recorded at platform) 1208633 

D Phone dialogue (recorded with mini disc) 853371 

E Business conversations 136461 

F 

Interview & discussions record from 

radio 

790269 

G Political debates, discussions & meetings 360328 

H Lectures 405409 

I Sport comments 208399 

J Discussion on current events 186072 

K News 368153 

L Comments on radio & TV 145553 

M Masses, ceremonies 18075 

N Lectures, discourses 140901 

O Texts, read aloud 903043 

4.2 The Tagset 

The CGN corpus is tagged with a large tagset [Eyn04] of 316 

tags. This tagset can easily be simplified to a smaller tagset. 

This simplification is done by grouping of tags. There are three 

main tagsets: 

• The large tagset. This is a really large tagset, compared to 

the Brown tagset (87) and the Penn Treebank tagset (48). 

Because of its size many tags are rarely used. Although the 

CGN corpus is a large corpus, some tags occur only once, 

making the training and recognizing of these tags very 

difficult. 

• The medium tagset. This reduced set of 72 tags is derived 

from the larger set of 316 tags by grouping tags that look 

alike. The reduced set of 72 tags makes, for example, no 

difference between the seven articles described in the 

original set. But does not group all the verbs to one tag, it 

still uses, for example, the difference between normal 

verbs and auxiliary verbs. 

• The small tagset. With this tagset the number of tags is 

reduced even further, leaving one with only the twelve 

main tags, like verb, noun. 

At the start of this project the first intention was to use a tagset 

with twelve tags, but after a few tests this tagset seemed to be 

too small. The base tagger (simply returning the most used tag) 

produced a result of over 95%. This leaves no room for 

improvement. Using the 72 tags sized tagset the base tagger 

recognized around 91%. This leaves enough errors to make a 

better tagger. This larger tagset splits the main categories in 

more precise tags. For example it splits the tag VG 

(conjunction) in subordinating conjunction and coordinating 

conjunction. More about the CGN-tagset can be found in 

[Eyn04].

Even in this 72 tagset there are quite a number of tags hardly 

used. VNW10, N8, VNW18, VNW4, WW5, VNW9, ADJ8 

occur less than 100 times and half the tagset occurs less than 

10.000 times. With over a 10,000,000 words, their occurrences 

are very rare. In total these 36 tags are 1.2% of the whole 

corpus. 

Table 2. The different tags of the 72 tagset 

Tag numbers Part of Speech tag Tags in CGN corpus 

1…8 Noun N1, N2… N8 

9…21 Verb WW1, WW2… WW13 

22 Article LID 

23…49 Pronoun VNW1…VNW27 

50, 51 Conjunction VG1, VG2 

52 Adverb BW 

53 Interjections TSW 

54...65 Adjective ADJ1…ADJ12 

66…68 Preposition VZ1,VZ2,VZ3 

69,70 Numeral TW1, TW2 

71 Punctuation LET 

72 Special SPEC 

Table 3 shows the most occurring tags. Together these 12 tags 

occupy over 70% of the whole corpus. Notice that 3% of the 

corpus is tagged with SPEC, meaning 3% of the words in the 

corpus are special cases. This tag is used for unfinished words, 

noises like coughing and words from foreign languages. Most 

of these cases occur because the CGN corpus is built out of 

spoken Dutch sentences. 

Table 3. Overview of the most frequently used tags 

Number Tag % number tag % 

71 LET 11,2 22 LID 5,3 

52 BW 10,4 23 VNW1 4,9 

53 TSW 7,8 50 VG1 4,0 

1 N1 7,4 62 ADJ9 3,4 

9 WW1 6,7 72 SPEC 3,0 

66 VZ1 6,7 12 WW4 2,6 

5. EXPERIMENT SETUP 

Reading about neural networks, the target looks very easy, but 

contains a lot of choices. In this section all the different choices 

are presented. 

5.1 The Trained networks 

The questions mentioned in the section 3 require lots of 

different tests. This section shows these different kinds of tests, 

split in a few categories: 

• Testing the window size. It is important to know what the 

influence is of increasing the window size. A window size 

of one is the normal base tagger. Knowing more words 

before and after the current word will improve the 

accuracy, but to what extent. 

• Testing the influence of a hidden layer. When adding a 

hidden layer the network will be able to see more complex 

relations. Marques [Mar01] concluded that adding a hidden 

layer did not improve the recognition. A few networks will 

be tested to verify his findings. 

• Testing different kinds of input. As mentioned in section 

2.2, neural networks work the best when it is fed with tags. 

But which tags gives the best performance When the 

tagger uses a window size of two words back, two words 

ahead and it is tagging the third word should it use the 

relative frequencies of the tags found in the lexicon or 

should it use the computed tags And when using the 

second method will it make more errors After all, if it 

makes a mistake with the first word, it will use a wrong tag 

to predict the next. A few tests are done to determine if 

there is a best way. 

• Testing the learning rate. One of the main questions is: 

Will a neural network perform well with a small corpus 

To test this, different kinds of networks with different 

sized corpora are used. 

• How to handle unknown words. Unknown words are a 

problem for all existing taggers, some tests must determine 

how to handle unknown words. 

• Building the best network. Of course one always wants to 

find the best network possible. During these tests all the 

results of the other test are used to construct the network 

with the highest accuracy possible. 

Because of the many parameters not every networks possible 

with the questions above can be built. The five tests above 

contain to many parameters to build all the combinations 

possible. 

5.2 Comparing the Results with Other 

Taggers 

Before any conclusions about the performance can be made, 

there must be some comparison with other taggers. During these 

tests two other taggers are used as reference. The first one is the 

base tagger; it stores all the words with their tags of the training 

corpus. Afterwards all the words of the test corpus are tagged 

with the highest frequency of a tag for that word. (Example: the 

Dutch word “een” can be an article or a numeral (meaning the 

word one), in most of the cases it is an article (meaning the 

word a), so tag it with article). The second tagger is a HMM 

(Hidden Markov Model) tagger with bigrams. It stores tuples of 

tags and their frequencies and tries to match this with the test 

set. Only the bi-grams are used, the complexity (memory use, 

computation time) of larger n-grams grows rapidly, which 

makes it a research area on its own. 

6. RESULTS 

According to the literature the best performance was obtained 

with a network with a window size of six, three words back, the 

word itself and two words ahead. [sch94]. This will be the base 

of many of the tests. 

6.1 Test 1: The Window Size 

The first target is to verify the results found by [mar01] and 

[sch94], they claimed that a neural network with a window size 

of six, three tags back, the relative frequencies of the tags of the 

word, and two tags ahead. In the next tests the next notation is 

used: bxa, where b is the number of tags back and a the number 

of tags ahead. So this network will be called 3x2. The figure 

below shows 7 different networks. These networks are trained 

with 1 file from the trainings set and tested with the test file, 

set0. The X-axis shows the number of times all the samples are 

trained.

96,3 

96,2 

96,1 

96 

95,9 

base 

1x4 

3x2 

4x1 

1x1 

2x2 

3x3 

4x4 

95,8 

1 3 5 7 9 11 

Figure 3. Influence of the Window Size 

The difference between the nets seems pretty obvious, but 

notice there is only an area of .4% in which all the nets perform. 

The base tagger results in an accuracy of 91.8%, so all the 

neural networks perform much better. (The results of the base 

tagger can not be seen in the figure) Because of the small 

differences between all the nets the next tests use the network 

preferred by the literature. [sch94] 

96,6 

96,58 

96,56 

96,54 

96,52 

96,5 

96,48 

96,46 

3x2 

1mil 2mil 3mil 4mil 

Figure 4. A long run with a 3x2 network 

Figure 4 shows the influence of a longer run. This time all the 

data is used, and this is fed one, two, three and four times to 

network. Feeding the network multiple times with the data does 

improve its recognition, but the graph flattens within three to 

four cycles. 

6.2 Test 2: Hidden Layers 

To test the influence of a hidden layer a bit more data is used, 

because the networks are more complex. This time 2 files of the 

training data are used. Each network is fed with 400000 lines 

from this data. 

Shape 

Table 4. The influence of a hidden layer 

Result 

3x2 hidden layer 30 95,69 





Because the complexity of the net, which may result in 

suboptimal networks, this test is done twice. After these tests 

the results did not differ that much. (Less than 0.2%). The 

average results are presented in Table 4. These results show that 

a small hidden layer has a negative effect on the results. The 

best network seems to be the network with 60 neurons. Nets 

with larger hidden layers possibly suffer from over-learning. 

6.3 Test 3: What Input To Use 

While trying to build the best network possible (see section 

6.7), one question was what input data to use. The data can be 

trained in different ways: 

• Method 1. Always feed the relative frequencies of the tags 

possible for each word. Using a 3x2 network would be fed 

with the relative frequencies of the tags three words back, 

the relative frequencies of tags of the word itself and those 

of the two words ahead. 

A positive effect is that the wrong recognition of a word 

would not affect the next. 

A negative effect is that the results of the tags already 

found are not used. If the tagger is correct using its answer 

will be better than the relative frequencies of tags of this 

word. 

• Method 2. Feeding the network with relative frequencies 

of the tags of the current words and the two words ahead, 

but use the found tags of the words back. In this case 

during the training the desired tags of the words back can 

be fed. 

This method has exactly the opposite effects, a wrong 

recognition affects the next word and a good prediction 

strengthens the next prediction. 

Another positive effect is that a neural network never 

really fires at one neuron alone. There is always a second 

best. Using all the output of the last word gives the 

network a change to benefit of it. But here is another 

problem, during training only the desired tag is fed. (Can 

one think of a second best, when one knows the correct 

answer) During testing it will see data it has never seen 

before. But giving only the most firing tag will increase the 

power of an error onto the next tag. 

• Method 3. A mixture of both. Feeding the network with the 

relative frequencies and the outcome of the last recognition 

is also an option. During training it will see the influence 

of the correct tag of the last word, but also the existence of 

a second best (the other tags possible) and will learn this 

extra information. 

Table 5 shows the results of the 3 described ways. 

Table 5. Testing different kinds of input 

Input used After 1 run After 2 runs 

Tags only 96,27 96,55 

Computed results only 96,37 96,50 

Tags and computed results 96,40 96,55 

These results may seem a bit difficult to explain. First of all 

notice that the differences are not that large. After the first run 

the network is clearly not finished learning, and it shows that 

the network finds it a bit more difficult to figure out which tag 

to use, so tags only produces a slightly worse output. Using the 

output only lets the network produce results more easily, after 

one run it works better than the first. After the second run the 

chances have changed. The network knows better how to

handle more tags on the input. And the good results of only 

using the computed results are punished with more errors after 

two runs. The mixture shows the best of both worlds. And 

although the differences are small, from this point on the 

mixture input is used. It is merely done because it seems to get 

better results with less data. 

6.4 Test 4: Different Size of Corpora 

In the literature all papers mention the good learning rate of 

Part of Speech taggers based on neural networks [mar01]. 

Because the CGN-corpus is rather large, this has to be split in a 

few subparts to test the learning rate. The tests are done with 

three small parts from one training file, a part containing the 

first 100 lines, one with 1000 lines and a third with 10000 lines. 

The training data all come from the face to face conversation, 

so it is tested with the face to face conversation part of the test 

files. Below are the results of the 100 line corpus. 

75 

74 

73 

72 

71 

70 

69 

68 

67 

Base 

1x1 

2x2 

3x2 

66 

3x3 

4x4 

65 

3x3x60 

3x3x100 

0 5 10 15 20 25 30 

Figure 5. Learning a 100 lines corpus 

It is clear that this training set is too small, the training data had 

to be fed 20 times (makes 2000 lines) to get near the base 

tagger. After 20 times only 100 lines (whole trainings corpus) 

the tests are stopped. At this point the program produced output 

every 2000 lines, to get the best tagger possible. After around 

20000 times one of the 100 lines trained the network stopped 

learning (it flattens). The figure shows clearly that adding a 

hidden layer results in a lower performance. 

85 

84 

83 

82 

Base 

1x1 

2x2 

3x2 

3x3 

4x4 

81 

3x3x60 

3x3x100 

2 7 12 17 22 27 


Next the 1000 line corpus is trained. Every 1000 lines the 

program produces output, and after around 20000 times a line 

from the corpus the output flattens, showing the network 

reaches its maximum. This figure shows clearly that all the 

networks perform better than the base tagger. It also shows that 

with such a small corpus the choice of which network is 

irrelevant. Only the ones with hidden layers don not perform 

well. They keep bouncing around. 

92,4 

92,2 

92 

91,8 

91,6 

91,4 

91,2 

91 

Base 

2x2 

3x3 

3x3x60 

1x1 

3x2 

4x4 

3x3x100 

1 6 11 16 21 


The next test used a 10000 lines corpus. And again the program 

produces output when the whole corpus is trained once, so after 

10000 lines. After 200.000 times training a line (20 times this 

corpus) the output flattens, showing that the best network 

results in recognition of around 92.2% for a 3x2 network. The 

base tagger performs only 88.4% and can not be seen in the 

figure. 

These three tests make clear that a corpus can be too small for a 

good tagger. A training set of 100 lines is way too small to get a 

good tagger. Above 1000 lines the networks performs 

significant better than a base tagger. With small corpora the use 

of a hidden layer is a very wrong choice. The results do not 

stabilize quickly and when stabilized are significant below the 

results of the networks without hidden layer. This may be a 

result of over-learning; it is learning the specific sentences, 

instead of the big picture. 

To complete the tests of the influence of larger corpora another 

test is done. While searching for the best network (section 6.7), 

it showed that a hidden layer can improve the recognition. The 

next test is done with a 3x2 network with a hidden layer of 60 

neurons because of the good results (see section 6.2). To 

remove the effect of a better lexicon, the following tests are 

done with a lexicon built of all the trainings data. First a 

network is trained using two files (a 200000 lines corpus). This 

network is fed 800000 times with one line. This resulted in a 

recognition of 96,9% Afterwards the network is copied, making 

two identical Nets. Network1 is fed with the same corpus 

400000 times. Network2 gets a new corpus with the next two 

files (again a 200000 lines corpus) and is fed with 400000 times 

one line. Again Network2 is copied, making two identical Nets, 

Network2a and Network2b. Network1 is fed again with 400000 

lines from the first corpus, Network2a again with the second 

corpus. Network2b is fed with a third corpus, again another two 

files (again 200000 lines). So every network is trained with 

exactly the same number of lines. Table 6 shows the results. 

These results show clearly that increasing the corpus does not 

improve the recognition much. Using a 3 times larger corpus

improves the recognition by 0.03%. Of course this small 

improvement is used in section 6.7 to try to build the best 

network. 

Table 6. Influence of a larger corpus 

Network Trainings set Recognition 

Network1 1,6 million lines from a 200000 

line corpus 

Network2a 1,6 million lines from a 400000 

line corpus 

Network2b 1,6 million lines from a 600000 

line corpus 

96,97% 

96,99% 

97,00% 

6.5 Unknown Words 

Giving unknown words the correct tags is a study on its own. 

And this is not the objective of this paper. To get some idea of 

the kinds of unknown words the trainings corpus is split into ten 

parts. Every time a lexicon of nine parts was build and the tenth 

part was used to count the occurrence of the tags of the 

unknown words. This test was done ten times every time using 

another subset to count the unknown word tags. Of course the 

test set is not used at all, this remains the testing candidate. It 

should not be used to optimize the results. 

opposite is true for N3 and N5, they do not occur very often in 

the whole corpus (less than 2%) but they are important tags for 

the unknown words. 

Now the network can be fed with these relative frequencies 

instead of an equal weight of 1/72 for every input neuron. This 

should help the network with the recognition of these unknown 

words. A small test showed that this way of handling unknown 

words is better than a 1/72 weight to all the inputs. With the use 

of a 3x2 network 10.000 trainings sentences the results were 

92.4% for a weighted distribution and 91.7% with an equal 

spread. The next tests all will be done with these relative 

frequencies in stead of an equal weight for every input neuron. 

6.6 Comparing Neural Networks with the 

base tagger and the HMM tagger 

According to the results in the previous sections a tagger can be 

built with the use of a neural network, but how well does it 

perform compared to other taggers During this test the results 

of the neural network trained in preceding tests are plotted 

against the results of a base tagger and the HMM tagger. The 

HMM tagger is trained with Simple Good-Turing context 

smoothing and Witten-Bell lexical smoothing. The neural 

network used is a 3x2 network. 

Occurence (%) 

40 

35 

30 

25 

20 

15 

10 

5 

10 9 8 7 

6 5 4 3 

2 1 GEM 

0 

0 20 40 60 

Tag number 

Figure 8. Spreading of the unknown tags 

Every part added roughly around 8900 new unknown words. 

The frequencies of the tags over these sets also did not differ a 

lot. (Notice that in the next figure the ten dots are so close, they 

look like a single dot) All the tags are numbered with the 

numbers mentioned in section 4.2. Only the tags with an 

occurrence of at least 2% are used to predict unknown words. 

These nine tags cover over 90% of the words. 

Table 7. The largest frequencies of the tags of unknown 

words 

Tag 

Nr 

Tag Freq Tag 

Nr 

Tag 

Freq 

1 N1 33% 15 WW7 2,5% 

3 N3 14% 54 ADJ1 4,0% 

5 N5 10% 62 ADJ9 2,4% 

9 WW1 2,2% 72 SPEC 19% 

12 WW4 2,3% 

Notice the differences between this table and table 3. Although 

TSW and BW occur quite a lot (see table 3) and they are both 

open word classes, they do not occur in the table above. The 

performance 

100 

95 

90 

85 

80 

75 

70 

65 

Base 

HMM 

60 

100 1000 10000 100000 1000000 

corpus size 

Figure 9. Results of a neural network, a base and a HMM 

tagger 

One can clearly see that neural network gives much better 

results. Especially with small corpora, the differences are over 

five percent. So the findings of Marques [mar01] and others are 

also true for the CGN corpus. 

6.7 Building the best network 

After all these tests, all the information is used to build the best 

neural network possible, once more five of the most promising 

networks are trained. The results of the other tests are used to 

improve the best network found so far. Surprisingly the best 

net, found after many tests, was a network with a hidden layer. 

With use of all the improvements of the previous sections and 

the whole corpus the following results were achieved. 

With all the data available, a larger network seems to perform 

slightly better, like mentioned in [sch94]. But the results seem 

to differ more than one would expect according to the report of 

Schmidt [sch94]. The results after adding a hidden layer are 

around 0.4% better, this is a rather large gain. 

NN

97,4 

97,3 

97,2 

97,1 

97 

96,9 

96,8 

3x3 

4x4 

3x3x60 

4x4x60 

3x3x100 

96,7 

0 2 4 6 8 

Figure 10. Search for the best network 

Figure 10 shows a 3x3 neural network with a hidden layer of 

100 neurons got the best recognition. This network is used to 

compare the results of a neural network with a Support Vector 

Machine based tagger. 

Table 8 shows the results of the different parts of the corpus. 

All the test corpora are parts of set0, the test file. Every test is 

done with the first 10.000 lines of each part. (If possible, the 

corpus contains for example a part with Masses, ceremonies, 

but there are only 18075 lines in the total corpus. These lines 

are distributed over the 11 files. Meaning there are only around 

1500 lines of this part in set0.) 

Data 

Table 8. Results of the best network trained 

Results 

NN 

Results 

SVM 

Face to face conversations 97.3% 97.4% 

Interviews with teacher Dutch 97.1% 97.1% 

Phone dialogue (recorded at platform) 98,0% 98.0% 

Phone dialogue (recorded with mini 

disc) 

98.0% 98,0% 

Business conversations 97.8% 97,6% 

Interview & discussions record from 

radio 

96.9% 96,7% 

Political debates & discussions 96.5% 96,2% 

Lectures 97.3 % 97,2% 

Sport comments 97.3% 97,3% 

Discussions on current events 97,2% 97,2% 

News 96.7% 96,7 

Comments (radio, TV) 96.3% 95,7% 

Masses, ceremonies 95.5% 95,6% 

Lectures, discourses 96.3% 96,0% 

Texts, read aloud 96.2% 96,4% 

Total result of set0 97.3% 97,3% 

Table 8 not only the results of a tagger using neural networks 

are presented, but also the results of Luite Stegeman are shown. 

He built a tagger using Support Vector Machines [ste05]. His 

results are almost identical to the results of this paper. At some 

points this program gets better results, other parts are slightly 

better done with a neural network. The table shows that parts of 

the CGN corpus with not that many lines (see Table 1 for the 

size of each part) do not perform very well. The neural network 

and the SVM tagger both tend to recognize the larger parts 

much better. 

Table 9. Overall performances of different taggers 

Tagger Results Tagging speed 

Support Vector Machines 97,3 1000 

Genetic Tagger 90,0 3000 

neural networks 97,3 7500 

The results of the two stochastical taggers are better than the 

Genetic tagger, built by Wouter Joosse. But the research done 

by Wouter Joosse [joo05] had a slightly different objective; he 

wanted to test if a Brill tagger can be simplified by using 

Genetic algorithms. 

6.8 Error Analysis 

In this section the errors made by the tagger in section 6.7 are 

briefly mentioned. The next table shows the largest errors the 

network made. The network in section 6.7 reaches a recognition 

of 97,3%, so the error rate is 2,7%. 

Table 10. The largest errors 

Tag Occur Errors Recall Precision F 1 

WW2 17247 2018 0.88 0.92 0.90 

N3 19662 1615 0.92 0.98 0.95 

SPEC 27295 1582 0.94 0.93 0.93 

WW4 24005 1529 0.94 0.90 0.92 

VG2 15652 1525 0.90 0.94 0.92 

N1 67745 1511 0.98 0.95 0.96 

N5 15432 1384 0.91 0.96 0.93 

ADJ9 30749 1350 0.96 0.96 0.96 

ADJ1 14710 1039 0.93 0.94 0.94 

VZ1 61465 1027 0.98 0.98 0.98 

This table shows the top 10 of tags with the most errors. Its 

columns represent the occurrences in the test corpus, and the 

number of times it made a mistake. The next two columns are 

filled with the recall and precision ratios. The first one indicates 

how well tag is recognized (false negative), the precision ratio 

shows how often the program assigns this tag to an word 

wrongly (false positive). The last column shows the F1 – 

measure, a weighted mean between recall and precision. A few 

figures are interesting; although a word in the test corpus with 

tag N1 is 1511 times wrongly tagged, which is below the error 

rate of all the tags in total (a recall of 0.98), the precision only 

0.95, meaning the program gives many times another word 

(with another tag) this tag.( 3542 times). This could well be the 

effect of the spreading of the unknown tags. Like mentioned in 

the section 6.5, an unknown word gets a relative frequency of 

33% for the N1 tag, making the network pick this tag more 

easily. The tag VZ1 is also mentioned in Table 10, but this 

only the result of the large number of occurrences in the test 

corpus. It got an good recall and precision rate of 0.98. 

The next table shows the five tags with the highest error rate. 

The table does not show the tags with an occurrence of 100 tags 

in the test corpus, although the error frequency of those tags 

may be 100% (The tag ADJ8 occurs 8 times and is wrong every

time). But because there occurrences are so low the errors are in 

no way of any influence to the total result. 

Table 11. Largest errors based on recall 

Tag Occur Errors Recall Precision F 1 

Vnw16 128 71 0.45 0.76 0.56 

WW6 1306 541 0.59 0.74 0.65 

ADJ4 1108 421 0.62 0.73 0.67 

N6 137 50 0.64 0.89 0.74 

Vnw25 209 68 0.67 0.84 0.75 

This table shows clearly, that the tags with the lowers recall 

ratio do not occur often in the corpus. It also shows that the 

precision of these tags is better. Meaning the network uses these 

tags not that often. The times the network tags other words 

wrongly with this tag are low. For example VNW16 is not 

recognized 71 times, and the network assigns only 18 times 

another word with this tag. 

Table 12. The largest mix-ups 

wanted found errors wanted Found errors 

N3 N1 1005 WW2 WW4 1790 

N5 N1 645 WW4 WW2 994 

N5 SPEC 625 VZ1 VZ2 683 

SPEC N1 625 VZ2 VZ1 863 

Table 12 clearly shows there are three major errors. The 

network makes errors with nouns and the specials, mixes WW2 

and WW4 and mixes VZ1 and VZ2 a lot. These three errors are 

29% of all the errors in total, but these tags are also 27% of all 

the tags computed. Next table shows the relative errors. 

Table 13. The largest mix-ups based on frequency 

wanted found % wanted Found % 

WW2 WW4 10.4 VNW16 VNW22 18,8 

WW6 WW4 30,3 VNW23 VNW20 17,8 

ADJ4 ADJ1 19,7 VNW25 VNW24 12,4 

ADJ4 ADJ9 11,1 VNW13 VNW19 13,8 

ADJ7 ADJ9 11,0 VNW16 VNW14 36,7 

Again the smallest occurrences are not added, like the error of 

computing WW4 while WW5 was the correct answer. WW5 

occurs only twice in the test corpus and both of the times the 

network computes WW4. This is of course a mix-up of 100%, 

but of no influence to error rate in total. 

6.9 The Best Neural Network 

So what is the way to tag texts when using neural networks 

This totally depends on ones needs and the available data. With 

a little amount of data, take a simple network. Using a 1x1 or a 

4x4 does not make any difference in performance, although the 

first one only needs 1/3 of the calculations, making the 

recognition also 3 times faster. With less than 200.000 lines of 

training data do not use a hidden layer, it will lead to a slightly 

loss of performance. With around 1 million lines of training 

data using a hidden layer seems to give the best results. This 

makes the training and testing a bit slower, but it will still tag 

around 7.500 words a second on the computer mentioned 

before. 

7. Conclusion 

Looking at the individual results one can conclude that a neural 

network can be used to get a reasonable tagger. But as the tests 

reveal there is no such thing as a best network. This requires a 

well thought strategy when building a program using neural 

networks. Although the results do not differ a lot, it is never 

funny to realize there might be a network a bit better adapted to 

the data one uses. The results also show the literature is a little 

bit too positive about small networks. Section 6.7 shows a 

significant improvement with a hidden layer and larger 

networks. A positive finding is the computation time. A trained 

network is nothing more than a lot of multiplications and sums 

of digits. There are no difficult instructions or program jumps in 

the computation, which makes it ideal for computers today. 

8. Future research 

Because of the many dimensions this research tries to solve 

there certainly will be some room for improvements. To name 

just one: The unknown words are only investigated with a 

rather large lexicon. The spreading of unknown words in small 

corpora, like in section 6.5, could well be quite different. 

Adjusting these percentages could give it a boost. Also the 

number of tests should increase. During the experiments every 

time set0 is used as test set. Although this test set is around 1 

million words, this could very well influence the results. Every 

test should be run several times and with different training and 

test sets. This to verify the findings of this report and be able to 

measure the statistical significance of the results. 

ACKNOWLEDGEMENTS 

The author would like to thank Rieks op den Akker for his 

comments and making the Corpus Gesproken Nederlands 

available for testing. He would also like to thank Luite 

Stegeman and Henri Boschman for all their remarks. 

REFERENCES 

[bog92] Julian Eugene Boggess, Some issues and problems in 

text tagging using neural networks, Proceedings of the 

30th annual Southeast regional conference, 397 – 400, 

1992 

[emg94]Martin Eineborg and Bjorn Gamback. Tagging 

experiment using neural networks. In Eklund (ed.). 71- 

81.1994. 

[mar01] Nuno C. Marques and GP Lopes, Tagging with small 

Training Corpora, Lecture Notes In Computer Science; 

Vol. 2189, Proceedings of the 4th International 

Conference on Advances in Intelligent Data Analysis, 

63-72, 2001 

[nak90]Masami Nakamura, neural network approach to word 

category prediction for English texts, Proceedings of the 

13th conference on Computational linguistics - Volume 

3,213 - 218, 1990 

[sch94] Helmut Schmid, Part-of-speech tagging with neural 

networks, Proceedings of the 15th conference on 

Computational linguistics - Volume 1, 172 - 176 1994 

[her91] John Hertz ea, Introduction to the theory of neural 

computation, Addison Wesley ISBN 0-201-51560- 

151560, p144-145 

[Jar05] Daniel Jarafski & James H. Martin, Speech and 

Language Processing: An introduction to natural 

language processing, computational linguistics, and 

speech recognition. chapter 4, 2005

[Eyn04] Frank van Eynde, Part of Speech Tagging en 

Lemmatisering van het Corpus Gesproken Nederlands, 

http://lands.let.kun.nl/cgn/doc_Dutch/topics/version_1.0 

/annot/pos_tagging/tg_prot.pdf, 2004 

[cgn04] Nederlandse Taalunie, Het Corpus Gesproken 

Nederlands, http://lands.let.kun.nl/cgn/, 2004. Last 

vistited december 4, 2005 

[ste05] Luite Stegemans, Part of speech tagging of spoken 

Dutch using support vector machines, 4th Twente 

Student Conference on IT , Enschede, 2005 

[joo05] Wouter Joosse, The application of Genetic Algorithms 

in Part-of-Speech tagging for Dutch corpora, 4th 

Twente Student Conference on IT , Enschede, 2005 

[nis03] Steffen Nissen, Implementation of a Fast Artificial 

Neural Network Library (FANN), 

http://fann.sourceforge.net/report/report.html, 2003, last 

visited at 12-10-2005 

[wik05] Wikipedia, Lexical Category, 

http://en.wikipedia.org/wiki/Part_of_speech, 2005, last 

visited at 12-18-2005

Part of Speech Tagging of spoken Dutch using Neural Networks

Create successful ePaper yourself

Delete template?

Save as template?