13.07.2013 Views

Factive / non-factive predicate recognition within Question ...

Factive / non-factive predicate recognition within Question ...

Factive / non-factive predicate recognition within Question ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ISSN 1744-1986<br />

T e c h n i c a l R e p o r t N O 2009/ 09<br />

<strong>Factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong> <strong>recognition</strong><br />

<strong>within</strong> <strong>Question</strong> Generation systems<br />

B Wyse<br />

20 September, 2009<br />

Department of Computing<br />

Faculty of Mathematics, Computing and Technology<br />

The Open University<br />

Walton Hall, Milton Keynes, MK7 6AA<br />

United Kingdom<br />

http://computing.open.ac.uk


<strong>Factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong> <strong>recognition</strong> <strong>within</strong> <strong>Question</strong><br />

Generation systems<br />

A dissertation submitted in partial fulfilment<br />

of the requirements for the Open University’s<br />

Master of Science Degree<br />

in Computing for Software Development<br />

Brendan Wyse<br />

(X5348818)<br />

9 March 2010<br />

Word Count: 14,532


Preface<br />

I am extremely grateful to my supervisor, Dr. Paul Piwek, for his enthusiastic guidance<br />

and his willingness to allow me to become involved in the area of <strong>Question</strong> Generation.<br />

He has introduced me to a community and a culture that I had always wanted to be a<br />

part of.<br />

My family are unfamiliar with the technical nature of my work but were willing and<br />

able to offer the encouragement and support necessary to help me see this research<br />

through to completion. To my wife, Amanda, my utmost thanks must go. Without her<br />

support I would not have even started this work.<br />

Special thanks go to my sons, Jason and Daniel. One day they will understand why I<br />

seemed so busy all the time. Their patience was appreciated.


Table of Contents<br />

Preface...........................................................................................................................i<br />

List of Figures ..............................................................................................................v<br />

List of Tables..............................................................................................................vii<br />

Chapter 1 Introduction...................................................................................................1<br />

1.1 Background to the research ...................................................................................1<br />

1.1.1 Defining the question generation task............................................................4<br />

1.1.2 Specific areas for research..............................................................................8<br />

1.2 Aims and objectives of the research project........................................................10<br />

1.3 Overview of the dissertation ...............................................................................12<br />

Chapter 2 Literature Review........................................................................................14<br />

2.1 Introduction .........................................................................................................14<br />

2.2 Overview of existing systems .............................................................................16<br />

2.3 Techniques used and NLP tools employed .........................................................19<br />

2.3.1 Tagging and parsing.....................................................................................21<br />

2.3.2 Term extraction ............................................................................................24<br />

2.3.3 WordNet.......................................................................................................27<br />

2.3.4 Rule-based mapping.....................................................................................29<br />

2.3.5 Lemmatisation..............................................................................................36<br />

2.4 Factivity...............................................................................................................38<br />

2.5 Research question................................................................................................41<br />

2.6 Summary .............................................................................................................41<br />

Chapter 3 Research Methods .......................................................................................43<br />

ii


3.1 Introduction .........................................................................................................43<br />

3.2 Research Techniques...........................................................................................43<br />

3.2.1 Prototyping...................................................................................................43<br />

3.2.2 Quantitative analysis of overall impact........................................................44<br />

3.2.3 Quantitative analysis of quality increase......................................................45<br />

3.3 Summary .............................................................................................................50<br />

Chapter 4 Software Overview......................................................................................52<br />

4.1 Introduction .........................................................................................................52<br />

4.2 Preparation of the OpenLearn data set ................................................................53<br />

4.3 Matching tagged sentences..................................................................................55<br />

4.4 Transformation to questions................................................................................60<br />

4.5 Matching sentences with structure information ..................................................62<br />

4.6 Rule representation..............................................................................................64<br />

4.7 <strong>Factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> functionality ...................................................68<br />

4.8 Summary .............................................................................................................69<br />

Chapter 5 Results.........................................................................................................71<br />

5.1 Introduction .........................................................................................................71<br />

5.2 Overall impact .....................................................................................................71<br />

5.3 Quality Increase...................................................................................................74<br />

Chapter 6 Conclusions.................................................................................................79<br />

6.1 Introduction .........................................................................................................79<br />

6.2 Assessment of <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> module ......................................79<br />

6.3 Further work........................................................................................................80<br />

References ..................................................................................................................82<br />

Index...........................................................................................................................85<br />

iii


Appendix A – Extended Abstract...............................................................................86<br />

Appendix B – List of <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong>s...........................................94<br />

iv


List of Figures<br />

Figure a.1 Predicate verbs ............................................................................................viii<br />

Figure 1.1 Fill-in-the-blanks test .....................................................................................2<br />

Figure 1.2 An example of question generation................................................................5<br />

Figure 1.3 A rationale type question ...............................................................................6<br />

Figure 1.4 <strong>Question</strong> Taxonomy by Nielsen et al. (2008) ................................................7<br />

Figure 1.5 <strong>Factive</strong> versus <strong>non</strong>-<strong>factive</strong>..............................................................................9<br />

Figure 2.1 QuALiM mark-up. (Kaisser and Becker, 2004)...........................................17<br />

Figure 2.2 QG system mark-up from Cai et al. (2006)..................................................18<br />

Figure 2.3 Sub-tasks of question generation (Silveira, 2008) .......................................20<br />

Figure 2.4 Examples: John is in … plain text................................................................24<br />

Figure 2.5 Examples: John is in … POS tagged............................................................25<br />

Figure 2.6 Regular expression targeting NNP is in NNP ..............................................26<br />

Figure 2.7 Examples: Targeting motion related verbs...................................................27<br />

Figure 2.8 Examples: POS tagged motion verb examples ............................................28<br />

Figure 2.9 Mitkov and Ha example sentence ................................................................30<br />

Figure 2.10 Mitkov and Ha example question................................................................31<br />

Figure 2.11 Wang et al. sample template........................................................................31<br />

Figure 2.12 QuALiM mark-up language ........................................................................32<br />

Figure 2.13 NLGML mark-up for ‘somebody went to somewhere’...............................35<br />

Figure 2.14 Factivity sentence examples ........................................................................38<br />

v


Figure 3.1 Precision and Recall: Van Rijsbergen’s (1976) formal definition ...............49<br />

Figure 3.2 Precision and Recall: Adapted for this research...........................................50<br />

Figure 4.1 Ceist system architecture..............................................................................53<br />

Figure 4.2 OpenLearn Study Unit processing ...............................................................53<br />

Figure 4.3 OpenLearn XML format ..............................................................................54<br />

Figure 4.4 Python extraction script................................................................................55<br />

Figure 4.5 POS tagged sentence ....................................................................................56<br />

Figure 4.6 Example with proper nouns .........................................................................57<br />

Figure 4.7 Example of POS limitation ..........................................................................58<br />

Figure 4.8 Group matching in Ceist ..............................................................................61<br />

Figure 4.9 Sentence with grammatical structure ...........................................................62<br />

Figure 4.10 Demonstrating use of noun phrase ..............................................................63<br />

Figure 4.11 Sentence structure viewed as a tree in Ceist................................................64<br />

Figure 4.12 XML representation of a rule in Ceist .........................................................65<br />

Figure 4.13 Rule editing interface in Ceist .....................................................................67<br />

Figure 5.1 Chart showing occurrences of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong>.....................................72<br />

Figure 5.2 Occurrences of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> directly before a new clause ..............74<br />

Figure 5.3 Proportion of Answerable <strong>Question</strong>s for Rule 1 ..........................................76<br />

Figure 5.4 Proportion of Answerable <strong>Question</strong>s for Rule 2 ..........................................77<br />

Figure 5.5 Recall values for both rules..........................................................................78<br />

vi


List of Tables<br />

Table 2.1 Existing QG Systems reviewed....................................................................15<br />

Table 2.2 <strong>Factive</strong> and <strong>non</strong>-<strong>factive</strong> categorised by Hooper (1974) ...............................40<br />

Table 3.1 Sample of YES/NO questions rejected ........................................................47<br />

Table 3.2 Sample of YES/NO questions considered answerable by the input ............47<br />

Table 3.3 Sample of YES/NO questions not answerable by the input.........................48<br />

Table 5.1 Occurrences of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> ............................................................72<br />

Table 5.2 Occurrences of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> directly before a new clause ..............73<br />

Table 5.3 Quality increase results ................................................................................75<br />

vii


Abstract<br />

The research in this paper relates to <strong>Question</strong> Generation (QG) – an area of<br />

computational and linguistic study with the goal of enabling machines to ask questions<br />

using human language. QG requires processing a sentence to generate a question or<br />

questions relating to that sentence. This research focuses on the sub-problem of<br />

generating questions where the answer can be obtained from the input sentence. One<br />

issue with generating such questions is the instance where a proposition in a declarative<br />

content clause in a sentence is taken to be true, when it might not actually be.<br />

Two sentences are shown in Figure a.1 below with the same declarative content clause<br />

(underlined) but with different <strong>predicate</strong> verbs (bold). The certainty that the proposition<br />

in the declarative content clause is true, is different for each.<br />

Figure a.1 Predicate verbs<br />

A QG system without the ability to understand the difference between the sentences<br />

above might generate the question ‘How many people were at the conference?’ Whilst<br />

this is grammatically, a valid question, it cannot be definitively answered given (1)<br />

above. From (1) we are not absolutely certain how many people were at the conference<br />

viii


ecause the speaker in the sentence is not absolutely certain. In a system designed to<br />

generate only questions that can be answered by the input sentence, this is a flaw.<br />

The verb ‘know’ is a <strong>factive</strong> verb. A <strong>factive</strong> verb “assigns the status of an established<br />

fact to its object” (Soanes and Stevenson, 2005a). The verb ‘think’ is a <strong>non</strong>-<strong>factive</strong>. A<br />

<strong>non</strong>-<strong>factive</strong> is a verb “that takes a clausal object which may or may not designate a true<br />

fact” (Soanes and Stevenson, 2005b). This research asks the question; what is the<br />

impact of enabling a QG system to recognise sentences containing these <strong>factive</strong> or <strong>non</strong>-<br />

<strong>factive</strong> verbs? Impact was regarded as both the overall impact which such a system<br />

might have on QG as a whole and the quality improvements which might be obtainable.<br />

A QG system was written as part of this research and a sub-task was implemented in<br />

this system by writing a software algorithm to perform <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>.<br />

This was done by using a list of <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> verbs produced by Hooper<br />

(1974) which was expanded using a thesaurus. The expanded list allowed me to<br />

determine frequency of occurrence for <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> indicators and thus analyse<br />

overall impact. The same list was then used <strong>within</strong> the QG system to analyse the<br />

improvement of question quality.<br />

The analysis of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> was carried out using the Open<br />

University’s online educational resource, OpenLearn. OpenLearn was chosen as it is<br />

educational material and is available in a well marked XML format which makes it easy<br />

to extract certain content.<br />

ix


It was found that <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> verbs are common enough in educational<br />

discourse to justify further work on factivity <strong>recognition</strong>. The effect on precision when<br />

generating questions where the question must be answerable from the input sentence<br />

was quite good. It was found that whilst the module was successful in removing<br />

unwanted questions it did also remove some perfectly good questions. Previous research<br />

has concluded, however, that it is better to generate questions of higher precision and I<br />

agree.<br />

x


Chapter 1 Introduction<br />

1.1 Background to the research<br />

Written and spoken language is complex and this makes it difficult to process<br />

computationally. Complexity aside, the benefits of having machines capable of working<br />

with human languages are numerous. Human machine interaction would improve as<br />

machines could understand human language commands and respond using direct<br />

communications. One particular area that would benefit greatly from improved human<br />

machine interaction is that of educational technology, where computers are used for<br />

tutoring or training purposes.<br />

Some of the older traditional multimedia learning applications use little more than<br />

multiple-choice selection or fill-in-the-blanks (Figure 1.1) to interact with the student.<br />

They merely take paper based methods and transfer them to the computer. If a student<br />

has difficulty with a topic, there is no method of resolving that difficulty via the<br />

application.<br />

1


Figure 1.1 Fill-in-the-blanks test<br />

Intelligent Tutoring Systems (ITS) seek to educate through better interaction with the<br />

students. They allow students to engage with the artificial tutor and enable them to ask<br />

for assistance or explain their decisions. The tutor can figure out what the student is<br />

trying to achieve and question them about their methods. To do this directly, an<br />

artificial tutor must be capable of dialog.<br />

Intelligent Tutoring Systems are already capable of engaging in dialog using processes<br />

such as Natural Language Understanding (NLU) and Natural Language Generation<br />

(NLG). Both of these are sub-fields in the wider research area of Natural Language<br />

Processing (NLP). An ITS can recognise its student’s input and converse accordingly to<br />

either determine what the student might be thinking, or to guide the student towards a<br />

specific goal. The Intelligent Tutoring agent needs to generate and ask the right<br />

questions in order to do this.<br />

2


This research focuses on the field of <strong>Question</strong> Generation (QG) which aims to improve<br />

the technology that will allow applications such as ITS to ask appropriate and sensible<br />

questions. QG is a relatively new area of study and in order to promote research in it, a<br />

number of communities, including those involved with Intelligent Tutoring Systems<br />

(ITS) have met with the aim of setting up a Shared Task and Evaluation Campaign<br />

(STEC) for <strong>Question</strong> Generation.<br />

The STEC involves creating clearly defined tasks relating to QG, providing data sets<br />

relating to those tasks and asking QG system developers to run the data sets through<br />

their systems. The results are evaluated allowing the QG community to identify<br />

promising approaches to QG or areas which may need further study.<br />

This research was carried out with a view to contributing in some way to the <strong>Question</strong><br />

Generation STEC. It is hoped that it will help to achieve one objective of the campaign<br />

which is to boost “research on computational models of <strong>Question</strong> Generation, an area<br />

less studied by the natural language generation community” 1 .<br />

1 http://www.questiongeneration.org<br />

3


1.1.1 Defining the question generation task<br />

It is hoped that by initiating a STEC for <strong>Question</strong> Generation that the NLP community<br />

will participate and consequently advance technologies related to QG. As part of the<br />

preparation work for the STEC, researchers have already done some work to clearly<br />

define the QG task. Let’s examine the QG task a little closer.<br />

It is clearly a computational task although not explicitly stated by Rus et al. (2007b p.2)<br />

who define QG as given an input of one or more sentences “the task of a QG approach<br />

is to generate questions related to this input”. Piwek et al. (2008 p.1) describe it as “the<br />

task of automatically generating questions”, thus recognising the computational aspect<br />

of QG.<br />

A precise definition is offered by Silveira (2008 p.1) as the ability of a system “to<br />

receive an input of free text, and to generate questions, in a language- and domain-<br />

independent manner, relevant for a target user previously profiled by the system”.<br />

Silveira is describing an ideal system which would be both language and domain<br />

independent and account for a specific type of user. This definition indicates a level of<br />

capability which might well be obtainable in the future but for now system designers<br />

generally focus on only very specific languages, and no specific domain or user type.<br />

Figure 1.2 is a very simple example of the process of generating questions which shows<br />

one possible output. Piwek et al. (2008 p.2) define the relation between the input in this<br />

4


example and the generated question as being the “input answers the output question”.<br />

Piwek et al. also describe other types of QG such as question reformulation where an<br />

input question is rephrased in some manner. They also define the relation where the<br />

input raises the output question, i.e. the generated question should elicit further<br />

information about the given input.<br />

Figure 1.2 An example of question generation<br />

This research concentrated on questions which could be answered by the input sentence<br />

such as the example in Figure 1.2. Such questions are applicable to educational<br />

technology because questions which test a student’s comprehension of a subject area are<br />

typically of this type. A comprehension question will ask the student something about<br />

what they have learned in order to determine whether or not they have understood it and<br />

the answer will generally be in the content they have studied.<br />

5


Concept completion questions are quite shallow. A deeper question might be the<br />

rationale type question such as that in Figure 1.3. The rationale type question is not<br />

answerable by the input sentence and this would be the case for most open questions.<br />

Figure 1.3 A rationale type question<br />

This research concentrates mainly on concept completion type questions (i.e. Who?,<br />

What?, When?, Where?, etc.) as many questions which can be generated from and<br />

answered by a single input sentence are of these type. There are a variety of other<br />

question types as defined by Nielsen et al. (2008) and listed in Figure 1.4 some of which<br />

would require processing of a complete paragraph of text.<br />

6


Figure 1.4 <strong>Question</strong> Taxonomy by Nielsen et al. (2008)<br />

Focusing on single sentences simplifies the task and makes it more accomplishable for a<br />

researcher new to QG and indeed NLP. Working with multiple sentences requires<br />

processing to determine links between sentences such as anaphora. Anaphora is where a<br />

7


sentence refers to something or someone using a personal pronoun (e.g. she, it) and we<br />

must determine what that personal pronoun is referring to from the surrounding text. It<br />

was decided that this would be beyond the scope of this research. Anaphora resolution<br />

is a research topic of its own.<br />

1.1.2 Specific areas for research<br />

Although there is no doubt the STEC will highlight many areas for potential research, it<br />

will not begin until early 2010 and will end well after the submission deadline for this<br />

dissertation has passed. For the purposes of my research an area of possible<br />

improvement was identified by experimenting with question generation.<br />

During my research I found a lot of open source tools available for NLP in general.<br />

Experimenting with open source tools often highlights limitations or areas the original<br />

developer did not implement. Access to the code behind the tools allows the tool to be<br />

improved or adapted if necessary and in addition allows others to learn from the original<br />

developer’s methods. Because QG is a relatively new area of study there are no open<br />

source systems available yet that I am aware of.<br />

The lack of an open source QG system presented an opportunity to me. By examining<br />

source code and documentation relating to existing NLP tools I was able to develop my<br />

own QG system, which I called Ceist 2 . Through developing Ceist I gained an insight<br />

2 Ceist is the word for ‘question’ in the Irish language and is pronounced ‘KESHT’<br />

8


into the types of issues that QG systems must solve. The literature review (Chapter 2)<br />

outlines some of these issues and how they were addressed. Ceist then allowed me to<br />

experiment with QG in order to find some area for further research.<br />

I began to focus on the generation of questions where the answer to the question is<br />

explicitly contained in the input sentence. My experiments with Ceist, and indeed with<br />

other QG systems (such as Michael Heilman’s online question generator 3 ), highlighted<br />

a particular problem area. This was the case where a question was generated from a<br />

clause in a sentence and the QG system assumed that the statement made in the clause<br />

was an established fact but it was not.<br />

The problem is only really relevant to systems intending to generate questions where the<br />

input sentence explicitly contains the answer. Grammatically correct questions that are<br />

not answered by the input sentence are regarded as invalid for such systems. Figure 1.5<br />

presents two very similar sentences for the purpose of demonstrating this.<br />

3 http://www.ark.cs.cmu.edu/mheilman/questions/<br />

Figure 1.5 <strong>Factive</strong> versus <strong>non</strong>-<strong>factive</strong><br />

9


A QG system without <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> verb <strong>recognition</strong> will assume that there were<br />

10 new students yesterday, given both of these sentences as input because it will only<br />

process the declarative content clause ‘there were 10 new students yesterday.’ This<br />

means that although ‘How many new students were there yesterday?’ is a grammatically<br />

correct question, such a system assumes that its answer is ‘10’ for both (1) and (2).<br />

This is a problem. We cannot say that (2) contains the answer to ‘How many new<br />

students were there yesterday?’ because (2) does not establish the number of new<br />

students as a fact. The speaker in (2) only states that they ‘think’ that there were 10<br />

students. The speaker is not absolutely certain.<br />

This is the problem which I have attempted to solve for QG systems. Can we use the<br />

<strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> verbs to ensure that we only generate a question from an input<br />

sentence when we know the answer is contained in that sentence as an established fact?<br />

The aims and objectives of this research are directed towards solving this problem and<br />

then evaluating the solution.<br />

1.2 Aims and objectives of the research project<br />

Based on the identified area for research I aimed to develop an algorithm that uses<br />

<strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> verbs and phrases to determine whether declarations in input<br />

sentences are established fact. This functionality is added to the baseline QG system and<br />

its performance analysed.<br />

10


The research conducted consisted of two main objectives:<br />

1. Building a working QG system from scratch<br />

2. Develop a software component capable of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong><br />

The aim of the QG system development stage was to allow me to attain the knowledge<br />

and skills required to understand how QG is implemented. It also provided a QG system<br />

which could be freely adapted as necessary. This working QG system then<br />

accommodates the testing of algorithms capable of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>.<br />

The aims and objectives of the research project will allow me to answer the following<br />

research question - “What is the impact of implementing <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> <strong>predicate</strong><br />

<strong>recognition</strong> as a sub-task of a <strong>Question</strong> Generation system?”<br />

There are two aspects to this research question. Firstly, we would like to know the likely<br />

benefit of <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> in QG, i.e. ‘What is the overall impact which<br />

such a module would have on QG as a whole?’ For example, if I found that only 1<br />

sentence in 1,000 contained a <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong>; further researchers might seek other<br />

areas for improvement in QG with a broader scope.<br />

Secondly, we wish to assess such a module and determine its effectiveness with regard<br />

to the quality of the system output, i.e. ‘What improvement does such a module deliver<br />

to generated question quality?’ This is done by analysing the question quality both with<br />

and without a <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> module.<br />

11


1.3 Overview of the dissertation<br />

The following outline briefly describes the contents of each chapter in the dissertation.<br />

Chapter 1 – Introduction<br />

This chapter contains an introduction to <strong>Question</strong> Generation (QG). It describes recent<br />

efforts to drive research in QG and the potential benefits to Intelligent Tutoring<br />

Systems. A specific problem with current QG systems relating to <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong><br />

verbs or phrases is described. The aims and objectives of the research project relating to<br />

solving this problem are outlined.<br />

Chapter 2 – Literature Review<br />

To answer the research question laid out in the introduction (Chapter 1) required<br />

gaining some familiarity with Natural Language Processing and in particular techniques<br />

relating to question generation.<br />

Several resources relating to existing QG systems were reviewed and this chapter details<br />

the implementation of techniques used by these systems, many of which were then used<br />

in Ceist. Some of the current work relating to factivity in linguistics is outlined.<br />

Chapter 3 – Research Methods<br />

A general overview of the research methods used in the project to assess current QG<br />

systems, assist with building Ceist and to identify a means to implement <strong>factive</strong> / <strong>non</strong>-<br />

<strong>factive</strong> question <strong>recognition</strong> functionality.<br />

12


Chapter 4 – Software Review<br />

The software review describes the technical detail of how a QG system such as Ceist is<br />

created. It relates closely to and follows on from the techniques described in the<br />

literature review (Chapter 2) but with more technical detail.<br />

Chapter 5 - Results<br />

This chapter presents the results which were obtained following the assessment of the<br />

<strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> module using the research methods outlined in Chapter<br />

3.<br />

Chapter 6 - Conclusions<br />

The conclusions which can be drawn from the results of the assessment are discussed in<br />

this chapter. I also raise some further work which could be undertaken relating to Ceist<br />

and this research.<br />

13


Chapter 2 Literature Review<br />

2.1 Introduction<br />

A prerequisite to developing any system would normally be to understand how such a<br />

system works. In order to learn how existing QG systems work, some basic knowledge<br />

in NLP techniques is required. NLP is a vast area of study but the quantity and quality<br />

of the body of knowledge relating to it is excellent. There are several online resources<br />

available and some recommendable books on the topic which I reviewed to obtain this<br />

base knowledge.<br />

I experimented with some software toolkits in order to gain an understanding of NLP<br />

including the Python based Natural Language Toolkit 4 and the C# port of OpenNLP 5 ,<br />

SharpNLP 6 . <strong>Question</strong> Answering (QA) is another area of NLP and one that has already<br />

seen several years of research dedicated to it. It shares many common problems with<br />

QG and consequently, common solutions too. For this reason I reviewed existing source<br />

code from both QA and QG applications. In this chapter I focus on the work I reviewed<br />

and detail some specific NLP techniques that are used for QG and were used in Ceist.<br />

Ceist is a rule based QG system as it uses expressions and templates to move from input<br />

to output. The combination of the expressions and templates are called rules. The<br />

4 http://www.nltk.org/<br />

5 http://opennlp.sourceforge.net/<br />

6 http://sharpnlp.codeplex.com/<br />

14


techniques described here would be considered to be core to how rule-based QG<br />

systems work.<br />

The literature review does not delve into the technical details of the techniques. Instead,<br />

this is left to the software overview (Chapter 4) which presents a detailed description of<br />

the technical implementation of Ceist. The systems which were reviewed and/or<br />

assisted in the development of Ceist are listed in Table 2.1.<br />

System <strong>Question</strong> Types Generated<br />

Mitkov and Ha (2003) Multiple Choice <strong>Question</strong><br />

generator<br />

Brown et al. (2005) Vocabulary Assessment questions<br />

consisting of 6 question types<br />

Cai et al. (2006) <strong>Question</strong>s to aid Intelligent<br />

Tutoring Systems<br />

Rus et al. (2007a) Factual questions whose answers<br />

are specific facts: Who? What?<br />

Where? When?<br />

Gates, D. (2008) Reading comprehension questions<br />

for children<br />

Wang et al. (2008) <strong>Question</strong>s to evaluate medical<br />

article comprehension<br />

Table 2.1 Existing QG Systems reviewed<br />

Following the overview of techniques used in QG systems in general, a description of<br />

the current body of knowledge pertaining to <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> is<br />

presented. Factivity is a sub-field of linguistics relating to the established truth of a fact<br />

as given in a sentence. Although it has been researched for a number of years now, I<br />

15


was not able to find any practical NLP tools available for my requirements. I briefly<br />

describe the creation of my own tool.<br />

2.2 Overview of existing systems<br />

Workshops relating to The <strong>Question</strong> Generation Shared Task and Evaluation Challenge<br />

have seen some recently developed QG systems presented by their creators but some<br />

earlier work had been done too (Rus and Graesser, 2009). Early QG systems used<br />

shallow parsing and “employed various NLP techniques including term extraction”<br />

(Mitkov and Ha, 2003 p.1) and these techniques were also used in QA systems around<br />

the same time (Kaisser and Becker, 2004). Parsing allowed early systems to determine<br />

the grammatical structure of free text sentences and term extraction allowed the systems<br />

to match sentences which had a specific grammatical structure. Today’s systems still<br />

employ parsing and term extraction and I expand on these two methods in the next<br />

section (2.3).<br />

The expressions used in term extraction are related to either the question being<br />

answered, in the case of QA, or the question being generated, for QG. If a QA system<br />

seeks to answer the question ‘Where was Mary McAleese born?’ then it might use term<br />

extraction to find a sentence of the form ‘Mary McAleese was born in .’ to retrieve the answer. A QG system would do the reverse; finding<br />

sentences of the form ‘Mary McAleese was born in .’ to generate the<br />

question ‘Where was Mary McAleese born?’ with its answer.<br />

16


Naturally the system designers chose to use some form of mark-up to represent the term<br />

extraction expression and link it to its related question. The QG system by Cai et al.<br />

(2006) and the QA system, QuALiM, by Kaisser and Becker (2004) both use XML<br />

based mark-up languages to do this. QuALiM’s mark-up is shown in Figure 2.1.<br />

Figure 2.1 QuALiM mark-up. (Kaisser and Becker, 2004)<br />

The combination of the term extraction expression and the generated question template<br />

is called a rule. Figure 2.2Error! Reference source not found. below is an example of<br />

one such rule as defined by Cai et al.’s mark-up language which they call NLGML.<br />

Although this rule is from a QG system, it does bear some similarities to the mark-up<br />

from the QA system. Both mark-ups use a sequence to match a specific grammatical<br />

structure and then use matched parts of that structure in a template to generate an<br />

output.<br />

17


Figure 2.2 QG system mark-up from Cai et al. (2006)<br />

Three of the original authors of NLGML described an improved version of their QG<br />

system just one year later which focused on flexible term extraction and its associated<br />

mark-up language (Rus et al., 2007a). Mark-up language for QA/QG systems appears to<br />

be an area where different developers are re-inventing the wheel and a unified effort to<br />

standardise the representation would be useful (Wyse and Piwek, 2009).<br />

Systems generally accept various discourse types although it is not uncommon for<br />

developers to focus on very specific domains and domain-specific discourse. Donna<br />

Gates (2008) focused on news articles for young children. The language contained<br />

<strong>within</strong> such articles would be relatively simple everyday language. As sentence<br />

structure becomes more complex and contains an increasing number of elements,<br />

finding patterns <strong>within</strong> the sentences would become increasingly difficult.<br />

Gates used “several off the shelf language technologies to generate reading<br />

comprehension questions” based on the news articles. Like Gates’ system, Ceist focuses<br />

on educational material also. This is because the application of QG in educational<br />

technology is a useful way to showcase the technology.<br />

18


Testing language learning skills was also the purpose of a system developed by Brown<br />

et al. (2005 p.1). This system is designed to test the user’s vocabulary knowledge by<br />

“automatically generating questions for vocabulary assessment”. A system with a<br />

completely different purpose was developed by Wang et al. (2008). This system was<br />

designed to test medical students and as such worked with text containing medical terms<br />

which would not typically be part of everyday vocabulary.<br />

As has previously been stated, the number of QG systems currently available to study is<br />

limited and the QG STEC will without doubt change this. Already, however, a lot can<br />

be gained by examining a few of the techniques used in the systems briefly described<br />

above.<br />

2.3 Techniques used and NLP tools employed<br />

The task of question generation can be split into several sub-tasks depending on the<br />

target question types (Silveira, 2008). Figure 2.3 is Silveira’s diagram showing groups<br />

of sub-tasks and typical processing flows. This sub-section will look at existing methods<br />

used to implement some of these sub-tasks.<br />

19


Figure 2.3 Sub-tasks of question generation (Silveira, 2008)<br />

The first group of sub-tasks in Silveira’s diagram which are performed on the free<br />

natural text input are tokenisation, stemming, syntactic and semantic parsing, anaphora<br />

resolution and ambiguity removal. These tasks perform some initial preparation on text<br />

for use in further processing. I will briefly explain the other tasks before focusing on<br />

syntactic and semantic parsing.<br />

Tokenisation is the task of splitting up words from text. Although it may sound simple,<br />

sometimes the use of punctuation makes this quite difficult (e.g. a period might<br />

represent the end of a sentence or a decimal point). The stem of a word is its root form<br />

and stemming is the process of finding that root form. For words such as ‘running’ and<br />

‘laughed’ the process is quite simple; the suffix is removed leaving ‘run’ and ‘laugh’.<br />

Stemming is not always so straight forward however; for example irregular verbs<br />

cannot be stemmed using a common algorithm (eg ‘bought’, ‘swam’).<br />

20


I described anaphora in my introduction chapter (1.1) and to briefly recap it is the<br />

process of determining what a personal pronoun refers to <strong>within</strong> the surrounding text.<br />

Ambiguity removal is quite simply determining the meaning of something which is<br />

ambiguous in the text. Syntactic parsing provides grammatical information for text<br />

which allows us to process it at a grammatical level rather than at word level. The next<br />

sub-section explains this further.<br />

2.3.1 Tagging and parsing<br />

The most basic syntactic information that can be added to plain text is part of speech<br />

(POS) tags. This process marks each word in a sentence with its corresponding part of<br />

speech (i.e. noun, adjective, verb, etc.) and is called POS tagging. Stochastic and rule-<br />

based are two categories of taggers with two different approaches to tagging.<br />

Stochastic taggers such as those using a Hidden Markov Model (HMM) are based on<br />

probability. HMM taggers learn the probability that pairs (or longer sequences) of<br />

words will be specific POS types by analysing corpora of manually annotated text such<br />

as the British National Corpus 7 .<br />

My initial experiments with the NLTK included some work with the rule-based Brill<br />

tagger (Brill, 1992). The Brill tagger uses a machine learning technique known as<br />

7 http://www.natcorp.ox.ac.uk/<br />

21


supervised learning. It is trained on annotated development data to learn rules that it can<br />

apply to certain words or sequences of words.<br />

Lexical rules in the Brill tagger check for characters <strong>within</strong> words and change the POS<br />

of matching words. An example is the rule ‘NN s fhassuf 1 NNS’ which changes<br />

any word tagged as a noun (NN) to a plural noun (NNS) if the word has the suffix ‘s’.<br />

The tagger also uses contextual rules and will change the POS of a word based on other<br />

words in the same sentence.<br />

POS tagged text is a pre-requisite for the process of term extraction as described briefly<br />

in the overview of existing systems in the previous section (2.2). My initial experiments<br />

with the Brill tagger highlighted one problem that occurs when using text which has<br />

been tagged with POS information only. It became quite apparent that using parts of<br />

speech tagging alone was inflexible. Let’s examine why this is.<br />

Consider the sentence ‘The house was in a field’. Targeting this sentence with the<br />

objective of creating a question ‘Where was the house?’, one might use POS tags to<br />

match the determiner ‘The’ followed by the noun ‘house’. The problem with this<br />

approach is that it eliminates other potentially valid matches such as ‘The green house<br />

was in a field.’<br />

To overcome this, one might permit an optional adjective between the determiner and<br />

the noun, but what if there are two adjectives as in ‘The big green house’. The possible<br />

permutations are endless and accounting for them using POS tag combinations alone<br />

22


would be impossible. One way to solve this issue would be to match the complete noun<br />

phrase, rather than just parts of speech. A noun phrase encompasses the determiner, any<br />

adjectives and the noun itself. The technical approach to this matching is described in<br />

the software review (Chapter 4) but I will outline the parsing process which makes it<br />

possible.<br />

Shallow parsing is a NLP process that adds syntactic structure information to a<br />

sentence. Noun phrases, verb phrases and other groupings <strong>within</strong> a sentence are marked<br />

in addition to the POS tags. Shallow parsing was required by Mitkov and Ha (2003)<br />

who wanted to identify key terms in a given text. They decided that frequently<br />

occurring nouns or noun phrases would make good key terms and they utilised the FDG<br />

shallow parser (Tapanainen and Järvinen, 1997). There are a variety of approaches to<br />

parsing now and a range of parsers available.<br />

QG system developers do not provide rationale for choosing one parser over another. It<br />

is highly likely that the choice is made from the best of the state of the art parsers<br />

available at the time based on the following factors; (i) Availability – is the use of the<br />

parser restricted in any way, (ii) Performance – has the parser performance been proven.<br />

(iii) Open source – if the source code for the parser is available in the QG system<br />

developers chosen programming language, then this could be important as it can be<br />

adapted if needed.<br />

The factor that was deemed to be most important when choosing a parser for Ceist was<br />

that the parser source code was available and written using the Java language. Ceist is<br />

23


written using the Java programming language and is modular in design. Modules exist<br />

for rule storage and conversion, some NLP functionality such as group matching and<br />

the main QG application itself.<br />

Other factors included the output formats available and although performance was not<br />

measured it was considered. In the next section I look at how parsed text is used in term<br />

extraction and then I explain another factor in the choice of parser for Ceist.<br />

2.3.2 Term extraction<br />

Term extraction is the task of extracting targeted terms from free text. This corresponds<br />

to the term ‘representation matching’ in Figure 2.3. A basic requirement for QA and QG<br />

systems is the ability to identify sentences with a specific grammatical structure. Note<br />

the similarities among the sentences in Figure 2.4. Two of them refer to John being in a<br />

city, i.e. (1) & (3), but the other, (2), does not.<br />

Figure 2.4 Examples: John is in … plain text<br />

Term extraction will allow us to match sentences stating that John is in some city such<br />

as (1) and (3). For the purposes of QG the matched terms can then be re-arranged to<br />

24


form a question (e.g. Where is John?) and if desired, an answer (e.g. Dublin or London).<br />

Let’s look at simple POS tagged versions of the example sentences as provided by the<br />

Stanford Parser online 8 . Figure 2.5 shows the sentences with the POS tagger<br />

information coloured grey.<br />

Figure 2.5 Examples: John is in … POS tagged<br />

I use the term ‘target’ to indicate that I wish to only match sentences of a specific<br />

structure. From our example we wish to target sentences (1) and (3) exclusively, and so<br />

we can simply look for sentences with a proper noun (NNP), followed by the words ‘is<br />

in’, followed by another proper noun. Not all proper nouns are city names of course, but<br />

to keep things simple, this example the target simply looks for a proper noun at the end<br />

of the sentence. Because a POS tagged sentence is a string, we can search it using<br />

regular expressions.<br />

The use of regular expressions to search a POS tagged sentence is quite simple, as the<br />

POS tagged sentence is a list of word / POS pairs separated by a forward slash. The<br />

8 http://nlp.stanford.edu:8080/parser/index.jsp<br />

25


egular expression ‘(\w*)/NNP’ can be used to match proper nouns. The regular<br />

expression shown in Figure 2.6 matches both (1) and (3) above exclusively. (2) is not<br />

matched because the last word in (2) is not a proper noun.<br />

Figure 2.6 Regular expression targeting NNP is in NNP<br />

Terms in parentheses can be captured as groups. This allows a system to retrieve the<br />

proper noun at the beginning of the sentence and use it to generate a question such as<br />

‘Where is John?’ These techniques are used in existing QG systems and I use the same<br />

in Ceist. The software overview (Chapter 4) provides more detail on this process.<br />

The previous sub-section (2.3.1) explained the problem with targeting sentences using<br />

POS tags alone. A method was needed to allow extraction based on syntactic groupings<br />

such as noun phrases or verb phrases. Shallow parsing provided this information but<br />

this additional information is more difficult to extract terms from. This is because the<br />

representation of the parsed sentence is no longer linear; it is a tree data structure.<br />

Searching <strong>within</strong> branches of a tree using normal regular expressions is not possible as<br />

the structure is a hierarchy of nodes. Gates (2008) used a NLP tool designed for this<br />

purpose called T-Surgeon (Levy and Andrew, 2006). Ceist uses a tool from the same<br />

26


package called Tregex, which is capable of performing regular expression searches<br />

<strong>within</strong> a tree data structure.<br />

Tregex is a sub-project <strong>within</strong> the larger Stanford NLP tools package. Also <strong>within</strong> this<br />

package is the Stanford NL parser. This parser supplies parsed text in a variety of<br />

formats including a one line syntax parsed tree which can be sent directly as input to<br />

Tregex. All of the source code for the Stanford NLP tools is open source. Based on<br />

these factors the parser used with Ceist is currently the Stanford NL parser. Klein and<br />

Manning (2003) do report performance figures for the parser but performance was not a<br />

key factor in its choice for Ceist.<br />

2.3.3 WordNet<br />

In the previous sub-section I explained term extraction techniques and in particular how<br />

QG system developers use syntactic groupings to encompass an entire noun phrase<br />

rather than individual parts of sentence. Term extraction can be improved further by<br />

grouping terms semantically. The sentences in Figure 2.7 below demonstrate this.<br />

Figure 2.7 Examples: Targeting motion related verbs<br />

27


Let’s assume we wish to generate a question ‘Where did John move towards?’ from<br />

these sentences. Only (4) and (5) in Figure 2.7 are valid for this question because they<br />

concern movement, but can we rely on POS tagging to eliminate sentence (6)? The<br />

Stanford NL parser produces the POS tagged versions of these sentences shown below.<br />

Figure 2.8 Examples: POS tagged motion verb examples<br />

The sequence of POS tags is identical for all three sentences. POS tags alone do not<br />

allow us to target only sentences (4) and (5). What is required is the ability to<br />

distinguish between verbs that mean the subject is moving (i.e. walked and ran) and<br />

other verbs (e.g. pointed). Semantics relating to linguistics is the study of the meaning<br />

of words, phrases or sentences. NLP tools exist which allow systems to find words with<br />

similar meanings.<br />

The vast majority of current QG systems use a lexical database called WordNet<br />

(Fellbaum, 1998). One function WordNet provides is the ability to lookup words that<br />

are semantically similar to another. It can be used to solve the problem with the<br />

sentences above by querying the database for all verbs with a similar meaning to ‘walk’<br />

and then using this group of verbs to target only sentences where movement has taken<br />

place.<br />

28


Any of the systems I reviewed which used a semantic grouping lookup, used WordNet.<br />

The technical details are not discussed in great detail but this is because it is not by any<br />

means difficult to do. Essentially WordNet is a database and can be consulted to return<br />

words close in meaning to a target word.<br />

The modular nature of Ceist removes any ties to specific NLP tools. Ceist employs a<br />

grouping function which allows the specification of semantic groups as a tag name (e.g.<br />

motionVerbs). The implementation of motionVerbs can be either simply a list of user<br />

created motion verbs or something cleverer such as a module interfaced to WordNet. I<br />

have been careful to avoid becoming tied to one particular NLP tool and as such Ceist<br />

presently only uses simple user created lists.<br />

2.3.4 Rule-based mapping<br />

In the sub-section relating to term extraction (2.3.2) I briefly mentioned how rules could<br />

map matched terms from an input sentence and rearrange the terms to form a question.<br />

An example was the sentence ‘John is in London’. Any time a sentence of the format<br />

‘ is in ’ was found, the first proper noun could be inserted into the<br />

template ‘Where is ?’ to generate a question. This process of transforming the<br />

matched terms into a question using a template is called mapping.<br />

The contrived example is relatively simple but this is not always the case. Some<br />

mappings might require a change in verb tense or possibly the conversion of a word, or<br />

words, to lower case letters. We can look at one example of mapping involving verb<br />

tense changes which was considered by Mitkov and Ha (2003, p.2). They state that they<br />

29


use “a number of simple question generation rules” for transforming sentences of “SVO<br />

or SV type”. So, what do they mean by SVO or SV type?<br />

Sentences commonly consist of a structure consisting of subjects, verbs and objects.<br />

The relation between the subject and object is defined by the verb. In the sentence ‘John<br />

ate apples’, ‘John’ is the subject and ‘apples’ is the object. Sentences with a structure<br />

containing a subject followed by a verb and then an object are known as SVO (subject-<br />

verb-object). Sentences with a structure containing just a subject and verb would be<br />

categorised as SV (Huddleston and Pullum, 2005).<br />

The simple transformation Mitkov and Ha perform on such sentences is to rearrange<br />

them into the format “What do/does/did the S V?” They take the subject and the verb<br />

and append them to the interrogative phrase “What do/does/did”. They provide an<br />

example using the sentence in Figure 2.9.<br />

Figure 2.9 Mitkov and Ha example sentence<br />

The subject is underlined and the verb marked with bold type. Applying the simple<br />

transformation from input sentence to question, Mitkov and Ha present the output in<br />

Figure 2.10 as the generated question.<br />

30


Figure 2.10 Mitkov and Ha example question<br />

We can observe two issues with this transformation which must be addressed. The<br />

choice of the verb ‘do’, ’does’ or ‘did’ in the generated question depends on the verb<br />

tense in the original sentence, and the transformed verb could also be subject to<br />

inflection e.g. (constitutes constitute). There is also the minor issue that the letter ‘A’<br />

in the subject becomes lower case in the question. Simply converting all first characters<br />

to lower case is not a valid solution to this issue because if the first word were a proper<br />

noun, it should not be converted in this way. We will see that in other systems, the<br />

methods used to define transformational templates and rules have been enhanced to<br />

address these issues.<br />

Wang et al. provide some examples of templates used to generate questions. Their<br />

templates consisted of four components: “question, entries, keywords and answer” and<br />

they give the following sample template:<br />

Figure 2.11 Wang et al. sample template<br />

31


Their system looks for parts of a medical text (which has been tagged to indicate<br />

diseases and symptoms amongst other medical entities) containing the required entries<br />

and at least one of the keywords.<br />

Both Mitkov and Ha and Wang et al. used templates capable of matching a very wide<br />

range of sentences. There are an infinite number of sentences of the form Subject-Verb-<br />

Object or Subject-Verb. The sample given above from a Wang et al. template would<br />

match any sentence containing a symptom, a disease and any of the keywords ‘feel’,<br />

‘experience’ or ‘accompany’. Such sentences are probably common enough in medical<br />

texts. These systems differ from the others because they do not define very exact<br />

sentences which they wish to match. Other rule based systems target very specific<br />

sentences right down to the word level, as does Ceist.<br />

The QA system QuALiM (Kaisser and Becker, 2004 p.2) featured “strict pattern<br />

matching” using “sequences” to “classify the questions according to their linguistic<br />

(mostly syntactic) structure”. An example of the sequence definition given by Kaiser<br />

and Becker is shown in Figure 2.12.<br />

Figure 2.12 QuALiM mark-up language<br />

32


QuALiM is a QA system and thus the sequence is designed to match a question. Kaisser<br />

and Becker used the sequence to identify candidate sentences which could answer this<br />

question, and searched Google for those sentences. In a QG system designed to generate<br />

questions for which the answer is contained <strong>within</strong> the text, the target sequence would<br />

mainly be declarative sentences as declarative sentences are statements and will usually<br />

state some fact about which a valid question can be asked. Although the method used by<br />

QuALiM was employed to represent not just declarative sentences, it can be used to<br />

represent them and is still applicable.<br />

The sequence given above matches a very specific question, described by Kaisser and<br />

Becker (2004, p.2) as “any question that starts with the word ‘When’, followed by the<br />

word ‘did’, followed by an NP, followed by a verb in its infinitive form, followed by an<br />

NP or a PP, followed by a question mark which has in addition to be the last element in<br />

the question”.<br />

The mark up they employed was flexible in that it permitted the building of patterns<br />

containing a mix of elements which could be matched. The reason this capability is<br />

important can be explained by a simple example. To match the sentence, ‘Mary went to<br />

Dublin’, a QG system could be asked to look for exactly those four words in sequence.<br />

Any sentence matching those exact four words will be used to generate a question.<br />

It is quite easy to determine computationally whether or not a sentence contains the<br />

exact words ‘Mary went to Dublin’ but in any given text the probability that we are<br />

given this individual sentence is negligible. This approach to finding candidate<br />

33


sentences obviously needs to be improved. The coverage of the pattern must be<br />

increased and one manner in which this might be done would be to replace the word<br />

‘Mary’ in the pattern, with any person’s name. By writing a new pattern ‘<br />

went to Dublin’, we would increase the coverage significantly. This functionality<br />

was seen as a necessity if patterns were ever to become fully defined and as a result it<br />

became a design aim for Ceist. It would be possible to use groups such as ‘NAME’<br />

<strong>within</strong> the patterns and the manner in which the groups were implemented was<br />

irrelevant to Ceist.<br />

The variable ‘’ might match a single first name, a surname, maybe both or even<br />

a titled name such as ‘Mrs. Mary Burke’. The important point to note is that our pattern<br />

now contains two elements, i.e. individual words and a variable representing a group of<br />

words (people’s names). The pattern could also be modified to look for any verb in the<br />

past tense. The word ‘went’ would be replaced with the part-of-speech tag ‘VBD’.<br />

Kaisser’s system allowed patterns to contain different element types and his mark up<br />

was designed to suit this. Having realised the potential of such a flexible mark up, I<br />

decided to implement similar functionality in Ceist. I made personal contact with<br />

Kaisser and he was kind enough to provide me with further technical details regarding<br />

the mark up used in the sequence definitions.<br />

Following a similar approach, Cai et al. introduced NLGML, “A Markup Language for<br />

<strong>Question</strong> Generation”. The approach was “based on lexical, syntactic and semantic<br />

patterns described in a mark-up language” (Cai et al., 2006 p.1). An example of<br />

34


NLGML describing a sentence of the form “somebody went to somewhere” is shown in<br />

Figure 2.13.<br />

Figure 2.13 NLGML mark-up for ‘somebody went to somewhere’<br />

The language allows semantic features to be matched using the attributes specified in<br />

the phrases e.g. (person=”true”, location=”true”). NLGML also introduces<br />

functions to address some of the issues previously described. NLGML uses the function<br />

_lowerFirst that would change the first letter of a term to lower case. The function<br />

_getLemma changes the verb ’went’ to ’go’ for example and the process is known as<br />

lemmatisation. QuALiM also used lemmatisation and I have implemented similar<br />

functionality in Ceist. The task of lemmatisation is described further in the next sub-<br />

section (2.3.5).<br />

The work by Cai et al. and Rus et al., to begin defining a unified mark up language for<br />

rules, is very important. They consider the two most important parts of a QG system to<br />

be: the transformation rules and an interpreter. It is very likely that if the manner in<br />

which rules are defined is standardised and the rules are sufficiently precise that rules<br />

can be used across systems and there may even be an effort to create a unified set of<br />

rules for QG. What would then distinguish one rule-based system from another would<br />

35


e its ability to perform the NLP sub-tasks such as Named Entity Recognition or<br />

lemmatisation.<br />

2.3.5 Lemmatisation<br />

The function getLemma and the related NLP task of lemmatisation were mentioned in<br />

the previous sub-section. Possibly because it is deemed to be a well-researched task, QG<br />

system designers do not elaborate on the methods used by their systems to lemmatise<br />

words. Indeed, only from personally communicating with the author of the QA system<br />

QuALiM, was I able to learn more about their method of lemmatisation.<br />

QuALiM uses transformation rules to change verb forms and NLGML/QG-ML uses<br />

functions. Other functions allow matched terms to be transformed in other ways, such as<br />

to be converted to lower case characters. The NLP sub task of determining inflected<br />

word forms has also been well researched (Porter, M.F., 1980; Jurafsky and Martin,<br />

2009). Stemming is the process of acting on a verb using simple rules such as removing<br />

the suffix ‘-ed’ from a verb in the past tense to provide the base form of the verb, e.g.<br />

(walked walk). This solution is not perfect and an example is notable when dealing<br />

with irregular verbs, e.g. (ran, hid), because irregular verbs do not follow simple rules.<br />

QuALiM uses a morphology database that was part of the XTAG system by Doran et al.<br />

(1994). It contains a vast number of verbs and their inflected forms that can be easily<br />

queried to perform lemmatisation. The database was originally compiled by Karp et al.<br />

(1992) using a set of morphological rules for English by Karttunen and Wittenburg<br />

36


(1983) and the 1979 edition of the Collins Dictionary of the English Language and<br />

unfortunately new words must be added to the database over time but the advantage is<br />

that the lemmatisation is very accurate.<br />

Ceist currently uses the XTAG system but there is a limitation with lemmatisation.<br />

Lemmatisation alone is not sufficient to generate some questions. Lemmatisation only<br />

provides us with the base form for a verb, but there are cases where we would like to<br />

obtain another form for a verb. An example is the input sentence ‘John has eaten all the<br />

apples.’ Generating the question ‘Who ate all the apples?’ requires the verb ‘eat’ to be<br />

morphed from the past participle ‘eaten’ to the preterite ‘ate’. A lemmatiser such as the<br />

XTAG system will only tell us that the base form of ‘eaten’ is ‘eat’. It does not provide<br />

a means to then tell us the preterite of ‘eat’. The tool which provides this functionality is<br />

a verb conjugator.<br />

Verb conjugation has not yet been implemented in Ceist. This is an area where the<br />

system can be improved in the future. One potential solution is to integrate the verb<br />

conjugator provided by the American Northwestern University’s MorphAdorner 9 tools<br />

with Ceist.<br />

9 http://morphadorner.northwestern.edu/morphadorner/<br />

37


2.4 Factivity<br />

Factivity is relevant to question generation for single sentences in the particular case<br />

where we expect the generated question to be answered by the sentence. If we ask a<br />

question based on a declarative statement in the sentence then we must be sure that the<br />

statement is an established fact. This may not always be the case as the sentences in<br />

Figure 2.14 show (the statement is underlined).<br />

Figure 2.14 Factivity sentence examples<br />

The speaker in (1) is not sure that there were 10 people at the conference whereas the<br />

speaker in (2) is absolutely certain. The clause <strong>within</strong> the sentence containing the<br />

declarative statement is known as a declarative content clause.<br />

Predicate verbs which establish a true fact, such as ‘know’ are known as <strong>factive</strong> verbs.<br />

Predicates which do not establish the statements as fact are known as <strong>non</strong>-<strong>factive</strong> (e.g.<br />

think). Much of the pioneering work on factivity was done by Kiparsky and Kiparsky<br />

(1970) and Hooper (1974). Kiparsky and Kiparsky explain that in <strong>factive</strong> sentences the<br />

speaker “presupposes that the embedded clause expresses a true proposition”. The<br />

speaker in (1) above does not presuppose that they are aware of the precise number of<br />

attendees at the conference, but the speaker in (2) does. Kiparsky and Kiparsky begin to<br />

clearly define <strong>factive</strong>s and <strong>non</strong>-<strong>factive</strong>s using rules in their paper from 1970.<br />

38


Hooper expanded this work by further clarifying “the differences between classes of<br />

<strong>predicate</strong>s that take that clauses as subject or object complements.” Table 2.2 is the list<br />

of categorised <strong>predicate</strong>s drawn up by Hooper. The categorised list drawn up by Hooper<br />

is a useful starting point for a more comprehensive list. Hooper’s list can be expanded<br />

using thesaurus to create a software module for <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>. This is<br />

exactly what I did for this research and I outline the technical details of the software<br />

module in the software overview (Chapter 4).<br />

These researchers categorise <strong>factive</strong>s and <strong>non</strong>-<strong>factive</strong>s into groups such as strong-<br />

assertive or weak-assertive as well as mental or verbal. Their intention being that we<br />

can estimate the certainty of what a speaker presupposes by looking up the category into<br />

which the <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> verb they have used, falls. This is very difficult to do.<br />

Even as I expanded the original list of verbs by Hooper I found that the thesaurus would<br />

present the same word for two different verbs from two different categories. I was not<br />

concerned with the categories so I simply added the new word to my list once only. For<br />

my research the degree of certainty was not important.<br />

I have considered a new approach to determining the certainty with which a speaker<br />

presupposes a fact. The certainty could be estimated by studying sentences containing<br />

<strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> verbs from my comprehensive list and assigning a value for the<br />

chance of certainty for each word. If it was found, for example, that 6 times out of 10<br />

when a speaker ‘claims’ to have seen something, that they in fact had seen it, we would<br />

assign a certainty value of 0.60 to the verb ‘claim’. Using a numeric value would then<br />

allow NLP systems to select the level of factivity they desire.<br />

39


<strong>Factive</strong><br />

Assertive (semi-<strong>factive</strong>) Non-assertive (true <strong>factive</strong>)<br />

find out regret<br />

discover resent<br />

know forget<br />

learn amuse<br />

note suffice<br />

notice bother<br />

observe make sense<br />

perceive care<br />

realize be odd<br />

recall be strange<br />

remember be interesting<br />

reveal be relevant<br />

see be sorry<br />

be exciting<br />

Non-<strong>factive</strong><br />

Assertive<br />

Weak assertive Strong assertive Non-assertive<br />

think (a) (b) <strong>non</strong>-negative<br />

believe acknowledge insist agree be likely<br />

suppose admit intimate be afraid be possible<br />

expect affirm maintain be certain be probable<br />

imagine allege mention be sure be conceivable<br />

guess answer point out be clear negative<br />

seem argue predict be obvious be unlikely<br />

appear assert prophesy be evident be impossible<br />

figure assure postulate calculate be improbable<br />

be<br />

certify remark decide<br />

inconceivable<br />

charge reply deduce doubt<br />

claim report estimate deny<br />

contend say hope<br />

declare state presume<br />

divulge suggest surmise<br />

emphasize swear suspect<br />

explain testify<br />

grant theorize<br />

guarantee verify<br />

hint vow<br />

hypothesize<br />

imply<br />

indicate<br />

write<br />

Table 2.2 <strong>Factive</strong> and <strong>non</strong>-<strong>factive</strong> categorised by Hooper (1974)<br />

40


2.5 Research question<br />

To begin to answer the two specific sub-questions of the research question, “What is the<br />

overall impact of implementing <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> sentence <strong>recognition</strong> as a sub-task of<br />

a <strong>Question</strong> Generation system?” and “What increase in generated question quality does<br />

it deliver?” there are some pre-requisite steps that must be taken.<br />

A working QG system is required which uses the techniques described in this literature<br />

review (2.3). A module is added to this system which is capable of <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong><br />

sentence <strong>recognition</strong> using a comprehensive list of <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> verbs and<br />

phrases as outlined in the previous section (2.4).<br />

Developing this module requires drawing again on the techniques described in section<br />

2.3 and also on the knowledge gained from the entire literature review. Using<br />

techniques which are described in Chapter 3 but which were learned during the<br />

literature review, the new module will then be evaluated in order to determine its overall<br />

impact on a QG system and the increase in generated question quality which it delivers.<br />

2.6 Summary<br />

The literature review provided an insight into the techniques used to implement current<br />

QG systems. With this knowledge it was possible to develop a new QG system from<br />

scratch with specific design aims in mind.<br />

41


These design aims were that the system would focus on single sentences only, and that<br />

the answer to the generated question would be contained <strong>within</strong> the sentence. In fact, the<br />

system would also allow the generation of the answer in addition to the question.<br />

Another design aim was that the match pattern used to identify candidate sentences<br />

from which questions could be generated would be extremely flexible. It would allow<br />

the pattern to identify exact words, parts-of-speech, syntactic structures such as noun<br />

phrases and in addition groupings (semantic or otherwise) that increase the coverage of<br />

the patterns.<br />

The system would allow, for example, a rule author to match ‘personName’ <strong>within</strong> a<br />

pattern and the implementation of how the QG system identifies a persons name is<br />

hidden from the user. The rule author is only concerned that the pattern they have<br />

written will generate valid questions if the system correctly recognises person names.<br />

The working QG system built with these design aims would then facilitate the testing of<br />

a module capable of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong> <strong>recognition</strong>. The body of knowledge<br />

described in this chapter relating to factivity was used to create this module.<br />

42


Chapter 3 Research Methods<br />

3.1 Introduction<br />

The primary research techniques used in this project were prototyping of a QG system<br />

and assessment of a software module providing <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> sentence<br />

<strong>recognition</strong>. The development of the prototype was based on knowledge attained in<br />

some part by a review of literature relating to existing systems (Chapter 2).<br />

The overall impact of the software module and the quality increase which it delivered<br />

were both assessed using quantitative analysis. This chapter outlines the methods used<br />

in the analysis.<br />

3.2 Research Techniques<br />

3.2.1 Prototyping<br />

The technique of prototyping was used to develop the QG system Ceist. It was built<br />

iteratively. Each iteration incorporated additional functionality or highlighted a new<br />

problem. The work relating to existing QG systems as described in the literature review<br />

(Chapter 2) was used to address these problems.<br />

The benefits of prototyping as a research method were very clear. The process of<br />

experimenting with an application and attempting to implement various features<br />

highlights potential drawbacks and new solutions must be found. Problems not easily<br />

solved are identified as candidates for further research.<br />

43


Examples of redesign during prototyping are the original parser choice where POS<br />

tagging alone was found to be insufficient or the choice of a lemmatiser tool rather than<br />

a fully fledged verb conjugator. Both decisions only showed any limitation because of<br />

their use <strong>within</strong> the prototype. The discovery of a limitation with a particular tool during<br />

prototyping was seen not as a setback but as a valuable lesson <strong>within</strong> the research as a<br />

whole.<br />

Prototyping as a research technique was very effective. The practical exercise of<br />

building a system considerably advanced my knowledge of NLP. First hand experience<br />

of NLP problems gave me an opportunity to contemplate solutions to those problems.<br />

3.2.2 Quantitative analysis of overall impact<br />

Overall impact measures the benefit of <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> to question<br />

generation as a whole. Calculating the frequency at which <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong><br />

<strong>predicate</strong>s occur in text can provide some indication of the likely benefit to be gained by<br />

recognising such <strong>predicate</strong>s. This analysis aimed to calculate those frequencies <strong>within</strong><br />

the educational discourse OpenLearn.<br />

The initial list of <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong>s which was drawn up by Hooper<br />

(1974) was extended using a thesaurus. For the purposes of analysing the overall<br />

impact, some phrases which indicate factivity were also used (e.g. realised = got the<br />

message). More detail about the creation of the collection is given in the software<br />

overview (Chapter 4 – 4.7).<br />

44


Once the list was finalised, it was incorporated into a Ruby script. The input to this<br />

script was a sample of the entire OpenLearn online resource in text file format. Each<br />

individual sentence was read by the script, one by one. The frequency of occurrences of<br />

the terms in the list was counted.<br />

The script produced a count of <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> words or phrases in a sample of<br />

the OpenLearn resource. I analysed the frequencies to determine how often <strong>factive</strong> and<br />

<strong>non</strong>-<strong>factive</strong> words and phrases would play a part in determining the certainty of an<br />

established fact in a declarative content clause.<br />

This type of analysis has been used before by QG system developers. Brown et al.<br />

(2005 p.5) “examined the percentage of words for which we could generate various<br />

question types”. Knowing the frequency of a specific set of words <strong>within</strong> a discourse<br />

allows us to deduce the overall impact which work relating to that set of words will<br />

have.<br />

3.2.3 Quantitative analysis of quality increase<br />

The measurement of question quality is an area of QG which is still in its infancy. It is<br />

generally accepted that generated questions must be complete and grammatically<br />

correct. Gates (2009) scored a generated question by assessing “whether it was<br />

syntactically grammatical and whether it made sense semantically in the context of the<br />

text”. This aspect of question quality is important but these criteria are not useful in<br />

measuring the improvement which <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> delivers.<br />

45


For my specific focus area, the generation of sentences answerable by the input<br />

sentence, I must use another criteria: Is the question explicitly answerable by the input<br />

sentence?<br />

A quantitative analysis of the increase in quality produced by the factivity module<br />

therefore needs to measure the number of questions unanswerable by the input<br />

sentences which are no longer generated when the module is added to Ceist. It should<br />

also measure the number of perfectly good questions which were no longer generated, if<br />

any.<br />

To execute such measurements, the entire OpenLearn parsed data set was inputted into<br />

Ceist with no <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> (Ceist-Baseline), using two rules. The<br />

rules generated yes/no type questions and what type questions.<br />

A yes/no type question has simply yes or no as an answer. For analysis I determined if it<br />

was possible to answer the generated question with ‘yes’ or ‘no’ given the input<br />

sentence. The particular rule I used was aimed at generating questions with ‘yes’ as the<br />

answer. Grammatically incorrect or incomplete questions were removed from the<br />

output. Table 3.1 shows an example of the type of output which was rejected for the<br />

yes/no rule type.<br />

46


Input sentence Rejected question<br />

The point is that, whatever the shape of the<br />

memorial, there has to be agreement that the<br />

form is appropriate in order that the meaning,<br />

and therefore the function, is assured.<br />

So, it is important to realize that a water<br />

molecule is quite different from the two types<br />

of atom from which it is formed.<br />

`The Mouse's Tale' is fun, but it seems to me<br />

that here meaning is rather unrelated to form,<br />

apart from the tale/tail pun.<br />

Is the form appropriate in order?<br />

Is a water molecule quite different from the two<br />

types of atom?<br />

Is here meaning rather unrelated?<br />

Table 3.1 Sample of YES/NO questions rejected<br />

For each of the remaining questions, it was decided whether the generated question<br />

could be explicitly answered from the input sentence, i.e. the answer was an established<br />

fact in the sentence. Table 3.2 list a sample of generated questions which were<br />

considered answerable by the input sentence.<br />

Input sentence Generated question<br />

A second useful feature to notice is that the<br />

sum of all the deviations is equal to zero.<br />

We now know that the world is spherical and<br />

you won't fall off because gravity holds you to<br />

the planet's surface.<br />

Notice that the cell constant is very small and<br />

is measured in nanometres.<br />

Is the sum of all the deviations equal to<br />

zero?<br />

Is the world spherical?<br />

Is the cell constant very small?<br />

Table 3.2 Sample of YES/NO questions considered answerable by the input<br />

On the other hand, some questions which were generated were deemed unanswerable by<br />

the input sentence alone. Note from the samples in Table 3.3 that the reason that each<br />

question is unanswerable is because of the uncertainty relating to the <strong>non</strong>-<strong>factive</strong> verbs.<br />

47


Input sentence Generated question<br />

Answers to these questions are beyond the<br />

scope of this unit, but they indicate that current<br />

understanding of brain functioning in autism<br />

is provisional.<br />

Methodological considerations: critics have<br />

argued that Lovaas's selection of participants<br />

is ill-defined, and that the design of the<br />

interventions can not exclude improvements due<br />

to confounding factors.<br />

However, we would suggest that the<br />

relationship between in-school and out-ofschool<br />

knowledge is different in the case of<br />

language.<br />

Is current understanding of brain<br />

functioning in autism provisional?<br />

Is Lovaas's selection of participants illdefined?<br />

Is the relationship between in-school and<br />

out-of-school knowledge different in the<br />

case of language?<br />

Table 3.3 Sample of YES/NO questions not answerable by the input<br />

This analysis of the generated questions produced a set of values for each rule; the<br />

number of questions which could be explicitly answered with absolute certainty by the<br />

input sentence and the number that could not. Our ideal system would produce only the<br />

former.<br />

The same rules were then run against the same data set using Ceist with <strong>factive</strong> / <strong>non</strong>-<br />

<strong>factive</strong> <strong>recognition</strong> enabled (Ceist-Factivity). The same set of values are produced but in<br />

this run it is expected that the number of questions which cannot be explicitly answered<br />

by the input sentence will be reduced. In addition we measure the intersection of the<br />

number of good questions from Ceist-Baseline with Ceist-Factivity to determine if there<br />

were any false positives.<br />

I adapted the formal model for precision and recall for my analysis of quality increase.<br />

Previous QG researchers have focused on precision more so than recall but recall is<br />

48


applicable in the case where one has a baseline generated output and wishes to compare<br />

with a second set of generated output. Recall allows us to measure any degradation in<br />

system output.<br />

The original definitions for precision and recall were given by Van Rijsbergen (1976).<br />

Van Rijsbergen was writing with respect to information retrieval and so his definitions<br />

relate to the number of relevant documents and the number of retrieved documents. QG<br />

researchers can use similar metrics by redefining A and B in Van Rijsbergen’s original<br />

equations shown in Figure 3.1.<br />

Figure 3.1 Precision and Recall: Van Rijsbergen’s (1976) formal definition<br />

Precision for the purposes of question generation defines A as the number of good<br />

questions and B as the total number of generated questions. This calculates precision as<br />

“the proportion of good questions out of all generated questions” according to Rus et al.<br />

(2007a). Similarly, recall can be adapted to measure the number of good questions ‘lost’<br />

by the new module. Figure 3.2 shows the adapted versions of the original precision and<br />

recall definitions.<br />

49


Figure 3.2 Precision and Recall: Adapted for this research<br />

The adapted value for precision therefore indicates the overall quality of what was<br />

generated, i.e. the proportion of acceptable questions (explicitly answerable by the input<br />

sentence) to unacceptable questions. The adapted value for recall indicates the<br />

degradation in the system as a result of enabling <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>. It tells<br />

us the proportion of acceptable questions which were retained.<br />

3.3 Summary<br />

The primary methods used in this research were prototyping and quantitative analysis.<br />

Prototyping facilitated the development of a QG system using an iterative approach.<br />

50


The system was built up in stages and based on observations it was modified and<br />

improved over successive builds. The prototype drew on an existing body of knowledge<br />

and resources which were referred to continuously.<br />

The quantitative analysis consisted of two separate assessments. The first uses a script<br />

to read sample sentences from the OpenLearn educational resource and count<br />

occurrences of <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> indicating words and phrases. These counts can be<br />

used to determine the overall impact which a module capable of recognising these terms<br />

might have.<br />

The second quantitative analysis was executed to determine the quality increase that the<br />

same module would deliver to generated questions outputted by the system. This<br />

method required a comparison of output with a baseline QG system and with a system<br />

capable of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>. To execute this comparison the formal<br />

definitions for precision and recall as used in information retrieval were adapted to suit<br />

the specific needs of this research.<br />

51


Chapter 4 Software Overview<br />

4.1 Introduction<br />

This chapter describes the implementation of Ceist, the question generation system<br />

which I developed for my research. Ceist was designed and built based on source code<br />

from existing NLP systems, technical documentation and papers and with the direct<br />

assistance of some of the authors of these systems. The data source used in conjunction<br />

with Ceist is the Open University’s OpenLearn online educational resource. Textual<br />

data was extracted and parsed from the entire online OpenLearn resource for use with<br />

Ceist.<br />

Chapter 2 described term extraction (2.3.2) and rule-based mapping (2.3.4). This<br />

provided a general overview of how tagged sentences provide NLP systems with the<br />

additional information needed to identify the structure of a sentence and the syntax of<br />

the words in that sentence. The technical detail of how this is accomplished <strong>within</strong> Ceist<br />

is described in this chapter.<br />

Figure 4.1 shows an overview of Ceist’s architecture and this chapter describes some of<br />

the parts of this architecture. I begin with an explanation of how simple tagged<br />

sentences are matched using match patterns and then show how a valid match is<br />

transformed to generate a question using the question template. The process for<br />

sentences tagged with grammatical structure information differs slightly and so this is<br />

explained in a separate section. The chapter ends with a description of the rules<br />

contained in the rule repository and their representation followed by a sub-section<br />

relating to the <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> software module.<br />

52


Figure 4.1 Ceist system architecture<br />

4.2 Preparation of the OpenLearn data set<br />

The Open University online educational resource OpenLearn was used as source data<br />

for Ceist. Error! Reference source not found. Figure 4.2 shows the processing which<br />

was performed on the entire OpenLearn resource. Relevant text is extracted from the<br />

units and then parsed with syntactic structure information.<br />

Figure 4.2 OpenLearn Study Unit processing<br />

53


The Open University provides the OpenLearn study units in a variety of formats but the<br />

XML format is particularly useful. This is because the XML tags mark specific content<br />

<strong>within</strong> the units such as unit headers, images, tables and the actual course textError!<br />

Reference source not found.. Figure 4.3 is an extract from one such XML file.<br />

Figure 4.3 OpenLearn XML format<br />

A Python script is used to extract only the course text for use with question generation.<br />

The script targets specific nodes in the XML file and extracts their content. A portion of<br />

the script is shown in Figure 4.4Error! Reference source not found.. The extracted<br />

text is cleaned to remove oddities (e.g. table references, pronunciations) and then parsed<br />

into syntax trees using the Stanford NL Parser.<br />

54


Figure 4.4 Python extraction script<br />

The parsed output from the Stanford Parser provides syntactic structure and POS<br />

information in one line. The resulting data structure is a tree and the Tregex tool which<br />

is part of the Stanford NLP tools is used to match trees with a similar structure.<br />

4.3 Matching tagged sentences<br />

In sub-section 2.3.2 (Term Extraction) the process of parsing a sentence was described.<br />

To summarise; parsers take plain text and identify its grammatical structure. They<br />

typically output a tagged version of the input sentence that incorporates the grammatical<br />

structure in a tree data structure.<br />

Before I explain the process of searching for patterns in a tree data structure let’s<br />

examine the simpler case of matching POS tagged only sentences (string data type) with<br />

regular expressions. Figure 4.5 is an example of the sentence ‘John went to Dublin’<br />

after it has been tagged to include POS information.<br />

55


Figure 4.5 POS tagged sentence<br />

The words ‘John’ and ‘Dublin’ are both proper nouns and have been tagged as ‘NNP’.<br />

The tag ‘NNP’ represents a singular proper noun. There are different conventions for<br />

POS tags and this example is from the Penn Treebank Tagset 10 . Let’s consider finding<br />

similar sentences to (1) from which we can generate the question ‘Where did John go?’<br />

or alternatively ‘Who went to Dublin?’<br />

Regular expressions allow specific strings to be found using a formal language. The<br />

syntax of regular expressions is beyond the scope of this dissertation but I will give a<br />

brief explanation here. A regular expression contains plain text and formal expressions<br />

designed to match a specific string pattern. In fact, sentence #1 from Figure 4.5 is a<br />

valid regular expression which would only match itself.<br />

The regular expression which allows us to match any sequence of characters, excluding<br />

white space and special characters is:<br />

10 http://www.cis.upenn.edu/~treebank/<br />

.*?<br />

56


We can use this pattern to match a sentence with an exact sequence of POS tags,<br />

regardless of the words. For example:<br />

(a) .*?/NNP .*?/VBD .*?/TO .*?/NNP ./.*<br />

This pattern matches (1) above and (2) and (3) in Figure 4.6 below, but not (4). Because<br />

the final word in (4) is tagged as NN (a noun), the pattern above will not match it. It<br />

requires the fourth word to be a proper noun (NNP). This is good; we have ensured the<br />

pattern only matches proper nouns for the fourth word.<br />

Figure 4.6 Example with proper nouns<br />

Consider sentences (5) and (6) in Figure 4.7 and the problem which will occur if we<br />

persist with pattern (a) above. Given the aim of finding sentences which allow us to ask<br />

the question ‘Where did John go?’, (6) is invalid because it refers to a person called<br />

‘Washington’, but our pattern (a) will match it. The pattern must be refined to ensure<br />

that only sentences relating to a persons movement are matched. It could be argued that<br />

the ‘Washington’ in (5) refers to a person too but like many NLP processes we base our<br />

57


example on probability. It is more likely that the verb ‘went’ refers to a city rather than a<br />

person.<br />

Figure 4.7 Example of POS limitation<br />

The pattern is re-written to accept only the verb ‘went’ as shown in (b) below. It will<br />

work but we are discounting all other verbs relating to movement (e.g. walked, traveled,<br />

hiked, ran, etc.) We could write a separate pattern for each motion related verb but there<br />

is an easier solution.<br />

(b) .*?/NNP went/VBD .*?/TO .*?/NNP ./.*<br />

Regular expressions allow one to form a disjunction of several sub-patterns which may<br />

be acceptable in a pattern. The pattern (c) below incorporates a sample of three motion<br />

related verbs to expand the coverage of the overall pattern. Any of the three verbs is<br />

acceptable as the second word.<br />

(c) .*?/NNP (ran|went|hiked)/VBD .*?/TO .*?/NNP ./.*<br />

58


There are several motion related verbs in the English language. Inserting them all into a<br />

regular expression would make the resulting pattern very long and difficult to read.<br />

Furthermore, if we mistakenly omit a verb and wish to add it later, we must find every<br />

pattern using motion verbs and update each separately. A QG system could have many<br />

such patterns.<br />

Ceist allows the creation of groups. Groups are disjunctions with a simple name tag<br />

which can be used in regular expressions. It is possible to create a group called<br />

motionVerbs, for example, with the members ‘ran’, ‘went’ and ‘hiked’. At runtime<br />

Ceist will replace any occurrence of motionVerbs in a regular expression with the<br />

pattern ‘(ran|went|hiked)’. This means that the pattern (d) below is actually<br />

interpreted as (c) by the regular expression engine in Ceist.<br />

(d) .*?/NNP motionVerbs/VBD .*?/TO .*?/NNP ./.*<br />

This ability to match sentences containing specific words or POS tags like this is quite<br />

powerful because for QG a specific question may only be applicable to a very precise<br />

set of sentences. Our example has demonstrated this for the question ‘Where did<br />

go?’ When a sentence matching specific patterns is found, it is transformed<br />

in such a way as to generate a question. The next sub-section explains how Ceist uses<br />

capture groups <strong>within</strong> regular expressions to do this.<br />

59


4.4 Transformation to questions<br />

One way in which a question is generated from a sentence is by a rearranging or<br />

transformation of the words in the sentence. A question can be generated from example<br />

sentence ‘John went to Dublin’ by a simply repositioning the subject ‘John’ and<br />

changing the verb form of ‘went’ and asking ‘Where did John go?’. The question<br />

‘Where?’ can only be asked because the sentence is of the form ‘ went to<br />

’. Finding sentences of a specific form was the objective of the pattern<br />

creation described in the previous sub-section for this reason.<br />

The task of repositioning of the subject requires finding the subject from the matched<br />

sentence and placing it into the generated question at a specific place. Regular<br />

expressions allow certain parts of the match pattern to be marked as groups by using<br />

parentheses. Groups are numbered in sequence after group zero, which represents the<br />

entire match. The engines which provide regular expression capability allow access to<br />

these groups by the sequence number.<br />

Ceist allows access to these groups <strong>within</strong> a pattern by naming them using the syntax<br />

‘=gXX’ where XX specifies the group number. The group number can then be used to<br />

reposition matched terms to form a question as can be seen in the screenshot from Ceist<br />

in Figure 4.8.<br />

60


Figure 4.8 Group matching in Ceist<br />

Templates, such as the question template in Figure 4.8, use the forward slash as a meta-<br />

character to indicate that the group number matched in the pattern should be inserted at<br />

this point in the template. The screenshot shows the first proper noun is tagged as group<br />

1 (NNP=g1) and then inserted into the question template using ‘/1’ producing the<br />

desired result. Note that Ceist indicates the matched group numbers in its results display<br />

and also colours the matched terms green if they are used in the question template or<br />

black if used in the answer template.<br />

The combination of a match pattern and a question generation template is known as a<br />

rule. The rule storage format is described in sub-section 4.6 of this chapter.<br />

61


4.5 Matching sentences with structure information<br />

One of the main focus points of Ceist was rule flexibility. It was very important that<br />

Ceist had the ability to group vast permutations of words (be they phrases or semantic<br />

groups) and thus reduce the number of rules. In Chapter 2 sub-section 2.3.4 (Pattern<br />

matching and rule mapping) the importance of this is explained. How this is achieved<br />

technically by Ceist is the subject of this sub-section.<br />

Here is the same example sentence I have been using to this point, displayed with its<br />

grammatical structure information embedded.<br />

Figure 4.9 Sentence with grammatical structure<br />

This output was generated using the Stanford Parser (Klein and Manning, 2003), a<br />

working version of which is available online 11 .<br />

11 http://nlp.stanford.edu:8080/parser/<br />

62


The additional information present in this sentence provides us with the flexibility we<br />

sought. Rather than matching a sentence of the format ‘NNP’, ‘went to’, ‘NNP’, which is<br />

limited to sentences beginning with a single proper noun, we can search for sentences<br />

beginning with a noun phrase. Allowing Ceist to match ‘NP’ 12 , ‘went to’, ‘NNP’<br />

matches all the sentences in Figure 4.10 using one expression.<br />

Figure 4.10 Demonstrating use of noun phrase<br />

In order to efficiently search for a noun phrase in the sentence that includes structure<br />

information, regular expressions alone are not sufficient. This limitation was identified<br />

by Roger Levy and Galen Andrew (2006) and they wrote the Tregex tool to extend<br />

regular expressions to work with tree data structures. If we view the example sentence<br />

above as a tree structure it can be seen how this works. A sentence with only four words<br />

has a relatively complex sentence structure as can be seen in Figure 4.11. This tree<br />

structure viewer was adapted from the Tregex source code and is integrated into Ceist.<br />

12 NP is the tag used to indicate a noun phrase group in the Penn Treebank tagset<br />

63


Figure 4.11 Sentence structure viewed as a tree in Ceist<br />

Whereas normal regular expression engines are capable of matching only linear strings,<br />

Tregex allows regular expressions to be written to match specific nodes <strong>within</strong> tree data<br />

structures. Only by using Tregex, can Ceist search for a noun phrase node followed by<br />

the words ‘went’ and ‘to’ and the POS tag ‘NNP’. It also provides tree manipulation<br />

functions that can, for example, allow Ceist to find all the leaves of a match node. An<br />

example of a Tregex expression is shown in the match pattern in Figure 4.8.<br />

4.6 Rule representation<br />

Rule-based mapping in Chapter 2 (2.3.4) touched on representations using XML to<br />

store rules. Rules in Ceist consist of a match pattern and a template. The match pattern<br />

is used to determine whether the question generation template can be applied to a<br />

64


sentence. Ceist also uses XML to store rules and an example is shown in Figure 4.12,<br />

which contains one such rule.<br />

Figure 4.12 XML representation of a rule in Ceist<br />

The XML in Figure 4.12 represents one rule containing a match pattern (), a question template () and also includes an<br />

answer template (). The entire rule is contained <strong>within</strong> a<br />

rule element. The name attribute of rule can be used to name the rule or to<br />

summarise it as is done in the figure. A brief description of each section follows.<br />

The match-patterns section contains each part of the match pattern. This bears<br />

some similarity to the approach employed by QuALiM. The parse elements <strong>within</strong> the<br />

match-pattern each represent a part of the regular expression sent to Tregex by Ceist.<br />

65


The attribute id is important. This attribute is used by the template to generate the<br />

question using matched parts and will be described in the next paragraph. The level<br />

attribute is simply an indentation level used by Ceist to allow nested expressions. The<br />

value for each parse element contains the actual item to be matched. The first three<br />

elements represent a noun phrase group (NP), the word was and a past participle<br />

(VBN).<br />

The final parse element shown in Figure 4.12 contains some XML formatted characters<br />

but actually represents ‘NP


Figure 4.13 Rule editing interface in Ceist<br />

When Ceist finds a sentence that matches the pattern described previously it then uses<br />

the question-template element to generate the question. The question-<br />

template element contains both word and ref elements. Each element is read in<br />

sequence and added to the output to generate the final question. A word element simply<br />

inserts the text value onto the output. The integer value given in the ref element is<br />

used to find the group of the matched sentence which should be appended to the output<br />

as was described in section 4.3.<br />

The final sentence consists of a question which has been formed from a template in<br />

combination with matched parts of the original input sentence. A rule based system will<br />

consist of several rules and it would be expected that given a large block of text that<br />

67


many of these rules will match sentences in the text and consequently generate a<br />

number of questions.<br />

4.7 <strong>Factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> functionality<br />

Enhancing Ceist with <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> capability was done using the list<br />

of words first drafted up by Hooper in 1974 (and presented in Table 2.2) as a starting<br />

point. The list of words was expanded using the thesaurus provided with Apple’s Mac<br />

OS X operating system; a digital version of The Oxford American Writer’s Thesaurus<br />

(Lindberg, 2004) which contains both American and British English phrases. I extended<br />

the list with both words and phrases from the thesaurus.<br />

Where words were derived from the thesaurus for an existing word in Hooper’s list, the<br />

new words were inserted into the same category as Hooper had originally assigned. It<br />

was not uncommon for the same word to appear in the thesaurus for two words from<br />

two different categories in Hooper’s list. This indicates that the process if categorising<br />

words in the manner which Hooper has done is quite difficult. Categorisation was not<br />

required for this research but I do propose an approach to assigning a factivity value to<br />

words in the conclusions chapter (Chapter 6).<br />

The collection of <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong>s was used in two different ways to carry out<br />

both of the quantitative analysis methods described in Chapter 3. The Tregex engine did<br />

have some limitations which prevented the use of phrases <strong>within</strong> Ceist so in order to<br />

gain an accurate figure for overall impact, using words and phrases, a script was used.<br />

68


The words and phrases from the list were then converted to regular expressions. A ruby<br />

script was used to create the expressions, and the same script was used to count the<br />

frequencies <strong>within</strong> the OpenLearn parsed trees as presented in Chapter 5 – results.<br />

<strong>Factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> was implemented by using Ceist’s grouping feature.<br />

Two groups were created named all<strong>Factive</strong> and allNon<strong>factive</strong> containing all<br />

the words from the comprehensive list. The regular expressions for each word were<br />

written to include all verb forms. Enabling <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> was then<br />

done for a rule by adding a check for all<strong>Factive</strong> or allNon<strong>factive</strong> at the<br />

desired position in that rule.<br />

4.8 Summary<br />

Ceist uses OpenLearn as a data source. The benefit of the XML format study units<br />

available from OpenLearn online was discussed. This chapter described how an<br />

OpenLearn unit in this format is converted to a format which is searchable for the<br />

purposes of natural language processing.<br />

I also detailed the use of regular expressions to match patterns in sentences. This<br />

permits the targeting of similar sentences from which a specific question can be<br />

generated. A feature of Ceist is to allow grouping of semantically similar words to<br />

making pattern definition easier and maintainable.<br />

69


The method of generating questions I have focused on is to take some of the matched<br />

terms from input sentences and change their order and/or transform them using<br />

techniques such as lemmatisation. This chapter described how captured groups <strong>within</strong><br />

regular expression patterns facilitate this process.<br />

The chapter details how the combined regular expression pattern and a template for<br />

generating a question from this expression are stored together to form a rule. The<br />

combination of many such rules forms a QG system.<br />

Finally I described the addition of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong> <strong>recognition</strong> to Ceist. I<br />

used a comprehensive list of verbs and phrases derived from other research and<br />

represented this list in a script and into Ceist as regular expressions. This functionality<br />

allowed me to analyse the impact of same on a QG system.<br />

70


Chapter 5 Results<br />

5.1 Introduction<br />

There were two sets of quantitative analysis done for this research. The first focused on<br />

measuring the overall impact which <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> can have <strong>within</strong><br />

QG as a whole. I measured this by counting the frequency at which <strong>factive</strong> or <strong>non</strong>-<br />

<strong>factive</strong> words and phrases occurred <strong>within</strong> a sample of the educational discourse<br />

OpenLearn. The results were obtained at a general level and at a specific clause type<br />

and are presented in this chapter.<br />

The second analysis measured the benefit of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> with<br />

regard to the quality of the QG system output. A detailed description of quantitative<br />

analysis of question quality is given in Chapter 3 (3.2.3) and this analysis applied the<br />

methods described therein. This chapter presents the values for precision and recall<br />

obtained for two rules in Ceist.<br />

5.2 Overall impact<br />

For impact analysis the entire OpenLearn online resource was used. The parsed resource<br />

contains over 100,000 14 sentences in over 475 study units and covers a varied range of<br />

topics in an educational format. The <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> module used a list<br />

of words and phrases which indicate factivity. In the software overview (Chapter 4 -<br />

14 This is the number of sentences <strong>within</strong> content paragraphs. Other content <strong>within</strong> questions, captions,<br />

activities, tables etc. was not used.<br />

71


4.7) I explained how Hooper’s list of <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong>s (Table 2.2) was<br />

expanded to produce a larger set of <strong>predicate</strong>s. The complete list is available in<br />

Appendix B.<br />

The first set of data shows the number of occurrences of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> indicating<br />

phrases in a sample of our parsed study units. The samples were a random selection of<br />

17 study units across various categories. The data is presented in tabular format in Table<br />

5.1 and as a pie chart in Figure 5.1.<br />

Item Count Percentage<br />

Total sentences 4673 100%<br />

Containing <strong>factive</strong> phrase 362 7.75%<br />

Containing <strong>non</strong>-<strong>factive</strong> phrase 655 14.01%<br />

Containing <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> phrase 1017 21.76%<br />

Table 5.1 Occurrences of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong><br />

Figure 5.1 Chart showing occurrences of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong><br />

72


This gives a good indication of the frequency of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> usage in<br />

educational material. Over 20% of sentences contain some <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong><br />

indicating verb or phrase.<br />

For the purposes of QG and this research we would like to focus on studying only the<br />

occurrences where there is a declarative content clause immediately succeeding the<br />

<strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> phrase. This allows us to ignore such sentences as ‘I know who you<br />

are.’ or ‘They saw the movie.’ and analyse instead sentences such as ‘He claims that<br />

there were 10 people at the conference.’<br />

The sentences were analysed again to measure only the occurrences where the clause<br />

containing a <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> phrase immediately preceded another clause. We<br />

make the assumption that the succeeding clause is a subordinate clause. Again the data<br />

is presented in tabular format in Table 5.2 and as a chart in Figure 5.2.<br />

Item Count Percentage<br />

Total sentences 4673 100%<br />

Containing <strong>factive</strong> phrase 106 2.27%<br />

Containing <strong>non</strong>-<strong>factive</strong> phrase 206 4.41%<br />

Containing <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> phrase 1017 6.68%<br />

Table 5.2 Occurrences of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> directly before a new clause<br />

73


Figure 5.2 Occurrences of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> directly before a new clause<br />

The data shows that just under 7% of sentences contain a clause preceded by a <strong>factive</strong> or<br />

<strong>non</strong>-<strong>factive</strong> phrase. Almost 4.5% of these are <strong>non</strong>-<strong>factive</strong>. Current QG efforts focus<br />

mainly on declarative clause types because typically such clauses make a statement<br />

about which a question can be asked.<br />

My analysis measures occurrences in all clause types. Further work could evaluate<br />

declarative clauses only. In any case if statements made in 4.5% of sentences are taken<br />

to be true when they are not, then questions generated from this 4.5% will possibly not<br />

be answerable by the sentence.<br />

5.3 Quality Increase<br />

The performance analysis measures the improvement in the quality of generated<br />

questions once <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> has been included. This involved<br />

74


unning Ceist with two rules designed specifically to generate questions from statements<br />

in declarative content clauses. The method used was outlined in the research methods<br />

chapter (3.2.3). Briefly summarising, the output from both rules was firstly cleaned by<br />

removing grammatically incorrect or incomplete sentences. The remaining questions<br />

were categorised as answerable by the input sentence or not answerable by the input<br />

sentence.<br />

The output from a baseline Ceist system without <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong><br />

(Ceist-Baseline) was assessed and then the output from an improved system with the<br />

extra functionality added (Ceist-Factivity) was also assessed. The results are presented<br />

in tabular format in Table 5.3.<br />

Rule 1 - YES/NO<br />

Answerable by sentence Precision Recall<br />

Ceist-Baseline YES 20 48% N/A<br />

NO 22<br />

Ceist-Factivity YES 17 71% 85%<br />

NO 7<br />

Rule 2 - WHAT<br />

Answerable by sentence Precision Recall<br />

Ceist-Baseline YES 28 54% N/A<br />

NO 24<br />

Ceist-Factivity YES 21 100% 75%<br />

NO 0<br />

Table 5.3 Quality increase results<br />

75


It can be seen that precision has increased for both rules at some cost to recall. Notably<br />

for rule 2, all unanswerable questions were removed giving a precision of 100%. Rule 2<br />

did however see a decrease in answerable questions of 25%. The pie charts below show<br />

the results for precision for each rule. The lighter blue represents questions deemed<br />

unanswerable by the input sentence. Chapter 6 will assess these results in more detail<br />

but it can be seen that precision has improved in both cases.<br />

Figure 5.3 Proportion of Answerable <strong>Question</strong>s for Rule 1<br />

76


Figure 5.4 Proportion of Answerable <strong>Question</strong>s for Rule 2<br />

Recall was added a metric so that we could determine the degradation in the system as a<br />

result of adding <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>. The results for recall show that there<br />

were acceptable good questions for Ceist-Baseline which were removed by Ceist-<br />

Factivity. The bar chart in Figure 5.5 represents that degradation of 85% for rule 1 and<br />

75% for rule 2. This result can be interpreted as a setback but there are some researchers<br />

that believe we must aim for maximum precision at a cost to recall, i.e. quality over<br />

quantity. In my conclusions (Chapter 6) I refer to previous work which has done so.<br />

77


Figure 5.5 Recall values for both rules<br />

78


Chapter 6 Conclusions<br />

6.1 Introduction<br />

It was the intention of this project to both answer the research question and at the same<br />

time to provide some benefit to the QG community. I believe the work done has been<br />

largely successful in accomplishing both of these general aims. This chapter outlines<br />

some of the conclusions which can be drawn from the research.<br />

6.2 Assessment of <strong>factive</strong>/<strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> module<br />

From the results of the first analysis it was found that over 20% of sentences in the<br />

educational discourse OpenLearn contained either a <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> verb or<br />

phrase. This is quite high and would justify further research in the area. I also measured<br />

the occurrence of <strong>factive</strong>s or <strong>non</strong>-<strong>factive</strong>s preceding another clause and this was also<br />

high, accounting for over 6.5% of sentences.<br />

One conclusion which Rus et al. (2007a) drew from evaluating their QG system was<br />

that it was better to generate a smaller number of good precise questions rather than a<br />

larger number of questions, containing lots of bad ones. They used precision as their<br />

metric to report performance.<br />

There was an increase in precision for questions answerable by the input sentence when<br />

using <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>. In the case of one rule, all unanswerable<br />

questions were removed. The drawback is the detection of false positives, i.e. questions<br />

79


which are actually answerable by the input sentence but incorrectly removed by <strong>factive</strong> /<br />

<strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>.<br />

I believe that a system should be designed to primarily produce high precision results<br />

and then improved to include valid output which has been incorrectly rejected. Based on<br />

this belief I would call rate the module performance as quite good. There is room for<br />

improvement as I will outline below but it did perform adequately.<br />

6.3 Further work<br />

The work done relating to factivity to date is very black and white. The researchers have<br />

been concerned with determining whether the content clause in a statement is<br />

presupposed or not. I believe an approach to measuring factivity would be a progression<br />

on the current body of knowledge. The speaker is more confident that there were 10<br />

people at the conference in (3) below yet current research only tells us that both ‘be<br />

sure’ and ‘be possible’ are <strong>non</strong>-<strong>factive</strong>.<br />

(3) I am sure that there were 10 people at the conference.<br />

(4) It was possible that there were 10 people at the conference.<br />

It would be more useful if a NLP system had more choice in the level of confidence<br />

acceptable to it. If the phrase ‘be sure’ indicates a level of confidence higher than ‘is<br />

possible’ then this should be measurable and thus selectable by a NLP system. A<br />

80


stochastic approach could analyse these phrases to ascertain the probability that given,<br />

for example, the speaker being sure about a proposition, that the proposition is true.<br />

81


References<br />

Brill, E. (1992) ‘A simple rule-based part-of-speech tagger’, in Proceedings of the Third<br />

Conference on Applied Natural Language Processing, Trento, Italy, pp. 152-155.<br />

Brown, J.C., Frishkoff, G.A. and Eskenazi, M. (2005) ‘Automatic <strong>Question</strong> Generation<br />

for Vocabulary Assessment’, in Proceedings of the Conference on Human Language<br />

Technology and Empirical Methods in Natural Language Processing, Vancouver, BC,<br />

Canada October 6-8 2005, Morristown, NJ, USA, Association for Computational<br />

Linguistics, pp. 819-826.<br />

Cai, Z., Rus, V., Kim, H.J., Susarla, S.C., Karnam, P., and Graesser, A.C. (2006)<br />

‘NLGML: A Markup Language for <strong>Question</strong> Generation’, in Proceedings of World<br />

Conference on E-Learning in Corporate, Government, Healthcare, and Higher<br />

Education, Honolulu, Hawaii, USA October 13-17 2006, Chesapeake, VA, USA.<br />

Association for the Advancement of Computing in Education, pp. 2747-2752.<br />

Doran, C., Egedi, D., Hockey, B.A., Srinivas, B. and Zaidel, M. (1994) ‘XTAG System<br />

- A Wide Coverage Grammar for English’, in Proceedings of the fifteenth International<br />

Conference on Computational Linguistics, Kyoto, Japan August 5-9 1994, Morristown,<br />

NJ, USA, Association for Computational Linguistics, pp. 922-928.<br />

Fellbaum, C. (1998) WordNet: An Electronic Lexical Database, Cambridge, MA, USA,<br />

MIT Press.<br />

Gates, D. (2008) 'Generating Look-Back Strategy <strong>Question</strong>s from Expository Texts',<br />

Workshop on the <strong>Question</strong> Generation Shared Task and Evaluation Challenge,<br />

Arlington, VA, USA September 25-26 2008.<br />

Gates, D. (January 6 2009), ‘Generating <strong>Question</strong>s from Text’, e-mail to B. Wyse.<br />

Hooper, J. (1974) ‘On assertive <strong>predicate</strong>s’, in Kimball, J. (ed.) Syntax and Semantics,<br />

vol. 4, pp. 91-124, New York, Academic Press.<br />

Huddleston, R. and Pullum, G.K. (2005) A Student’s Introduction to English Grammar,<br />

New York, Cambridge University Press.<br />

Jurafsky, D. and Martin, J.H. (2009) Speech and Language Processing: An Introduction<br />

to Natural Language Processing, Speech Recognition, and Computational Linguistics,<br />

Second Edition. Upper Saddle River, NJ, USA, Prentice-Hall.<br />

Kaisser, M. and Becker, T. (2004) ‘<strong>Question</strong> Answering by Searching Large Corpora<br />

with Linguistic Methods’, in Proceedings of the thirteenth Text REtrieval Conference,<br />

TREC 2004, Gaithersburg, MD, USA November 16-19 2004, Gaithersburg, MD, USA,<br />

National Institute of Standards and Technology.<br />

Karp, D., Schabes, Y., Zaidel, M. and Egedi, D. (1992) ‘A Freely Available Wide<br />

Coverage Morphological Analyzer for English’, in Proceedings of the fourteenth<br />

International Conference on Computational Linguistics, Nantes, France August 23-28<br />

1992, Morristown, NJ, USA, Association for Computational Linguistics.<br />

Karttunen, L. and Wittenburg, K. (1983) ‘A two-level morphological analysis of<br />

English’, Texas Linguistic Forum, vol. 22, pp. 217-228.<br />

Kiparsky, P. and Kiparsky, C. (1970) ‘Fact’, in Bierwisch, M. and Heidolph, K.E. (eds.)<br />

Progress in Linguistics, pp. 143-173, The Hague, Mouton.<br />

82


Klein, D. and Manning, C. (2003) ‘Fast Exact Inference with a Factored Model for<br />

Natural Language Parsing’, Advances in Neural Information Processing Systems, vol.<br />

15, pp. 3-10.<br />

Levy, R. and Andrew, G. (2006) ‘Tregex and Tsurgeon: tools for querying and<br />

manipulating tree data structures’, in Proceedings of the fifth International Conference<br />

on Language Resources and Evaluation, LREC 2006, Genoa, Italy May 24-26 2006, pp.<br />

2231-2234.<br />

Lindberg, C. (2004) The Oxford American Writer’s Thesaurus, New York, Oxford<br />

University Press.<br />

Mitkov, R. and Ha, L.A. (2003) ‘Computer-Aided Generation of Multiple-Choice Tests’<br />

in Proceedings of the HLT-NAACL 03 workshop on Building educational applications<br />

using natural language processing – Volume 2, Edmonton, Canada May 31 2003,<br />

Morristown, NJ, USA, Association for Computational Linguistics, pp. 17-22.<br />

Nielsen, R.D., Buckingham, J., Knoll, G., Marsh, B. and Palen, L. (2008) ‘A Taxonomy<br />

of <strong>Question</strong>s for <strong>Question</strong> Generation’, Workshop on the <strong>Question</strong> Generation Shared<br />

Task and Evaluation Challenge, Arlington, VA, USA September 25-26 2008.<br />

Piwek, P., Prendinger, H., Hernault, H. and Ishizuka, M. (2008) ‘Generating <strong>Question</strong>s:<br />

An Inclusive Characterization and a Dialogue-based Application’, Workshop on the<br />

<strong>Question</strong> Generation Shared Task and Evaluation Challenge, Arlington, VA, USA<br />

September 25-26 2008.<br />

Porter, M.F. (1980) ‘An algorithm for suffix stripping’ Program, vol. 14(3) , pp<br />

130−137.<br />

Rus, V., Cai, Z. and Graesser, A.C. (2007a) ‘Experiments on Generating <strong>Question</strong>s<br />

About Facts’ Computational Linguistics and Intelligent Text Processing, vol. 4394, pp.<br />

444-455.<br />

Rus, V., Cai, Z. and Graesser, A.C. (2007b) ‘Evaluation in Natural Language<br />

Generation: The <strong>Question</strong> Generation Task’ in Workshop on Shared Tasks and<br />

Comparative Evaluation in Natural Language Generation, Arlington, VA, USA April<br />

20-21.<br />

Rus, V. and Graesser, A.C. (2009) The <strong>Question</strong> Generation Shared Task and<br />

Evaluation Challenge<br />

Silveira, N. (2008) ‘Towards a Framework for <strong>Question</strong> Generation’ , Workshop on the<br />

<strong>Question</strong> Generation Shared Task and Evaluation Challenge, Arlington, VA, USA<br />

September 25-26 2008.<br />

Soanes, C. and Stevenson, A. (2005a) ‘<strong>factive</strong> adjective’ in The Oxford Dictionary of<br />

English, Revised Edition, Oxford Reference Online, viewed 3 rd January 2010,<br />

http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t140.e265<br />

73<br />

Soanes, C. and Stevenson, A. (2005b) ‘<strong>non</strong>-<strong>factive</strong> adjective’ in The Oxford Dictionary<br />

of English, Revised Edition, Oxford Reference Online, viewed 3 rd January 2010,<br />

http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t140.e524<br />

56<br />

Tapanainen, P. and Järvinen, T. (1997) ‘A <strong>non</strong>-projective dependency parser’ in<br />

Proceedings of the fifth Conference on Applied Natural Language Processing,<br />

83


Washington DC, USA March 31 – April 3, San Francisco, CA, USA, Morgan<br />

Kaufmann, pp. 64-71.<br />

Van Rijsbergen, C. J. (1976) Information Retrieval, London, Butterworths.<br />

Wang, W., Hao, T. and Liu, W. (2008) ‘Automatic <strong>Question</strong> Generation for Learning<br />

Evaluation in Medicine’ in Advances in Web Based Learning – ICWL 2007, Edinburgh,<br />

UK, August 15-17, 2007, New York, USA, Springer-Verlag.<br />

Wyse, B. and Piwek, P. (2009) ‘Generating <strong>Question</strong>s from OpenLearn Study Units’ in<br />

Rus, V. and Lester, J. (eds.), Proceedings of the 2 nd Workshop on <strong>Question</strong> Generation,<br />

AIED 2009 Workshop Proceedings, pp. 66-73.<br />

84


Index<br />

anaphora, 7<br />

Ceist, 15, 52<br />

factivity, 38, 68, 80<br />

intelligent tutoring systems (ITS), 2<br />

lemmatisation, 36<br />

mapping, 29, 60<br />

mark-up, 17, 34, 64<br />

OpenLearn, 45, 53, 71<br />

parsing, 23<br />

POS tagging, 21<br />

precision, 48, 76<br />

prototyping, 43<br />

QG systems, 16<br />

QG task, 4<br />

recall, 48, 76<br />

regular expression, 25, 56<br />

Shared Task and Evaluation Campaign<br />

(STEC), 3<br />

sub-tasks, 19<br />

Term extraction, 24<br />

Tregex, 27, 63<br />

WordNet, 27


Appendix A – Extended Abstract


<strong>Factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong> <strong>recognition</strong> <strong>within</strong><br />

<strong>Question</strong> Generation systems<br />

Brendan Wyse<br />

Extended Abstract of Open University MSc Dissertation Submitted 9 March 2010<br />

Introduction<br />

<strong>Question</strong> Generation (QG) is a relatively new field of study in linguistics and computing. A<br />

QG system takes plain text as an input and generations questions relating to that free text<br />

such as in the figure below.<br />

One key area where these systems are used is in Intelligent Tutoring Systems (ITS). These<br />

systems go beyond simply delivering educational content. They also interact with the<br />

student. One manner in which they interact is by asking questions.


I focused on a particular area of QG which is the generation of questions from a single<br />

sentence where the answer to the question is contained in the sentence. Although<br />

declarative content clauses in single sentences always make statements, these statements<br />

may not actually be an established fact. In the figure below, the speaker in (2) is more<br />

certain about the number of attendees than the speaker in (1).<br />

This is important because if a QG system is to ask a question about a sentence, where the<br />

answer is in that sentence; it must know what has been established as fact in the sentence.<br />

The key to solving this problem is recognising the words which decide the factivity of the<br />

declarative content clause; <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong>s (bold type in the figure<br />

above). I developed a QG system, Ceist, for this research and used it to assess a software<br />

module capable of recognising <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> words <strong>within</strong> input sentences.<br />

Method<br />

To introduce <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> to Ceist, a list was drawn up of words which<br />

may indicate factivity. This list was formed by expanding work from an existing researcher<br />

with the aid of a thesaurus.<br />

The list also allowed some impact analysis to be carried out on a new data source, a parsed<br />

version of the online educational resource OpenLearn created specifically for this research.


The impact analysis sought to measure any improvement in generated question quality<br />

produced by <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>. This was done by comparing a baseline Ceist<br />

system with a system incorporating the new functionality.<br />

Results<br />

The data set consisting of <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> indicating words and phrases is a<br />

comprehensive list. It was converted to regular expressions and matched against a subset of<br />

the parsed OpenLearn study units to determine the frequency of such words and phrases.<br />

The following table shows the frequency counts.<br />

Item Count Percentage<br />

Total sentences 4673 100%<br />

Containing <strong>factive</strong> phrase 362 7.75%<br />

Containing <strong>non</strong>-<strong>factive</strong> phrase 655 14.01%<br />

Containing <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> phrase 1017 21.76%<br />

The tabular data above is also represented in the pie chart below. As can be seen from the<br />

pie chart, the proportion of sentences in educational discourse containing either a <strong>factive</strong> or<br />

<strong>non</strong>-<strong>factive</strong> verb or phrase is high. Over 20% of sentences contained at least one term<br />

which could be classed as <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong>.


A second test examined the case where the terms immediately preceded a complement<br />

clause. This would be the case for declarative content clauses such as that-clauses (e.g. I<br />

know that there were 10 people at the conference.) The results for this test are contained in<br />

the following table.<br />

Item Count Percentage<br />

Total sentences 4673 100%<br />

Containing <strong>factive</strong> phrase 106 2.27%<br />

Containing <strong>non</strong>-<strong>factive</strong> phrase 206 4.41%<br />

Containing <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> phrase 1017 6.68%<br />

As would be expected due to limiting the accepted matches, the number of sentences<br />

matching was reduced to just over 6.5%, with 4.4% being <strong>non</strong>-<strong>factive</strong> terms. This could be<br />

taken to indicate that questions generated from these 4.4% of sentences would be<br />

potentially flawed if they assume the proposition in the content clause contains a fact and<br />

thus an answer.


The second analysis focused on measuring the benefit in generated question quality which a<br />

<strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> module would deliver to a QG system. Output from a<br />

baseline QG system was compared to output from a QG system with factivity enabled. The<br />

proportion of good questions (precision) and any degradation in system performance<br />

(recall) were measured. This was done for two different question generation rules and the<br />

data are presented in the table below.<br />

Rule 1 - YES/NO<br />

Answerable by sentence Precision Recall<br />

Ceist-Baseline YES 20 48% N/A<br />

NO 22<br />

Ceist-Factivity YES 17 71% 85%<br />

NO 7<br />

Rule 2 - WHAT<br />

Answerable by sentence Precision Recall<br />

Ceist-Baseline YES 28 54% N/A<br />

NO 24<br />

Ceist-Factivity YES 21 100% 75%<br />

NO 0<br />

Precision is the proportion of answerable questions generated and has increased in both<br />

cases. Recall was used to measure degradation in system performance. It indicated the<br />

removal of what had been acceptable questions in Ceist-Baseline by Ceist-Factivity. There<br />

were some perfectly good questions which were not accepted by the system with <strong>factive</strong> /<br />

<strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>.


Analysis<br />

The frequency at which <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> verbs and phrases are used in educational<br />

discourse is quite high. Around 20% of all sentences contain at least one such factivity<br />

indicating term. Even focusing on only sentences where a <strong>non</strong>-<strong>factive</strong> immediately<br />

precedes a new clause accounted for 4.4% of all sentences.<br />

The increase in question quality was high for both test rules. In the test using the first rule<br />

the question quality went from a measurement of less than half of the generated questions<br />

being acceptable to over 70%. In the test using the second rule, precision jumped from just<br />

over 50% to 100%. There were no unanswerable questions generated.<br />

In both cases there was a negative impact where the <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong><br />

module actually removed some answerable questions. This was captured by the recall<br />

values of 85% and 75% respectively.<br />

Discussion<br />

I believe that the results of this research show that <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong> is an<br />

area of NLP which has a part to play in question generation. This is definitely the case<br />

when systems need to engage in dialog. Systems will need to move beyond simple<br />

grammatical structure and semantics and start to extract and work with some of the intricate<br />

details in language such as factivity. This is important if systems are to begin to ask the<br />

most appropriate questions.


Over 20% of sentences in OpenLearn contain some <strong>factive</strong> or <strong>non</strong>-<strong>factive</strong> indicating verb.<br />

Despite this high usage frequency I was not able to find any NLP tool capable of telling me<br />

that ‘I know’ expresses a lot more certainty about a subject than ‘I think’. I see this as an<br />

important part of future work with question generation.<br />

It was expected that the comprehensive list of <strong>factive</strong>s and <strong>non</strong>-<strong>factive</strong>s would indeed<br />

eliminate many of the unanswerable questions generated. It was also expected that there<br />

would be some false positives. This is because some work is needed to formally establish<br />

boundaries for <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> verbs. My own list was simply an original list<br />

extended by thesaurus.<br />

If I were to extend this work further I would attempt to assign a factivity value to each verb<br />

based on some formal assessment of their usage in a large discourse. A verb would be<br />

assigned a numeric value indicative of the certainty that the statement following it is<br />

absolutely a fact. A tool capable of distinguishing between ‘I know’ and ‘I think’ at a<br />

software level would then be a possibility.


Appendix B – List of <strong>factive</strong> and <strong>non</strong>-<strong>factive</strong> <strong>predicate</strong>s<br />

The following is a list of the words and phrases used to evaluate the impact of <strong>factive</strong> / <strong>non</strong>-<strong>factive</strong> <strong>recognition</strong>. An asterisk indicates a<br />

word from Hooper’s list. The grey terms indicate which of Hooper’s words was used to derive the new word or phrase in the list.<br />

<strong>Factive</strong><br />

Assertive (semi-<strong>factive</strong>)<br />

find out found out * come to know came to know < found out<br />

discover discovered * bring to light brought to light < found out<br />

know knew * discern discerned < found out<br />

learn learnt * unearth unearthed < found out<br />

learned cotton on cottoned on < found out<br />

note noted * catch on caught on < found out<br />

notice noticed * twig twigged < found out<br />

observe observed * be aware was aware < know<br />

perceive perceived * be conscious was conscious < know<br />

realize realized * be informed was informed < know<br />

realise realised < localisation sense sensed < know<br />

recall recalled * hear heard < learn<br />

remember remembered * understand understood < learn<br />

reveal revealed * establish established < learn<br />

see saw * suss out sussed out < learn<br />

recollect recollected < remember spot spotted < noticed<br />

ascertain ascertained < discover make out made out < perceive<br />

figure out figured out < discover grasp grasped < perceive<br />

work out worked out < discover take in took in < perceive<br />

fathom fathomed < discover find found < perceive<br />

recognise recognised < discover register registered < realize<br />

recognize recognised < localisation get the message got the message < realize<br />

become aware became aware < found out tell told < reveal<br />

detect detected < found out let slip let slip < reveal<br />

expose exposed < found out let drop let drop < reveal<br />

disclose disclosed < found out give away gave away < reveal<br />

get to know got to know < found out determine determined < see


Non-<strong>factive</strong><br />

Strong assertive<br />

acknowledge acknowledged * mention mentioned *<br />

admit admitted * point out pointed out *<br />

affirm affirmed * predict predicted *<br />

allege alleged * prophesy prophesied *<br />

answer answered * postulate postulated *<br />

argue argued * remark remarked *<br />

assert asserted * reply replied *<br />

assure assured * report report *<br />

certify certified * say said *<br />

charge charged * state stated *<br />

claim claimed * suggest suggested *<br />

contend contended * swear swore *<br />

declare declared * testify testified *<br />

divulge divulged * theorize theorized<br />

emphasize emphasized * theorise theorised < localisation<br />

emphasise emphasised < localisation verify verified *<br />

explain explained * vow vowed *<br />

grant granted * write wrote *<br />

guarantee guaranteed * accept accepted < acknowledge<br />

hint hinted * concede conceded < acknowledge<br />

hypothesize hypothesized ? Not in dictionary confess confessed < acknowledge<br />

hypothesise hypothesised < localisation proclaim proclaimed < affirm<br />

imply implied * pledge pledged < affirm<br />

indicate indicated * give an undertaking gave an undertaking < affirm<br />

insist insisted * respond responded < answer<br />

intimate intimated * retort retorted < answer<br />

maintain maintained * announce announced < assert<br />

(Continued)


Non-<strong>factive</strong><br />

Strong assertive<br />

retort retorted < answer forecast forecasted < predict<br />

announce announced < assert foresee foresaw < predict<br />

pronounce pronounced < assert anticipate anticipated < predict<br />

avow avowed < assert tell in advance told in advance < predict<br />

ensure ensured < assure envision envisioned < predict<br />

confirm confirmed < assure foretell foretold < prophesy<br />

promise promised < assure forewarn of forewarned of < prophesy<br />

attest attested < certify prognosticate prognosticated < prophesy<br />

provide evidence provided evidence < certify propose proposed < postulate<br />

give proof gave proof < certify assume assumed < postulate<br />

prove proved < certify presuppose presupposed < postulate<br />

demonstrate demonstrated < certify take for granted took for granted < postulate<br />

profess professed < claim utter uttered < say<br />

communicate communicated < divulge recommend recommended < suggest<br />

publish published < divulge advise advised < suggest<br />

stress stressed < emphasized speculate speculated < theorise<br />

highlight highlighted < emphasized justify justified < verify<br />

press home pressed home < emphasized validate validated < verify<br />

make clear made clear < explain authenticate authenticated < verify<br />

describe described < explain record recorded < write<br />

spell out spelt out < explain log logged < write<br />

allow allowed < grant list listed < write<br />

appreciate appreciated < grant scribble scribbled < write<br />

insinuate insinuated < hint scrawl scrawled < write<br />

signal signalled < hint agree agreed *<br />

mean meant < hint be afraid was afraid *<br />

say indirectly said indirectly < imply be certain was certain *<br />

convey the impression conveyed the impression < imply be sure was sure *<br />

make known made known < indicate be clear was clear *<br />

reiterate reiterated < insist be obvious was obvious *<br />

make public made public < intimate be evident was evident *<br />

(Continued)


Non-<strong>factive</strong><br />

Strong assertive<br />

identify identified < point out be indisputable was indisputable < be clear<br />

decide decided * be beyond doubt was beyond doubt < be clear<br />

deduce deduced * be beyond question was beyond question < be clear<br />

estimate estimated * be blatant was blatant < be clear<br />

hope hoped * be apparent was apparent < be obvious<br />

presume presumed * be noticeable was noticeable < be evident<br />

surmise surmised * reckon reckoned < calculate<br />

suspect suspected * elect elected < decide<br />

concur concurred < agree choose chose < decide<br />

be fearful was fearful < be afraid opt opted < decide<br />

be frightened was frightened < be afraid reason reasoned < deduce<br />

be scared was scared < be afraid infer inferred < deduce<br />

be alarmed was alarmed < be afraid glean gleaned < deduce<br />

be petrified was petrified < be afraid judge judged < estimate<br />

be terrified was terrified < be afraid gauge gauged < estimate<br />

be confident was confident < be certain approximate approximated < estimate<br />

be positive was positive < be certain desire desired < hope<br />

be convinced was convinced < be certain wish wished < hope<br />

be satisfied was satisfied < be certain fancy fancied < surmise<br />

be in no doubt was in no doubt < be certain calculate calculated *<br />

be crystal clear was crystal clear < be clear specify specified < point out<br />

be unmistakable was unmistakable < be clear


Non-<strong>factive</strong><br />

Weak assertive Non-assertive (<strong>non</strong>-negative) Non-assertive (negative)<br />

think thought * be likely was likely be unlikely was unlikely<br />

believe believed * be possible was possible be impossible was impossible<br />

suppose supposed * be probable was probable be improbable was improbable<br />

expect expected * be conceivable was conceivable be inconceivable was inconceivable<br />

imagine imagined * doubt doubted<br />

guess guessed * deny denied<br />

seem seemed *<br />

appear appeared *<br />

figure figured *<br />

feel felt < think<br />

sense sensed < suppose<br />

trust trusted < suppose<br />

forecast forecasted < expect<br />

visualize visualized < imagine<br />

visualise visualised < localisation<br />

envisage envisaged < imagine<br />

picture pictured < imagine<br />

conceive conceived < imagine<br />

conceptualize conceptualized < imagine<br />

conceptualise conceptualised < localisation<br />

emerge emerged < appear<br />

surface surfaced < appear<br />

become apparent became apparent < appear<br />

become evident became evident < appear<br />

gather gathered < figured

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!