A generic framework for Arabic to English machine ... - Acsu Buffalo
A generic framework for Arabic to English machine ... - Acsu Buffalo
A generic framework for Arabic to English machine ... - Acsu Buffalo
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
A <strong>generic</strong> <strong>framework</strong> <strong>for</strong> <strong>Arabic</strong><br />
<strong>to</strong> <strong>English</strong> <strong>machine</strong> translation<br />
of simplex sentences using the<br />
Role and Reference Grammar<br />
linguistic model<br />
By<br />
Yasser Salem B.Sc<br />
Supervisors:<br />
Dr. Brian Nolan<br />
Mr. Arnold Hensman<br />
Submitted in Fulfillment of the Requirements <strong>for</strong> the of M.Sc. in<br />
Computing in the School of In<strong>for</strong>matics and Engineering at<br />
the Institute of Technology Blanchards<strong>to</strong>wn<br />
Dublin, Ireland<br />
April, 2009
Abstract<br />
The aim of this research is <strong>to</strong> develop a rule-based lexical <strong>framework</strong> <strong>for</strong> <strong>Arabic</strong> language<br />
processing using the Role and Reference Grammar linguistic model. A system, called<br />
UniArab is introduced <strong>to</strong> support the <strong>framework</strong>. The UniArab system <strong>for</strong> Modern Stan-<br />
dard <strong>Arabic</strong> (MSA), which takes MSA <strong>Arabic</strong> as input in the native orthography, parses<br />
the sentence(s) in<strong>to</strong> a logical meta-representation, and using this, generates a grammati-<br />
cally correct <strong>English</strong> output with full agreement and morphological resolution. UniArab<br />
utilizes an XML-based implementation of elements of the Role and Reference Grammar<br />
theory, and its representations <strong>for</strong> the universal logical structure of <strong>Arabic</strong> sentences.<br />
Role and Reference Grammar (RRG) is a functional theory of grammar that posits a<br />
direct mapping between the semantic representation of a sentence and its syntactic rep-<br />
resentation. The theory allows a sentence in a specific language <strong>to</strong> be described in terms<br />
of its logical structure and grammatical procedures. RRG creates a linking relationship<br />
between syntax and semantics, and can account <strong>for</strong> how semantic representations are<br />
mapped in<strong>to</strong> syntactic representations. We claim that RRG is highly suitable <strong>for</strong> <strong>machine</strong><br />
translation of <strong>Arabic</strong> via an Interlingua bridge implementation model. RRG is a mono<br />
strata–theory, positing only one level of syntactic representation, the actual <strong>for</strong>m of the<br />
i
sentence and its linking algorithm can work in both directions from syntactic representa-<br />
tion <strong>to</strong> semantic representation, or vice versa. In RRG, semantic decomposition of pred-<br />
icates and their semantic argument structures are represented as logical structures. The<br />
lexicon in RRG takes the position that lexical entries <strong>for</strong> verbs should contain unique in-<br />
<strong>for</strong>mation only, with as much in<strong>for</strong>mation as possible derived from general lexical rules.<br />
For this reason and due <strong>to</strong> the functional nature of our linguistic model, we will create<br />
our own lexicon.<br />
We use the RRG theory <strong>to</strong> motivate the architecture of the lexicon and the RRG bidirec-<br />
tional linking system <strong>to</strong> design and implement the parse and generate functions between<br />
the syntax-semantic interfaces. Through an input process with seven phases, including<br />
morphological and syntactic unpacking, UniArab extracts the universal logical structure<br />
of an <strong>Arabic</strong> sentence. Using the XML based metadata representing the RRG logical<br />
structure (XRRG), UniArab accurately generates an equivalent grammatical sentence in<br />
the target language through four output phases. We outline the conceptual structure of<br />
the UniArab System which utilizes the <strong>framework</strong> and translates the <strong>Arabic</strong> language<br />
in<strong>to</strong> another natural language. We follow the Interlingua design approach <strong>for</strong> <strong>machine</strong><br />
translation. We analyse the <strong>Arabic</strong> sentences <strong>to</strong> create a universal, abstract logical repre-<br />
sentation, and from this representation we generate <strong>English</strong> translations.<br />
We also explore how the characteristics of the <strong>Arabic</strong> language will affect the develop-<br />
ment of a Machine Translation (MT) <strong>to</strong>ol. Several characteristics of <strong>Arabic</strong> pertinent<br />
<strong>to</strong> MT will be explored in detail with reference <strong>to</strong> some potential difficulties that they<br />
present. We will conclude with a proposed model incorporating the Role and Reference<br />
Grammar techniques <strong>to</strong> achieve this end. The UniArab system has been tested by gener-<br />
ating equivalent grammatical sentences, in <strong>English</strong>, via the universal logical structure of<br />
<strong>Arabic</strong> sentences, based on MSA <strong>Arabic</strong> input with very significant and accurate results.<br />
ii
It provides more accurate translations when compared with au<strong>to</strong>mated transla<strong>to</strong>rs from<br />
Google and Microsoft though these systems have a much wider coverage than UniArab<br />
at present. The free word order nature of <strong>Arabic</strong> and the challenges of incorporating tran-<br />
sitivity in<strong>to</strong> the logical structure will be outlined in detail. This research demonstrates the<br />
capabilities of the Role and Reference Grammar as a base <strong>for</strong> multilingual translation<br />
systems.<br />
iii
Declaration<br />
I herby certify that this material, which I now submit <strong>for</strong> assessment on the programme<br />
of study leading <strong>to</strong> the award of M.Sc. in Computing in the Institute of Technology Blan-<br />
chards<strong>to</strong>wn, is entirely my own work except where otherwise stated, and has not been<br />
submitted <strong>for</strong> assessment <strong>for</strong> an academic purpose at this or any other academic institu-<br />
tion other than in partial fulfillment of the requirements of that stated above.<br />
. . ..................................<br />
Yasser Salem<br />
Dublin, Ireland<br />
April 2009<br />
iv
Acknowledgements<br />
“In the name of GOD (Allah), Most Gracious, Most Merciful. Praise be <strong>to</strong> GOD, the<br />
Cherisher and Sustainer of the worlds; Most Gracious, Most Merciful; Master of the Day<br />
of Judgement. Thee do we worship, and Thine aid we seek. Show us the straight way,”<br />
[The Quran: Al-Fatiha (The Opening)]. “and my success (in my task) can only come<br />
from GOD. In Him I trust and un<strong>to</strong> Him I turn (repentant).” [The Quran: Hud,88]. I am<br />
not able <strong>to</strong> fulfil the due thanks <strong>to</strong> GOD, but I seek his <strong>for</strong>giveness and that GOD assists<br />
me in thanking His Majesty. “Say. Surely my prayer and my sacrifice and my life and<br />
my death are all <strong>for</strong> God, the Lord of the worlds.” (The Quran: AL-Anaam, 162)<br />
I am deeply indebted <strong>to</strong> my mother <strong>for</strong> accepting that I pursue my M.Sc. away from<br />
her. And May God bless my father’s soul and join all of us in the paradise. I am grateful<br />
<strong>to</strong> my wife, Qamar, and my children, <strong>for</strong> their encouragement and understanding. Thanks<br />
<strong>to</strong> my brothers and sisters.<br />
I am deeply indebted <strong>to</strong> my supervisor, Dr. Brian Nolan, <strong>for</strong> many insightful conver-<br />
sations during the development of the ideas in this thesis, and <strong>for</strong> helpful comments on<br />
the text. Dr. Nolan has supported me not only by providing research guidelines and<br />
v
advices, but also academically and emotionally through the rough road <strong>to</strong> finishing this<br />
thesis. And during the most difficult times when writing this thesis, he gave me the<br />
moral support. I also extend my sincere gratitude <strong>to</strong> Mr. Arnold Hensman who has had<br />
a significantly positive influence on my development through his role as my research co-<br />
supervisor. His patience and willingness <strong>to</strong> discuss the minutiae of the different obstacles<br />
I encountered while working on this project were invaluable.<br />
I also extend my sincere gratitude <strong>to</strong> all the staff of the Department of In<strong>for</strong>matics, In-<br />
stitute of Technology Blanchards<strong>to</strong>wn, Dublin, Ireland, where I did my undergraduate<br />
and masters studies. I specially thank Mark Cummins, Gareth Curran, Dr. Kevin Far-<br />
rell, Geraldine Gray, Dr. Anthony Keane, Laura Keyes, Margaret Kinsella, John Massey,<br />
Hugh McCabe, Dr. Simon McLoughlin, Orla McMahon, Daniel McSweeney, Frances<br />
Murphy, Tom Nolan, Michael O’Donnell, Luke Raeside, Stephen Sheridan and Dr. Matt<br />
Smith.<br />
Thanks <strong>to</strong> all my friends who supported me during my studies, especially, and <strong>to</strong> name<br />
just a few Mansoor Bin Khalifa Al-Thani, Mohammed A. Attia, Khalid Buzakhar, Suhaib<br />
Fahmy, Mohamed Faouzi Gahbiche, James Kennedy, Essam Mansour, Eftaima Najjair<br />
and Aliye Peksen. I would like <strong>to</strong> express my gratitude <strong>to</strong> all those who gave me the<br />
possibility <strong>to</strong> complete this thesis.<br />
. . ..................................<br />
Yasser Salem<br />
Dublin, Ireland<br />
April 2009<br />
vi
Contents<br />
Abstract iii<br />
Declaration iv<br />
Acknowledgements vi<br />
1 Introduction 1<br />
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
1.3 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
2 The <strong>Arabic</strong> Language 8<br />
2.1 Characteristics of the <strong>Arabic</strong> language . . . . . . . . . . . . . . . . . . . 9<br />
2.2 Characteristics of <strong>Arabic</strong> words . . . . . . . . . . . . . . . . . . . . . . 11<br />
2.2.1 Free word order . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />
2.3 Part of speech inven<strong>to</strong>ry of the <strong>Arabic</strong> language . . . . . . . . . . . . . . 14<br />
2.3.1 Noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />
2.3.1.1 Definite nouns . . . . . . . . . . . . . . . . . . . . . . 15<br />
vii
CONTENTS<br />
2.3.1.2 Indefinite nouns . . . . . . . . . . . . . . . . . . . . . 16<br />
2.3.2 Adjectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
2.3.3 Adverbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
2.3.4 Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
2.3.4.1 Verb tenses . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
2.3.4.2 Aspect . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.3.4.3 Mood . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.3.4.4 Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
2.3.4.5 Transitivity . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
2.3.5 Demonstratives . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
2.3.6 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
2.4 Sentence types in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
2.4.1 Equational sentences . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
2.4.1.1 Verb and noun . . . . . . . . . . . . . . . . . . . . . . 23<br />
2.4.1.2 Verb and two nouns . . . . . . . . . . . . . . . . . . . 23<br />
2.4.1.3 Verb and three nouns . . . . . . . . . . . . . . . . . . 24<br />
2.4.1.4 Verb and four nouns . . . . . . . . . . . . . . . . . . . 24<br />
2.4.2 The Verbal Sentence . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
2.4.3 Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
3 Role and Reference Grammar (RRG) 27<br />
3.1 Role and Reference Grammar linguistic model . . . . . . . . . . . . . . . 28<br />
3.2 Formal representation of layered structure of the clause . . . . . . . . . . 31<br />
3.2.1 Representing the universal aspects of the layered structure of the<br />
clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
3.2.2 Layered structure of the clause (LSC) . . . . . . . . . . . . . . . 32<br />
viii
CONTENTS<br />
3.2.3 Non-universal aspects of the layered structure of the clause . . . . 32<br />
3.3 Noun phrase structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
3.3.1 NP headed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
3.4 Lexical representations <strong>for</strong> verbs . . . . . . . . . . . . . . . . . . . . . . 38<br />
3.4.1 Agents, effec<strong>to</strong>rs, instruments and <strong>for</strong>ces . . . . . . . . . . . . . 39<br />
3.4.2 change of state verb . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
3.5 Why we use RRG as the linguistic model . . . . . . . . . . . . . . . . . 41<br />
3.5.1 RRG representing the universal aspects of the layered structure<br />
of the clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
3.5.2 The lexical representation of verbs and their arguments . . . . . . 43<br />
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
4 Machine translation strategies 46<br />
4.1 Advantages of <strong>machine</strong> translation . . . . . . . . . . . . . . . . . . . . . 47<br />
4.2 Computational techniques in MT . . . . . . . . . . . . . . . . . . . . . . 47<br />
4.2.1 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />
4.2.2 Interactive systems . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />
4.2.3 Lexical databases . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />
4.2.4 Tokens and <strong>to</strong>kenization . . . . . . . . . . . . . . . . . . . . . . 49<br />
4.2.5 Syntactic analysis (Parsing) . . . . . . . . . . . . . . . . . . . . 50<br />
4.3 Basic <strong>machine</strong> translation strategies . . . . . . . . . . . . . . . . . . . . 50<br />
4.3.1 Multilingual versus bilingual systems . . . . . . . . . . . . . . . 50<br />
4.3.2 Direct translation . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />
4.3.3 Interlingua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
4.3.4 Transfer systems . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />
4.3.5 Statistical <strong>machine</strong> translation . . . . . . . . . . . . . . . . . . . 56<br />
4.4 Linguistic aspects of MT . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
ix
CONTENTS<br />
4.4.1 Non-Roman alphabet scripts . . . . . . . . . . . . . . . . . . . . 57<br />
4.4.2 Lexical ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
4.4.2.1 Category ambiguity . . . . . . . . . . . . . . . . . . . 57<br />
4.4.2.2 Homograph . . . . . . . . . . . . . . . . . . . . . . . 58<br />
4.4.3 Syntactic ambiguity . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
4.4.4 Structural differences . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
4.5 Challenges of <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> MT . . . . . . . . . . . . . . . . . . . . 60<br />
4.6 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />
4.6.1 Generation in direct systems . . . . . . . . . . . . . . . . . . . . 62<br />
4.6.2 Generation in transfer-based systems . . . . . . . . . . . . . . . 63<br />
4.6.3 Generation in interlingua systems . . . . . . . . . . . . . . . . . 64<br />
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
5 Design of <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation system based on RRG 67<br />
5.1 UniArab: Interlingua-based system . . . . . . . . . . . . . . . . . . . . . 68<br />
5.2 Designing an XML lexicon architecture <strong>for</strong> <strong>Arabic</strong> MT based on RRG . . 69<br />
5.2.1 An XML-based lexicon . . . . . . . . . . . . . . . . . . . . . . . 70<br />
5.2.2 Lexical representation in UniArab . . . . . . . . . . . . . . . . . 70<br />
5.2.3 Lexical properties . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
5.3 Design of test strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
5.4 Design of evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 77<br />
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
6 UniArab: a proof-of-concept <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation system 79<br />
6.1 Conceptual structure of the UniArab system . . . . . . . . . . . . . . . . 80<br />
6.1.1 Technical architecture of the UniArab system . . . . . . . . . . . 81<br />
6.1.2 UniArab: Lexical representation in interlingua system . . . . . . 84<br />
x
CONTENTS<br />
6.2 UniArab: Lexical representation in interlingua system based on RRG . . 86<br />
6.2.1 Verb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
6.2.2 Common noun . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
6.2.3 Proper noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
6.2.4 Adjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
6.2.5 Demonstrative . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />
6.2.6 Adverb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />
6.2.7 Other <strong>Arabic</strong> words . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
6.3 UniArab: Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
6.4 UniArab: Screen design . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
6.4.1 Lexicon interface . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />
6.5 Technical challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
7 Testing and evaluation 100<br />
7.1 Evaluation of MT systems . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
7.2 Sentence tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
7.2.1 Verb-Subject with one argument in different tenses . . . . . . . . 102<br />
7.2.2 Gender-ambiguous proper nouns . . . . . . . . . . . . . . . . . . 106<br />
7.2.3 Verb ‘<strong>to</strong> be’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />
7.2.4 Verb ‘<strong>to</strong> have’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
7.2.5 Free word order . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
7.2.6 Pro-drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
7.2.7 Transitivity of verbs . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
7.2.7.1 Intransitive . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
7.2.7.2 Transitive . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
7.2.7.3 Ditransitive . . . . . . . . . . . . . . . . . . . . . . . 119<br />
xi
CONTENTS<br />
7.2.8 Limitation of UniArab . . . . . . . . . . . . . . . . . . . . . . . 122<br />
7.3 System evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />
8 Conclusion 129<br />
8.1 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
8.2 Summary of thesis contributions . . . . . . . . . . . . . . . . . . . . . . 132<br />
8.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133<br />
References 134<br />
Appendix 140<br />
A The author’s publications related <strong>to</strong> this research 140<br />
B Buckwalter <strong>Arabic</strong> transliteration 142<br />
C List of translatable sentences 145<br />
D Verbs in lexicon 161<br />
E The UniArab code 170<br />
xii
List of Figures<br />
2.1 A classification <strong>for</strong> the <strong>Arabic</strong> language syntax . . . . . . . . . . . . . . 14<br />
2.2 A classification of clauses in the <strong>Arabic</strong> language . . . . . . . . . . . . . 22<br />
3.1 Layout of Role and Reference Grammar . . . . . . . . . . . . . . . . . . 29<br />
3.2 <strong>Arabic</strong> sentence types; verb subject object or subject verb object (<strong>for</strong><br />
gloss please see example 3.1) . . . . . . . . . . . . . . . . . . . . . . . . 30<br />
3.3 Formal representation of the layered structure of the clause . . . . . . . . 31<br />
3.4 <strong>English</strong> Sentence with precore slot and left-detached position . . . . . . . 33<br />
3.5 Opera<strong>to</strong>r projection in LSC . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
3.6 LSC with constituent and opera<strong>to</strong>r projections . . . . . . . . . . . . . . . 35<br />
3.7 <strong>Arabic</strong> LSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
3.8 The RRG representing the universal aspects of the layered structure of<br />
the clause (Van Valin and LaPolla 1997) . . . . . . . . . . . . . . . . . . 42<br />
4.1 Direct MT system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />
4.2 Interlingua1 model with eight languages pairs . . . . . . . . . . . . . . . 52<br />
4.3 Multilinguality transfer model with eight languages pairs . . . . . . . . . 54<br />
xiii
LIST OF FIGURES<br />
4.4 Difference between direct, transfer, and interlingua MT models, (Trujillo<br />
1999) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
4.5 NP rule (NP –> det n pp) . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
4.6 PP is attached at a higher level . . . . . . . . . . . . . . . . . . . . . . . 59<br />
4.7 Direct MT system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
4.8 Semantic generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
4.9 Structure <strong>to</strong> be generated . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
4.10 Interlingua model of <strong>Arabic</strong> MT . . . . . . . . . . . . . . . . . . . . . . 66<br />
5.1 The conceptual architecture of the UniArab system . . . . . . . . . . . . 68<br />
5.2 In<strong>for</strong>mation recorded in the UniArab lexicon . . . . . . . . . . . . . . . 72<br />
6.1 Layout of Role and Reference Grammar . . . . . . . . . . . . . . . . . . 79<br />
6.2 The conceptual architecture of the UniArab system . . . . . . . . . . . . 80<br />
6.3 Generation the right tense <strong>for</strong> the verbs . . . . . . . . . . . . . . . . . . . 84<br />
6.4 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> verb . . . . . . . . . . . . . . . . . . 87<br />
6.5 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> noun . . . . . . . . . . . . . . . . . 88<br />
6.6 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> proper noun . . . . . . . . . . . . . . 89<br />
6.7 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> adjective . . . . . . . . . . . . . . . 90<br />
6.8 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> demonstrative. . . . . . . . . . . . . 91<br />
6.9 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> adverb. . . . . . . . . . . . . . . . . 91<br />
6.10 In<strong>for</strong>mation recorded on the other <strong>Arabic</strong> words . . . . . . . . . . . . . . 92<br />
6.11 UniArab’s GUI 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />
6.12 UniArab’s GUI 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />
6.13 UniArab’s GUI 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />
6.14 UniArab’s lexicon interface . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
7.1 Verb-Subject with one argument . . . . . . . . . . . . . . . . . . . . . . 102<br />
xiv
LIST OF FIGURES<br />
7.2 Verb-Subject with one argument . . . . . . . . . . . . . . . . . . . . . . 103<br />
7.3 Verb-subject agreement 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
7.4 Verb-subject agreement 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />
7.5 Gender-ambiguous proper nouns 1 . . . . . . . . . . . . . . . . . . . . . 106<br />
7.6 Gender-ambiguous proper nouns 2 . . . . . . . . . . . . . . . . . . . . . 107<br />
7.7 Verb ‘<strong>to</strong> be’ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />
7.8 Verb ‘<strong>to</strong> be’ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />
7.9 Verb ‘<strong>to</strong> have’ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
7.10 Verb ‘<strong>to</strong> have’ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
7.11 Free word order (Verb Noun Noun scenario one) . . . . . . . . . . . . . 112<br />
7.12 Free word order (Verb Noun Noun scenario two) . . . . . . . . . . . . . 113<br />
7.13 Free word order (Verb Noun Noun scenario three) . . . . . . . . . . . . . 114<br />
7.14 Pro-drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
7.15 Intransitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
7.16 Intransitive with an adverb . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
7.17 Transitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
7.18 Ditransitive 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
7.19 Ditransitive with 2 NP . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
7.20 Ditransitive with preposition . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
7.21 Limitation of UniArab 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />
7.22 Limitation of UniArab 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />
7.23 Limitation of UniArab 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />
xv
List of Tables<br />
2.1 Dual: merely add two letters <strong>to</strong> achieve dual <strong>for</strong>m in <strong>Arabic</strong> . . . . . . . 10<br />
2.2 Grammatical gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />
2.3 Feminine is different than masculine . . . . . . . . . . . . . . . . . . . . 12<br />
2.4 Feminine and masculine in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . 12<br />
2.5 Definiteness in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />
2.6 Definiteness example in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . 12<br />
2.7 Free word order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />
2.8 Noun example in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
2.9 Definite example in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
2.10 Indefinite example in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
2.11 <strong>Arabic</strong> adjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
2.12 <strong>Arabic</strong> adverb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
2.13 Imperfect tense <br />
ālf֒l ālmd. ār֒ . . . . . . . . . . . . . . . . 17<br />
2.14 Perfect tense ālf֒l ālmād. y . . . . . . . . . . . . . . . . . . 17<br />
2.15 Imperfect inflectional <strong>for</strong>ms of word ‘write’ . . . . . . . . . . . . . . . . 18<br />
2.16 Perfect inflectional <strong>for</strong>ms of word ‘wrote’ . . . . . . . . . . . . . . . . . 18<br />
2.17 Future tense in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
xvi
LIST OF TABLES<br />
2.18 Indicative mood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.19 Subjunctive mood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.20 Jussive mood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.21 Imperative mood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
2.22 Particle ‘Lan’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
2.23 Nominal sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
2.24 Kan and its sisters kān w ֓ahwthā . . . . . . . . . . . . . . 23<br />
˘<br />
2.25 zanna and its sisters z . n w ֓ahwthā . . . . . . . . . . . . . . 24<br />
˘<br />
2.26 In<strong>for</strong>med and showed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
2.27 verb(V), subject(S) and object(O) . . . . . . . . . . . . . . . . . . . . . 25<br />
2.28 subject(S), verb(V) and object(O) . . . . . . . . . . . . . . . . . . . . . 25<br />
2.29 verb(V), object(O) and subject(S) . . . . . . . . . . . . . . . . . . . . . 25<br />
2.30 Two simple clauses by subordinating conjunction . . . . . . . . . . . . . 25<br />
3.1 Relationships between the semantic and syntactic units . . . . . . . . . . 32<br />
3.2 Lexical representations <strong>for</strong> the basic Aktionsart classes . . . . . . . . . . 38<br />
4.1 Modules required in an all-pairs multilingual transfer system . . . . . . . 54<br />
4.2 Derived words from a three-letter-root in <strong>Arabic</strong> . . . . . . . . . . . . . 61<br />
5.1 Verb 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
5.2 Verb 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
5.3 Test strategy: verb-subject agreement . . . . . . . . . . . . . . . . . . . 75<br />
5.4 Test strategy: demonstrative adjective-noun agreement . . . . . . . . . . 75<br />
5.5 Test strategy: gender-ambiguous proper nouns . . . . . . . . . . . . . . 75<br />
5.6 Test strategy: verb ‘<strong>to</strong> be’ . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
5.7 Test strategy: verb ‘<strong>to</strong> have’ . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
5.8 Test strategy: free word order (Verb Noun Noun) . . . . . . . . . . . . . 77<br />
xvii
LIST OF TABLES<br />
5.9 Test strategy: pro–drop . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />
6.1 Verb 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
6.2 Verb 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
6.3 Noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
6.4 Proper Noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
6.5 Adjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
6.6 Demonstrative representative . . . . . . . . . . . . . . . . . . . . . . . . 90<br />
6.7 Adverb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />
6.8 Other <strong>Arabic</strong> words (where ‘NON’ means not applicable) . . . . . . . . 92<br />
7.1 Test : Verb-Subject; one argument . . . . . . . . . . . . . . . . . . . . . 102<br />
7.2 Test : Verb-subject; agreement 1 . . . . . . . . . . . . . . . . . . . . . . 103<br />
7.3 Test : verb-subject; agreement 2 . . . . . . . . . . . . . . . . . . . . . . 104<br />
7.4 Test : Gender-ambiguous proper nouns 1 . . . . . . . . . . . . . . . . . 106<br />
7.5 Test : gender-ambiguous proper nouns 2 . . . . . . . . . . . . . . . . . . 107<br />
7.6 Test : Verb ‘<strong>to</strong> be’ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />
7.7 Test : Verb ‘<strong>to</strong> be’ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />
7.8 Test : Verb ‘<strong>to</strong> have’ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
7.9 Test : Verb ‘<strong>to</strong> have’ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
7.10 Test : Free word order (Verb Noun Noun scenario one) . . . . . . . . . . 112<br />
7.11 Test : Free word order (Verb Noun Noun scenario two) . . . . . . . . . . 113<br />
7.12 Test : Free word order (Verb Noun Noun scenario three) . . . . . . . . . 114<br />
7.13 Test: Pro-drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
7.14 Test : Intransitive 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
7.15 Test : Intransitive 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
7.16 Test : Transitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
xviii
LIST OF TABLES<br />
7.17 Test : Ditransitive 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
7.18 Test : Ditransitive with 2 NP . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
7.19 Test : Ditransitive with preposition . . . . . . . . . . . . . . . . . . . . 121<br />
7.20 Test : Limitation of UniArab . . . . . . . . . . . . . . . . . . . . . . . . 122<br />
7.21 Test : Limitation of UniArab 3 using non existing nonsense word . . . . 124<br />
xix
1<br />
Introduction<br />
The following paragraph was translated from <strong>Arabic</strong> in<strong>to</strong> <strong>English</strong> using the Google trans-<br />
la<strong>to</strong>r (Google 2009).<br />
That rely entirely on <strong>machine</strong> translation ignores the fact that communication in<br />
the language of rights is an integral part of the context, and that the human is<br />
capable of understanding the context of the original text in a manner sufficient.<br />
There<strong>for</strong>e can not be trusted after the <strong>machine</strong> translation programs, they could<br />
not analyze the context of the original version is similar <strong>to</strong> the human<br />
understanding of when listening <strong>to</strong> the same text.<br />
It is clear that the paragraph cannot be easily unders<strong>to</strong>od, and a large amount of the<br />
in<strong>for</strong>mation has been confused or mixed up. This shows the problems facing <strong>machine</strong><br />
translation, and motivates our work. We believe that statistical <strong>machine</strong> translation has<br />
not achieved what people expected in terms of quality. Hence we wish <strong>to</strong> look at another<br />
method, building from the ground up <strong>to</strong> achieve higher quality translations.<br />
1
Machine translation has yet <strong>to</strong> reach its potential within the translation market as a whole.<br />
Figures suggest that MT accounts <strong>for</strong> a small us $ 100 million portion of a us $ 10 billion<br />
translation market (Intelligence 2004). Many have suggested that the reason is the poor<br />
quality of results, hence it only makes sense when very large amounts of data need <strong>to</strong><br />
be processed (Oren 2004). For the MT market <strong>to</strong> expand, it is necessary <strong>to</strong> improve the<br />
quality of results, which will then make it a viable alternative within the much bigger<br />
translation market.<br />
<strong>Arabic</strong> is acquiring attention in the natural language processing (NLP) community be-<br />
cause of its political importance and the linguistic differences between it and European<br />
languages. These linguistic characteristics, especially complex morphology, present in-<br />
teresting challenges <strong>for</strong> NLP researchers. According <strong>to</strong> Holes (2004) <strong>Arabic</strong> is the sole<br />
or joint official language in twenty independent Middle Eastern and African states: Alge-<br />
ria, Bahrain, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Oman,<br />
Palestine, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tunisia, the United Arab Emirates<br />
and Yemen. Since the end of the nineteenth century, there have been large communities<br />
of <strong>Arabic</strong> speakers outside the Middle East, particularly in the United States and Eu-<br />
rope. <strong>Arabic</strong> is also the language of Islam’s holy book the Qur’an, and as such is the<br />
religious language of all Muslims. <strong>Arabic</strong> has been an official language of the United<br />
Nations alongside <strong>English</strong>, French, Spanish, and Chinese since 1 January 1971’(Holes<br />
2004). There are a number of different <strong>Arabic</strong> words in languages such as Persian, Turk-<br />
ish, Urdu or Malawian. The words derived from <strong>Arabic</strong> that exist in Spanish, Portuguese,<br />
German, Italian, <strong>English</strong> or French are also numerous (Bateson 2003).<br />
The aim of this research is <strong>to</strong> create an Interlingua Machine Translation (MT) system<br />
that will accept <strong>Arabic</strong> source sentences and generate <strong>English</strong> sentences, and <strong>to</strong> build a<br />
2
high-quality translation technology that is adequate <strong>for</strong> text-<strong>to</strong>-text translation. In this<br />
research we build an Interlingua architecture in MT which translates efficiently. We con-<br />
sider semantic analysis and other disambiguation related <strong>to</strong> <strong>Arabic</strong>. This research also<br />
represents a starting point <strong>for</strong> the future implementation of a successful and complete<br />
<strong>Arabic</strong> MT engine. The hypothesis under investigation and main aims are <strong>to</strong> present an<br />
interlingua architecture, which is not only successful in translating simplex <strong>Arabic</strong> (in-<br />
transitive, transitive, ditransitive and copula-like nominative) sentences <strong>to</strong> corresponding<br />
<strong>English</strong> sentences, but also does so in the most optimal way.<br />
This research is the first contribution (not just <strong>for</strong> <strong>Arabic</strong>) that uses the Role and Refer-<br />
ence Grammar (RRG) model as a basis <strong>for</strong> <strong>machine</strong> translation. This contribution shows<br />
how RRG can be used <strong>to</strong> deduce the logical structure of sentences and produce a lexical<br />
representation which can then be used as the interlingua bridge. The lexicon in RRG<br />
takes the position that lexical entries <strong>for</strong> verbs should contain unique in<strong>for</strong>mation only,<br />
with as much in<strong>for</strong>mation as possible derived from general lexical rules. This was the<br />
reason <strong>for</strong> creating our own lexicon since we need an RRG–based lexicon of the unique<br />
in<strong>for</strong>mation of verbs and their logical structure.<br />
UniArab stands <strong>for</strong> Universal <strong>Arabic</strong> <strong>machine</strong> transla<strong>to</strong>r system. The UniArab system<br />
is a natural language processing application based on Role and Reference Grammar <strong>for</strong><br />
translating the <strong>Arabic</strong> language in<strong>to</strong> any other language, using an RRG based interlingua<br />
bridge. The UniArab system can understand the part of speech of a word, agreement<br />
features, number, gender and the word type. The syntactic parse unpacks the agreement<br />
features between elements of the <strong>Arabic</strong> sentence in<strong>to</strong> a semantic representation (the log-<br />
ical structure) with the ‘state of affairs’ of the sentence. In the UniArab system we intend<br />
<strong>to</strong> have a strong analysis system that can unpack all in<strong>for</strong>mation and its attributes. This<br />
3
1.1. MOTIVATION<br />
allows <strong>for</strong> a generalized target language <strong>to</strong> be generated from the logical structures. In<br />
this research we translate from <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> only, with a view <strong>to</strong> translate from<br />
<strong>Arabic</strong> <strong>to</strong> any other target language in the future.<br />
1.1 Motivation<br />
The motivation <strong>for</strong> an <strong>Arabic</strong>-<strong>English</strong> translation <strong>to</strong>ol is obvious when one considers that<br />
<strong>Arabic</strong> is the lingua franca of the Middle-Eastern world. Presently, 20 countries with a<br />
combined population of 450 million consider Standard <strong>Arabic</strong> as their national language.<br />
A simple test case during a study at Abu Dhabi University over three popular <strong>Arabic</strong><br />
translation <strong>to</strong>ols (Google, Sakhr’s Tarjim and Systran) revealed little success in generat-<br />
ing the correct meaning (Izwaini 2006). This research demonstrates the capabilities of<br />
Role and Reference Grammar as a base <strong>for</strong> multi-language translation.<br />
1.2 Goals<br />
The goal of our research <strong>to</strong>wards an <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> <strong>machine</strong> translation system is <strong>to</strong><br />
create a system that translates simplex sentences of Modern Standard <strong>Arabic</strong> as a source<br />
language in<strong>to</strong> <strong>English</strong>. Our goal is <strong>to</strong> build a system which can translate a wide variety of<br />
simple sentence types. We aim <strong>to</strong> make this system as scalable as possible by allowing<br />
users <strong>to</strong> add <strong>to</strong> the lexicon and later, in future research, <strong>to</strong> include complex sentences.<br />
To achieve this goal, it is essential <strong>to</strong> build a robust and accurate lexical system and<br />
<strong>machine</strong> transla<strong>to</strong>r. One of the steps we have <strong>to</strong> achieve is <strong>to</strong> generate the universal<br />
logical structure from a source sentence. The system should be capable of dealing with<br />
free word order which <strong>Arabic</strong> exhibits. This poses a significant challenge <strong>to</strong> MT due<br />
<strong>to</strong> the vast number of ways <strong>to</strong> express the same sentence in <strong>Arabic</strong>. Also, we must<br />
account <strong>for</strong> verbs that do not exist in <strong>Arabic</strong> like the copula verb ‘<strong>to</strong> be’ and the verb<br />
‘<strong>to</strong> have’. The system should deal with the transitivity of verbs (intransitive, transitive,<br />
4
1.3. TECHNOLOGIES<br />
ditransitive). The <strong>Arabic</strong> language is written from right <strong>to</strong> left and has a unique letter<br />
shape. Words are written in horizontal lines from right <strong>to</strong> left. The letter shape depends<br />
on its position in the word; initial (prefix), medial (infix), final (suffix) or (Isolated).<br />
In technical linguistic terms, <strong>Arabic</strong> is a ‘pro–drop’ or ‘pronoun–drop’ language. It can<br />
define who takes the action by using conjugations. The pro–drop parameter is an aspect of<br />
grammar that allows subjects <strong>to</strong> be optional in some languages. That is, every inflection<br />
in a verb paradigm is specified uniquely and does not need <strong>to</strong> use independent pronouns<br />
<strong>to</strong> differentiate the person, number, and gender of the verb. The system should cover and<br />
solve the “pro–drop” challenge in <strong>Arabic</strong>.<br />
1.3 Technologies<br />
We introduce the main technologies used <strong>to</strong> support the development of the research pre-<br />
sented in this thesis. These technologies are mainly the XML language and Java. The<br />
most recent recommendation of the XML language has been presented by Bray et al.<br />
(2008). XML has become the default standard <strong>for</strong> data exchange among heterogeneous<br />
data sources (Arciniegas 2000). The UniArab system allows data <strong>to</strong> be s<strong>to</strong>red in XML<br />
<strong>for</strong>mat. This data can then be queried, exported and serialized in<strong>to</strong> any <strong>for</strong>mat the devel-<br />
oper wishes. The Java programming language is used <strong>to</strong> implement the logical structures.<br />
The primary advantage being that Java is plat<strong>for</strong>m-independent and thus highly suitable<br />
<strong>for</strong> MT.<br />
Advantages of XML<br />
XML is a generalized way <strong>to</strong> s<strong>to</strong>re data, which is not married <strong>to</strong> any particular technology.<br />
This makes it easy <strong>to</strong> s<strong>to</strong>re something, and then come back and grab it later with some<br />
other technology <strong>for</strong> processing. Using XML <strong>to</strong> exchange in<strong>for</strong>mation offers a number<br />
of advantages, including the following:<br />
5
1.4. THESIS ORGANIZATION<br />
Easily built: A well-<strong>for</strong>med data element must be enclosed between tags. The XML<br />
document can be parsed without prior knowledge of the tags. XML allows you <strong>to</strong> define<br />
all sorts of tags with all sorts of rules, such as tags representing data description or data<br />
relationships.<br />
Human readable: Using intelligible tag names will make it possible <strong>to</strong> read, even by<br />
novices.<br />
Machine readable: XML was designed <strong>to</strong> be easy <strong>for</strong> computers <strong>to</strong> process. XML is<br />
completely compatible with Java and portable plat<strong>for</strong>ms. Any application can process<br />
XML on any plat<strong>for</strong>m, as it is a plat<strong>for</strong>m-independent language.<br />
XML fully supports <strong>Arabic</strong>: We chose <strong>to</strong> create our datasource as XML files, <strong>for</strong> op-<br />
timum support of different plat<strong>for</strong>ms. It was also easier as we used <strong>Arabic</strong> letters rather<br />
than Unicode inside the datasource.<br />
XML search engine: It is easy <strong>to</strong> extend the search sample <strong>to</strong> display more in<strong>for</strong>mation<br />
about the search. Search by Java API Document Object Model (DOM) is the ideal <strong>to</strong>ol<br />
<strong>for</strong> searching collections of XML documents.<br />
1.4 Thesis organization<br />
This thesis is organized as follows. Chapter 2 explores how the characteristics of the<br />
(Modern Standard) <strong>Arabic</strong> language will affect the development of an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong><br />
<strong>machine</strong> translation (MT) <strong>to</strong>ol. Several distinguishing features of <strong>Arabic</strong> pertinent <strong>to</strong> MT<br />
are explored in detail (Salem et al. 2008b). Chapter 3 reviews the most important fea-<br />
tures of Role and Reference Grammar (RRG) Theory (Salem and Nolan 2009a). Chapter<br />
4 will discuss some distinguishing features of Machine translation strategies. Chapter<br />
5 presents the design of an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation <strong>framework</strong> based on<br />
RRG. It also presents a high-level view of the system <strong>framework</strong> and defines our eval-<br />
uation criteria <strong>for</strong> measuring system per<strong>for</strong>mance and effectiveness (Salem and Nolan<br />
6
1.4. THESIS ORGANIZATION<br />
2009b). Chapter 6 presents UniArab: a proof-of-concept <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong><br />
transla<strong>to</strong>r system. It covers the technical aspects of UniArab, covering all the phases<br />
involved in the <strong>machine</strong> translation process. We describe the lexical system that under-<br />
lies UniArab, detailing the attribute in<strong>for</strong>mation held <strong>for</strong> each type of word. We discuss<br />
the input and generation phase and how the system maps the logical structure <strong>to</strong> a target<br />
<strong>English</strong> sentence. We then briefly discuss the user interface, and some of the technical<br />
challenges encountered during the implementation (Salem et al. 2008a) and (Nolan and<br />
Salem 2009). Chapter 7 discusses the evaluating and experimental results of the case<br />
study. We present the results of our evaluation of UniArab <strong>for</strong> a wide variety of simple<br />
(Intransitive, Transitive and Ditransitive) sentence types (Salem and Nolan 2009c). The<br />
thesis conclusions and future work are discussed in Chapter 8.<br />
7
2<br />
The <strong>Arabic</strong> Language<br />
<strong>Arabic</strong> is a language with a derivational and inflectional rich morphology (Holes 2004).<br />
The version of <strong>Arabic</strong> we consider in this research is Modern Standard <strong>Arabic</strong> (MSA).<br />
When we mention <strong>Arabic</strong> throughout this research we mean MSA which is distinct from<br />
classical <strong>Arabic</strong>. Modern Standard <strong>Arabic</strong> (MSA) is a modernized <strong>for</strong>m of Classical<br />
<strong>Arabic</strong> (Alosh 2005). MSA is the literary and standard variety of <strong>Arabic</strong> used in writing<br />
and <strong>for</strong>mal speeches <strong>to</strong>day (Schulz 2005). MSA is the universal language of the <strong>Arabic</strong>-<br />
speaking population. MSA is printed in most books, newspapers, magazines, official<br />
documents, and reading primers <strong>for</strong> children. Most of the oral <strong>Arabic</strong> spoken <strong>to</strong>day is<br />
more divergent than the written <strong>Arabic</strong> language. <strong>Arabic</strong> words are often ambiguous<br />
in their morphological analysis (Al-Sughaiyer and Al-Kharashi 2004). As a language,<br />
<strong>Arabic</strong> is rich in morphological and syntactic structures. <strong>Arabic</strong> is also challenging in<br />
that it is a derivational or constructional language rather than a concatenative one. Words<br />
8
2.1. CHARACTERISTICS OF THE ARABIC LANGUAGE<br />
like ‘go’ yd ¯ hb and d ¯ hb ∗ can easily be seen as being part of a hierarchy<br />
of inheritance from a ‘specific root (in this case d ¯ hb). In <strong>English</strong> and in many<br />
other languages this is not usually the case. The <strong>Arabic</strong> language is written from right<br />
<strong>to</strong> left. It has 28 letters, many language specific grammar rules with a relatively free<br />
word order language. Each <strong>Arabic</strong> letter represents a specific sound so the spelling of<br />
words can easily be done phonetically. There is no use of silent letters as in <strong>English</strong>.<br />
Similarly, there is no need <strong>to</strong> combine letters in <strong>Arabic</strong> <strong>to</strong> indicate a certain sound. For<br />
example, the ’th’ sound in <strong>English</strong> as in the word ’Thinking’ is reduced in <strong>Arabic</strong> <strong>to</strong> the<br />
character t ¯ . In addition <strong>to</strong> the standard challenges involved in developing an efficient<br />
translation <strong>to</strong>ol from <strong>Arabic</strong> <strong>to</strong> <strong>English</strong>, the relatively free word order nature of <strong>Arabic</strong><br />
creates an obstacle. There is no copula verb ‘<strong>to</strong> be’ in <strong>Arabic</strong>, <strong>for</strong> example, the mere<br />
juxtaposition of the subject and predicate indicates the predicational relationship. The<br />
absence of the indefinite article, while not unique <strong>to</strong> <strong>Arabic</strong> still poses many difficulties<br />
within the context of the language structure.<br />
2.1 Characteristics of the <strong>Arabic</strong> language<br />
The copula verbs ‘<strong>to</strong> be’ and ‘<strong>to</strong> have’ do not exist in <strong>Arabic</strong>. Instead of saying ‘My<br />
name is Zaid’, the <strong>Arabic</strong> equivalent would read like ‘Name mine Zaid’ - <br />
֓ismy<br />
zyd. Instead of saying ‘She is a student’, the <strong>Arabic</strong> equivalent would be ‘She student’;<br />
in <strong>Arabic</strong> hy t.ālbh. The copula in <strong>Arabic</strong> is only realised in the past and<br />
future tenses and in negation. Regarding the verb ‘<strong>to</strong> have’, which in <strong>English</strong> can also<br />
mean ‘<strong>to</strong> own’. Instead of saying “He has a house”, the <strong>Arabic</strong> equivalent is ‘To him a<br />
house’ - lh byt. Adjectives in <strong>Arabic</strong> have both a masculine and a feminine <strong>for</strong>m.<br />
The singular feminine adjective is just like the masculine adjective but morphologically<br />
marked (Ryding 2007).<br />
∗ <strong>Arabic</strong> examples are written here by using Buckwalter <strong>Arabic</strong> Transliteration which is converted in latex in<strong>to</strong> the DIN 31635<br />
standard of <strong>Arabic</strong> transliteration<br />
9
2.1. CHARACTERISTICS OF THE ARABIC LANGUAGE<br />
Table 2.1: Dual: merely add two letters <strong>to</strong> achieve dual <strong>for</strong>m in <strong>Arabic</strong><br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
bāb door<br />
bābān two doors<br />
The <strong>Arabic</strong> number system includes the dual <strong>for</strong>m, whereas other languages move from<br />
the singular <strong>to</strong> the plural <strong>for</strong>m directly. In <strong>Arabic</strong> we need only <strong>to</strong> add two letters <strong>to</strong> the<br />
singular <strong>for</strong>m <strong>to</strong> express the dual <strong>for</strong>m. An example is given in Table 2.1. The plural<br />
<strong>for</strong>m, however, is obtained using a different mechanism.<br />
Plurals are of two types:<br />
(1) The sound plural. The sound plural is one in which the singular <strong>for</strong>m of the word<br />
remains intact (sound) with some addition at the end. Examples;<br />
Masculine in the nominative case e.g. engineers mhndswn in which wn is<br />
added <strong>to</strong> a singular noun. Masculine in the accusative and genitive cases e.g. engineers<br />
mhndsyn in which yn is added <strong>to</strong> the singular noun.<br />
Feminine in the nominative e.g engineers mhndsātun in which ātun is<br />
added <strong>to</strong> the singular noun.<br />
Feminine in the accusative and genitive cases engineers mhndsātin in which <br />
ātin is added <strong>to</strong> the singular noun.<br />
(2) The broken plural. The broken plural is one in which the <strong>for</strong>m of the singular word is<br />
broken, that is, changed. It has no fixed rule <strong>for</strong> making it. Sometimes letters are added<br />
or deleted and sometimes there is merely a change in the vowels. Examples ktb<br />
books, ktāb a book, rˇgl man, rˇgāl men, snh a year snwāt<br />
years.<br />
10
2.2 Characteristics of <strong>Arabic</strong> words<br />
2.2. CHARACTERISTICS OF ARABIC WORDS<br />
There is no upper and lower case distinction. Words are written horizontally from right<br />
<strong>to</strong> left. Most letters change <strong>for</strong>m depending on whether they appear at the beginning,<br />
middle or end of a word or on their own. <strong>Arabic</strong> letters that may be joined are always<br />
joined in both hand-written and printed <strong>for</strong>m.<br />
An interesting feature of <strong>Arabic</strong> is its treatment of the demonstrative. Whereas in En-<br />
glish one refers <strong>to</strong> an object that is either near or far as simply this (very near the speaker)<br />
or that (away from the speaker up <strong>to</strong> any distance), <strong>Arabic</strong> has a third demonstrative <strong>to</strong><br />
specify objects that are in between these points on the distance spectrum.<br />
Table 2.2: Grammatical gender<br />
<strong>Arabic</strong> Masculine <strong>English</strong> Translation<br />
qmr moon<br />
<br />
syf sword<br />
bāb door<br />
<strong>Arabic</strong> Feminine <strong>English</strong> Translation<br />
ˇsms sun<br />
֒s. ā stick<br />
nāfd ¯ h window<br />
In <strong>Arabic</strong>, all nouns must be either feminine or masculine, and the gender can be either<br />
grammatical or natural. The gender of inanimate objects is grammatical, examples are<br />
in Table 2.2. In this case the gender is a built-in lexical property of the word. Animate<br />
objects have a natural gender, and this gender can be either non-productive or productive.<br />
The non-productive gender is the case of nouns where the feminine and the masculine<br />
have different lexical entries, i.e., the feminine is not derived from the masculine, as<br />
in Table 2.3. By contrast, in the productive gender, the feminine is derived from the<br />
masculine, usually by adding a special suffix ‘ta marbuta’ <strong>to</strong> the end of the masculine<br />
11
2.2. CHARACTERISTICS OF ARABIC WORDS<br />
<strong>for</strong>m, as in Table 2.4. The <strong>Arabic</strong> definite article is concatenated <strong>to</strong> nouns and adjectives.<br />
The shape of the definite article is shown in Table 2.5.<br />
Table 2.3: Feminine is different than masculine<br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
daˇgaāˇgah Chicken<br />
dyk Cock<br />
Table 2.4: Feminine and masculine in <strong>Arabic</strong><br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
mu֒ilmun teacher(M)<br />
mu֒limtun teacher(F)<br />
t.ālb student(M)<br />
t. ālbh student(F)<br />
Table 2.5: Definiteness in <strong>Arabic</strong><br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
āl the<br />
The definite article in <strong>Arabic</strong> is graphically prefixed <strong>to</strong> an <strong>Arabic</strong> noun. An example of<br />
<strong>Arabic</strong> definiteness is shown in Table 2.6.<br />
2.2.1 Free word order<br />
Table 2.6: Definiteness example in <strong>Arabic</strong><br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
rˇgl a man<br />
ālrˇgl the man<br />
<strong>Arabic</strong> has a relatively free word order (Ramsay and Mansour 2006), this poses a signif-<br />
icant challenge <strong>to</strong> MT due <strong>to</strong> the number of possible ways <strong>to</strong> express the same sentence<br />
in <strong>Arabic</strong>. For the elements of subject(S), verb(V) and object(O), <strong>Arabic</strong>’s relatively free<br />
12
2.2. CHARACTERISTICS OF ARABIC WORDS<br />
word order allows the combinations of SVO, VSO, VOS and OVS. For example, consider<br />
the following word orders:<br />
(1) Noun1 Verb Noun2<br />
(2) Noun2 Verb Noun1<br />
(3) Verb Noun1 Noun2<br />
(4) Verb Noun2 Noun1<br />
(a) Noun Verb Noun example.<br />
qys yh. b lylā<br />
Qays loves Laila<br />
lylā yh. b qys<br />
Laila loves Qays<br />
noun verb noun<br />
Table 2.7: Free word order<br />
(c) Verb Noun Noun example.<br />
yh. b lylā qys<br />
Qays loves Laila<br />
qys lylā yh. b<br />
Qays Laila loves<br />
noun noun verb<br />
(b) Verb Noun Noun example.<br />
yh. b qys lylā<br />
Qays loves Laila<br />
lylā qys yh. b<br />
Laila Qays loves<br />
noun noun verb<br />
This means that we have a challenge <strong>to</strong> identify exactly which is the subject and the ob-<br />
ject. Tables 2.7(a), 2.7(b) and 2.7(c) show this challenge. In <strong>Arabic</strong> the subject agrees<br />
with the verb with appropriate morphological marking on the word <strong>to</strong> differentiate sub-<br />
ject from object in these free word order sentences. †<br />
The difference in Tables 2.7(a), 2.7(b) and 2.7(c) is the position of the ac<strong>to</strong>r. The sen-<br />
tences in fact have the same meaning. While in <strong>English</strong> the <strong>for</strong>m of a sentence is subject<br />
verb object.<br />
† Note that <strong>Arabic</strong> sentences should be read from right <strong>to</strong> left.<br />
13
2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />
2.3 Part of speech inven<strong>to</strong>ry of the <strong>Arabic</strong> language<br />
In the <strong>Arabic</strong> linguistic tradition there is not a clear-cut, well-defined analysis of the<br />
inven<strong>to</strong>ry of parts of speech in <strong>Arabic</strong>. Attia (2008) mentioned that the traditional clas-<br />
sification of <strong>Arabic</strong> parts of speech in<strong>to</strong> nouns, verbs and particles is not sufficient <strong>for</strong> a<br />
complete computational grammar. This categorization, originally proposed by Sibawaih<br />
(Owens 2006), remains the standard accepted scheme <strong>to</strong>day. However, we have found it<br />
lacking when applied <strong>to</strong> <strong>machine</strong> translation, and so, developed our own lexical scheme.<br />
Our classification of the parts of speech in <strong>Arabic</strong> is illustrated in Figure 2.1. We clas-<br />
sified parts of speech in<strong>to</strong> nouns, adjectives, adverbs, verbs, demonstratives, and others.<br />
Each category will be further explained in the following subsections.<br />
2.3.1 Noun<br />
Figure 2.1: A classification <strong>for</strong> the <strong>Arabic</strong> language syntax<br />
A noun denotes either tangible or intangible identities. Nouns are independent of other<br />
words in indicating their meaning. What distinguishes nouns from verbs is that nouns<br />
refer <strong>to</strong> entities or things. Nouns are further classified in<strong>to</strong> pronouns, common nouns and<br />
proper nouns. Pronouns are classified according <strong>to</strong> person (first, second, third), number<br />
14
2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />
(singular, dual and plural) and gender (masculine and feminine). They can also be nom-<br />
inative, accusative, or genitive. Examples are ֓anā “I”, ֓ant “you”, and hw<br />
“he”. We make a further classification of common nouns in<strong>to</strong> animate and inanimate.<br />
Examples of common noun are in Table 2.8.<br />
Table 2.8: Noun example in <strong>Arabic</strong><br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
<br />
raˇgulun man<br />
ˇsaˇgrh tree<br />
Although this seems more like a semantic classification, the <strong>Arabic</strong> morphology and<br />
syntax needs this classification. For example the choice of demonstrative adjective with<br />
plural nouns depends on whether the noun is human or non-human. For example<br />
hd ¯ h ālklāb<br />
this.sg.f dog.pl<br />
The proper nouns are further classified in<strong>to</strong> names of persons, such as ֒mr “Omar”<br />
and <br />
h ˘ āld “Khalid”; locations, such as ālqāhrh “Cairo” and ֓ayrlndā<br />
“Ireland”; organizations āl֓amm ālmth. dh “United Nations”; and objects,<br />
such as lynwks “Linux” Common nouns can either be definite or indefinite.<br />
2.3.1.1 Definite nouns<br />
A noun normally can be considered as definite (in <strong>Arabic</strong>: m֒rfh) when the speaker<br />
and the reader know about the specific object being referred <strong>to</strong>, <strong>for</strong> example in Table 2.9.<br />
Table 2.9: Definite example in <strong>Arabic</strong><br />
<strong>Arabic</strong> ālktāb āld ¯ y tbh. t ¯ ֒nh fwq ālt.āwlh.<br />
<strong>English</strong> Translation The book you are looking <strong>for</strong> is on the table.<br />
In the example, the word ‘book’ is definite by using the definite article ‘the’, since both<br />
the speaker and the listener know which book they are dealing with. The definite article<br />
15
2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />
in <strong>Arabic</strong> is used <strong>to</strong> introduce and talk about a known subject. The <strong>Arabic</strong> language uses<br />
the same article <strong>for</strong> all nouns, be they male or female, singular or plural. The article is<br />
written be<strong>for</strong>e the noun it refers <strong>to</strong> and, graphically, it appears attached <strong>to</strong> it.<br />
2.3.1.2 Indefinite nouns<br />
Indefinite nouns (in <strong>Arabic</strong>: nkrh ) are nouns which are not specified. It is translated<br />
as ‘a’ or ‘an’ in <strong>English</strong>, e.g. a man, an apple, water. There is no need <strong>to</strong> translate it<br />
everywhere as in the example of water. The absence of the indefinite article is, as in<br />
Table 2.10, a potential source of problems <strong>for</strong> <strong>Arabic</strong>-<strong>English</strong> <strong>machine</strong> translation.<br />
Table 2.10: Indefinite example in <strong>Arabic</strong><br />
<strong>Arabic</strong> wˇgdt ktābā ֒lā ālt.āwlh hl hw lk?<br />
<strong>English</strong> I found (a) book on the table, is it yours?<br />
2.3.2 Adjectives<br />
Adjectives are used <strong>to</strong> modify nouns. <strong>Arabic</strong> adjectives agree with nouns in number,<br />
gender, definiteness and case. An example is the adjective “useful”, in Table 2.11.<br />
2.3.3 Adverbs<br />
Table 2.11: <strong>Arabic</strong> adjective<br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
qr֓at ktābā nāf֒ā I read a useful book<br />
Adverbs are used <strong>to</strong> modify verbs. They can be adverbs of place, time or manner. An<br />
example in Table 2.12.<br />
Table 2.12: <strong>Arabic</strong> adverb<br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
֒qd ālāˇgtmā֒msā֓ The meeting was held in the evening<br />
16
2.3.4 Verbs<br />
2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />
A verb describes both an action and tense. There are four ways <strong>to</strong> classify verbs in<br />
<strong>Arabic</strong>: according <strong>to</strong> tense, transitivity, mood and voice:<br />
2.3.4.1 Verb tenses<br />
There are mainly two tenses in <strong>Arabic</strong>: the imperfect and the perfect.<br />
The imperfect tense <br />
ālf֒l ālmd. ār֒, which indicates that an action has not<br />
yet been completed but is being done or will be done; something that is happening at the<br />
moment. An example is shown in Table 2.13.<br />
Table 2.13: Imperfect tense <br />
ālf֒l ālmd. ār֒<br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
yaktubu he is writing.<br />
The perfect tense ‘ mād. y, which indicates that an action has been completed. An<br />
example is shown in Table 2.14.<br />
Table 2.14: Perfect tense ālf֒l ālmād. y<br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
<br />
kataba he wrote.<br />
Both the perfect and imperfect tenses can be modified by thirteen inflectional <strong>for</strong>ms<br />
which depend on person, mood and number. Table 2.15 shows these <strong>for</strong>ms applied <strong>to</strong><br />
the imperfect, and Table 2.16 shows the thirteen person markers <strong>for</strong> the perfect tense.<br />
<br />
The word ‘ swf’ if it is be<strong>for</strong>e the imperfect tense then the verb has a future meaning.<br />
Graphically a word like this will look like [sawfa + imperfect] or [s + imperfect] similar<br />
<strong>to</strong> the example in Table 2.17.<br />
In <strong>Arabic</strong>, it is possible <strong>to</strong> combine the verb kaana kān with the main verb <strong>to</strong> in-<br />
dicate past progressive. This is where an action <strong>to</strong>ok place in the past but happened<br />
17
2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />
Table 2.15: Imperfect inflectional <strong>for</strong>ms of word ‘write’<br />
Singular Dual Plural<br />
First Person <br />
nktbu <br />
֓aktbu<br />
Second Person (m) tktbwna tktbāni <br />
tktbu<br />
Second Person (f)<br />
tktbna tktbāni tktbna<br />
Third Person (m) yktbwna yktbāni <br />
yktbu<br />
Third Person (f)<br />
yktbna tktbāni <br />
tktbu<br />
Table 2.16: Perfect inflectional <strong>for</strong>ms of word ‘wrote’<br />
Singular Dual Plural<br />
First Person uktbnā ktbt<br />
Second Person (m) ktbtum <br />
ktbtumā<br />
ktbta<br />
<br />
ktbtuna <br />
ktbtumā ktbti<br />
Second Person (f)<br />
Third Person (m) ktbwā ktbā <br />
ktba<br />
Third Person (f)<br />
ktbna ktbatā<br />
ktbat<br />
Table 2.17: Future tense in <strong>Arabic</strong><br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
<br />
<br />
swf yaktubu he will write<br />
<br />
<br />
<br />
syaktubu he will write<br />
over a long period, or represents a state of being. This construct is used when talk-<br />
ing about knowledge of something in the past. In <strong>Arabic</strong>, the past perfect progressive<br />
is actually indicated using the present tense and the particle mundhu mnd. e.g.<br />
¯<br />
֓a֒yˇs hnā mnd hms snwāt I have been living here <strong>for</strong> five<br />
¯ ˘<br />
years. Future perfect in <strong>Arabic</strong> is indicated using the present tense of kaana with a past<br />
tense main verb. e.g. sykwn qd ֓anhā drāsth he will have fin-<br />
ished his studies.<br />
18
2.3.4.2 Aspect<br />
2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />
Tense deals with when an action occurs, aspect determines whether the action has been<br />
completed, is ongoing or is yet <strong>to</strong> occur. In <strong>Arabic</strong>, tense and aspect are generally blended<br />
<strong>to</strong>gether, that is why past/present are often switched with perfect/imperfect in discussion.<br />
For a larger discussion on the sentax the tense and aspect refer <strong>to</strong> Ryding (2007).<br />
2.3.4.3 Mood<br />
Mood is reflected in <strong>Arabic</strong> in word structure, and so analysis is a part of the morphol-<br />
ogy. The mood can be indicative, subjunctive, imperative or jussive. The indicative are<br />
straight<strong>for</strong>ward statements, the subjunctive includes the attitude <strong>to</strong>wards actions, the im-<br />
perative indicates a command. Mood marking is only done on the present tense. There<br />
are no markings <strong>for</strong> past tense. Examples of the four moods are shown in Tables 2.18,<br />
2.19, 2.20 and 2.21.<br />
Table 2.18: Indicative mood<br />
<strong>Arabic</strong> <br />
nrh. b bzbā֓ynnā<br />
<strong>English</strong> we welcome our cus<strong>to</strong>mers.<br />
<strong>Arabic</strong> y˙gādr dbln ālywm<br />
<strong>English</strong> He leaves Dublin <strong>to</strong>day.<br />
Table 2.19: Subjunctive mood<br />
<strong>Arabic</strong> <br />
yˇgb ֓an nqwm bzyārh<br />
<strong>English</strong> It is necessary that we undertake a visit.<br />
<strong>Arabic</strong><br />
Table 2.20: Jussive mood<br />
lm n֓at<br />
<strong>English</strong> we did not come.<br />
֓is. lāh. āt lm tktml mnd ¯ ֒āmyn<br />
<strong>Arabic</strong><br />
<strong>English</strong> renovations that have not been completed <strong>for</strong> two years<br />
19
2.3.4.4 Voice<br />
2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />
Table 2.21: Imperative mood<br />
<strong>Arabic</strong> āfth. yā smsm<br />
<strong>English</strong> Open, Semsame.<br />
<strong>Arabic</strong> āsmh. ly<br />
<strong>English</strong> Permit me.<br />
<strong>Arabic</strong> lā tns<br />
<strong>English</strong> Do not <strong>for</strong>get.<br />
The voice in <strong>Arabic</strong> is indicated by inflection on the verbs and differentiates between<br />
active and passive, as shown in the contrast between qāl “[he] said” and yqāl<br />
“[it is] said”.<br />
2.3.4.5 Transitivity<br />
In <strong>Arabic</strong> and <strong>English</strong>, we can classify verbs as either intransitive, transitive or ditransi-<br />
tive.<br />
<br />
(1) Intransitive ( <br />
āllāazim)<br />
An intransitive verb is unable <strong>to</strong> take an object; it exists alone. Intransitive verbs<br />
include ysbh. , swim, y֓akl , ate, māt ,die, nā֓ym sleep.<br />
Some verbs can be both transitive and intransitive:<br />
<br />
֓anā fzt , I won. (Intransitive)<br />
֓anā fzt bālˇgā֓yzh āl֓awlā , I won the first prize. (Transitive)<br />
(2) Transitive ( ālmt֒dy)<br />
A transitive verb takes one or more objects (an object, or undergoer of the verb).<br />
For example; ֓iˇstrā ֒mr ktāb , Omar bought a book. <br />
֓aktb rsālh , I write a letter. ֓abw bkr wd. ֒ ālktāb<br />
fwq mktbh, Abu Bakr put the book on his disk.<br />
20
2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />
A transitive verb is incomplete without a direct object. For example;<br />
Incomplete: hāld yh. ml , Khalid holds.<br />
˘<br />
Complete: <br />
<br />
hāld yh. ml tlāth ktb<br />
˘ ¯ ¯<br />
w h. āswbh ālˇsh ˘ s.y w zhwr , Khalid holds three books, his lap<strong>to</strong>p and flowers.<br />
(3) Ditransitive<br />
A ditransitive verb takes two objects. This can be through an indirect object con-<br />
struction, ֒s. ām ֓a֒t.ā ktāb l֒mr , Essam gave a book <strong>to</strong> Omar<br />
Or double object construction, ֒s. ām ֓a֒t.ā ֒mr ktāb , Essam<br />
gave Omar a book.<br />
2.3.5 Demonstratives<br />
The demonstrative pronouns in <strong>Arabic</strong> include reference <strong>for</strong> the near hd ¯ ā “this”, the<br />
far d ¯ lk “that” and <strong>for</strong> the inbetween d ¯ āk, which has no equivelent in <strong>English</strong>.<br />
2.3.6 Others<br />
This class includes all other types of words not included in the previous categories. It<br />
includes, <strong>for</strong> example, the prepositions, such as mn “from”, ֒lā “on”, fy “in”,<br />
֓ilā “till”. It also includes conjunctions, such as w “and”; determiners such as <br />
āl “the”; relative pronouns, such as āld¯ y “who (masculine)” and ālty “who<br />
(feminine)” and particles, such as ln “Will not” (Khan 2007).<br />
Table 2.22: Particle ‘Lan’<br />
<strong>Arabic</strong> <strong>Arabic</strong> Meaning <strong>English</strong> Translation<br />
<br />
<br />
<br />
<br />
lan yad ¯ haba will not ‘he’ go he will not go<br />
The particle ln is used <strong>to</strong> negate future events. It is used within the imperfect tense<br />
(Versteegh 2001). An example is shown in Table 2.22.<br />
21
2.4 Sentence types in <strong>Arabic</strong><br />
2.4. SENTENCE TYPES IN ARABIC<br />
A sentence is a string of words that expresses a semantically complete message. There are<br />
two main sentence types in <strong>Arabic</strong>: verbal sentences and equational or copula sentences.<br />
The classification of clauses in the <strong>Arabic</strong> language is illustrated in Figure 2.2.<br />
Figure 2.2: A classification of clauses in the <strong>Arabic</strong> language<br />
2.4.1 Equational sentences<br />
Equational sentences contain two parts (subject and predicate). In <strong>Arabic</strong> the copula verb<br />
‘<strong>to</strong> be’ is not used in the present tense. Both the subject and the predicate have <strong>to</strong> be<br />
in the nominative case if they are not preceded by ֓in ”indeed“ or kān ”was“<br />
(Abn-Aqeal 2007). In Table 2.23 the predicate in the first example is realized as a noun<br />
phrase, in the second example as an adjective, and in the third example as a preposition.<br />
The subject and predicate can serve as arguments <strong>for</strong> other verbs as will be shown in the<br />
following subsections.<br />
22
2.4.1.1 Verb and noun<br />
2.4. SENTENCE TYPES IN ARABIC<br />
Table 2.23: Nominal sentence<br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
zaydun t.aālibun Zaid is (a) student.<br />
zaydun krym Zaid is generous.<br />
<br />
<br />
<br />
<br />
zaydun fy ālbyt Zaid is in the house.<br />
Verb and noun. such as: s֓al ֓ah. md Ahmad asked.<br />
2.4.1.2 Verb and two nouns<br />
It only occurs in one construct <br />
kān w ֓ah ˘ wāthā kan and its sisters. The<br />
verb kna kān and its sister verbs mark the time or duration of actions, states, and<br />
events. Sentences that use these verbs are considered <strong>to</strong> be a type of nominal sentence<br />
according <strong>to</strong> <strong>Arabic</strong> grammar, not a type of verbal sentence. The word order resembles<br />
Verb Subject Object when there is no other verb in the sentence,<br />
They are kān was, s. ār <strong>to</strong> become, ֓as. bh. <strong>to</strong> become, ֓ad. h. ā <strong>to</strong><br />
become, ֓amsā <strong>to</strong> become, z . l <strong>to</strong> remain, bāt <strong>to</strong> be, lys it is not.<br />
<strong>English</strong> can not express the punctual and telic aspectual differences encoded within the<br />
<strong>Arabic</strong> examples just mentioned.<br />
Table 2.24: Kan and its sisters <br />
kān w ֓ah ˘ wthā<br />
<strong>Arabic</strong> <strong>English</strong> Translation<br />
<br />
<br />
kān āl֓aklu ld¯ ydāan The food was delicious.<br />
¯<br />
With these verbs the subject is in the nominative case and the predicate is in the accusative<br />
case, an example is shown in Table 2.24.<br />
23
2.4.1.3 Verb and three nouns<br />
2.4. SENTENCE TYPES IN ARABIC<br />
It only occurs in one construct <br />
<br />
z . n w ֓ah ˘ wāthā anna and its sisters. Both the<br />
subject and the predicate of z . n and its sisters are in an equational clause.<br />
They are z . n <strong>to</strong> guess or <strong>to</strong> think, h. sb <strong>to</strong> consider, ֒lm <strong>to</strong> learn (about), <br />
ˇg֒l <strong>to</strong> make, s. yr <strong>to</strong> make. They usually come be<strong>for</strong>e the nominal sentences ‘subject<br />
and a predicate’, an example is in Table 2.25. <strong>English</strong> can not express the semantic and<br />
causative ‘make’ differences encoded within the <strong>Arabic</strong> examples just mentioned.<br />
Table 2.25: zanna and its sisters <br />
z . n w ֓ah ˘ wthā<br />
<strong>Arabic</strong> z . n ֓ah. md ālqyādh shlt.<br />
<strong>English</strong> Translation Ahmad thinks leadership is easy.<br />
2.4.1.4 Verb and four nouns<br />
This clausal type is used in classical <strong>Arabic</strong>, but not in MSA. It is mentioned here only<br />
<strong>for</strong> the sake of completeness. It has one type in ֓a֒lm w ֓ary in<strong>for</strong>med and<br />
showed. They are ֓a֒lm in<strong>for</strong>med, ֓ary showed, ֓anb֓a <strong>to</strong>ld. nb֓a <strong>to</strong>ld,<br />
<br />
֓ahbr <strong>to</strong>ld,<br />
˘ <br />
hbr <strong>to</strong>ld. <br />
˘ h. dt talked. <br />
¯ ֓a֒lm when it has hamza above it<br />
can has four nouns (ibn Abd Allah Ibn Malik 1984), such as in Table 2.26.<br />
<br />
<br />
<br />
֓a֒lmtu ֒mrāan h˘ aālidāan tlmydāan ¯<br />
Table 2.26: In<strong>for</strong>med and showed<br />
<strong>Arabic</strong><br />
<strong>English</strong> Translation I in<strong>for</strong>med Omer that Khalid (is) a student.<br />
2.4.2 The Verbal Sentence<br />
The verbal sentence is the second type of sentence in <strong>Arabic</strong>. It contains a verb and one<br />
or more participants depending on the verb transitivity. The default word order in <strong>Arabic</strong><br />
is <strong>to</strong> begin with a verb: verb(V), subject(S) and object(O), such as in Table 2.27<br />
24
Table 2.27: verb(V), subject(S) and object(O)<br />
<strong>Arabic</strong> <br />
Gloss drank Khalid the-milk<br />
<strong>English</strong> Translation Khalid drank the milk.<br />
ˇsrb h ˘ āld āllbn<br />
2.5. SUMMARY<br />
Another possible word order is <strong>to</strong> start with the subject, i.e. SVO, such as in Table 2.28<br />
Table 2.28: subject(S), verb(V) and object(O)<br />
<strong>Arabic</strong> hāld ˇsrb āllbn<br />
˘<br />
Gloss Khalid drank the-milk<br />
<strong>English</strong> Translation Khalid drank the milk.<br />
Another possible, but more restricted, word order is VOS, such as in Table 2.29<br />
Table 2.29: verb(V), object(O) and subject(S)<br />
<strong>Arabic</strong> <br />
ˇsrbh h ˘ āld<br />
Gloss drank it Khalid<br />
<strong>English</strong> Translation Khalid drank it<br />
The OVS word order is perfectly acceptable in Classical <strong>Arabic</strong> but no longer occurs in<br />
MSA (Attia 2008).<br />
2.4.3 Clause<br />
A clause in <strong>Arabic</strong> may be simple or complex. A complex clause is <strong>for</strong>med by conjoining<br />
two simple clauses by subordinating conjunction, such as in Table 2.30.<br />
Table 2.30: Two simple clauses by subordinating conjunction<br />
<strong>Arabic</strong> <br />
<strong>English</strong> Khalid drank the milk be<strong>for</strong>e he went <strong>to</strong> school.<br />
2.5 Summary<br />
ˇsrb hāld āllbn qbl ֓an ydhb ֓ilā ālmdrsh<br />
˘ ¯<br />
We have shown that <strong>Arabic</strong> is a language of increasing importance in the modern world.<br />
As a language it is fundamentally different from European languages and has many<br />
25
2.5. SUMMARY<br />
unique features. Considerations such as its derivational structure, its distinction of gender<br />
<strong>for</strong>ms, and its numerous sentence orders present a challenge <strong>for</strong> au<strong>to</strong>matic <strong>machine</strong> trans-<br />
lation. We discussed an inven<strong>to</strong>ry of the language including examples. In order <strong>to</strong> deal<br />
with these challenges it is important that a <strong>machine</strong> transla<strong>to</strong>r understands the structure<br />
of the source language. We aim <strong>to</strong> use this knowledge <strong>to</strong> build the UniArab transla<strong>to</strong>r. In<br />
order <strong>to</strong> provide a standards-based, cross-plat<strong>for</strong>m solution, we will make use of XML<br />
<strong>for</strong> data representation and build the system using Java.<br />
26
3<br />
Role and Reference Grammar (RRG)<br />
This chapter is based largely on material taken from (Van Valin and LaPolla 1997), which<br />
explains the theory behind Role and Reference Grammar. Role and Reference Grammar<br />
(RRG) is a model of grammar developed by William Foley and Robert Van Valin, Jr. in<br />
the 1980s, which incorporates many of the points of view of current functional grammar<br />
theories. We have chosen RRG because it has been shown <strong>to</strong> be flexable and universal<br />
in the creation of parsers <strong>for</strong> <strong>English</strong> (Van Valin and LaPolla 1997). We wish <strong>to</strong> apply<br />
this success <strong>to</strong> MT in order <strong>to</strong> discover its importance and demonstate its viability with<br />
accuracy of translation.<br />
In RRG, the description of a sentence in a particular language is <strong>for</strong>mulated in terms<br />
of its logical structure and communicative functions, and the grammatical procedures<br />
that are available in the language <strong>for</strong> the expression of these meanings. The main fea-<br />
tures of RRG are the use of lexical decomposition, based upon predicate semantics, an<br />
analysis of clause structure and the use of a set of thematic roles organized in<strong>to</strong> a hier-<br />
27
3.1. ROLE AND REFERENCE GRAMMAR LINGUISTIC MODEL<br />
archy in which the highest-ranking roles are ‘Ac<strong>to</strong>r’ (<strong>for</strong> the most active participant) and<br />
‘Undergoer’ (Van Valin 1993). RRG takes language <strong>to</strong> be a system of communicative<br />
social action, and accordingly, analysing the communicative functions of grammatical<br />
structures plays a vital role in grammatical description and theory from this perspective.<br />
Role and Reference Grammar posits algorithms <strong>to</strong> go from syntax <strong>to</strong> semantics and se-<br />
mantics <strong>to</strong> syntax. The main contribution is the use of parsing templates and the notion<br />
of the core. A core consists of a predicate (generally a verb) and (normally) a number of<br />
arguments. It must have a predicate. Everything else is built around one or more cores.<br />
Simple sentences contain a single core; complex sentences contain several cores. The<br />
fact that RRG focuses on cores, means that the semantics is relatively easy <strong>to</strong> extract<br />
from a parse tree. You just have <strong>to</strong> look <strong>for</strong> the (PRED), and (ARG) branches of the core<br />
<strong>to</strong> obtain the predicate (PRED) and the arguments (ARG). Who did what <strong>to</strong> whom will<br />
depend either on the ordering of the ARG branches (in the case of <strong>English</strong>), or on their<br />
cases, or both.<br />
3.1 Role and Reference Grammar linguistic model<br />
Role and Reference Grammar is a model which presupposes a direct mapping between<br />
the semantic representation of a sentence and its syntactic representation; there are no<br />
intermediate levels of representation (Van Valin 2007). The general view of RRG is<br />
presented in Figure 3.1.<br />
RRG creates a relationship between syntax and semantics and can account <strong>for</strong> how se-<br />
mantic representations are mapped in<strong>to</strong> syntactic representations. RRG also accounts<br />
<strong>for</strong> the very different process of mapping syntactic representations <strong>to</strong> semantic repre-<br />
sentations. Be<strong>for</strong>e developing the linking algorithms that govern these mappings, it is<br />
necessary <strong>to</strong> first introduce a general principle constraining these algorithms (Van Valin<br />
and LaPolla 1997). Of the two directions, syntactic representation <strong>to</strong> semantic represen-<br />
28
3.1. ROLE AND REFERENCE GRAMMAR LINGUISTIC MODEL<br />
Figure 3.1: Layout of Role and Reference Grammar<br />
tation is the more difficult since it involves interpreting the morphosyntactic <strong>for</strong>m of a<br />
sentence and inferring the semantic functions of the sentence from it. Accordingly, the<br />
linking rules must refer <strong>to</strong> the morphosyntactic features of the sentence. The question<br />
remains why a grammar should deal with linking from syntax <strong>to</strong> semantics at all. Simply<br />
specifying the possible realizations of a particular semantic representation should suffice.<br />
They refute this using the argument that theories of linguistic structure should be directly<br />
relatable <strong>to</strong> testable theories of language production and comprehension (Van Valin and<br />
LaPolla 1997). One of our hypotheses it that RRG is very suitable <strong>for</strong> <strong>machine</strong> translation<br />
of <strong>Arabic</strong> via an interlingua bridge. It is a mono strata-theory, positing only one level of<br />
syntactic representation, the actual <strong>for</strong>m of the sentence. The RRG Linking algorithm can<br />
work in the both directions from syntactic representation <strong>to</strong> semantic representation or<br />
vice versa. UniArab will fulfil this role. In RRG, semantic decomposition of predicates<br />
and their semantic argument structures are represented as logical structures. The lexicon<br />
in RRG takes the position that lexical entries <strong>for</strong> verbs should contain unique in<strong>for</strong>mation<br />
only, with as much in<strong>for</strong>mation as possible derived from general lexical rules. We briefly<br />
illustrate the active voice linking in (3.1) where (3.1a) is a subject, verb, object (SVO)<br />
clause and (3.1b) is the verb, subject, object (VSO) equivalent.<br />
29
(3.1)<br />
3.1. ROLE AND REFERENCE GRAMMAR LINGUISTIC MODEL<br />
a. <br />
zyd r֓ay ֒mr Zaid saw Omar<br />
<br />
zyd MsgNOM see.past ֒mr - MsgNOM<br />
b. <br />
r֓aā zyd ֒mr Saw Zaid Omar<br />
see.past <br />
zyd MsgNOM ֒mr MsgNOM<br />
<strong>Arabic</strong> allows variation in clause word order. The active-voice linkings, those in the<br />
sentence in (3.1a)-(3.1b), are illustrated in figure 3.2.<br />
Figure 3.2: <strong>Arabic</strong> sentence types; verb subject object or subject verb object (<strong>for</strong> gloss<br />
please see example 3.1)<br />
The first (leftmost) argument of ‘see’ in the logical structure is the ac<strong>to</strong>r, the second the<br />
undergoer, following the RRG Ac<strong>to</strong>r-Undergoer Hierarchy. Since <strong>Arabic</strong> is an accusative<br />
language and r֓aā ‘see’ is a regular verb, the ac<strong>to</strong>r will receive nominative case and<br />
the undergoer accusative case. On the other hand, in <strong>Arabic</strong> we can start the sentence<br />
with verb first as shown in the example in (3.1b). The only changes in the clause are the<br />
<strong>for</strong>m of the verb and the <strong>for</strong>m of the ac<strong>to</strong>r NP; the arrangement of the arguments has not<br />
changed in the logical structure.<br />
30
3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />
3.2 Formal representation of layered structure of the clause<br />
Having introduced the fundamental units of clause structure, we need <strong>to</strong> have an explicit<br />
representation of them. We will present the non-universal features of the layered structure<br />
of the clause (LSC).<br />
3.2.1 Representing the universal aspects of the layered structure of the clause<br />
Figure 3.3: Formal representation of the layered structure of the clause<br />
To represent the nucleus, core, periphery and clause, we will use a type of tree diagram<br />
which differs substantially from the constituent-structure trees discussed earlier. The ab-<br />
stract schema of the layered structure of the clause can be represented as in Figure 3.3.<br />
The clause consists of the core with its arguments, and then the nucleus, which subsumes<br />
the predicate. At the very bot<strong>to</strong>m are the actual syntactic categories which realize these<br />
units. Notice that there is no VP in the tree, <strong>for</strong> it is not a concept that plays a direct role<br />
in this conception of clause structure. The periphery is represented on the margin, and<br />
the arrow there indicates that it is an adjunct; that is, it is an optional modifier of the core<br />
(Van Valin and LaPolla 1997).<br />
31
3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />
Constituent structure representations of sentences in free-word-order and head-marking<br />
languages are unrevealing, because they fail <strong>to</strong> capture what is common <strong>to</strong> clauses in the<br />
different language types. The layered approach <strong>to</strong> clause structure does not suffer <strong>for</strong>m<br />
the same shortcomings. For a language like <strong>Arabic</strong>, the line linking the head nouns with<br />
their determiners will be discussed in the section on noun phrase structure 3.3 below.<br />
3.2.2 Layered structure of the clause (LSC)<br />
In the simplex <strong>English</strong> sentences, James ate the sandwich in the class, James ate the sand-<br />
wich is the core (with ate the nucleus and James and the sandwich the core arguments);<br />
and in the class is in the periphery. The first division in the clause is between a core and<br />
a periphery, and within the core a distinction is made between the nucleus (containing<br />
the predicating element, normally a verb) and its core arguments (NPs and PPs which are<br />
arguments of the predicate in the nucleus). Core arguments are those arguments which<br />
are part of the semantic representation of the verb (Van Valin and LaPolla 1997). The<br />
relationships between the semantic and syntacts units are summarized in Table 3.1<br />
Table 3.1: Relationships between the semantic and syntactic units<br />
Semantic element (s) Syntactic unit<br />
Predicate Nucleus<br />
Argument in semantic representation of predicate Core argument<br />
Non arguments Periphery<br />
Predicate + arguments Core<br />
Predicate + arguments + Non- arguments Clause (= core + periphery)<br />
3.2.3 Non-universal aspects of the layered structure of the clause<br />
An initial phrase cannot be in the precore slot, because there is a WH–word (<strong>for</strong> example,<br />
<strong>for</strong> <strong>English</strong> who, where, what etc.) in the precore slot in the sentence; hence the position<br />
of the initial phrase is distinct from the precore slot. This position, which will be termed<br />
32
3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />
the left-detached position, is outside of the clause but within the sentence. An example<br />
from <strong>English</strong> with all of these elements is given in Figure 3.4.<br />
Figure 3.4: <strong>English</strong> Sentence with precore slot and left-detached position<br />
The opera<strong>to</strong>r projection in Figure 3.5 may be combined with what we will call the ‘con-<br />
stituent projection’ in Figure3.8 <strong>to</strong> yield a more complete picture of the clause, as in<br />
Figure 3.6; the periphery is omitted, since it can occur in a number of different positions.<br />
What we have here is two projections of the clause, one of which contains the predicate<br />
and its arguments (the constituent projection), while the other contains the opera<strong>to</strong>rs (the<br />
opera<strong>to</strong>r projection) (Van Valin and LaPolla 1997).<br />
They are both linked through the predicate, which may be a verb, NP, AdjP or PP, be-<br />
cause it is the one crucial element common <strong>to</strong> both. The opera<strong>to</strong>r projection mirrors the<br />
constituent projection in terms of layering; hence ‘nucleus’ in the opera<strong>to</strong>r projection<br />
corresponds <strong>to</strong> ‘nucleus’ in the constituent projection, and so on. The multiple nucleus,<br />
core and clause nodes represent each of the individual opera<strong>to</strong>rs at that level; the num-<br />
ber of multiple nodes corresponds <strong>to</strong> the number of opera<strong>to</strong>rs at that level present in the<br />
sentence. If there are no opera<strong>to</strong>rs at a given level, a bare node will be given. As the<br />
‘bare skele<strong>to</strong>n’ of the layered structure of the clause on the right makes clear, the two<br />
33
3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />
Figure 3.5: Opera<strong>to</strong>r projection in LSC<br />
projections are indeed mirror images of each other, and this will become particularly im-<br />
portant in representing the structure of complex sentences. A more complete picture of<br />
the clause in <strong>Arabic</strong>, is given in Figure 3.7. Please note that the sentences in Figure 3.7<br />
should be read from right <strong>to</strong> left.<br />
One of the major motivations <strong>for</strong> this scheme is that opera<strong>to</strong>rs virtually always occur<br />
in the same linear sequence with respect <strong>to</strong> the predicating element. When an ordering<br />
relationship can be established among opera<strong>to</strong>rs, they are always ordered in the same<br />
way cross-linguistically, such that their linear order reflects their scope. This is a very<br />
significant point. Opera<strong>to</strong>rs are ordered with respect <strong>to</strong> each other in terms of the scope<br />
principle discussed earlier, with the verb or other predicating element in the nucleus<br />
34
3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />
Figure 3.6: LSC with constituent and opera<strong>to</strong>r projections<br />
as the anchorpoint, and thus the ordering restrictions on the morphemes expressing the<br />
opera<strong>to</strong>rs are universal. For a technical discussion of the meaning of the various opera<strong>to</strong>rs<br />
in the LSC (Van Valin and LaPolla 1997).<br />
35
3.3 Noun phrase structure<br />
Figure 3.7: <strong>Arabic</strong> LSC<br />
3.3. NOUN PHRASE STRUCTURE<br />
Noun phrases refer, while clauses predicate, and yet there are striking parallels between<br />
the structure of the two which have long been noted. For example, both can be said <strong>to</strong><br />
have arguments; while this is obvious in the case of verbs in clauses, it is also clear that<br />
relational nouns like father, friend and sister can take what could be analyzed as argu-<br />
ments, e.g. father of James / James’s father, a friend of Khalid / Khalid’s friend and the<br />
other sister of Sarah / Sarah’s other sister. Clauses sometimes have clauses within them<br />
as arguments, as in Zaid believed that pollution isn’t a problem, and the same is true of<br />
NPs, e.g. Zaid’s belief that pollution isn’t a problem. Given these parallels, it would be<br />
36
3.3. NOUN PHRASE STRUCTURE<br />
appropriate <strong>to</strong> say that at least some nouns take arguments analogous <strong>to</strong> verbs taking ar-<br />
guments, and there<strong>for</strong>e it is also appropriate <strong>to</strong> posit a layered structure <strong>for</strong> NPs (LSNP)<br />
similar but not identical <strong>to</strong> that <strong>for</strong> clauses. Relating <strong>to</strong> the fundamental functional dif-<br />
ference between verbs and nouns, is that the nominal nucleus NUCN dominates a REF<br />
(<strong>for</strong> ‘reference’) node, indicating that the unit in question refers, in contrast <strong>to</strong> the PRED<br />
(<strong>for</strong> ‘predicate’) node which appears in the nucleus of a clause. The word ‘of’ is non-<br />
predicative in this construction, because it does not license the argument; moreover, it is<br />
semantically empty, as it can occur with argument NPs having many different semantic<br />
functions (Van Valin and LaPolla 1997). Consider the range of semantic functions which<br />
the of-NPs have in the following examples.<br />
(2.2)<br />
a. the attack of the killer bees Agent<br />
b. the gift of a new car Theme<br />
c. the destruction of the city Patient<br />
d. the leg of the table Possessor<br />
e. the resupplying of the troops (with ammunition) Recipient<br />
(Nunes 1993) shows that NPs have only a single direct core argument, and it is marked<br />
by of. This is consistent with the point made above that of does not mark any particular<br />
semantic relation, in much the same way that the direct grammatical functions, subject<br />
and direct object, are not restricted <strong>to</strong> particular semantic functions. Accordingly, the of-<br />
marked NP counts as the single direct syntactic argument of the nominal nucleus in the<br />
core of the NP. Predicative adpositions, by contrast, have well-defined semantic content,<br />
like other predicates.<br />
37
3.4. LEXICAL REPRESENTATIONS FOR VERBS<br />
An important feature of the layered structure of the clause is the differential treatment<br />
given <strong>to</strong> opera<strong>to</strong>rs like tense, aspect and illocutionary <strong>for</strong>ce, and the same contrast is a<br />
vital part of the layered structure of the noun phrase. NP opera<strong>to</strong>rs include determin-<br />
ers (articles, demonstratives, deictics), quantifiers, negation and adjectival and nominal<br />
modifiers (Van Valin and LaPolla 1997).<br />
3.3.1 NP headed<br />
Pronouns can be classified in<strong>to</strong> a number of subtypes: personal pronouns, including pos-<br />
sessive pronouns (PRO), e.g. I liked her book; relative pronouns (PRO REL), e.g. the<br />
book which I bought; demonstrative pronouns (PRO DEM), e.g. That pleased Mary; WH-<br />
pronouns (PRO wh ), e.g. who did Fred see?; and expletive pronouns (PRO EXP ), e.g. it<br />
rained.<br />
3.4 Lexical representations <strong>for</strong> verbs<br />
These distinctions among the four basic Aktionsart types may be represented <strong>for</strong>mally<br />
as in Table 3.2. The term Aktionsart refers <strong>to</strong> the means of a capturing the distinctions<br />
between basic states of affairs, or events, of individual verbs. These representations<br />
are called logical structures. Following the conventions of <strong>for</strong>mal semantics, constants<br />
(which are normally predicates) are presented in boldface followed by a prime, whereas<br />
variable elements are presented in normal typeface. The elements in boldface and prime<br />
are part of the vocabulary of the semantic metalanguage used in the decomposition; they<br />
are not words from any particular human language.<br />
Table 3.2: Lexical representations <strong>for</strong> the basic Aktionsart classes<br />
Verb class Logical structure<br />
State predicate’ (x) or (x, y)<br />
Activity do’ (x, [predicate’(x) or (x, y)])<br />
Achievement INGR predicate’ (x) or (x, y)<br />
Accomplishment BECOME predicate’ (x) or (x, y)<br />
38
3.4. LEXICAL REPRESENTATIONS FOR VERBS<br />
Hence the same representations are used <strong>for</strong> all languages (where appropriate), e.g. the<br />
logical structure <strong>for</strong> <strong>Arabic</strong> and <strong>English</strong> ‘die’ (intransitive) would be BECOME dead’(x).<br />
The elements in all capitals, INGR and BECOME, are modifiers of the predicate in the<br />
logical structure; their function will be explained below. The variables are filled by lexi-<br />
cal items from the language being analysed; <strong>for</strong> example, the <strong>English</strong> sentence The dog<br />
died would have the logical structure BECOME dead’ (dog), while the correspond-<br />
ing <strong>Arabic</strong> sentence ālklb māt . “The dog died” would have the logical<br />
structure BECOME dead’ (ālklb) should be this sentence start with the verb<br />
māt ālklb. States are represented as simple predicates, e.g. broken’ (x),<br />
be-at’(x,y), and see’(x,y). There is no special <strong>for</strong>mal indica<strong>to</strong>r that a predicate<br />
is stative.<br />
The logical structure, be’(x,[pred’] ) is <strong>for</strong> identificational constructions, e.g.<br />
Omar is a student, and attributive constructions, such as The watch is broken require a<br />
different logical structure. In this logical structure the second argument is the attribute or<br />
identificational NP, e.g. be’ (Ayesha, [tall’]), be’(Omar,[student’]).<br />
The primary criteria <strong>for</strong> distinguishing between attributive constructions and result state<br />
constructions is whether the attribute is inherent, e.g. Coal is black (be’ (coal,<br />
[black’])), or whether it is the result of some kind of process, e.g. The fire black-<br />
ened the wood (... BECOME black’(wood)) (Van Valin and LaPolla 1997).<br />
3.4.1 Agents, effec<strong>to</strong>rs, instruments and <strong>for</strong>ces<br />
In ‘Zaid is cutting the bread with a knife’, an EFFECTOR, typically human, manipulates<br />
a knife and brings it in<strong>to</strong> contact with the bread, whereupon the interaction of the knife<br />
with the bread brings about the result that the bread becomes cut. This may be represented<br />
as in (3.3). (The main CLAUSE in the logical structure is italicized.)<br />
39
(3.3)<br />
[do’(Zaid, [use’(Zaid, knife)])] CAUSE<br />
[[do’(knife, [cut’(knife, bread)])]CAUSE<br />
[BECOME cut’(bread)]]<br />
3.4. LEXICAL REPRESENTATIONS FOR VERBS<br />
The causing event in (3.3) is complex, and the INSTRUMENT argument appears three<br />
times in the logical structure: as the IMPLEMENT of use’ and as the EFFECTOR of<br />
do’(x,[cut’(x,y)]). It is possible, if the first argument of the highest do’ were left unspec-<br />
ified, <strong>to</strong> say The knife cut the bread, with the INSTRUMENT knife as ac<strong>to</strong>r.<br />
3.4.2 change of state verb<br />
A change of state verb may be punctual in one language and non-punctual in another. A<br />
good example of this cross-linguistic variation is <strong>English</strong> ‘die’ and <strong>Arabic</strong>. Both have the<br />
result that the subject is dead. Accordingly, it is possible <strong>to</strong> say in <strong>English</strong><br />
He died quickly , He died slowly and He died suddenly. In <strong>Arabic</strong> we can say as (3.4),<br />
also, it is possible <strong>to</strong> say in <strong>Arabic</strong> Hence the logical structure <strong>for</strong> <strong>English</strong> and <strong>Arabic</strong><br />
‘die’ would be [ BECOME dead’(x)], an accomplishment.<br />
(3.4)<br />
(a) māt sry֒ā<br />
(b) <br />
He died quickly.<br />
māt bbt.֓y<br />
He died slowly.<br />
(c) māt fˇgāh<br />
He died suddenly.<br />
40
3.5. WHY WE USE RRG AS THE LINGUISTIC MODEL<br />
3.5 Why we use RRG as the linguistic model<br />
A reader might ask the question, why use Role and Reference Grammar as the basis of<br />
our <strong>machine</strong> transla<strong>to</strong>r? More than one reason prompts us <strong>to</strong> choose RRG. The most<br />
important one is that RRG is a new linguistic method and there is no research using the<br />
Role and Reference Grammar linguistic model as a basis <strong>for</strong> <strong>machine</strong> translation until<br />
now. We would like <strong>to</strong> discover this area using the RRG rules and techniques.<br />
What distinguishes the RRG conception is the conviction that grammatical structure can<br />
only be unders<strong>to</strong>od with reference <strong>to</strong> its semantic and communicative functions. Syntax<br />
is not au<strong>to</strong>nomous. In terms of the abstract paradigmatic and syntagmatic relations that<br />
define a structural system, RRG is concerned not only with relations of co-occurrence and<br />
combination in strictly <strong>for</strong>mal terms but also with semantic and pragmatic co-occurrence<br />
and combina<strong>to</strong>ry relations. According <strong>to</strong> Van Valin and LaPolla (1997) RRG takes lan-<br />
guage <strong>to</strong> be a system of communicative social action, and accordingly, analysing the<br />
communicative functions of grammatical structures plays a vital role in grammatical de-<br />
scription and theory from this perspective language is a system, and grammar is a system<br />
in the traditional structuralist sense.<br />
We claim that RRG is very suitable <strong>for</strong> <strong>machine</strong> translation of <strong>Arabic</strong> via an Interlingua<br />
bridge implementation model. RRG is a mono strata-theory, positing only one level of<br />
syntactic representation, the actual <strong>for</strong>m of the sentence and its linking algorithm can<br />
work in both directions from syntactic representation <strong>to</strong> semantic representation, or vice<br />
versa. In RRG, semantic decomposition of predicates and their semantic argument struc-<br />
tures are represented as logical structures. The lexicon in RRG takes the position that<br />
lexical entries <strong>for</strong> verbs should contain unique in<strong>for</strong>mation only, with as much in<strong>for</strong>-<br />
mation as possible derived from general lexical rules. The main features of RRG are<br />
the use of lexical decomposition, based upon predicate semantics, an analysis of clause<br />
41
3.5. WHY WE USE RRG AS THE LINGUISTIC MODEL<br />
structure and the use of a set of thematic roles organized in<strong>to</strong> a hierarchy in which the<br />
highest-ranking roles are ‘Ac<strong>to</strong>r’ (<strong>for</strong> the most active participant) and ‘Undergoer’.<br />
3.5.1 RRG representing the universal aspects of the layered structure of the clause<br />
A sentence in <strong>English</strong> is NP VP, but this is not valid in <strong>Arabic</strong> sentences. There is no<br />
copula (verb <strong>to</strong> be) in the <strong>Arabic</strong> language, this means some types of sentence in <strong>Arabic</strong><br />
may not contain any verb (nominal sentence). For example <br />
h ˘ āld t.ālb Khalid<br />
(is) a student; there is no ‘is’ in this sentence in <strong>Arabic</strong>. In RRG there is no VP in<br />
sentence structure. The abstract schema of the RRG layered structure of the clause can<br />
be represented as in figure3.8.<br />
Figure 3.8: The RRG representing the universal aspects of the layered structure of the<br />
clause (Van Valin and LaPolla 1997)<br />
The clause consists of the core with its arguments, and then the nucleus, which subsumes<br />
the predicate. At the very bot<strong>to</strong>m are the actual syntactic categories which realize these<br />
units. Notice that there is no VP in the tree, <strong>for</strong> it is not a concept that plays a direct role<br />
in this conception of clause structure in RRG.<br />
42
3.5. WHY WE USE RRG AS THE LINGUISTIC MODEL<br />
3.5.2 The lexical representation of verbs and their arguments<br />
The approach <strong>to</strong> the depiction of the lexical meaning of verbs which we will adopt is<br />
lexical decomposition, which involves paraphrasing verbs in terms of primitive elements<br />
in a well-defined semantic metalanguage. As a simple example of the mechanism of<br />
lexical decomposition, ‘kill’ can be paraphrased in<strong>to</strong> something like ‘cause <strong>to</strong> die’, and<br />
then ‘die’ can be broken down in<strong>to</strong> ‘become dead’, Thus the lexical representation<br />
of ‘kill’ would be something like ‘x causes [y become dead]’ (Van Valin and<br />
LaPolla 1997). A system of lexical representation should include a way of expressing the<br />
fact that the subject of ‘die’ and the object of ‘kill’ are the same argument semantically.<br />
There are many verbs pairs like this, and in many cases the relationship between them<br />
is overt. Examples include ‘sink’, as in ‘the boat sank’ and ‘the <strong>to</strong>rpedo sank the boat’,<br />
where boat is the subject of intransitive ‘sink’ and the object of transitive ‘sink’ (Van Valin<br />
and LaPolla 1997). Another example is the predicate ‘cool’, which can take three <strong>for</strong>ms,<br />
one adjectival and two verbal: ‘The soup is cool’, ‘the soup is cooling’ and ‘the wind<br />
cooled the soup’. Thus, there seems <strong>to</strong> be a pattern of intransitive verbs whose subjects<br />
are identical <strong>to</strong> the objects of their transitive counterparts. There are cases, however,<br />
when the intransitive-transitive alternates do not have the same lexical <strong>for</strong>m, as in ‘die’<br />
and ‘kill’, or ‘receive’ and ‘give’. An adequate theory of lexical representation should<br />
be able <strong>to</strong> capture these relationships, and lexical decomposition provides a promising<br />
method <strong>for</strong> doing it. There are many theories of lexical decomposition, which differ<br />
in terms of how fine-grained they are. It is necessary <strong>to</strong> find the right level of detail,<br />
one which allows the expression of certain important generalizations but which also has<br />
representations whose differences have morphosyntactic consequences. Thus, arriving<br />
at a decompositional system is a compromise between the demands of semantics (make<br />
all necessary distinctions relevant <strong>to</strong> meaning) and those of syntax (make syntactically<br />
43
3.6. SUMMARY<br />
relevant distinctions that permit the expression of significant generalizations) (Van Valin<br />
and LaPolla 1997).<br />
3.6 Summary<br />
RRG describes mainly a sentence of a specific language in terms of:<br />
1) logical structure;<br />
2) grammatical procedures.<br />
We use RRG <strong>to</strong> model <strong>Arabic</strong>, because there are certain cases where the standard NP VP<br />
categorisation does not apply due <strong>to</strong> the absence of a copula verb in the language. Since<br />
RRG does not structure sentences based around a VP, it is more suited <strong>to</strong> representing<br />
such sentences.<br />
The main features of RRG are the use of lexical decomposition, based upon predicate<br />
semantics. The RRG model creates a relationship between syntax and semantics and<br />
can account <strong>for</strong> how semantic representations are mapped in<strong>to</strong> syntactic representations.<br />
RRG also accounts <strong>for</strong> the very different process of mapping syntactic representations <strong>to</strong><br />
semantic representations.<br />
The division in the clause is between a core and a periphery The clause consists of the<br />
core with its arguments, and then the nucleus, which subsumes the predicate. The core<br />
arguments are those which are part of the semantic representation of the verb. The pe-<br />
riphery is represented on the margin, and the arrow there indicates that it is an adjunct;<br />
that is, it is an optional modifier of the core.<br />
There are languages in which opera<strong>to</strong>rs occur on both sides of the nucleus; <strong>for</strong> example,<br />
in <strong>Arabic</strong>, the imperfect tense <br />
ālf֒l ālmd. ār֒ marker is a prefix, while the<br />
perfect tense ālf֒l ālmād. y marker is a suffix (Ryding 2007). In such cases<br />
44
3.6. SUMMARY<br />
there will be more complex language-specific linear precedence rules <strong>for</strong> opera<strong>to</strong>rs.<br />
45
4<br />
Machine translation strategies<br />
In this chapter, we introduce background in<strong>for</strong>mation about Machine Translation. We<br />
discuss the computational techniques, basic strategies, linguistic aspects and the gener-<br />
ation problem. Much of the background in<strong>for</strong>mation is summarised from Hutchins and<br />
Somers (1992).<br />
Natural language processing (NLP) can be thought of as a subfield of artificial intelli-<br />
gence. It refers <strong>to</strong> understanding and au<strong>to</strong>matic generation of natural human languages.<br />
Machine translation (MT) is a part of computational linguistics and refers <strong>to</strong> comput-<br />
erised systems that can translate from one natural language <strong>to</strong> another. Hence, MT uses<br />
many ideas, methods and techniques from these related fields and has also built up a body<br />
of techniques which can, in turn, be applied in other areas of computer-based language<br />
processing.<br />
46
4.1. ADVANTAGES OF MACHINE TRANSLATION<br />
Modularity has changed as MT systems have developed. In transfer systems, lexical<br />
and structural transfer were sometimes separated. In many direct translation systems,<br />
analysis, transfer and generation are often mixed <strong>to</strong>gether and were not clearly distinct.<br />
As the area has matured, modularity has become an important aspect of MT systems,<br />
allowing different aspects <strong>to</strong> be developed independently.<br />
4.1 Advantages of <strong>machine</strong> translation<br />
Some of the advantages of <strong>machine</strong> translations are as follows:<br />
• Machine translation is quicker than human translation.<br />
• It ensures consistency. There is no concern that a transla<strong>to</strong>r might take <strong>to</strong>o much<br />
creative license with a translation or <strong>for</strong>get how a particular word was translated in<br />
earlier pages. MT will translate a particular word in the same way. However, the<br />
downside is that will exhibit the same errors over and over again.<br />
• It gives a neutral approach <strong>to</strong> translation without introducing bias, which can happen<br />
with human transla<strong>to</strong>rs.<br />
• Machine translation is considerably cheaper. It is a one time cost; the cost of the<br />
<strong>to</strong>ol and its installation.<br />
4.2 Computational techniques in MT<br />
Computational processing allows <strong>for</strong> the analysis and processing of large amounts of<br />
data. Be<strong>for</strong>e looking at the computational aspects of MT, we introduce some basic con-<br />
cepts. Machine translation can take advantage of one of the basic concepts in computing.<br />
Since data and programs are separate, it is possible <strong>to</strong> build a program that functions with<br />
different types of data. In the case of MT, this means that the algorithms <strong>for</strong> translation,<br />
47
4.2. COMPUTATIONAL TECHNIQUES IN MT<br />
and the data used <strong>for</strong> doing the actual translation can be developed separately. In reality<br />
this is a little simplistic, but there are certain examples of MT systems that operate in a<br />
similar manner <strong>for</strong> different sets of data like dictionaries and grammar rules (Hutchins<br />
and Somers 1992).<br />
4.2.1 System design<br />
As in standard software engineering, recent trends are <strong>to</strong>wards modular and incremental<br />
system design. Whereas previously, systems would be built in a monolithic structure,<br />
with some debugging access in<strong>to</strong> the system, now we build systems up in stages, com-<br />
pletely defining and testing each stage, be<strong>for</strong>e incorporating it in<strong>to</strong> the overall system.<br />
This method has revolutionarised software engineering and enabled much more effective<br />
collaborative design, as well as the integration of other people’s work in any design.<br />
4.2.2 Interactive systems<br />
Interactivity is a key aspect of computer systems. MT systems can take advantage of<br />
interactivity <strong>to</strong> achieve higher quality results. It is possible <strong>for</strong> an MT system <strong>to</strong> ask the<br />
user <strong>to</strong> select from a set of possible solutions. It is also possible <strong>to</strong> extend the lexicon<br />
through user input at the time of translation. The system might flag unfamiliar words,<br />
which the user can then categorise <strong>for</strong> inclusion in the lexicon. However, intereactivity<br />
and relying on user input can have disadvantages. For example, should the user be relied<br />
upon <strong>to</strong> be correct in his input? Is he fully aware of the linguistic properties of the words?<br />
Furthermore, as more user input is required, the benefits of MT over human translation<br />
become less significant.<br />
48
4.2.3 Lexical databases<br />
4.2. COMPUTATIONAL TECHNIQUES IN MT<br />
A key component of any rule-based MT system is its lexical resources; the in<strong>for</strong>mation<br />
associated with individual words. The field of computational lexicography is concerned<br />
with creating and maintaining computerised dictionaries. In practice, rule-based MT<br />
systems can often have different dictionaries, some containing the core entries, and others<br />
containing specialised vocabulary. An MT lexicon is different from a standard dictionary,<br />
and so is typically concentrated on some linguistically homogeneous set of words, e.g.<br />
abstract nouns, intransitive verbs, or the terminology of a specialist field. It is a good<br />
investment <strong>to</strong> develop <strong>to</strong>ols which aid lexicographers <strong>to</strong> expand the lexicon.<br />
4.2.4 Tokens and <strong>to</strong>kenization<br />
The term “<strong>to</strong>ken” refers <strong>to</strong> an abstraction <strong>for</strong> the smallest unit in a text that is considered<br />
when describing the syntax of a language. A process of <strong>to</strong>kenization can be used <strong>to</strong> split<br />
the sentence in<strong>to</strong> word <strong>to</strong>kens. Although the following example is given as XML there<br />
are many ways <strong>to</strong> represent <strong>to</strong>kenized input. The sentence He went <strong>to</strong> the school. could<br />
be <strong>to</strong>kenised as follows:<br />
<br />
He<br />
went<br />
<strong>to</strong><br />
the<br />
school<br />
<br />
49
4.2.5 Syntactic analysis (Parsing)<br />
4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />
Syntactic analysis, or parsing, is a major component in a rule-based MT system. It is the<br />
process by which a sentence is dissected or analysed in<strong>to</strong> constituent parts, <strong>to</strong> determine<br />
grammatical structure. One of the key challenges in analysis is dealing with ambiguity.<br />
One approach is what is called depth-first parsing, in which each possible solution is pur-<br />
sued <strong>to</strong> its conclusion. Each time a solution is found <strong>to</strong> be wrong, the system backtracks<br />
and takes another route until it eventually finds the correct categorisation of a word. In<br />
breadth-first parsing, alternatives are evaluated in parallel, until each alternative is found<br />
<strong>to</strong> be wrong except the right one.<br />
4.3 Basic <strong>machine</strong> translation strategies<br />
Traditionally three different approaches <strong>to</strong> MT have been used: direct translation, inter-<br />
lingua translation and transfer based translation. A few new approaches have also been<br />
established. In this section we will discuss basic strategies of MT systems.<br />
4.3.1 Multilingual versus bilingual systems<br />
Bilingual systems translate between a single pair of languages; multilingual systems<br />
translate between more than two languages. Bilingual systems are uni-directional or<br />
bi-directional, they may be designed <strong>to</strong> translate from one language <strong>to</strong> another in one<br />
direction only, or they are able <strong>to</strong> translate from both members of a language pair. As a<br />
further modification we may differentiate between reversible bilingual systems and non-<br />
reversible systems. In a reversible bilingual system the process involved in the analysis<br />
of a language can be inverted without change <strong>for</strong> the generation of output in the same<br />
language.<br />
50
4.3.2 Direct translation<br />
4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />
Figure 4.1: Direct MT system<br />
Direct translation is the oldest approach <strong>to</strong> MT. The direct translation strategy passes each<br />
sentence text <strong>to</strong> be translated through a series of standard stages. If the MT system uses<br />
direct translation, this usually means that there is no syntactic analysis after the morpho-<br />
logical analysis <strong>for</strong> the source language. The translation is based on large dictionaries<br />
and word-by-word translation with some simple grammatical adjustments e.g. on word<br />
order and morphology. A direct translation model is shown in Figure 4.1. This strategy<br />
is no longer in significant use.<br />
51
4.3.3 Interlingua<br />
4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />
The Interlingua approach is <strong>to</strong> develop a universal language-representation <strong>for</strong> text. In<br />
effect, in Interlingua there is no transfer map, and the MT model thus has phases: anal-<br />
ysis and generation. In a standard multilingual system with X source languages and Y<br />
target languages, the transfer approach will involve XY transfer maps; moreover, we<br />
need X analysers and Y genera<strong>to</strong>rs. In the Interlingua approach, only X parsers and Y<br />
genera<strong>to</strong>rs are needed per language. Interlingua based MT is done via an intermediary<br />
(semantic) representation of the source language text. Interlingua is supposed <strong>to</strong> be a lan-<br />
guage independent representation from which translations can be generated <strong>to</strong> different<br />
target languages. Translation needs two phases: analysis from the source language <strong>to</strong> the<br />
Interlingua (universal language) and generation from the universal language <strong>to</strong> the target<br />
language. An Interlingua translation model with eight languages is shown in Figure 4.2.<br />
Figure 4.2: Interlingua1 model with eight languages pairs<br />
52
4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />
To apply our <strong>framework</strong> <strong>to</strong> other generation languages, we only need <strong>to</strong> change the gen-<br />
eration phases. The intermediate representation is independent of the target language,<br />
and this is the benefit of using an Interlingua approach, since analysis and generation are<br />
separate tasks which are implemented independently.<br />
4.3.4 Transfer systems<br />
Transfer systems are a middle course between direct and Interlingua MT strategies.<br />
Transfer systems divide translation in<strong>to</strong> steps which clearly differentiate source lan-<br />
guage and target language parts. In the transfer approach there is there<strong>for</strong>e no language-<br />
independent representation: the source language intermediate representation is specific<br />
<strong>to</strong> a particular language, as is the target language intermediate representation. There is no<br />
necessary equivalence between the source and target intermediate representations <strong>for</strong> the<br />
same language. In the transfer strategy a source language sentence is first parsed in<strong>to</strong> an<br />
internal representation. Thereafter a transfer is made at both lexical and structural levels<br />
in<strong>to</strong> equivalent structures of the target language. In the third stage a translation is gener-<br />
ated. Whereas the Interlingua approach requires complete resolution of all ambiguities<br />
in the source language text so that translation in<strong>to</strong> any other language is possible, in the<br />
transfer approach only those ambiguities inherent in the language in question are tackled.<br />
This approach is a development over direct translation and this was lexically driven. The<br />
level of transfer differs from system <strong>to</strong> system - the representation varies from only syn-<br />
tactic deep structure <strong>to</strong> syntactic-semantic interpret trees. A multilingual transfer model<br />
with eight languages pairs is presented in Figure 4.3.<br />
In comparison with the Interlingua system there are clear disadvantages in the transfer<br />
approach. The addition of a new language involves not only the two modules <strong>for</strong> analy-<br />
sis and generation, but also the addition of new transfer modules, the number of which<br />
53
4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />
may vary according <strong>to</strong> the number of languages in the existing system: in the case of<br />
a two-language system, a third language would require four new transfer modules. The<br />
addition of a fourth language would entail the development of six new transfer modules,<br />
and so on as illustrated in Table 4.1.<br />
Figure 4.3: Multilinguality transfer model with eight languages pairs<br />
Table 4.1: Modules required in an all-pairs multilingual transfer system<br />
Number of languages 2 3 4 5 ... 8 n<br />
Analysis models 2 3 4 5 ... 8 n<br />
Generation models 2 3 4 5 ... 8 n<br />
Transfer models 2 6 12 20 ... 56 n 2 − n<br />
Total models 6 12 20 30 ... 72 n 2 + n<br />
The number of transfer modules in a multilingual transfer system, <strong>for</strong> all combinations<br />
of n languages, is n 2 − n. Also needed are n analysis and n generation modules, which<br />
54
would also be needed <strong>for</strong> an interlingua system.<br />
4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />
As shown in Figure 4.4, the direct method has no modules <strong>for</strong> source language analysis<br />
or target language generation. In the interlingua method the source language is fully<br />
analyzed in<strong>to</strong> a language-independent representation from which the target language is<br />
generated. The transfer strategy can be viewed as falling between interlingua systems<br />
and direct systems.<br />
Figure 4.4: Difference between direct, transfer, and interlingua MT models, (Trujillo<br />
1999)<br />
Figure 4.4 shows language analysis up the left-hand side, and target-language generation<br />
down the right. The peak of the pyramid represents the theoretical interlingua represen-<br />
tation achieved by analysis and suitable <strong>for</strong> direct use by generation. However, the path<br />
<strong>to</strong> that interlingua is long. By cutting off the monolingual analysis at some point and<br />
entering in<strong>to</strong> a bilingual transfer phase, one can avoid the difficulties of a full analysis.<br />
The diagram is also intended <strong>to</strong> suggest that the more the text is analysed, the simpler the<br />
transfer will be, as depicted by the length of the line cutting across the pyramid. At the<br />
very bot<strong>to</strong>m, where there is smallest amount of analysis, and nearly all the work is done<br />
in transfer, as was the case with the early direct method systems.<br />
55
4.3.5 Statistical <strong>machine</strong> translation<br />
4.4. LINGUISTIC ASPECTS OF MT<br />
The ideas behind statistical <strong>machine</strong> translation come out of in<strong>for</strong>mation theory. Essen-<br />
tially, the document is translated on the probability p(e|a) that a string e in the target<br />
language (<strong>for</strong> example <strong>English</strong>) is the translation of a string a in the source language (<strong>for</strong><br />
example <strong>Arabic</strong>). As translation systems are not able <strong>to</strong> s<strong>to</strong>re all native strings and their<br />
translations, a document is typically translated sentence by sentence, but even this is not<br />
enough. We assign <strong>to</strong> every pair of strings (e|a) a number P(e|a), which we interpret as<br />
the probability that a transla<strong>to</strong>r, when presented with e, will produce a as its translation.<br />
You could imagine another program that takes a sentence a as input, and outputs every<br />
conceivable string e along with its P(e|a). This program would take a long time <strong>to</strong> run,<br />
even if you limit <strong>English</strong> translations <strong>to</strong> some arbitrary length. They seek the <strong>English</strong> sen-<br />
tence e that maximizes P(e|a) and minimizes time (Brown et al. 1993). To summarize,<br />
we compute P(e|a) by summing the probabilities of all alignments. For each alignment,<br />
we make two significant simplifying assumptions: Each <strong>English</strong> word is generated by<br />
exactly one <strong>Arabic</strong> word; and the generation of each <strong>English</strong> word is independent of the<br />
generation of all other <strong>English</strong> words in the sentence. This is clearly not true in theory.<br />
4.4 Linguistic aspects of MT<br />
In this section we will look more closely at the kinds of linguistic problems that MT has <strong>to</strong><br />
face and will discuss ways in which MT programs work around these problems. We will<br />
distinguish monolingual problems of morphology, lexical ambiguity, syntactic ambiguity,<br />
pragmatic aspects from bilingual problems of language contrast: lexical mismatches,<br />
structural divergence, typological differences.<br />
56
4.4.1 Non-Roman alphabet scripts<br />
4.4. LINGUISTIC ASPECTS OF MT<br />
Since computer technology developed mostly in <strong>English</strong>, other languages, particularly<br />
those with non-Roman alphabet have his<strong>to</strong>rically been seen as a special case and required<br />
new code sets <strong>to</strong> define character representations. Furthermore, not all languages with<br />
alphabetic scripts are written left-<strong>to</strong>-right, e.g. <strong>Arabic</strong> and Hebrew, so any input/output<br />
devices making this assumption will be useless <strong>for</strong> such languages. Be<strong>for</strong>e Unicode was<br />
standardised, there were different encoding systems <strong>for</strong> assigning this problem. Unicode<br />
provides a unique code <strong>for</strong> every character, no matter what the plat<strong>for</strong>m, the program and<br />
the language are. Appendix A provides the corresponding Unicode <strong>for</strong> each <strong>Arabic</strong> letter<br />
and describes the letters with their corresponding written shapes.<br />
4.4.2 Lexical ambiguity<br />
Category ambiguities or homographs are examples of lexical ambiguities which arise<br />
when there are potentially two or more ways in which a word can be analysed. More<br />
complex are lexical ambiguities, where one word can be interpreted in more than one<br />
way. Lexical ambiguities are of three basic types: category ambiguities, homographs and<br />
transfer (or translational) ambiguities.<br />
4.4.2.1 Category ambiguity<br />
The simplest type of lexical ambiguity is that of category ambiguity: a given word could<br />
be assigned <strong>to</strong> more than one grammatical or syntactic category (e.g. noun, verb or<br />
adjective) according <strong>to</strong> the context. There are several examples of this in <strong>English</strong>: light<br />
can be a noun, verb or adjective, also, control can be a noun or verb. In <strong>Arabic</strong> there<br />
are some words that can be in more than one category, <strong>for</strong> example ֒lā could be a<br />
preposition with meaning of “on”, or a verb with meaning of “raise”.<br />
57
4.4.2.2 Homograph<br />
4.4. LINGUISTIC ASPECTS OF MT<br />
The second type of lexical ambiguity occurs when a word can have two or more differ-<br />
ent meanings. Linguists distinguish between homographs, homophones and polysemes.<br />
Homographs are two (or more) ‘words’ with quite different meanings which have the<br />
same spelling: example, light (not dark or not heavy). Many <strong>Arabic</strong> words can have two<br />
or more overlapping meanings examples; ֓i֒lān could be announcement, advertisement,<br />
declaration or sign. Also, <br />
mrkz could be centre, position, rank or status.<br />
Moreover, mwq֒ could be position, rank, site or status. The direct approach has<br />
particular problems with homographs; the usual method of resolving homograph ambi-<br />
guities is <strong>to</strong> look at the closest words <strong>for</strong> clues.<br />
4.4.3 Syntactic ambiguity<br />
Syntactic ambiguity arises when there is more than one way of analysing the underlying<br />
structure of a sentence according <strong>to</strong> the grammar used in the system. Example, I know a<br />
man with a dog who has fleas, is ambiguous. It could be the man or the dog who has fleas.<br />
It is the syntax not the meaning of the words which is unclear. The classical example is<br />
He saw the girl with the telescope. For the purposes of this discussion, we represent these<br />
examples in the notation of a context-tree grammar rather than in RRG notation.<br />
The two trees in Figure 4.5 and Figure 4.6 represent the two different analyses in the<br />
sense of recording two different ‘parse his<strong>to</strong>ries’. In linguistic terms, they correspond <strong>to</strong><br />
the two readings of the sentence: one in which the PP is part of the NP (i.e. the girl has<br />
the telescope), and the other where the PP is the same level as the subject (i.e. the man<br />
has the telescope). For convenience, a bracketed notation <strong>for</strong> trees is sometimes used:<br />
the equivalents <strong>for</strong> the trees in Figure 4.5 and Figure 4.6 are shown in (4.1a) and (4.1b)<br />
respectively.<br />
58
Figure 4.5: NP rule (NP –> det n pp)<br />
Figure 4.6: PP is attached at a higher level<br />
4.4. LINGUISTIC ASPECTS OF MT<br />
59
(4.1a)<br />
4.5. CHALLENGES OF ARABIC TO ENGLISH MT<br />
S(NP(pron(he)), VP( v(saw),NP(det(the), N(girl),<br />
PP(prep(with),NP(det(the),n(telescope))))))<br />
(4.1b)<br />
S(NP(pron(he)), VP(v(saw),NP(det(the),n(girl)),<br />
PP(prep(with),NP(det(the),n(telescope)))))<br />
The tree structures required may of course be much more complex, not only in the sense<br />
of having more levels, or more branches at any given level, but also in that the labelling<br />
of the nodes (i.e. the ends of the branches) may be more in<strong>for</strong>mative.<br />
4.4.4 Structural differences<br />
Many relatively trivial syntactic differences between languages are well known, e.g. in<br />
<strong>Arabic</strong> most adjectives follow nouns but in <strong>English</strong> adjectives normally precede the nouns<br />
they qualify. Also, <strong>Arabic</strong> sentences have more than one structural type. The sentence<br />
which contains a verb, will have order of the <strong>for</strong>m verb(V), subject(S) and object(O) or<br />
verb(V), object(O) and subject(S). The only combinations that do not occur in <strong>Arabic</strong> are<br />
OSV and SOV (Attia 2004).<br />
4.5 Challenges of <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> MT<br />
<strong>Arabic</strong> words can often be ambiguous due <strong>to</strong> the three-letter root system. These conso-<br />
nant roots interlock with patterns of vowels or consonants <strong>to</strong> words or word stems. This<br />
root system allows the language <strong>to</strong> evolve <strong>to</strong> cover a wide range of meanings. In some<br />
derivations one or more of the root letters is dropped, resulting in possible ambiguity.<br />
Examples of derived words from a three-letter-root are shown in Table 4.2.<br />
60
4.5. CHALLENGES OF ARABIC TO ENGLISH MT<br />
Table 4.2: Derived words from a three-letter-root in <strong>Arabic</strong><br />
<strong>Arabic</strong> Example POS<br />
<br />
kataba he wrote verb<br />
<br />
kātaba he corresponded verb<br />
kutiba it was written verb<br />
ktiāb book noun<br />
<br />
kutub books noun<br />
kātib writer; (adj) writing noun<br />
kutāb writers noun<br />
<br />
maktab desk; office noun<br />
makātib desks; offices noun<br />
<br />
maktabah library noun<br />
A root is defined in (Ryding 2007) as “a relatively invariable discontinuous bound mor-<br />
pheme, represented by two <strong>to</strong> five phonemes, typically three consonants in a certain order,<br />
which interlocks with a pattern <strong>to</strong> <strong>for</strong>m a stem and which has lexical meaning.”<br />
There are also two and four letter roots. They are discontinuous because the root letters<br />
can be interspersed with other letters in a pattern. However, the order of the root letters<br />
must be the same.<br />
A pattern is defined in (Ryding 2007) as “a bound and in many cases discontinuous mor-<br />
pheme consisting of one or more vowels and slots <strong>for</strong> root phonemes (radicals), which<br />
either alone or in combination with one <strong>to</strong> three derivational affixes, interlocks with a<br />
root <strong>to</strong> <strong>for</strong>m a stem, and which generally has grammatical meaning.”<br />
Patterns signify grammatical or language-internal in<strong>for</strong>mation, distinguishing word types<br />
and classes. These patterns can differentiate between nouns, verbs and adjectives, but<br />
also give more detailed in<strong>for</strong>mation about sublasses of these categories. There are fewer<br />
patterns than roots.<br />
<strong>Arabic</strong> has a large set of morphological features (Al-Sughaiyer and Al-Kharashi 2004).<br />
61
4.6. GENERATION<br />
These features are in the <strong>for</strong>m of prefixes, suffixes and also infixes that can completely<br />
change the meaning of the word. Also, in <strong>Arabic</strong> there are some words that hold the<br />
meaning of a full sentence <strong>for</strong> example, snsāfr , would translate <strong>to</strong>; We will<br />
travel. in <strong>English</strong>. This means any MT system should apply thorough analysis in order<br />
<strong>to</strong> obtain the root or <strong>to</strong> deduce that in one word there is in fact a full sentence. <strong>Arabic</strong><br />
has a relatively free word order, this poses a significant challenge <strong>to</strong> MT due <strong>to</strong> the vast<br />
possibilities <strong>to</strong> express the same sentence in <strong>Arabic</strong>.<br />
4.6 Generation<br />
In this section we discuss the generation of target language texts.<br />
4.6.1 Generation in direct systems<br />
In direct systems in Figure 4.7, generation is based as much as possible on source lan-<br />
guage structures: nothing is changed more than strictly needed <strong>for</strong> the creation of a suit-<br />
able target language word order.<br />
62
Figure 4.7: Direct MT system<br />
4.6.2 Generation in transfer-based systems<br />
4.6. GENERATION<br />
In a transfer system, the generation phase is generally divided in<strong>to</strong> two parts, syntac-<br />
tic generation and morphological generation. Syntactic generation involves creating a<br />
deep-tree structure from the output of the analysis, which is then re-ordered by trans-<br />
<strong>for</strong>mational rules. The final tree is labelled with the grammatical functions and features<br />
of the target language. This re-ordered surface structure can now be processed by the<br />
morphological genera<strong>to</strong>r, which creates labelled lexical items which can be easily turned<br />
in<strong>to</strong> target sentences.<br />
63
4.6.3 Generation in interlingua systems<br />
4.6. GENERATION<br />
The steps <strong>for</strong> generating texts in interlingua-based systems are similar <strong>to</strong> those described<br />
<strong>for</strong> transfer-based systems. Generation includes phases of syntactic and morphological<br />
generation. The main difference is that the start point is not a deep-structure syntactic<br />
representation, but an interlingua representation, probably based on predicate-argument<br />
structures. The syntactic structure must first be generated from the interlingual represen-<br />
tation by a phase often known as semantic generation. The process may be described<br />
using example in Figure 4.8. The structure <strong>to</strong> be generated is shown in Figure 4.9.<br />
Figure 4.8: Semantic generation<br />
64
4.7 Summary<br />
Figure 4.9: Structure <strong>to</strong> be generated<br />
4.7. SUMMARY<br />
In the stages of analysis and generation, most MT systems contain separated components<br />
dealing with different levels of linguistic description: morphology, syntax, semantics.<br />
Hence, analysis may be divided in<strong>to</strong> morphological analysis, syntactic analysis and se-<br />
mantic analysis (Hutchins and Somers 1992).<br />
For the purposes of this study, our proposed solution <strong>to</strong> an <strong>Arabic</strong>-<strong>English</strong> transla<strong>to</strong>r will<br />
be based upon the interlingua model of <strong>machine</strong> translation. <strong>Arabic</strong> is unique in many<br />
ways but is not immune <strong>to</strong> the standard challenges faced by other languages such as<br />
multiple meanings of words, non-verbalisation and insufficient lexicons.<br />
An Interlingua model that incorporates source language analysis, thereby creating a so<br />
called universal logical structure, will facilitate multiple language generation in a more<br />
flexible way. An Interlingua model is presented in Figure 4.10.<br />
65
Figure 4.10: Interlingua model of <strong>Arabic</strong> MT<br />
4.7. SUMMARY<br />
For the elements of subject(S), verb(V) and object(O), <strong>Arabic</strong>’s relatively free word order<br />
allows the combinations of SVO, VSO and VOS. The only combinations that do not<br />
occur in <strong>Arabic</strong> are OSV and SOV. <strong>Arabic</strong>’s flexible word order is discussed later in this<br />
research. Our research develops a rule-based and lexical <strong>framework</strong> <strong>for</strong> the processing of<br />
<strong>Arabic</strong> using the Role and Reference Grammar (RRG) linguistic model. The <strong>framework</strong><br />
is <strong>to</strong> be evaluated using a <strong>machine</strong> translation system that translates an <strong>Arabic</strong> text as<br />
source language in<strong>to</strong> an <strong>English</strong> text as target language.<br />
66
5<br />
Design of <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation<br />
system based on RRG<br />
The UniArab system is a natural language processing application based on Role and<br />
Reference Grammar (RRG) <strong>for</strong> translating the <strong>Arabic</strong> language in<strong>to</strong> any other language,<br />
using an interlingua bridge. An interlingua based MT approach <strong>to</strong> translation is done<br />
via an intermediate semantic representation of the source language (Hutchins 2003). The<br />
conceptual architecture of the UniArab system is shown in Figure 5.1. To apply it <strong>to</strong><br />
any other language, we need only change phases 9, 10, 11 and 12. Figure 5.1 will be<br />
discussed in more detail in Chapter 6.<br />
67
5.1. UNIARAB: INTERLINGUA-BASED SYSTEM<br />
Figure 5.1: The conceptual architecture of the UniArab system<br />
5.1 UniArab: Interlingua-based system<br />
In interlingua MT systems, the result of source language analysis is a language indepen-<br />
dent representation of the text which is the basis <strong>for</strong> the generation of the target language<br />
text. The advantages of using interlingua <strong>for</strong> multilingual systems have already have<br />
been mentioned in Chapter 4. The challenges start with analysis and generation, they<br />
have <strong>to</strong> be strictly separated; it is not desirable <strong>to</strong> learn about analysis <strong>to</strong>wards a par-<br />
ticular target language and it is not possible, during generation, <strong>to</strong> refer <strong>to</strong> the original<br />
source language text. Using an RRG based interlingua bridge creates strong analysis<br />
methods that incorporate all attributes of a sentence and its words including the logical<br />
structure of its verbs. This technique could be very amenable <strong>to</strong> interlingua. The interlin-<br />
gua representation must include all the in<strong>for</strong>mation that can possibly be required during<br />
68
5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />
the generation of any target language text or rather more correctly: any target language<br />
included in the system from the outset or planned <strong>for</strong> the future. In effect, this high<br />
degree of language-independence and objectivity means that interlinguas must strive <strong>to</strong>-<br />
wards universality in lexicon and structure: one might almost say, <strong>to</strong>wards representing<br />
the meaning of the text. Most interlingua-based systems use representations. The Chom-<br />
skyan theory of deep structures was thought <strong>to</strong> be attractive, but it is now agreed they<br />
are not sufficiently abstract, being <strong>to</strong>o oriented <strong>to</strong>wards the surface features of individ-<br />
ual languages. The implications of neutral structural representations can be illustrated<br />
by allowing <strong>for</strong> differences of word order between languages, and their significance. In<br />
<strong>English</strong>, word order is the primary means of distinguishing grammatical functions like<br />
subject and object. The <strong>Arabic</strong> language has a relatively free word order. The implica-<br />
tion <strong>for</strong> an interlingua is that it is not enough <strong>to</strong> designate word order on its own: the<br />
interlingua must represent the significance in terms of grammatical function (syntactic<br />
relations), text function, determination, case role or whatever else the interpretation of<br />
the word-order dictates. Structural differences can be treated in transfer-based systems<br />
by structural transfer rules. But in interlingua-based systems the representation must be<br />
language-neutral.<br />
5.2 Designing an XML lexicon architecture <strong>for</strong> <strong>Arabic</strong> MT based<br />
on RRG<br />
The lexicon in RRG takes the position that lexical entries <strong>for</strong> verbs should contain unique<br />
in<strong>for</strong>mation only, with as much in<strong>for</strong>mation as possible derived from general lexical<br />
rules. The lexicon is designed <strong>to</strong> reflect the word categories in the <strong>Arabic</strong> language with<br />
as much in<strong>for</strong>mation as possible derived from general lexical rules. The lexicon s<strong>to</strong>res<br />
the <strong>Arabic</strong> words in categories, each category is s<strong>to</strong>red in an XML <strong>for</strong>mat datasource<br />
69
5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />
file. In order <strong>to</strong> be able <strong>to</strong> analyses <strong>Arabic</strong> by computer we must first extract the lexical<br />
properties of the <strong>Arabic</strong> words. The UniArab system uses the lexicon <strong>to</strong> construct a logi-<br />
cal structure <strong>for</strong> <strong>Arabic</strong> input sentences, also represented in XML, which is then used <strong>for</strong><br />
generating the target language translation. We show the structure of the UniArab lexicon,<br />
discuss how it is used in the system, and show the user interface used <strong>for</strong> adding <strong>to</strong> the<br />
lexicon. The lexicon is built from individual words at present.<br />
5.2.1 An XML-based lexicon<br />
In order <strong>to</strong> build this system and represent the data sources, we use the XML language<br />
and Java. The most recent recommendation of the XML language has been presented<br />
by Bray et al. (2008). XML has become the default standard <strong>for</strong> data exchange among<br />
heterogeneous data sources (Arciniegas 2000). The UniArab system allows data <strong>to</strong> be<br />
s<strong>to</strong>red in XML <strong>for</strong>mat. This data can then be queried, exported and serialized in<strong>to</strong> any<br />
<strong>for</strong>mat the developer wishes.<br />
We choose <strong>to</strong> create our data source as XML, <strong>for</strong> optimum support or different plat-<br />
<strong>for</strong>ms. It was also easier as we used <strong>Arabic</strong> letters not Unicode inside the data source,<br />
XML fully supported <strong>Arabic</strong>. We created our search engine using Java.<br />
5.2.2 Lexical representation in UniArab<br />
Lexical frames represent the language-dependent lexicon. We use an XML data source<br />
<strong>to</strong> represent the UniArab lexicon. The lexicon creates pointers <strong>to</strong> corresponding con-<br />
ceptual frames or attributes of each word. These frames also have relations which link<br />
them <strong>to</strong> verb class frames, which are organized hierarchically according <strong>to</strong> the particular<br />
language.<br />
70
5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />
In Phase 3 in Figure 5.1, the UniArab system <strong>to</strong>kenizes a sentence in<strong>to</strong> words, then sends<br />
each word <strong>to</strong> the search engine within the Lexicon <strong>to</strong> query the category of each word<br />
and all attributes <strong>for</strong> that word. The Lexicon returns the corresponding category and its<br />
attributes as detailed below. The Morphology Parser, Phase 5, receives the word meta-<br />
data and ensures that the properties of the words are consistent. The verb attributes in<br />
particular, are of great importance in correctly extracting sentence logical structure fur-<br />
ther down the processing chain, helping <strong>to</strong> answer the basic question ‘Who does what?’<br />
In free word order sentences, <strong>for</strong> example, yh. b qys lylā, ‘Qays loves<br />
Laila’ multiple orders are possible including verb-subject-object, verb-object-subject or<br />
subject-verb-object. The attributes of the verb agree with the gender of the subject. Given<br />
the masculine gender of the verb in this case, the Syntactic Parser will look <strong>for</strong> a mascu-<br />
line proper noun <strong>to</strong> make the ac<strong>to</strong>r <strong>for</strong> this sentence. If there is more than one masculine<br />
proper noun in such a case, then Modern Standard <strong>Arabic</strong> defines the first proper noun<br />
as the ac<strong>to</strong>r. The Morphology Parser will be extended so that it can deal with words that<br />
are defined in multiple categories, deciding which should be processed. Meanwhile the<br />
Syntactic Parser, so far, has only been implemented <strong>for</strong> extracting word order, though it<br />
will be extended <strong>to</strong> deal with word ambiguities in future versions.<br />
5.2.3 Lexical properties<br />
Figure 5.2 shows the structure of the Lexicon including the properties s<strong>to</strong>red <strong>for</strong> each<br />
word category. For all categories, an <strong>Arabic</strong> word is s<strong>to</strong>red along with its <strong>English</strong> repre-<br />
sentation. Since word ambiguity has not been dealt with so far, there is a one <strong>to</strong> one map-<br />
ping <strong>for</strong> the simple sentences which UniArab processes up <strong>to</strong> now. However, word am-<br />
biguity is supported in the structure, with each possible case s<strong>to</strong>red as a separate record.<br />
All search results will be passed <strong>to</strong> the Morphology Parser <strong>to</strong> decide which is taken.<br />
71
5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />
Figure 5.2: In<strong>for</strong>mation recorded in the UniArab lexicon<br />
72
5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />
Since the verb is the key component when analysing using RRG, each verb has an asso-<br />
ciated logical structure, which is later used <strong>to</strong> determine the logical structure of the full<br />
sentence. The tense of the verb is also s<strong>to</strong>red within its metadata along with the person.<br />
The verb type also s<strong>to</strong>res the gender, which in <strong>Arabic</strong> must be either masculine or femi-<br />
nine; there is no neutral gender. The number property in arabic can be singular, dual or<br />
plural. These properties help the Syntactic Parser analyse the sentence, since there must<br />
be agreement with the subject and verb, among other rules.<br />
Although we adhere <strong>to</strong> the Interlingua approach, we do not do so with the translation<br />
of lexical items. In an ideal Interlingua system lexical entries should be broken down<br />
in<strong>to</strong> sets of semantic features. For example the word “man” is broken down in<strong>to</strong> +human<br />
+male +adult. While this works in theory, in practice we cannot find enough seman-<br />
tic features <strong>to</strong> describe every entity in the world. For example “cow”, “computer” and<br />
“chair” cannot be described using these sets of semantic features unless we invent a<br />
unique semantic feature <strong>for</strong> every object and this is practically impossible, and of course,<br />
beyond the scope of this thises.<br />
Table 5.1: Verb 1<br />
<strong>Arabic</strong> verb qr֓a<br />
<strong>English</strong> translation read<br />
Logical structure [do’(x,[read’(x,(y)])]<br />
Tense past<br />
Gender m<br />
Person 3rd<br />
Number singular<br />
In Tables 5.1, 5.2, we show two examples of records <strong>for</strong> verbs in the Lexicon. The<br />
absence of t ‘t’ suffix signifies m: gender. The <strong>English</strong> translation of these verbs are<br />
‘read’ and ‘wrote’.<br />
An example of the XML record <strong>for</strong> a verb in the Lexicon is shown here;<br />
73
/><br />
<strong>English</strong>Translate=“read”<br />
Table 5.2: Verb 2<br />
<strong>Arabic</strong> verb <br />
ktbt<br />
<strong>English</strong> translation wrote<br />
Logical structure [do’(x,[write’(x,(y)])]<br />
Tense past<br />
Gender f<br />
Person 3rd<br />
Number singular<br />
LogicalStructures= “”<br />
NumberVerb=“sg”<br />
P.O.S=“Verb”<br />
genderVerb=“M”<br />
personVerb=“3rd”<br />
tenseVerb=“PAST”<br />
5.3 Design of test strategy<br />
5.3. DESIGN OF TEST STRATEGY<br />
We will create variants of <strong>Arabic</strong> sentences that represent all possible structures of sen-<br />
tences that UniArab can translate. We will evaluate the result of the system output by<br />
comparing between human-translated and <strong>machine</strong>-translated versions. In Tables 5.3 <strong>to</strong><br />
5.9 we represent some examples of sentences that are used <strong>to</strong> test the UniArab system.<br />
For actual test examples see Appendix C.<br />
Verb-Subject one argument in deferent tenses:<br />
In Table 5.3, Verb-Subject Agreement with two arguments sentences, are sentences where<br />
UniArab should select the correct <strong>for</strong>m of the verb. In particular the verb must agree with<br />
the subject.<br />
74
5.3. DESIGN OF TEST STRATEGY<br />
Table 5.3: Test strategy: verb-subject agreement<br />
<strong>Arabic</strong> human-translated UniArab other<br />
yˇsrb ֒mr ālh. lyb Omar is drinking the milk. ? ?<br />
ˇsrb ֒mr ālh. lyb Omar drank the milk. ? ?<br />
mārk qrā ālktāb Mark read the book. ? ?<br />
syˇsrb mārk āllbn Mark will drink the milk ? ?<br />
Demonstrative Adjective-Noun:<br />
The system should place the Demonstrative Adjective-Noun Agreement that agrees in<br />
number and gender. The test sentences are shown in Table 5.4.<br />
Table 5.4: Test strategy: demonstrative adjective-noun agreement<br />
<strong>Arabic</strong> human-translated UniArab best of rest<br />
hd ¯ ā ālrˇgl This man ? ?<br />
d ¯ lk ālrˇgl That man ? ?<br />
Gender-Ambiguous proper nouns:<br />
Proper nouns can confuse MT in two different ways. The first, the MT system may<br />
not identify that the word is a proper noun and analyse it as a noun, adjective, or any<br />
other categories. The second is that it may fail <strong>to</strong> identify the gender of the noun and<br />
thus fail <strong>to</strong> provide in<strong>for</strong>mation needed <strong>for</strong> agreement in <strong>Arabic</strong>. The test sentences<br />
are shown in Table 5.5. The UniArab system should follow the rules <strong>for</strong> agreement in<br />
number and gender. This is due <strong>to</strong> the fact that <strong>Arabic</strong> differs greatly from <strong>English</strong> in<br />
the distribution of number and gender in the pronoun system, lexical items as well as<br />
the syntactic structure. This difference results in many agreement problems during the<br />
translation process.<br />
Table 5.5: Test strategy: gender-ambiguous proper nouns<br />
<strong>Arabic</strong> human-translated UniArab best of rest<br />
<br />
qr֓a ˇgāk ālktāb Jack read the book. ? ?<br />
qr֓at māry ālktāb Mary read the book. ? ?<br />
75
Copula verb ‘<strong>to</strong> be’:<br />
5.3. DESIGN OF TEST STRATEGY<br />
There are certain cases where the standard NP VP categorisation does not apply due <strong>to</strong> the<br />
absence of a copula verb in the language. In <strong>Arabic</strong> there is no verb ‘<strong>to</strong> be’ (Salem et al.<br />
2008b). UniArab should understand if the sentences contain verb ‘<strong>to</strong> be’ and generate<br />
them correctly. The test sentences are shown in Table 5.6.<br />
Table 5.6: Test strategy: verb ‘<strong>to</strong> be’<br />
<strong>Arabic</strong> human-translated UniArab best of rest<br />
֓anā ālmhnds I am the engineer ? ?<br />
hw mhnds He is an engineer ? ?<br />
Verb ‘<strong>to</strong> have’:<br />
UniArab should understand if the sentences contain ‘<strong>to</strong> have’ and generate them correctly.<br />
<strong>Arabic</strong>, like Modern Irish, has no verb of ‘<strong>to</strong> have’. The test sentences are shown in Table<br />
5.7.<br />
<strong>Arabic</strong><br />
Table 5.7: Test strategy: verb ‘<strong>to</strong> have’<br />
human-translated UniArab best of rest<br />
<br />
lqd qmt bālh. ˇgz I have made a reservation. ? ?<br />
<br />
lqd fqdt tdkrty I have lost my ticket. ? ?<br />
¯<br />
The free word order in <strong>Arabic</strong>:<br />
<strong>Arabic</strong> has free word order, this poses a significant challenge <strong>to</strong> MT due <strong>to</strong> the vast<br />
possibilities <strong>to</strong> express the same sentence in <strong>Arabic</strong> (Salem et al. 2008a). The ac<strong>to</strong>r in<br />
Table 5.8 could be the first, second or third argument. UniArab should analyse who the<br />
ac<strong>to</strong>r is.<br />
Pro–Drop:<br />
In technical linguistic terms, <strong>Arabic</strong> is a ‘pro–drop’ or ‘pronoun–drop’ language (Ryd-<br />
ing 2007). The pro–drop parameter is an aspect of grammar that allows subjects <strong>to</strong> be<br />
optional but unders<strong>to</strong>od in some languages. That is, every inflection in a verb paradigm<br />
76
5.4. DESIGN OF EVALUATION CRITERIA<br />
Table 5.8: Test strategy: free word order (Verb Noun Noun)<br />
<strong>Arabic</strong> human-translated UniArab best of rest<br />
yh. b qys lylā Qays loves Laila. ? ?<br />
qys yh. b lylā Qays loves Laila. ? ?<br />
yh. b lylā qys Qays loves Laila. ? ?<br />
is specified uniquely and does not need <strong>to</strong> use independent pronouns <strong>to</strong> differentiate the<br />
person, number, and gender of the verb. The test sentences are shown in Table 5.9.<br />
Table 5.9: Test strategy: pro–drop<br />
<strong>Arabic</strong> human-translated UniArab best of rest<br />
<br />
fāttny ālt.ā֓yrh (I) missed the plane. ? ?<br />
֓aryd ˙grfh (I) want a room. ? ?<br />
<br />
nsyt mh. fz . ty (I) <strong>for</strong>got my wallet. ? ?<br />
֓aryd hātm (I) want a ring. ? ?<br />
˘<br />
5.4 Design of evaluation criteria<br />
We will evaluate the result of output by comparing with human-translated and <strong>machine</strong>-<br />
translated versions . Comparisons can be made between two <strong>machine</strong> translation systems,<br />
or between human-translated and <strong>machine</strong>-translated sentences. UniArab system is com-<br />
pared with translations done by human transla<strong>to</strong>rs. Then this result is compared with<br />
the results of other (<strong>Arabic</strong> <strong>to</strong> <strong>English</strong>) Machine translation systems. We are comparing<br />
different levels of human translation with UniArab system output, using human subjects<br />
as judges. The human judges were skilled <strong>for</strong> the purpose of Machine Translation; it is<br />
an efficient evaluation <strong>for</strong> MT research. The evaluation study compared an MT system<br />
translating from <strong>Arabic</strong> in<strong>to</strong> <strong>English</strong> with human transla<strong>to</strong>rs. The human transla<strong>to</strong>rs were<br />
a native <strong>Arabic</strong> speaking L1 adults who had <strong>English</strong> as their L2. The five point scale <strong>for</strong><br />
adequacy indicates how much of the meaning expressed in the reference translation is<br />
77
also expressed in a hypothetical translation:<br />
5 = All<br />
4 = Most<br />
3 = Much<br />
2 = Little<br />
1 = None<br />
5.5. SUMMARY<br />
The second five point scale indicates how fluent the translation is. When translating in<strong>to</strong><br />
<strong>English</strong> the values correspond <strong>to</strong>:<br />
5 = Flawless <strong>English</strong><br />
4 = Good <strong>English</strong><br />
3 = Non-native <strong>English</strong><br />
2 = Bad <strong>English</strong><br />
1 = Incomprehensible<br />
5.5 Summary<br />
UniArab is designed as an Interlingua <strong>machine</strong> transla<strong>to</strong>r, which takes <strong>Arabic</strong> sentences<br />
and analyses their structure producing in interlingua representation which can then be<br />
used in isolation <strong>to</strong> generate the <strong>English</strong> translation. We presented a test strategy in<br />
which a wide range of sentence types will be used <strong>to</strong> test the effectiveness of UniArab.<br />
We then set evaluation criteria which can be used <strong>to</strong> quantify how the system per<strong>for</strong>ms<br />
<strong>for</strong> each of these test types.<br />
78
6<br />
UniArab: a proof-of-concept <strong>Arabic</strong> <strong>to</strong> <strong>English</strong><br />
<strong>machine</strong> translation system<br />
This chapter presents an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> transla<strong>to</strong>r system, called UniArab.<br />
UniArab is a proof-of-concept translation system supporting the fundamental aspects of<br />
<strong>Arabic</strong>, such as the parts of speech, agreement and tenses. UniArab stands <strong>for</strong> Universal<br />
<strong>Arabic</strong> <strong>machine</strong> transla<strong>to</strong>r system. UniArab is based on the linking algorithm of RRG<br />
(syntax <strong>to</strong> semantics and vice versa) as indicated in Figure 6.1.<br />
Figure 6.1: Layout of Role and Reference Grammar<br />
79
6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />
6.1 Conceptual structure of the UniArab system<br />
The conceptual structure of the UniArab system is shown in figure 6.2. The system<br />
accepts <strong>Arabic</strong> as its source language. The morphology parser and word <strong>to</strong>kenizer have<br />
a connection <strong>to</strong> the lexicon which holds all attributes of a word.<br />
Figure 6.2: The conceptual architecture of the UniArab system<br />
UniArab s<strong>to</strong>res data in XML <strong>for</strong>mat. This data can then be queried, exported and se-<br />
rialized in<strong>to</strong> any <strong>for</strong>mat the developer wishes. The system can understand the part of<br />
speech of a word, agreement features, number, gender and the word type. The syntactic<br />
parse unpacks the agreement features between elements of the <strong>Arabic</strong> sentence in<strong>to</strong> a<br />
semantic representation (the logical structure) with the ‘state of affairs’ of the sentence.<br />
80
6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />
In UniArab we intend <strong>to</strong> have a strong analysis system that can extract all attributes from<br />
the words in a sentence.<br />
6.1.1 Technical architecture of the UniArab system<br />
The structure of the UniArab system in Figure 6.2 breaks down in<strong>to</strong> the following phases:<br />
Phase (1) - <strong>Arabic</strong> language sentence. The input <strong>to</strong> the system consists of one or more<br />
sentences in <strong>Arabic</strong>.<br />
Phase (2) - Sentence Tokenizer. Tokenization is the process of demarcating and classi-<br />
fying sections of a string of input characters. In this phase the system splits the text<br />
in<strong>to</strong> sentence <strong>to</strong>kens. The resulting <strong>to</strong>kens are then passed <strong>to</strong> the word <strong>to</strong>kenizer<br />
phase. For example <br />
<br />
qr֓a hāld ālktāb. hāld tlmyd<br />
˘ ˘ ¯<br />
d<br />
¯ ky. will be two <strong>to</strong>kens; qr֓a hāld ālktāb and <br />
˘ <br />
hāld ˘<br />
tlmyd ¯ d ¯ ky the translation of these two sentences is Khalid read the book. Khalid is<br />
a clever student.<br />
Phase (3) Word Tokenizer There, sentences are split in<strong>to</strong> <strong>to</strong>kens <br />
qr֓a<br />
hāld<br />
ālktāb Khalid read the book, the output of phase 3 is as follows;<br />
˘<br />
<br />
qr֓a<br />
h ˘ āld<br />
ālkt āb<br />
<br />
Phase (4) Lexicon Datasource A set of XML documents <strong>for</strong> each component category<br />
of <strong>Arabic</strong>.<br />
Phase (5) Morphology Parser Directly works with both the Lexicon and Tokenizer <strong>to</strong><br />
produce the word order. A connection is made <strong>to</strong> the datasource of phase 4 which<br />
81
6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />
has been implemented as a set of XML documents. The use of XML has the added<br />
advantage of portability. UniArab will effectively work the same regardless of the<br />
operating system. To understand the morphology of each word, we first <strong>to</strong>kenize<br />
each sentence and determine the word relationships. Phase 5 of the system holds all<br />
attributes specific <strong>to</strong> each word of the source sentence.<br />
Phase (6) Syntactic Parser Determines the precise phrasal structure and category of the<br />
<strong>Arabic</strong> sentence. At this point, the types and attributes of all words in the sentence<br />
are known.<br />
Phase (7) Syntactic linking (RRG) We must first develop the link from syntax <strong>to</strong> se-<br />
mantics out of the phrasal structure created in Phase 6, if we are <strong>to</strong> create a logical<br />
structure that will generate a target language and also act as the link in the opposite<br />
direction from semantics <strong>to</strong> syntax. The system should answer the main question in<br />
this phase, who does what <strong>to</strong> whom? We use the gender of the verb <strong>to</strong> determine<br />
the ac<strong>to</strong>r. When the subject and object have different genders, the gender of the verb<br />
must match the subject. If they both agree with the verb, then MSA dictates that<br />
the first noun is the subject. In this case the ac<strong>to</strong>r is Khalid and the undergoer is the<br />
book.<br />
Phase (8) Logical Structure Creation of logical structure is the most crucial phase. An<br />
accurate representation of the logical structure of an <strong>Arabic</strong> sentence is the primary<br />
strength of UniArab. Below is a sample output from the UniArab system. The<br />
<strong>Arabic</strong> equivalent of the past tense sentence ‘Khalid read the book’ <br />
<br />
qr֓a h ˘ āld ālktāb is input as the source.<br />
ālktāb book:N h ˘ āld Khalid:MsgN qr֓a read:V<br />
The results of the parse can be seen in the following logical structure:<br />
Verb read<br />
82
6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />
<br />
sg 3rd M PAST qr֓a<br />
where the Proper Noun is Khalid sg unspec M <br />
h ˘ āld<br />
and the Noun is, the book sg def M ālktāb<br />
Consider the following example; Omar is a student. be’(Omar,[student’]).<br />
in <strong>Arabic</strong> ֒mr tlmyd ¯ . This is a challenge since there is no verb ‘<strong>to</strong> be’ in<br />
<strong>Arabic</strong>, but this must be inferred <strong>for</strong> correct translation. Instead of saying ‘Omar is<br />
a student’, the <strong>Arabic</strong> equivalent would be ‘Omar student’. We also face the chal-<br />
lenge of inferring the indefinite article, which does not exist in <strong>Arabic</strong>. All of the<br />
unique in<strong>for</strong>mation <strong>for</strong> each word can thus be taken from the lexicon <strong>to</strong> aid in the<br />
creation of a logical structure of the target language.<br />
Phase (9) Semantic <strong>to</strong> Syntax Assuming we have an input and have produced a struc-<br />
tured syntactic representation of it, the grammar can map this structure from a se-<br />
mantic representation. In this phase the system uses a linking algorithm provided by<br />
RRG <strong>to</strong> determine ac<strong>to</strong>r and undergoer assignments, assign the core arguments and<br />
assign the predicate in the nucleus. The system uses semantic arguments of logical<br />
structures other than of the main verb.<br />
Phase (10) Syntax Generation This will be unique <strong>for</strong> each target language. In this<br />
phase the system uses the target language rules <strong>to</strong> generate the syntax. In this case<br />
<strong>English</strong> language rules are used.<br />
Phase (11) Generate <strong>English</strong> Morphology The system generates <strong>English</strong> morphology<br />
in an innovative way, generating the tenses not existent in <strong>Arabic</strong> but in <strong>English</strong> as<br />
well as verb ‘<strong>to</strong> be’.<br />
Figure 6.3 shows the technique used <strong>to</strong> generate the correct verb tenses, and generate<br />
verb <strong>to</strong> be. Verbs in <strong>English</strong> have a mood; e.g. indicative, subjunctive, imperative<br />
83
6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />
Figure 6.3: Generation the right tense <strong>for</strong> the verbs<br />
and can be in one of many tenses. We discussed the special situation with refer-<br />
ence <strong>to</strong> the intersection of <strong>Arabic</strong> tense and aspect in Chapter 2. The solution is <strong>to</strong><br />
recognize the difference between morphological features and syntactic functional<br />
categories. The tense features must be expressed analytically.<br />
Phase (12) <strong>English</strong> Sentence Generation The process of generating an <strong>English</strong> sentence<br />
can be as simple as keeping a list of rules. These rules can be extended through the<br />
life of the MT system. The system will use some operations in <strong>English</strong> such as<br />
vowel change: examples; man men. Sometimes this accompanies affixations: break<br />
broke broken (= broke + en).<br />
6.1.2 UniArab: Lexical representation in interlingua system<br />
In transfer-based systems there are no problems if <strong>for</strong> a particular language pair there<br />
are one-<strong>to</strong>-one equivalents; the problems arise when there is more than one target word<br />
<strong>for</strong> a single source word. But <strong>for</strong> an interlingua in a multilingual system there are prob-<br />
lems even if only one of the languages involved has two or more potential <strong>for</strong>ms <strong>for</strong> a<br />
84
6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />
single given word in one of the other languages. If an interlingua is <strong>to</strong> be completely<br />
language-neutral, it must represent not the words of one or another of the languages,<br />
but language-independent lexical units. Any distinction which is (or can be) expressed<br />
lexically in the languages of the system must be represented explicitly in the interlingua<br />
representation (Hutchins and Somers 1992). The UniArab system can generate a target<br />
language by classifying every <strong>Arabic</strong> word in the source text. There are six major parts<br />
of speech in <strong>Arabic</strong>. These are Verbs, Nouns, Adjectives, Proper nouns, Demonstratives,<br />
Adverbs and we create a seventh, so called ‘other’ category <strong>for</strong> <strong>Arabic</strong> words which do<br />
not fit in<strong>to</strong> any of these six categories. The major parts of speech in the <strong>Arabic</strong> language<br />
have their own attributes, and we use these attributes within the UniArab system. For<br />
example, verbs in the <strong>Arabic</strong> language agree with their subjects in gender. <strong>Arabic</strong> words<br />
are masculine and feminine; there is no neutral gender. In the UniArab system we record<br />
the gender associated with a verb in the syntax <strong>for</strong> a particular subject NP. Adjectives<br />
and demonstratives also agree with the subject in gender <strong>to</strong>o. In <strong>Arabic</strong>, words come in<strong>to</strong><br />
three categories with regards <strong>to</strong> number: They are (1) singular, indicating one, e.g. <br />
rˇgl ‘one man’. (2) dual, indicating two, e.g. rˇglān ‘two men’ and (3) plural,<br />
indicating three or more e.g. rˇgāl ‘men’. The UniArab system records these at-<br />
tributes of gender and number. It is important <strong>to</strong> understand that source language specific<br />
features may not be used or may be different in the target language. For example, the<br />
<strong>Arabic</strong> number categ ory of dual is not relevant in <strong>English</strong>. The UniArab system is based<br />
on RRG and uses logical structures <strong>for</strong> each verb in the lexicon.<br />
85
6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />
6.2 UniArab: Lexical representation in interlingua system based<br />
on RRG<br />
Lexical frames represent the language-dependent lexicon. We use an XML data source<br />
<strong>to</strong> represent the UniArab lexicon. The lexicon creates pointers <strong>to</strong> corresponding con-<br />
ceptual frames or attributes of each word. These frames also have relations which link<br />
them <strong>to</strong> verb class frames, which are organized hierarchically according <strong>to</strong> the particular<br />
language.<br />
Although we adhere <strong>to</strong> the Interlingua approach, we do not do so with the translation<br />
of lexical items. In an ideal Interlingua system lexical entries should be broken down<br />
in<strong>to</strong> sets of semantic features. For example the word “man” is broken down in<strong>to</strong> +human<br />
+male +adult. While this works in theory, in practice we cannot find enough seman-<br />
tic features <strong>to</strong> describe every entity in the world. For example “cow”, “computer” and<br />
“chair” cannot be described using these sets of semantic features unless we invent a<br />
unique semantic feature <strong>for</strong> every object and this is practically impossible.<br />
6.2.1 Verb<br />
In the UniArab system, we capture the in<strong>for</strong>mation shown in Figure 6.4 <strong>for</strong> each verb.<br />
The verb in<strong>for</strong>mation captured consists of <strong>Arabic</strong> Verb, <strong>English</strong> Translation, Logical<br />
Structure, Tense, Gender, Person and Number. The <strong>Arabic</strong> Verb represents one of the<br />
<strong>Arabic</strong> verbs in a specific tense, <strong>for</strong> a specific gender, person and number. The <strong>English</strong><br />
translation is the <strong>English</strong> equivalent of the <strong>Arabic</strong> verb. The Logical Structure attribute is<br />
the RRG equivalent logical structure or lexical entry representation <strong>for</strong> the <strong>Arabic</strong> Verb.<br />
<strong>Arabic</strong> inflects verbs <strong>for</strong> tense and they agree in person, number and gender with the<br />
subject. In RRG, Tense is a verbal opera<strong>to</strong>r in the layer structure of the clause providing<br />
86
6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />
in<strong>for</strong>mation about the tense of this verb.<br />
Figure 6.4: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> verb<br />
Table 6.1: Verb 1<br />
<strong>Arabic</strong> verb qr֓a<br />
<strong>English</strong> translation read<br />
Logical structure [do’(x,[read’(x,(y)])]<br />
Tense past<br />
Gender m<br />
Person 3rd<br />
Number singular<br />
Table 6.2: Verb 2<br />
<strong>Arabic</strong> verb ktbt<br />
<strong>English</strong> translation wrote<br />
Logical structure [do’(x,[write’(x,(y)])]<br />
Tense past<br />
Gender f<br />
Person 3rd<br />
Number singular<br />
In the <strong>Arabic</strong> language, tense can be past or present as the primary distinction. Gender is<br />
an <strong>Arabic</strong> attribute of the verb. The verb agrees with the subject in gender. The Person<br />
attribute could be first, second or third person. The Number attribute refers <strong>to</strong> number of<br />
the subject. In <strong>Arabic</strong>, the number of a verb can be singular, dual or plural. Table 6.1<br />
and Table 6.2 shows an example of one <strong>Arabic</strong> verb applied <strong>to</strong> different genders. The<br />
absence of t ‘t’ suffix signifies m: gender. The <strong>English</strong> translation of these verbs are<br />
‘read’ and ‘wrote’.<br />
87
6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />
6.2.2 Common noun<br />
In the UniArab system, we capture the in<strong>for</strong>mation shown in Figure 6.5 <strong>for</strong> each noun.<br />
The noun in<strong>for</strong>mation captured consists of <strong>Arabic</strong> Noun, <strong>English</strong> Translation, Definite-<br />
ness, Gender and Number. <strong>Arabic</strong> Noun represents a noun in the <strong>Arabic</strong> language. The<br />
<strong>English</strong> translation is the <strong>English</strong> equivalent of the <strong>Arabic</strong> Noun. Definiteness of the<br />
nouns can be definite or indefinite. Gender is an <strong>Arabic</strong> attribute of the noun.<br />
Figure 6.5: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> noun<br />
Table 6.3: Noun<br />
<strong>Arabic</strong> noun ֓aˇsˇgār ālktāb<br />
<strong>English</strong> translation trees the book<br />
Definiteness indefinite definite<br />
Gender f m<br />
Number plural singular<br />
The Number attribute refers <strong>to</strong> number of the noun. In the <strong>Arabic</strong> language number of<br />
nouns can be single, dual or plural. Table 6.3 shows examples of two different <strong>Arabic</strong><br />
noun words, whose <strong>English</strong> translations are ‘trees’ and ‘book’. Please note that ‘book’ is<br />
def+, meaning ‘definite’.<br />
6.2.3 Proper noun<br />
Proper nouns in <strong>Arabic</strong> are not capitalized. In the UniArab system we capture the in<strong>for</strong>-<br />
mation shown in Figure 6.6. For each proper noun the system captures <strong>Arabic</strong> proper<br />
noun, <strong>English</strong> translation, definiteness, gender and number. <strong>Arabic</strong> proper nouns rep-<br />
88
6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />
resents a proper noun in the <strong>Arabic</strong> language. The <strong>English</strong> translation is the <strong>English</strong><br />
equivalent of the <strong>Arabic</strong> proper noun. Gender is an <strong>Arabic</strong> attribute of the proper noun.<br />
The Number attribute refers <strong>to</strong> the number of the proper noun; single, dual or plural.<br />
Figure 6.6: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> proper noun<br />
Table 6.4: Proper Noun<br />
<strong>Arabic</strong> proper noun ֒mr ֓iymān<br />
<strong>English</strong> translation Omar Eman<br />
Gender m f<br />
Number singular singular<br />
Table 6.4 shows examples of two different <strong>Arabic</strong> proper noun words, whose <strong>English</strong><br />
translations are ‘Omar’ and ‘Eman’.<br />
6.2.4 Adjective<br />
In the UniArab system, we capture the in<strong>for</strong>mation shown in Figure 6.7 <strong>for</strong> each adjec-<br />
tive. This consists of <strong>Arabic</strong> Adjective, <strong>English</strong> Translation, Definiteness, Gender and<br />
Number. <strong>Arabic</strong> Adjectives represent adjectives in the <strong>Arabic</strong> language. The <strong>English</strong><br />
translation is the <strong>English</strong> equivalent of the <strong>Arabic</strong> Adjective. Definiteness can be definite<br />
or indefinite. Gender is an <strong>Arabic</strong> attribute of the adjective.<br />
Table 6.5: Adjective<br />
<strong>Arabic</strong> adjective qs. yr ālt.wylh<br />
<strong>English</strong> translation short the long<br />
Definiteness indefinite definite<br />
Gender m f<br />
Number singular singular<br />
89
6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />
Figure 6.7: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> adjective<br />
The Number attribute refers <strong>to</strong> the number of the adjective. In the <strong>Arabic</strong> language num-<br />
ber agreement <strong>for</strong> adjectives can be singular, dual or plural. Table 6.5 shows examples of<br />
two different <strong>Arabic</strong> adjective words, whose <strong>English</strong> translations are ‘short’ and ‘long’,<br />
please note that ‘long’ is def+.<br />
6.2.5 Demonstrative<br />
In the UniArab system we capture the in<strong>for</strong>mation shown in Figure 6.8 <strong>for</strong> each demon-<br />
strative. this consists of <strong>Arabic</strong> Demonstrative, <strong>English</strong> Translation, Demonstrative type,<br />
Gender and Number. <strong>Arabic</strong> Demonstratives represents a demonstrative in the <strong>Arabic</strong><br />
language. The <strong>English</strong> translation is the <strong>English</strong> equivalent of the <strong>Arabic</strong> Demonstra-<br />
tive. Demonstrative type can be, in the <strong>Arabic</strong> language, near <strong>to</strong> the speaker, far from the<br />
speaker or between near and far from the speaker. Gender is an <strong>Arabic</strong> attribute of the<br />
demonstrative. The Number attribute refers <strong>to</strong> number of the demonstrative. Table 6.6<br />
shows examples of two different <strong>Arabic</strong> demonstratives, whose <strong>English</strong> translations are<br />
‘this’ and ‘that’.<br />
Table 6.6: Demonstrative representative<br />
<strong>Arabic</strong> demonstrative hd ¯ ā d ¯ lk ֓awl֓yk<br />
<strong>English</strong> translation this that those<br />
Demonstrative type close far between near and far from the speaker<br />
Gender m m both m and f<br />
Number singular singular plural<br />
90
6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />
6.2.6 Adverb<br />
Figure 6.8: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> demonstrative.<br />
In the UniArab system we capture the in<strong>for</strong>mation shown in Figure 6.9 <strong>for</strong> each adverb.<br />
this consists of <strong>Arabic</strong> Adverb, <strong>English</strong> Translation and Adverb type. <strong>Arabic</strong> Adverbs<br />
represents an adverb in the <strong>Arabic</strong> language. The <strong>English</strong> translation is the <strong>English</strong><br />
equivalent of the <strong>Arabic</strong> Adverb. Adverbs type refers <strong>to</strong> time or place (proposition), time<br />
such as ‘<strong>to</strong>day’ or ‘<strong>to</strong>morrow’ and places like ‘under’, ‘in’, or ‘on’ etc.<br />
Figure 6.9: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> adverb.<br />
Table 6.7: Adverb<br />
<strong>Arabic</strong> adverb <br />
bˇgānb ālywm<br />
<strong>English</strong> translation beside <strong>to</strong>day<br />
Adverb type Proposition time<br />
Table 6.7 shows examples of two different <strong>Arabic</strong> adverbs, whose <strong>English</strong> translations<br />
are ‘beside’ and ‘<strong>to</strong>day’.<br />
91
6.2.7 Other <strong>Arabic</strong> words<br />
6.3. UNIARAB: GENERATION<br />
In the UniArab system, we capture the in<strong>for</strong>mation shown in Figure 6.10 <strong>for</strong> each other<br />
<strong>Arabic</strong> word. This consists of <strong>Arabic</strong> Other Word, <strong>English</strong> Translation, Logical Structure,<br />
Part of Speech, Tense, Gender, Person, Number and Definiteness.<br />
Figure 6.10: In<strong>for</strong>mation recorded on the other <strong>Arabic</strong> words<br />
Table 6.8: Other <strong>Arabic</strong> words (where ‘NON’ means not applicable)<br />
<strong>Arabic</strong> other words w hy<br />
<strong>English</strong> translation and she<br />
Logical structure NON NON<br />
Part of speech conjunction pronoun<br />
Tense NON NON<br />
Gender NON f<br />
Person NON 3rd<br />
Number NON singular<br />
Definiteness<br />
We allow a variety of attribute possibilities <strong>for</strong> the category ‘other’ in <strong>Arabic</strong> words <strong>for</strong><br />
the moment. Table 6.8 shows examples of two different <strong>Arabic</strong> Other words, whose<br />
<strong>English</strong> translations are ‘and’ and ‘she’.<br />
6.3 UniArab: Generation<br />
The target language generation phases in the UniArab system follow the syntactic re-<br />
alization model. Generation takes as input, the universal logical structure of the input<br />
sentence(s) and produces as output a morphology-syntactic realisation of the sentence in<br />
the target language. The UniArab system is designed as a universal <strong>machine</strong> transla<strong>to</strong>r,<br />
92
6.3. UNIARAB: GENERATION<br />
which means that it can support translation of the <strong>Arabic</strong> in<strong>to</strong> any other natural language<br />
with the addition of additional language generation bridges. The UniArab system is eval-<br />
uated using <strong>Arabic</strong> as source language in<strong>to</strong> <strong>English</strong> as the target language. In the UniArab<br />
system phases 9, 10, 11 and 12 are <strong>for</strong> generation of the target languages, in our case this<br />
is <strong>English</strong>. For the example given under Phase 8 in Section 6.1.1, <br />
qr֓a<br />
hāld<br />
ālktāb, Khalid read the book, we have the logical structure:<br />
˘<br />
Verb read [do’(x,[read’(x,(y)])] sg 3rd M PAST qr֓a><br />
where the Proper Noun is Khalid sg unspec M hāld ˘<br />
and the Noun is the book sg def M ālktāb<br />
Firstly, the Semantic <strong>to</strong> Syntactic phase determines the ac<strong>to</strong>r and undergoer assignments,<br />
assigns the core arguments and assigns the predicate in the nucleus. In the UniArab sys-<br />
tem we keep all word attributes whether they are used in the target language or not. In<br />
this case, the gender of the noun the book, in <strong>Arabic</strong> is masculine, but in <strong>English</strong> book<br />
has neutral gender. In Phase 10, Syntax Generation, and Phase 11, Generate <strong>English</strong><br />
Morphology, UniArab uses target language rule <strong>to</strong> generate the syntax. The verb log-<br />
ical structure gives UniArab a flag indicating how many arguments this verb takes. In<br />
this case the logical structure will be read[do’(x,[read’(x,(y)])]. Now the<br />
UniArab system replaces x with Khalid, and y with the book. The UniArab system now<br />
holds the following <strong>for</strong> this simple sentence:<br />
read[do’(Khalid,[read’(Khalid,(the book)])].<br />
In the last phase, <strong>English</strong> Sentence Generation, the UniArab system builds the final shape<br />
of a sentence: Khalid read the book. Moreover, there are some special cases, like the<br />
UniArab system adding verb <strong>to</strong> be or changing the verb tense of the source language <strong>to</strong><br />
93
6.4. UNIARAB: SCREEN DESIGN<br />
another tense in the target language. Also, the role of word order in the target language<br />
must be considered.<br />
ālktāb book:N h ˘ āld Khalid:MsgN qr֓a read:V<br />
read [do’(x,[read’(x,(y)])] sg 3rd M PAST qr֓a<br />
The results of the parse can be seen here with LS as :<br />
Verb read [do’(x,[read’(x,(y)])] sg 3rd M PAST qr֓a><br />
where the Proper Noun is Khalid sg unspec M <br />
h ˘ āld<br />
and the Noun is the book sg def M ālktāb<br />
At this point the generation will start; first of all the semantic <strong>to</strong> syntactic phase deter-<br />
mines the ac<strong>to</strong>r and undergoer assignments, assign the core arguments and assign the<br />
predicate in the nucleus. In the last phase, <strong>English</strong> Sentence Generation, the UniArab<br />
system builds the final shape of a sentence: Khalid read the book. Moreover, there are<br />
some special cases, like the UniArab system adding the verb ‘<strong>to</strong> be’ or ensure the verb<br />
tense of the source language is reflected as the appropriate tense in the target language.<br />
Also, the rules of word order in the target languages must be considered.<br />
6.4 UniArab: Screen design<br />
The graphical user interface (GUI) of UniArab is interactive. Designing the visual com-<br />
position and temporal behaviour of the GUI is an important aspect of the design of<br />
UniArab. We use one text area <strong>to</strong> allow a user <strong>to</strong> input source language sentences, two<br />
but<strong>to</strong>ns, Enter <strong>to</strong> submit the text <strong>to</strong> the system, Clear <strong>to</strong> delete all text in the input and<br />
output text areas. There is a separate text area <strong>for</strong> output of the translated text. Also there<br />
is a text area <strong>for</strong> logical structure output of every sentence.<br />
94
Figure 6.11: UniArab’s GUI 1<br />
6.4. UNIARAB: SCREEN DESIGN<br />
If a user needs <strong>to</strong> add a new <strong>Arabic</strong> word in the UniArab system datasource; he/she can<br />
click on the appropriate tab. There are seven different tabs each representing a category of<br />
words in the <strong>Arabic</strong> language; Add <strong>Arabic</strong> Verb, Add <strong>Arabic</strong> Noun, Add <strong>Arabic</strong> Adjective,<br />
Add <strong>Arabic</strong> Proper nouns, Add <strong>Arabic</strong> Demonstratives, Add <strong>Arabic</strong> Adverb and Add<br />
other <strong>Arabic</strong> Word. In every tab there are a number of combo boxes. A combo box is a<br />
combination of a drop-down list or list box, allowing the user <strong>to</strong> choose from the list of<br />
existing options. For example, when a user needs <strong>to</strong> add a new adjective <strong>to</strong> the datasource,<br />
the user will be presented with a text field <strong>to</strong> let him/her enter an <strong>Arabic</strong> adjective. There<br />
is another text field <strong>for</strong> adding the <strong>English</strong> equivalent of the <strong>Arabic</strong> adjective. There are<br />
a number of combo boxes; number, definition and gender, a user chooses from the list an<br />
option. There are two but<strong>to</strong>ns under each tab, Enter <strong>to</strong> submit the in<strong>for</strong>mation in<strong>to</strong> the<br />
95
Figure 6.12: UniArab’s GUI 2<br />
6.4. UNIARAB: SCREEN DESIGN<br />
system, and Clear <strong>to</strong> delete all words in all text fields and return combo boxes <strong>to</strong> their<br />
default state. Figures 6.11,6.12,6.13 shows GUI of the UniArab system.<br />
96
6.4.1 Lexicon interface<br />
Figure 6.13: UniArab’s GUI 3<br />
6.4. UNIARAB: SCREEN DESIGN<br />
In order <strong>to</strong> allow <strong>for</strong> robust user interaction with the lexicon, we use a graphical interface<br />
<strong>to</strong> capture the in<strong>for</strong>mation <strong>for</strong> each part of speech. The user selects the part of speech of<br />
the word he is adding, and is then presented with only the options relevant <strong>to</strong> it. The in-<br />
terface also limits the user’s selections <strong>to</strong> acceptable values and ensures that all attributes<br />
are filled. With this technique, we minimize the risk of human error, and there<strong>for</strong>e the<br />
in<strong>for</strong>mation is more accurate. The graphical interface is quicker and easier when a user<br />
adds a new word in the lexical (XML data source). When the system displays an in<strong>for</strong>-<br />
mation error. Figure 6.14 shows the entry interface that is implemented as part of the<br />
UniArab system.<br />
97
6.5 Technical challenges<br />
Figure 6.14: UniArab’s lexicon interface<br />
6.5. TECHNICAL CHALLENGES<br />
<strong>Arabic</strong> letters in the GUI We can not write <strong>Arabic</strong> letters in UniArab’s GUI. We use<br />
Unicode <strong>to</strong> represent them. Unicode Converter System allows us <strong>to</strong> enter <strong>Arabic</strong><br />
text and click on a but<strong>to</strong>n <strong>to</strong> get the equivalent Unicode of the text.<br />
<strong>Arabic</strong> letters in Eclipse IDE <strong>for</strong> Java We used Eclipse IDE <strong>for</strong> Java development. We<br />
can not write <strong>Arabic</strong> as a string in Eclipse. While Java does support <strong>Arabic</strong>, the<br />
problem lies in the operating system not supporting <strong>Arabic</strong> letter shapes in IDE. We<br />
used Windows XP and Windows 2000 which both have the same problem. To fix<br />
this we changed <strong>to</strong> Ubuntu Linux. Under Linux we can write <strong>Arabic</strong> text as a string<br />
in the Eclipse IDE.<br />
<strong>Arabic</strong> in data source We choose <strong>to</strong> create our data source as XML, <strong>for</strong> optimum sup-<br />
port or different plat<strong>for</strong>ms. It was also easier as we used <strong>Arabic</strong> letters not Unicode<br />
inside the data source. XML fully supports <strong>Arabic</strong>. We created our search engine<br />
using Java. We used a HashMap <strong>to</strong> make the keyword in <strong>Arabic</strong> when we search<br />
inside the datasource. We used verbMap.containsKey(word) in order <strong>to</strong> check<br />
the presence of an <strong>Arabic</strong> word in the data source.<br />
98
6.6 Summary<br />
6.6. SUMMARY<br />
We presented the conceptual structure and architecture of the UniArab system. We dis-<br />
cussed each of the phases from source language analysis, through the logical represen-<br />
tation, then the generation of the target sentences. We detailed the lexical properties of<br />
<strong>Arabic</strong> sentences and the attributes <strong>for</strong> each type of word. We discussed how generation<br />
maps the logic structure <strong>to</strong> the target language. Finally, we discussed the user interface<br />
and some of the problems encountered during development.<br />
99
7<br />
Testing and evaluation<br />
This chapter presents the results of the evaluation. Evaluation of MT software is nec-<br />
essary in order <strong>to</strong> improve system per<strong>for</strong>mance and analyse potential problems and, of<br />
course, its accuracy and effectiveness. In the evaluation session we consider many differ-<br />
ent aspects of the MT system including quality of translation, time <strong>for</strong> translation ability<br />
<strong>to</strong> add a new word in the lexicon of the system and resource utilization.<br />
7.1 Evaluation of MT systems<br />
The evaluation of MT systems is a difficult task. This is not only because many different<br />
metrics are involved, but also because translation is itself difficult (Laoudi et al. 2004).<br />
The first important aspect <strong>for</strong> a potential test is <strong>to</strong> determine the translational capability.<br />
There<strong>for</strong>e, we need <strong>to</strong> draw up a complete overview of the translational process, in all its<br />
different aspects. A good translation has <strong>to</strong> effectively capture the meaning. This involves<br />
establishing the size of the translation task, is it <strong>machine</strong> legible and if so, according<br />
100
7.2. SENTENCE TESTS<br />
<strong>to</strong> which standards? Current general function MT systems can not translate all texts<br />
consistently. Output can have very poor quality. It is important <strong>to</strong> mention that the<br />
‘subsequent editing required’ increases as translation quality gets poorer (Turian et al.<br />
2003).<br />
Given the limited lexicon implemented in this work so far, we evaluate the effectiveness<br />
and accuracy of UniArab by comparison. We create variants of <strong>Arabic</strong> sentences that<br />
represent all possible structures of the sentences that UniArab can translate. We then<br />
compare between human-translated and <strong>machine</strong>-translated versions.<br />
7.2 Sentence tests<br />
We have sentences (<strong>for</strong> actual test examples see Appendix C) in <strong>Arabic</strong> and their equiv-<br />
alent translations in <strong>English</strong>. We have covered a representative broad selection of verbs<br />
across intransitive, transitive and ditransitive constructions in simplex sentences in ac-<br />
tive voice. Complex sentences are beyond the thesis scope. However, we do address<br />
copula-like nominative clauses in <strong>Arabic</strong>. We tested UniArab in more than one way. We<br />
tested single sentences and multiple sentences. UniArab easily deals with more than one<br />
sentence as input and its output matches. We entered random sentences <strong>to</strong>gether in one<br />
input or as individual sentences.<br />
101
7.2.1 Verb-Subject with one argument in different tenses<br />
Table 7.1: Test : Verb-Subject; one argument<br />
<strong>Arabic</strong> yˇsrb ֒mr āllbn<br />
Human Omar is drinking the milk.<br />
Google Omar drink milk<br />
Microsoft drink milk Omar<br />
UniArab Omar is drinking the milk .<br />
7.2. SENTENCE TESTS<br />
In Table 7.1, the output of the Google transla<strong>to</strong>r (Google 2009) is faulty in tense and verb<br />
‘<strong>to</strong> be’. Microsoft’s MT (Microsoft 2009) failed <strong>to</strong> translate most of the sentence in tense,<br />
verb and word order. UniArab successfully translates the sentence in its entirety. Figure<br />
7.1 shows this sentence output in the UniArab system.<br />
Figure 7.1: Verb-Subject with one argument<br />
102
Table 7.2: Test : Verb-subject; agreement 1<br />
<strong>Arabic</strong> ˇsrb ֒mr āllbn<br />
Human Omar drank the milk<br />
Google Omar drinking milk<br />
Microsoft drinking milk Omar<br />
UniArab Omar drank the milk.<br />
7.2. SENTENCE TESTS<br />
In Table 7.2, the output of the Google transla<strong>to</strong>r is faulty in tense and definition. The<br />
Microsoft transla<strong>to</strong>r failed <strong>to</strong> translate most of the sentence in tense, definition and word<br />
order. UniArab successfully translates the sentence in its entirety. Figure 7.2 shows this<br />
sentence output in the UniArab system.<br />
Figure 7.2: Verb-Subject with one argument<br />
103
Table 7.3: Test : verb-subject; agreement 2<br />
<strong>Arabic</strong> <br />
h˘ āld qr֓a ālktāb<br />
human-translated Khalid read the book<br />
Google Khalid read the book<br />
Microsoft Khaled read book<br />
UniArab<br />
<strong>Arabic</strong><br />
Khalid read the book.<br />
<br />
human-translated Khalid will drink the milk<br />
Google Khalid drink milk.<br />
Microsoft Khaled drink milk.<br />
UniArab Khalid will drink the milk.<br />
syˇsrb h ˘ āld āllbn<br />
7.2. SENTENCE TESTS<br />
In Table 7.3, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT failed <strong>to</strong><br />
translate the definition. UniArab successfully translates the sentence in its entirety. In<br />
the output of the second sentence, the Google transla<strong>to</strong>r is faulty in tense and definition.<br />
Microsoft’s MT failed <strong>to</strong> translate the tense and definition. UniArab successfully trans-<br />
lates the sentence in its entirety. This is becouse of the RRG lexicalist approach in the<br />
interlingua. Figures 7.3 and 7.4 show this sentence output in the UniArab system.<br />
Figure 7.3: Verb-subject agreement 1<br />
104
Figure 7.4: Verb-subject agreement 2<br />
7.2. SENTENCE TESTS<br />
105
7.2.2 Gender-ambiguous proper nouns<br />
Table 7.4: Test : Gender-ambiguous proper nouns 1<br />
<strong>Arabic</strong> <br />
qr֓a ˇgāk ālktāb<br />
human-translated Jack read the book<br />
Google Jack read the book<br />
Microsoft read Jack book<br />
UniArab Jack read the book.<br />
7.2. SENTENCE TESTS<br />
In Table 7.4, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT failed<br />
<strong>to</strong> translate the definition. UniArab successfully translates the sentence in its entirety.<br />
Figure 7.5 shows this sentence output in the UniArab system.<br />
Figure 7.5: Gender-ambiguous proper nouns 1<br />
106
Table 7.5: Test : gender-ambiguous proper nouns 2<br />
<strong>Arabic</strong> qr֓at māry ālktāb<br />
human-translated Mary read the book<br />
Google Marie read the book<br />
Microsoft read Marrie book<br />
UniArab Mary read the book.<br />
7.2. SENTENCE TESTS<br />
In Table 7.5, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT failed <strong>to</strong><br />
translate the definition and word order. UniArab successfully translates the sentence in<br />
its entirety. Figure 7.6 shows this sentence output in the UniArab system.<br />
Figure 7.6: Gender-ambiguous proper nouns 2<br />
107
7.2.3 Verb ‘<strong>to</strong> be’<br />
Table 7.6: Test : Verb ‘<strong>to</strong> be’ 1<br />
<strong>Arabic</strong> hw mhnds<br />
human-translated He is an engineer.<br />
Google Is the architect of<br />
Microsoft is the engineer<br />
UniArab He is an engineer.<br />
7.2. SENTENCE TESTS<br />
In Table 7.6, the output of the Google transla<strong>to</strong>r is faulty. Microsoft’s MT successfully<br />
translated the person. UniArab successfully translates the sentence in its entirety. Figure<br />
7.7 shows this sentence output in the UniArab system.<br />
Figure 7.7: Verb ‘<strong>to</strong> be’ 1<br />
108
Table 7.7: Test : Verb ‘<strong>to</strong> be’ 2<br />
<strong>Arabic</strong> ֓anā ālmhnds<br />
human-translated I am the engineer.<br />
Google I Engineer<br />
Microsoft i am engineer<br />
UniArab I am the engineer.<br />
7.2. SENTENCE TESTS<br />
In Table 7.6, the output of the Google transla<strong>to</strong>r is faulty in the verb ‘<strong>to</strong> be’ and defini-<br />
tion. Microsoft’s MT successfully translated the verb ‘<strong>to</strong> be’, it is faulty in the definition<br />
only. UniArab successfully translates the sentence in its entirety. Figure 7.8 shows this<br />
sentence output in the UniArab system.<br />
Figure 7.8: Verb ‘<strong>to</strong> be’ 2<br />
109
7.2.4 Verb ‘<strong>to</strong> have’<br />
<strong>Arabic</strong><br />
Table 7.8: Test : Verb ‘<strong>to</strong> have’ 1<br />
<br />
human-translated I have made a reservation.<br />
Google I have made a reservation<br />
Microsoft You have a booking<br />
UniArab I have made a reservation.<br />
lqd qmt bālh. ˇgz<br />
7.2. SENTENCE TESTS<br />
In Table 7.8, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT is faulty<br />
in person. UniArab successfully translates the sentence in its entirety. Figure 7.9 shows<br />
this sentence output in the UniArab system.<br />
Figure 7.9: Verb ‘<strong>to</strong> have’ 1<br />
110
Table 7.9: Test : Verb ‘<strong>to</strong> have’ 2<br />
<strong>Arabic</strong> lqd fqdt td ¯ krty<br />
human-translated I have lost my ticket.<br />
Google I’ve lost my ticket<br />
Microsoft i have lost td¯ krty<br />
UniArab I have lost my ticket.<br />
7.2. SENTENCE TESTS<br />
In Table 7.9, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT is faulty<br />
in the object word. UniArab successfully translates the sentence in its entirety. Figure<br />
7.10 shows this sentence output in the UniArab system.<br />
Figure 7.10: Verb ‘<strong>to</strong> have’ 2<br />
111
7.2.5 Free word order<br />
7.2. SENTENCE TESTS<br />
Here we show three <strong>Arabic</strong> sentences with different word order which translate <strong>to</strong> the<br />
same <strong>English</strong> output.<br />
Table 7.10: Test : Free word order (Verb Noun Noun scenario one)<br />
<strong>Arabic</strong> yh. b qys lylā<br />
human-translated Qays loves Laila<br />
Google Qais likes of Laila<br />
Microsoft Love Qais laili<br />
UniArab Qays loves Laila<br />
In Table 7.10, the output of the Google transla<strong>to</strong>r is faulty in the verb meaning and the<br />
system added ‘of’ without any meaning in this sentence. Microsoft’s MT translated each<br />
word while ignoring the word order and meaning of the sentence. UniArab success-<br />
fully translates the sentence in its entirety. Figure 7.11 shows this sentence output in the<br />
UniArab system.<br />
Figure 7.11: Free word order (Verb Noun Noun scenario one)<br />
112
7.2. SENTENCE TESTS<br />
Table 7.11: Test : Free word order (Verb Noun Noun scenario two)<br />
<strong>Arabic</strong> yh. b lylā qys<br />
human-translated Qays loves Laila<br />
Google Leila loves measured<br />
Microsoft Love laili Qais<br />
UniArab Qays loves Laila<br />
In Table 7.11, the second ordering possibility is shown. The output of the Google transla-<br />
<strong>to</strong>r is faulty in the ac<strong>to</strong>r, the system can not analyse ‘who does what’, the ac<strong>to</strong>r is Qais but<br />
the output makes the object the subject. Microsoft’s transla<strong>to</strong>r translates each word while<br />
ignores the word order and the meaning of the sentence. It also makes the object the<br />
subject. UniArab successfully translates the sentence in its entirety. Figure 7.12 shows<br />
this sentence output in the UniArab system.<br />
Figure 7.12: Free word order (Verb Noun Noun scenario two)<br />
113
7.2. SENTENCE TESTS<br />
Table 7.12: Test : Free word order (Verb Noun Noun scenario three)<br />
<strong>Arabic</strong> qys yh. b lylā<br />
human-translated Qays loves Laila<br />
Google Qais likes of Laila<br />
Microsoft Qais love laili<br />
UniArab Qays loves Laila<br />
Table 7.12 shows the third possible sentence order. The output of the Google transla<strong>to</strong>r<br />
is faulty in verb meaning and adds an extra ‘of’ which does not carry any meaning.<br />
Microsoft’s MT translates each word while ignoring the word order, tense and meaning<br />
of the sentence. UniArab successfully translates the sentence in its entirety. Figure 7.13<br />
shows this sentence output in the UniArab system.<br />
Figure 7.13: Free word order (Verb Noun Noun scenario three)<br />
114
7.2.6 Pro-drop<br />
<strong>Arabic</strong><br />
Table 7.13: Test: Pro-drop<br />
fāttny ālt.ā֓yrh<br />
human-translated I missed the plane.<br />
Google Missed the plane<br />
Microsoft <br />
fāttny plane<br />
UniArab I missed the plane.<br />
7.2. SENTENCE TESTS<br />
Table 7.13 shows an example of a pro–drop sentence. The output of the Google trans-<br />
la<strong>to</strong>r is faulty in the point of pro-drop; the system did not find the subject. Microsoft’s<br />
MT did not recognize the important word in the sentence and passed it through <strong>to</strong> the<br />
output. UniArab successfully translates the sentence in its entirety. Figure 7.14 shows<br />
this sentence output in the UniArab system.<br />
Figure 7.14: Pro-drop<br />
115
7.2.7 Transitivity of verbs<br />
7.2. SENTENCE TESTS<br />
In <strong>Arabic</strong> and <strong>English</strong>, we can classify verbs as both intransitive, transitive and ditransi-<br />
tive.<br />
7.2.7.1 Intransitive<br />
Table 7.14: Test : Intransitive 1<br />
<strong>Arabic</strong> s. hyb yqr֓a<br />
human-translated Suhaib reads.<br />
Google Suhaib read<br />
Microsoft suhaib reads<br />
UniArab Suhaib reads.<br />
Table 7.14 shows an example of an intransitive sentence. The output of the Google trans-<br />
la<strong>to</strong>r is faulty in tense. Microsoft’s and UniArab transla<strong>to</strong>rs successfully. Both systems<br />
are translate the sentence in its entirety. Figure 7.15 shows this sentence output in the<br />
UniArab system.<br />
Figure 7.15: Intransitive<br />
116
Table 7.15: Test : Intransitive 2<br />
<strong>Arabic</strong> s. hyb yqr֓a kt ¯ yrā<br />
human-translated Suhaib reads a lot.<br />
Google Souhaib read a lot<br />
Microsoft suhaib reads much<br />
UniArab Suhaib reads a lot.<br />
7.2. SENTENCE TESTS<br />
Table 7.15 shows an example of an intransitive with an adverb. The output of the Google<br />
transla<strong>to</strong>r has given the wrong tense. Microsoft’s and UniArab transla<strong>to</strong>rs are both suc-<br />
cessful. Both systems are translate the sentence in its entirety, though the Microsoft<br />
output is more <strong>for</strong>mal. Figure 7.16 shows this sentence output in the UniArab system.<br />
Figure 7.16: Intransitive with an adverb<br />
117
7.2.7.2 Transitive<br />
Table 7.16: Test : Transitive<br />
<strong>Arabic</strong> ys. lh. mārk ālh. āswb<br />
human-translated Mark is fixing the computer.<br />
Google Mark works computer<br />
Microsoft Marc computer works<br />
UniArab Mark is fixing the computer.<br />
7.2. SENTENCE TESTS<br />
In Table 7.16, the output of the Google and Microsoft’s transla<strong>to</strong>rs are faulty in the verb<br />
‘<strong>to</strong> be’ and the meaning of the verb. UniArab successfully translates the sentence in its<br />
entirety. Figure 7.17 shows this sentence output in the UniArab system.<br />
Figure 7.17: Transitive<br />
118
7.2.7.3 Ditransitive<br />
Table 7.17: Test : Ditransitive 1<br />
<strong>Arabic</strong> <br />
hw ֓a֒t.ā h ˘ āld ktāb<br />
human-translated He gave Khalid a book.<br />
Google Khaled was given a book<br />
Microsoft is given Khaled book<br />
UniArab He gave Khalid a book.<br />
7.2. SENTENCE TESTS<br />
Table 7.17 shows an example of a ditransitive. The output of the Google has been given<br />
in the passive tense. Microsoft’s transla<strong>to</strong>r gives an incorrect output. UniArab success-<br />
fully translates the sentence in its entirety. Figure 7.18 shows this sentence output in the<br />
UniArab system.<br />
Figure 7.18: Ditransitive 1<br />
119
Table 7.18: Test : Ditransitive with 2 NP<br />
<strong>Arabic</strong><br />
human-translated Mark is showing Adam the letter.<br />
Google Mark Adam see the letter.<br />
Microsoft Marc finds Adam message.<br />
UniArab Mark is showing Adam the letter.<br />
7.2. SENTENCE TESTS<br />
mārk yry ֓ādm ālrsālh<br />
Table 7.18 shows an example of another ditransitive. The output of the Google transla<strong>to</strong>r<br />
is faulty in determining the ac<strong>to</strong>r, the system can not analyse who does what. Microsoft’s<br />
transla<strong>to</strong>r is faulty in the meaning of the verb and in sentence meaning. UniArab suc-<br />
cessfully translates the sentence in its entirety. Figure 7.19 shows this sentence output in<br />
the UniArab system.<br />
Figure 7.19: Ditransitive with 2 NP<br />
120
Table 7.19: Test : Ditransitive with preposition<br />
<strong>Arabic</strong> <br />
֒mr ֓a֒t.ā lh ˘ āld ktāb<br />
human-translated Omar gave a book <strong>to</strong> Khalid.<br />
Google Omar Khaled gave a book<br />
Microsoft Omar gave Khalid book<br />
UniArab Omar gave a book <strong>to</strong> Khalid.<br />
7.2. SENTENCE TESTS<br />
Table 7.19 shows an example of ditransitive. The output of the Google transla<strong>to</strong>r is<br />
faulty in determining the ac<strong>to</strong>r, the system can not analyse who does what. Microsoft’s<br />
transla<strong>to</strong>r is faulty loosing the definite article and sentence meaning. UniArab success-<br />
fully translates the sentence in its entirety. Figure 7.20 shows this sentence output in the<br />
UniArab system.<br />
Figure 7.20: Ditransitive with preposition<br />
121
7.2.8 Limitation of UniArab<br />
Table 7.20: Test : Limitation of UniArab<br />
<strong>Arabic</strong> snsāfr ˙gdā ֓ilā ms. r<br />
human-translated We will travel <strong>to</strong> Egypt <strong>to</strong>morrow<br />
Google Tomorrow he travels <strong>to</strong> Egypt<br />
Microsoft snsāfr on <strong>to</strong> Egypt<br />
UniArab<br />
7.2. SENTENCE TESTS<br />
Table 7.20 shows another example of a pro–drop sentence. The output of the Google<br />
transla<strong>to</strong>r is faulty on the point of pro-drop; the system did not find the subject. We found<br />
that Microsoft’s MT did not recognize the important word in the sentence and passed it<br />
through <strong>to</strong> the output. UniArab fails <strong>to</strong> give a translation, because this structure does not<br />
exist in the system. Since RRG is built upon the logical structure, when an unknown<br />
structure is encountered, it cannot produce an output, even if some of the words are in<br />
the lexicon. Figure 7.21 shows this sentence output in the UniArab system.<br />
Figure 7.21: Limitation of UniArab 1<br />
122
7.2. SENTENCE TESTS<br />
In a case where a word is not available in the lexicon, but the logical structure is recog-<br />
nised, UniArab will output a correctly structured translation, but with the unknown Ara-<br />
bic word in its position within the <strong>English</strong> sentence. This makes the system resilient<br />
<strong>to</strong> slight misspellings which can be recognised and corrected. Figure 7.22 shows this<br />
sentence output in the UniArab system.<br />
Figure 7.22: Limitation of UniArab 2<br />
123
7.2. SENTENCE TESTS<br />
Table 7.21: Test : Limitation of UniArab 3 using non existing nonsense word<br />
<strong>Arabic</strong> h˘ āld ksr d. s. tqf˙g ¯<br />
human-translated Khaled broke ( d. s. tqf˙g this word is not an <strong>Arabic</strong> word).<br />
¯<br />
Google Khalid break Dsthagafg<br />
Microsoft Khaled d. s. tqf˙g break<br />
¯<br />
UniArab Khaled broke <br />
d. s. t ¯ qf˙g<br />
In Table 7.21, we show how the system responds <strong>to</strong> an unknown word. We have put in a<br />
non-word in the <strong>Arabic</strong> sentence. The output of the Google and Microsoft’s transla<strong>to</strong>rs<br />
are faulty in the verb. Microsoft’s transla<strong>to</strong>r put the unknown word in the wrong position.<br />
Google transliterates the word and puts it in the correct position. UniArab successfully<br />
translates the verb and puts the unknown word in the correct position. Figure 7.23 shows<br />
this sentence output in the UniArab system.<br />
Figure 7.23: Limitation of UniArab 3<br />
124
7.3 System evaluation<br />
7.3. SYSTEM EVALUATION<br />
UniArab supports simple sentences with one or two arguments or compounds. We have<br />
a number of sentences that were created <strong>to</strong> aid the grammar in terms of coverage of basic<br />
<strong>Arabic</strong> sentence structures. Further research should be conducted <strong>to</strong> incorporate more<br />
stages in<strong>to</strong> UniArab.<br />
UniArab is based on RRG, and the logical structure of a sentence is the key piece of<br />
in<strong>for</strong>mation <strong>for</strong> producing a translation. The system is programmed <strong>to</strong> be capable of<br />
dealing with specific structures. Once a structure is enabled within the system, the only<br />
limit on translating sentences of that structure is the coverage of the vocabulary. Hence, if<br />
a specific sentence structure exists with in UniArab, any sentence of that structure can be<br />
translated. This is a strength of being RRG-based, since the structure and vocabulary are<br />
dealt with independently, and vocabulary is more straight<strong>for</strong>ward <strong>to</strong> improve. The struc-<br />
ture is also independent of issues of gender and tense, which are only considered once<br />
a structure has been assigned, <strong>to</strong> determine who does what. As we develop UniArab,<br />
adding further structures increases the coverage by a considerable amount. However, as<br />
the number of structures increases, word ambiguity will become a bigger issue.<br />
UniArab uses an XML-<strong>for</strong>matted data-source as its lexicon. The key strength is that<br />
this data source is open, and can be used under any operating system, and accessed us-<br />
ing different <strong>to</strong>ols and languages. The search engine we use <strong>to</strong> access the data source is<br />
able <strong>to</strong> deal with <strong>Arabic</strong> words which translate in<strong>to</strong> multiple-word <strong>English</strong> phrases. For<br />
example, ֓aryd in <strong>Arabic</strong> translates <strong>to</strong> I want. However, in its current state, we<br />
cannot find single entries that consist of multiple <strong>Arabic</strong> words. For example <br />
<br />
ˇsbāk ālh. ˇgz in <strong>Arabic</strong> translates <strong>to</strong> counter in <strong>English</strong>; the system cannot deal with this<br />
125
7.3. SYSTEM EVALUATION<br />
yet. Another example is ֒mr yl֒b ywm ālsbt which means Omar<br />
plays on Saturday. The structure of this sentence works in the system, however, since the<br />
two words ywm ālsbt translate <strong>to</strong> a single <strong>English</strong> word, Saturday, and the<br />
search engine cannot deal with this now. This also affects idioms and bigrams. We can<br />
overcome this issue by modifying the database and search algorithm. For each n-gram<br />
phrase, we index it by the first word. When we search the database <strong>for</strong> this <strong>to</strong>ken, we get<br />
the single word translation and any n-grams starting with the word. Then we can check<br />
the sentence <strong>to</strong> see if these n-grams are matched.<br />
Another limitation exists because we are not yet dealing with ambiguity. A word like<br />
֒lm can have many different meanings in <strong>Arabic</strong>, <strong>for</strong> example, it can mean flag,<br />
taught, knowledge or discovered. At the moment, the <strong>Arabic</strong> word might exist <strong>for</strong> differ-<br />
ent parts of speech. The search extracts all of them, but we only use the first one returned.<br />
Once we deal with ambiguity, we will have <strong>to</strong> analyse the different results, looking at the<br />
sentence structures <strong>to</strong> decide which translation <strong>to</strong> use.<br />
Since our system is based on RRG, the logical structure of a sentence is the basis of<br />
the translation. This was very useful, since it allows the system <strong>to</strong> deal with issues that<br />
can be complicated, like free-word-order, and determining the ac<strong>to</strong>r and undergoer. The<br />
lexicon used in UniArab can be refined further, and we would like <strong>to</strong> do this in further<br />
research. At the moment, the lexicon contains entries <strong>for</strong> single <strong>Arabic</strong> words, which can<br />
in some cases translate <strong>to</strong> clauses in <strong>English</strong>. For example, qlmy translates <strong>to</strong> my<br />
pen. The y at the end of the <strong>Arabic</strong> word is the possessive my in <strong>English</strong>. Similarly,<br />
bālqlm translates <strong>to</strong> with the pen, the <br />
b translates <strong>to</strong> with (or using), the āl<br />
is the definite article the. Finally, bqlmy translates <strong>to</strong> with my pen. In the future,<br />
it makes sense <strong>to</strong> simplify the lexicon by including only the basic noun, and allowing the<br />
126
7.3. SYSTEM EVALUATION<br />
search engine <strong>to</strong> extract these extra modifiers. Hence, we need only a single entry <strong>for</strong> the<br />
basic noun, rather than an entry <strong>for</strong> each possible occurrence. This reduces the size of<br />
the lexicon, and hence the speed of the search routine.<br />
At the moment, the lexicon is categorised in<strong>to</strong> seven parts of speech. We have designed<br />
the GUI so that when adding a specific word <strong>to</strong> the lexicon, only the related options are<br />
presented <strong>to</strong> the user <strong>for</strong> that part of speech. This minimises errors when entering data.<br />
As our research extends, we may need <strong>to</strong> modify the categorisation of the lexicon <strong>to</strong> allow<br />
<strong>for</strong> more complicated word types.<br />
UniArab does not process ambiguous words or complex sentences, so far, in this research.<br />
This research focussed first on discovering whether the logical structure of a sentence,<br />
based on RRG can be used <strong>for</strong> translation. Hence, we decided <strong>to</strong> limit the scope of the<br />
project, since this is work in a new area, that has not been investigated be<strong>for</strong>e. We fully<br />
expect <strong>to</strong> expand the system <strong>to</strong> allow it <strong>to</strong> cope with ambiguity in the future. The system’s<br />
reliability depends on the data source and fails <strong>to</strong> handle unknown words. UniArab does<br />
not process single words, even if those words are in its lexicon, because UniArab is built<br />
on the logical structure of verbs.<br />
In our comparison with other translation systems we have used simplex sentences. While<br />
UniArab is limited <strong>to</strong> simplex sentences and has limited coverage, we believe it is essen-<br />
tial <strong>to</strong> reach high quality translation of these sentences first, in order <strong>to</strong> be able <strong>to</strong> expand<br />
<strong>to</strong> high quality translations of more complex sentences. We can see that the existing <strong>to</strong>ols<br />
cannot even achieve reasonable translations of simplex sentences, so how can we expect<br />
them <strong>to</strong> give high quality translations of larger text? We have found that small errors in<br />
the initial analysis of a sentence can cause huge errors in the final translation, so high<br />
quality analysis is very important.<br />
127
7.4 Summary<br />
7.4. SUMMARY<br />
In this chapter we subjected the UniArab System <strong>to</strong> a series of tests in a wide range of<br />
sentence categories. For each test we compared the results obtained through UniArab<br />
<strong>to</strong> those obtained when using translation engines from Google and Microsoft. We also<br />
presented a human-translated equivalent <strong>to</strong> each. In contrast, the Google and Microsoft<br />
transla<strong>to</strong>rs gave mixed results. In many cases, sentence meaning was lacking, and even<br />
some basic constructs could not be translated. Perhaps this is due <strong>to</strong> their focus on<br />
translating long sentences and paragraphs. We highlighted this by comparing them <strong>to</strong><br />
UniArab <strong>for</strong> longer compound sentences and found that they did indeed convey more<br />
of the meaning. These results suggest that RRG is a promising candidate <strong>for</strong> <strong>Arabic</strong> <strong>to</strong><br />
<strong>English</strong> Machine translation, and as the grammar is developed, the system should begin<br />
<strong>to</strong> cope with more complicated sentences. For simplex sentences (intransitive, transitive<br />
and ditransitive) it clearly outper<strong>for</strong>ms existing systems.<br />
128
Now is not the end.<br />
It is not the beginning of the end.<br />
It is perhaps, the end of the beginning.<br />
Wins<strong>to</strong>n Churchill<br />
8<br />
Conclusion<br />
In this thesis we have presented an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation system called<br />
UniArab, which is based on the Role and Reference Grammar model. We detailed the<br />
design of the system and how it was built <strong>to</strong> accommodate specifics of the <strong>Arabic</strong> lan-<br />
guage and the generation of <strong>English</strong> translations.<br />
We started with the goal of designing a <strong>machine</strong> translation system that could show<br />
whether we can extract logical structure from <strong>Arabic</strong> sentences using RRG, and use this<br />
<strong>to</strong> produce high quality translations in<strong>to</strong> <strong>English</strong>. We believe the results shown in Chap-<br />
ter 7 show that our system has proved this, and that our method is more robust <strong>for</strong> these<br />
cases, than other MT systems. There are still a number of areas which need <strong>to</strong> be de-<br />
veloped <strong>for</strong> UniArab <strong>to</strong> achieve more coverage, and we believe that we can build on the<br />
work we have done so far.<br />
Since the logical structure is separate from the vocabulary, when we focus on giving<br />
129
the system capability <strong>to</strong> deal with a large number of structure variations, it becomes sig-<br />
nificantly more powerful since each structure represents all possible sentences of that<br />
structure regardless of the specific words included. This is a significant point, since vo-<br />
cabulary is easy <strong>to</strong> develop, but structure requires much more ef<strong>for</strong>t.<br />
The major challenge we faced was <strong>to</strong> use RRG within a <strong>machine</strong> translation system.<br />
UniArab is the first MT system that uses RRG.In the <strong>Arabic</strong> linguistic tradition there is<br />
not a clear-cut, well-defined analysis of the inven<strong>to</strong>ry of parts of speech in <strong>Arabic</strong>. We<br />
found that the existent classifications were not suitable, and so we had <strong>to</strong> create a clas-<br />
sification that made sense <strong>for</strong> RRG-based translation. We were able <strong>to</strong> extract logical<br />
structures that made sense from natural <strong>Arabic</strong>. And we were also able <strong>to</strong> generate En-<br />
glish translations from this logical structure.<br />
Some specific challenges included dealing with the absence of the copula verb, ‘<strong>to</strong> be’,<br />
in <strong>Arabic</strong>. To solve this, we had <strong>to</strong> look at some <strong>Arabic</strong> sentence which do not contain<br />
verbs, and correctly deduce how <strong>to</strong> extract the copula. Free word order was another chal-<br />
lenge due <strong>to</strong> its widespread presence in <strong>Arabic</strong>, and this was solved by detailed analysis<br />
of the source language and incorporating this in the logical structure.<br />
We have discoverd that RRG is a realistic basis <strong>for</strong> <strong>machine</strong> translation systems. The<br />
use of a sentence’s logical structure <strong>to</strong> create translations is robust and gives high quality<br />
translations which can deal with some of the challenges of languages like <strong>Arabic</strong>.<br />
Our work has contributed the first <strong>machine</strong> translation system based on RRG, which<br />
we have used <strong>to</strong> prove its effectiveness <strong>for</strong> MT. This was a major challenge as we had<br />
little work <strong>to</strong> refer <strong>to</strong>. We have also advanced work on <strong>Arabic</strong> language classification,<br />
130
8.1. THESIS SUMMARY<br />
and so we hope our work will be the beginning of more work in this arena. We believe<br />
this serves as an excellent foundation <strong>for</strong> further research in the area.<br />
While statistical <strong>machine</strong> translation has been promoted by many, we believe that lan-<br />
gauges, expecially rich languages like <strong>Arabic</strong>, are very organised and structural, and<br />
such approaches cannot correctly deal with the wide variety of sentence structures. When<br />
these systems cannot deal with simplex sentences, as we have shown in Chapter 7, how<br />
do we expect them <strong>to</strong> correctly translate whole paragraphs? In our approach, we expect<br />
that high quality translations of simplex sentences are the only basis which can build <strong>to</strong><br />
good translations of whole paragraphs. The results we have presented are the first step<br />
in applying RRG <strong>to</strong> sophisticated translation. By focussing in this initial stage on the<br />
basics, we build a more solid foundation <strong>for</strong> the next stage.<br />
8.1 Thesis summary<br />
In Chapter 2, we presented a summary overview of the grammatical structure of the Ara-<br />
bic language. We detailed various sentence structures as well as unique word attributes<br />
like gender rules applied <strong>to</strong> all words and duality in number. We discussed how some of<br />
these properties could be used <strong>to</strong> extract in<strong>for</strong>mation about sentence structure.<br />
In Chapter 3, we presented the Role and Reference Grammar model, and showed how it<br />
could be used <strong>to</strong> deduce the logical structure of sentences and produce a lexical represen-<br />
tation which could be used as the interlingua.<br />
In Chapter 4, we presented various approaches <strong>to</strong> <strong>machine</strong> translation. We compared<br />
direct translation, transfer systems and interlingua systems and showed how interlingua<br />
systems require significantly more ef<strong>for</strong>t in the analysis and generation stages, but have<br />
131
8.2. SUMMARY OF THESIS CONTRIBUTIONS<br />
a distinct advantage in the simplicity of the translation process. Furthermore, they are<br />
more flexible in terms of adding extra languages. We also talked about the challenges of<br />
<strong>machine</strong> translation, with a specific focus on those specific <strong>to</strong> the <strong>Arabic</strong> language.<br />
In Chapter 5, we presented a high-level view of the system <strong>framework</strong> and defined our<br />
evaluation criteria <strong>for</strong> measuring system per<strong>for</strong>mance.<br />
In Chapter 6, we detailed the technical aspects of UniArab, covering all the phases in-<br />
volved in the <strong>machine</strong> translation process. We described the lexical system that underlies<br />
UniArab, detailing the attribute in<strong>for</strong>mation held <strong>for</strong> each type of word. We discussed the<br />
generation phase and how the system maps the logical structure <strong>to</strong> a target <strong>English</strong> sen-<br />
tence. We then briefly discussed the user interface, and some of the technical challenges<br />
encountered during the implementation.<br />
In Chapter 7, we presented the results of our evaluation of UniArab <strong>for</strong> a wide vari-<br />
ety of sentence types. We compared its results <strong>to</strong> those of the Google and Microsoft<br />
transla<strong>to</strong>rs as well as human translation. We found that it significantly outper<strong>for</strong>ms the<br />
other au<strong>to</strong>mated translation systems, matching human translation. We discussed its limits<br />
in regards <strong>to</strong> complex sentence structures.<br />
8.2 Summary of thesis contributions<br />
This thesis contributions are summarised as follows:<br />
• A detailed presentation of the structure of <strong>Arabic</strong> sentences and a discussion of the<br />
language’s unique features.<br />
• A detailed system <strong>framework</strong> <strong>for</strong> implementing RRG <strong>machine</strong> translation <strong>for</strong> <strong>Arabic</strong><br />
132
and proving the suitability of the model.<br />
8.3. FUTURE WORK<br />
• A detailed technical implementation of an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> transla<strong>to</strong>r<br />
based on the RRG model, including user interface and a cus<strong>to</strong>m designed, extensible<br />
data source.<br />
• An evaluation of the translation system and comparison <strong>to</strong> existing commercial sys-<br />
tems.<br />
• Specifying verb ‘<strong>to</strong> be’, free word order, pro-drop and transitivity of verbs.<br />
8.3 Future work<br />
Given the scope of this Masters research project, there are a number of areas where this<br />
work could be extended. Firstly, the question of ambiguity is very interesting. We feel<br />
that RRG is suited <strong>to</strong> overcoming word ambiguity by using sentence structure, and would<br />
like <strong>to</strong> explore this. We would also like <strong>to</strong> incorporate more compound structures allow-<br />
ing UniArab <strong>to</strong> deal with more complex sentences. We would also like <strong>to</strong> explore the au<strong>to</strong><br />
generation of lexicon in<strong>for</strong>mation from <strong>Arabic</strong> source verbs as a way <strong>to</strong> quickly populate<br />
the lexical source.<br />
The main <strong>to</strong>pic of investigation is the development of a <strong>framework</strong> <strong>for</strong> translating <strong>Arabic</strong><br />
<strong>to</strong> <strong>English</strong> based on RRG. The <strong>framework</strong> is designed <strong>to</strong> demonstrate the capabilities of<br />
RRG as a base <strong>for</strong> <strong>machine</strong> translation of <strong>Arabic</strong> in<strong>to</strong> <strong>English</strong> using an interlingua bridge<br />
strategy. This thesis showed that RRG facilitates the translation process from a specific<br />
language <strong>to</strong> other languages. Future research should focus on:<br />
(1) Enhancing and extending the UniArab system <strong>to</strong> support more natural <strong>Arabic</strong> sen-<br />
tences, and word ambiguity, in particular:<br />
133
8.3. FUTURE WORK<br />
• To understand, process, and translate complex predicates and multi-clause sen-<br />
tences in coordinate, subordinate and cosubordination structures.<br />
• To understand, process and translate voice and valence increasing/decreasing<br />
operations in the <strong>machine</strong> translation of <strong>Arabic</strong>.<br />
• To design a lexicon architecture <strong>to</strong> support the morphological templates <strong>for</strong><br />
<strong>Arabic</strong> words in<strong>to</strong> their respective consonantal and vowel components with the<br />
appropriate word <strong>for</strong>mation rules implemented in software.<br />
• To extend the underlying theory of RRG <strong>to</strong> encompass more fully the lexicon,<br />
syntax and morphology of <strong>Arabic</strong>.<br />
(2) Evaluating UniArab with respect <strong>to</strong> other systems based on non-RRG methods.<br />
134
References<br />
Abn-Aqeal, 2007. sharah Abn-Aqeal ala’lfat Abn-Malek. Dar Al’alem llmlaien.<br />
Al-Sughaiyer, I. A. and Al-Kharashi, I. A., 2004. <strong>Arabic</strong> morphological analysis tech-<br />
niques: A comprehensive survey. JASIST, 55(3):189–213.<br />
Alosh, M., 2005. Using <strong>Arabic</strong>: A Guide <strong>to</strong> Contemporary Usage. Cambridge University<br />
Press.<br />
Arciniegas, F., 2000. XML Developer’s Guide. McGraw-Hill.<br />
Attia, M., 2004. Report on the introduction of <strong>Arabic</strong> <strong>to</strong> ParGram. In Proceedings of<br />
ParGram Fall Meeting, Dublin, Ireland.<br />
Attia, M. A., 2008. Handling <strong>Arabic</strong> Morphological and Syntactic Ambiguity within<br />
the LFG Framework with a View <strong>to</strong> Machine Translation. PhD thesis, University of<br />
Manchester.<br />
Bateson, M. C., 2003. <strong>Arabic</strong> Language Handbook. George<strong>to</strong>wn University Press.<br />
Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., and Yergeau, F., 2008. Extensi-<br />
ble Markup Language (XML) 1.0 (Fifth Edition).<br />
Brown, P., Pietra, S. D., Pietra, V. D., and Mer-cer., R., 1993. The mathematics of statisti-<br />
cal <strong>machine</strong> translation: Parameter estimation. In Computational Linguistics, volume<br />
19 i2, pages 263–311.<br />
Google, 2009. Google transla<strong>to</strong>r. http://translate.google.com.<br />
Holes, C., 2004. Modern <strong>Arabic</strong>: Structures, Functions, and Varieties. George<strong>to</strong>wn<br />
University Press.<br />
Hutchins, J., 2003. Machine Translation: General Overview. in : book, Ox<strong>for</strong>d Univer-<br />
sity Press.<br />
135
REFERENCES<br />
Hutchins, J. and Somers, H. L., 1992. An introduction <strong>to</strong> <strong>machine</strong> translation. Academic<br />
Press.<br />
ibn Abd Allah Ibn Malik, M., 1984. Alfiyat Ibn Malik. Maktabat al-Adab.<br />
Intelligence, A. B., 2004. Abi research. http://www.abiresearch.com/abiprdisplay2.jsp?pressid=116.<br />
Izwaini, S., 2006. Problems of <strong>Arabic</strong> <strong>machine</strong> translation: evaluation of three systems.<br />
In The British Computer Society (BSC), London ., pages 118–148.<br />
Khan, M. A. S., 2007. <strong>Arabic</strong> Tu<strong>to</strong>r - Volume One, volume one. Madrasah In’aamiyyah.<br />
Laoudi, J., Tate, C., and Voss, C., 2004. Towards an au<strong>to</strong>mated evaluation of an embed-<br />
ded mt system. In roceedings of the European Association <strong>for</strong> Machine Translation<br />
Workshop, Malta.<br />
Microsoft, 2009. Microsoft transla<strong>to</strong>r. http://www.windowslivetransla<strong>to</strong>r.com/Default.aspx.<br />
Nolan, B. and Salem, Y., 2009. UniArab: An RRG <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> Machine Trans-<br />
lation Software. In Proceedings of The 2009 International Conference on Role and<br />
Reference Grammar, University of Cali<strong>for</strong>nia, Berkeley, USA.<br />
Nunes, M., 1993. Argument Linking in <strong>English</strong> Derived Nominals In Van Valin (ed)<br />
Advances in Role and Reference Grammar. John Benjamins.<br />
Oren, T., 2004. Machine translation and the global blogosphere.<br />
http://www.windsofchange.net/archives/006011.html. website.<br />
Owens, J., 2006. A Linguistic His<strong>to</strong>ry of <strong>Arabic</strong>. Ox<strong>for</strong>d University Press.<br />
Ramsay, A. and Mansour, H., 2006. Local constraints on <strong>Arabic</strong> word order. In 5th<br />
International Conference on NLP, FinTAL 2006, pages 447–457.<br />
Ryding, K. C., 2007. A Reference Grammar of Modern Standard <strong>Arabic</strong>. Cambridge<br />
University Press, 3rd edition.<br />
136
REFERENCES<br />
Salem, Y., Hensman, A., and Nolan, B., 2008a. Implementing <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> ma-<br />
chine translation using the Role and Reference Grammar linguistic model. In Proceed-<br />
ings of the Eighth Annual International Conference on In<strong>for</strong>mation Technology and<br />
Telecommunication (IT&T 2008), Galway, Ireland.<br />
Salem, Y., Hensman, A., and Nolan, B., 2008b. Towards <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong><br />
translation. ITB Journal, 17:20–31.<br />
Salem, Y. and Nolan, B., 2009a. An <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> <strong>machine</strong> translation system using<br />
an XML-based Role and Reference Grammar representation. In Proceedings of the<br />
23rd Annual Symposium on <strong>Arabic</strong> Linguistics, University of Wisconsin-Milwaukee.<br />
Salem, Y. and Nolan, B., 2009b. Designing an XML lexicon architecture <strong>for</strong> <strong>Arabic</strong><br />
<strong>machine</strong> translation based on Role and Reference Grammar. In Proceedings of the 2nd<br />
International Conference on <strong>Arabic</strong> Language Resources and Tools (MEDAR 2009),<br />
Cairo, Egypt.<br />
Salem, Y. and Nolan, B., 2009c. UNIARAB: An universal <strong>machine</strong> transla<strong>to</strong>r system<br />
<strong>for</strong> <strong>Arabic</strong> Based on Role and Reference Grammar. In Proceedings of the 31st Annual<br />
Meeting of the Linguistics Association of Germany (DGfS 2009).<br />
Schulz, E., 2005. A Student Grammar of Modern Standard <strong>Arabic</strong>. Cambridge University<br />
Press.<br />
Trujillo, A., 1999. Translation Engines: Techniques <strong>for</strong> Machine Translation. Springer.<br />
Turian, J. P., Shen, L., and Melamed, I. D., 2003. Evaluation of <strong>machine</strong> translation<br />
and its evaluation. In Proceedings of the MT Summit IX, New Orleans, USA, pages<br />
386–393.<br />
Van Valin, R., 1993. Advances in Role and Reference Grammar. John Benjamins.<br />
137
REFERENCES<br />
Van Valin, R., 2007. The Role and Reference Grammar analysis of three place predicates.<br />
CEEOL Contemporary Linguistics, 63:31–63.<br />
Van Valin, R. and LaPolla, R., 1997. Syntax: Structure, Meaning, and Function. Cam-<br />
bridge University Press.<br />
Versteegh, K., 2001. The <strong>Arabic</strong> Language. Edinburgh University Press; New Ed edition.<br />
138
Appendix<br />
139
A<br />
The author’s publications related <strong>to</strong> this research<br />
• Brian Nolan and Yasser Salem. 2009. “UniArab: An RRG <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> Ma-<br />
chine Translation Software”, in Proceedings of The 2009 International Conference<br />
on Role and Reference Grammar, University of Cali<strong>for</strong>nia, Berkeley, USA, August<br />
2009.<br />
• Yasser Salem and Brian Nolan. 2009. “Designing an XML Lexicon Architecture<br />
<strong>for</strong> <strong>Arabic</strong> Machine Translation Based on Role and Reference Grammar”, in Pro-<br />
ceedings of the 2nd International Conference on <strong>Arabic</strong> Language Resources and<br />
Tools (MEDAR 2009), Cairo, Egypt, April 2009.<br />
• Yasser Salem and Brian Nolan, 2009. “An <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> Machine transla-<br />
tion system using an XMLbased Role and Reference Grammar representation”, in<br />
Proceedings of the 23rd Annual Symposium on <strong>Arabic</strong> Linguistics, University of<br />
Wisconsin-Milwaukee, USA, April 2009.<br />
140
• Yasser Salem and Brian Nolan. 2009. “ UNIARAB: An Universal Machine Trans-<br />
la<strong>to</strong>r System For <strong>Arabic</strong> Based On Role And Reference Grammar”, in Proceedings<br />
of the 31st Annual Meeting of the Linguistics Association of Germany (DGfS 2009),<br />
University of Osnabruck, Germany, March 2009.<br />
• Yasser Salem, Arnold Hensman and Brian Nolan. 2008. Implementing <strong>Arabic</strong>-<br />
<strong>to</strong>-<strong>English</strong> Machine Translation using the Role and Reference Grammar Linguistic<br />
Model, in Proceedings of the Eighth Annual International Conference on In<strong>for</strong>-<br />
mation Technology and Telecommunication (ITT 2008), Galway, Ireland, Oc<strong>to</strong>ber<br />
2008. (Runner-up <strong>for</strong> Best Paper Award)<br />
• Yasser Salem, Arnold Hensman and Brian Nolan. 2008. “Towards <strong>Arabic</strong> <strong>to</strong> En-<br />
glish Machine Translation”, ITB Journal, May 2008, Issue No. 17: 20-31.<br />
141
B<br />
Buckwalter <strong>Arabic</strong> transliteration<br />
1<br />
<strong>Arabic</strong> Letter and Phonetic Value<br />
ā<br />
Letter Name<br />
ALEF<br />
Unicode<br />
u0627<br />
2 b BEH u0628<br />
3 t TEH u062A<br />
4 t THEH u062B<br />
¯<br />
5 ˇg JEEM u062C<br />
6 h. HAH u062D<br />
7 h KHAH u062E<br />
˘<br />
8 d DAL u062F<br />
9<br />
d¯<br />
THAL u0630<br />
10 r REH u0631<br />
11 <br />
z ZAIN u0632<br />
12 s SEEN u0633<br />
13 ˇs SHEEN u0634<br />
142
<strong>Arabic</strong> Letter and Phonetic Value Letter Name Unicode<br />
14 s. SAD u0635<br />
15 d. DAD u0636<br />
16 t. TAH u0637<br />
17 z. ZAH u0638<br />
18 ֒ AIN u0639<br />
<br />
19 ˙g GHAIN u063A<br />
<br />
20 f FEH u0641<br />
21 q QAF u0642<br />
22 k KAF u0643<br />
23 l LAM u0644<br />
24 m MEEM u0645<br />
25 n NOON u0646<br />
26 h HEH ˘0647<br />
27 w WAW u0648<br />
28 y YEH u064A<br />
143
29<br />
<strong>Arabic</strong> Letter and Phonetic Value<br />
֓<br />
Letter Name<br />
HAMZA<br />
Unicode<br />
u0621<br />
30 ֓i ALEF WITH HAMZA UNDER u0625<br />
31<br />
32<br />
<br />
֓a ALEF WITH HAMZA ABOVE u0623<br />
<br />
֓ā ALEF WITH MADDA ABOVE u0622<br />
33 ā YEH u0649<br />
34 ֓y YEH WITH HAMZA ABOVE u0626<br />
35 ֓w WAW WITH HAMZA ABOVE u0624<br />
36 h TEH MARBUTA u0629<br />
<br />
37 a FATHA u064E<br />
<br />
38 u DAMMA u064F<br />
39 i KASRA u0650<br />
40<br />
41<br />
<br />
an TANWIN ALFATH u064B<br />
<br />
un TANWIN ALDAM u064C<br />
42 in TANWIN ALKASER u064D<br />
<br />
43 SKOON u0652<br />
<br />
44 SHADDA u0651<br />
45 ? ARABIC QUESTION MARK u061F<br />
144
C<br />
List of translatable sentences<br />
I want a ring. <br />
֓aryd h ˘ ātm<br />
I <strong>for</strong>got my wallet. nsyt mh. fz . ty<br />
I missed the plane. fāttny ālt.ā֓yrh<br />
֓aryd ˙grfh<br />
I want a room.<br />
I am a <strong>to</strong>urist. ֓anā sā֓yh.<br />
I am alone. ānā wh. dy<br />
I am Irish. ֓anā āyrlndy<br />
<br />
we are students. nh. n tlāmyd<br />
¯<br />
he is an engineer. hw mhnds<br />
145
I am an engineer. ānā mhnds<br />
I am the engineer. ֓anā ālmhnds<br />
Sarah hurts Yousuf. <br />
tˇgrh. sārh ywsf<br />
Sarah hurts Yousuf. <br />
sārh tˇgrh. ywsf<br />
Sarah hurts Yousuf. <br />
tˇgrh. ywsf sārh<br />
Omar will drink the milk. syˇsrb āllbn ֒mr<br />
Omar will drink the milk. syˇsrb ֒mr āllbn<br />
Khalid is drinking the milk. <br />
Khalid drank the milk.<br />
yˇsrb hāld āllbn<br />
˘<br />
ˇsrb hāld āllbn<br />
˘<br />
Omar is visiting Ireland. yzwr ֒mr ֓ayrlndā<br />
Qays loves Laila. qys yh. b lylā<br />
Qays loves Laila. yh. b qys lylā<br />
Qays loves Laila. yh. b lylā qys<br />
Laila loves Qays. lylā th. b qys<br />
Laila loves Qays. th. b lylā qys<br />
Laila loves Qays. th. b qys lylā<br />
Omar read the book. qr֓a ֒mr ālktāb<br />
Brian read the book. <br />
brāyn qr֓a ālktāb<br />
146
she is an engineer.<br />
Zaid loves Fatima.<br />
Zaid loves Fatima.<br />
hy mhndsh<br />
<br />
zyd yh. b fāt.mh<br />
<br />
yh. b zyd fāt.mh<br />
Zaid loves Fatima. <br />
yh. b fāt.mh zyd<br />
Fatima loves Zaid. <br />
fāt.mh th. b zyd<br />
Fatima loves Zaid. th. b fāt.mh zyd<br />
Fatima loves Zaid. th. b zyd fāt.mh<br />
Eman drew her school. <br />
rsmt ֓iymān mdrsthā<br />
Louis hit Mark. d. rb lwys mārk<br />
Louis hit Mark. lwys d. rb mārk<br />
Mark hit Louis. mārk d. rb lwys<br />
Mark hit Louis. d. rb mārk lwys<br />
Brian wrote the book. brāyn ktb ālktāb<br />
Ayesha wrote the book. ֒ā֓yˇsh ktbt ālktāb<br />
Eman wrote the book. <br />
ktbt ֓iymān ālktāb<br />
I have made a reservation. <br />
lqd qmt bālh. ˇgz<br />
I have lost my ticket. lqd fqdt td ¯ krty<br />
I am a doc<strong>to</strong>r. ֓anā t.byb<br />
147
Abbas is playing with the ball.<br />
Abbas is playing with the ball.<br />
yl֒b ֒bās bālkrh<br />
֒bās yl֒b bālkrh<br />
bālkrh yl֒b ֒bās<br />
Abbas is playing with the ball. <br />
Yousuf played with the ball. <br />
l֒b ywsf bālkrh<br />
Yousuf played with the ball. <br />
l֒b bālkrh ywsf<br />
Yousuf played with the ball. <br />
bālkrh l֒b ywsf<br />
Yousuf will play with the ball.<br />
Yousuf will play with the ball.<br />
Yousuf will play with the ball.<br />
Essam played with the spoon.<br />
Essam will play with the spoon.<br />
Essam is playing with the spoon.<br />
Essam played with the spoons.<br />
Essam will play with the spoons.<br />
Essam is playing with the spoons.<br />
Mansour ate with the spoon.<br />
Mansour ate with the spoon.<br />
<br />
<br />
syl֒b ywsf bālkrh<br />
<br />
ywsf syl֒b bālkrh<br />
<br />
bālkrh syl֒b ywsf<br />
l֒b ֒s. ām bālml֒qh<br />
syl֒b ֒s. ām bālml֒qh<br />
yl֒b ֒s. ām bālml֒qh<br />
l֒b ֒s. ām bālmlā֒q<br />
syl֒b ֒s. ām bālmlā֒q<br />
yl֒b ֒s. ām bālmlā֒q<br />
֓akl mns.wr bālml֒qh<br />
mns.wr ֓akl bālml֒qh<br />
Mansour ate with the spoon. bālml֒qh ֓akl mns.wr<br />
148
Jack is eating with the spoon. y֓akl ˇgāk bālml֒qh<br />
Jack will eat with the spoon. sy֓akl ˇgāk bālml֒qh<br />
Jack killed Mary. qtl ˇgāk māry<br />
Jack killed Mary. ˇgāk qtl māry<br />
Mary killed Jack. qtlt māry ˇgāk<br />
Mary killed Jack. <br />
māry qtlt ˇgāk<br />
Jack killed the man. qtl ˇgāk ālrˇgl<br />
Jack killed the man. ˇgāk qtl ālrˇgl<br />
The man killed Jack. ālrˇgl qtl ˇgāk<br />
The man killed Jack. qtl ālrˇgl ˇgāk<br />
Suhaib bellowed the fire. s. hyb nfh ˘ ālnār<br />
Suhaib bellowed the fire. nfh ˘ ālnār s. hyb<br />
Suhaib bellowed the fire. nfh s. hyb ālnār<br />
˘<br />
Sulaiman opened the door. fth. ālbāb slymān<br />
Sulaiman opened the door. slymān fth. ālbāb<br />
Sulaiman opened the door. fth. slymān ālbāb<br />
Zaid <strong>to</strong>ok the book. ֓ahd zyd ālktāb<br />
˘ ¯<br />
149
Zaid <strong>to</strong>ok the book. <br />
zyd ֓ahd ālktāb<br />
˘ ¯<br />
Qays <strong>to</strong>ok the files. <br />
֓ahd qys ālmlfāt<br />
˘ ¯<br />
Qays <strong>to</strong>ok the files. <br />
֓ahd qys ālmlfāt<br />
˘ ¯<br />
brāyn yrkb ālh. āflh<br />
Brian rides the bus.<br />
Brian rides the bus.<br />
Fahmy rides the car.<br />
Fahmy rides the car.<br />
yrkb brāyn ālh. āflh<br />
fhmy yrkb ālsyārh<br />
yrkb fhmy ālsyārh<br />
Fahmy rides the car. yrkb ālsyārh fhmy<br />
Khalid answered the question. <br />
h ˘ āld ֓aˇgāb āls֓wāl<br />
Khalid answered the question. ֓aˇgāb āls֓wāl h ˘ āld<br />
Khalid answered the question. ֓aˇgāb hāld āls֓wāl<br />
˘<br />
<br />
<br />
Rashid broke the window.<br />
Rashid broke the window.<br />
rāˇsd ksr ālnāfd ¯ h<br />
<br />
ksr rāˇsd ālnāfd ¯ h<br />
Rashid broke the window. <br />
ksr ālnāfd ¯ h rāˇsd<br />
Khalid broke his <strong>to</strong>y. <br />
ksr h ˘ āld l֒bth<br />
Omar <strong>to</strong>re the book. mzq ֒mr ālktāb<br />
Omar <strong>to</strong>re the book. <br />
֒mr mzq ālktāb<br />
Omar <strong>to</strong>re the book. mzq ālktāb ֒mr<br />
150
Ayah <strong>to</strong>re the page. <br />
֓āyh mzqt ālkys<br />
Ayah <strong>to</strong>re the page. mzqt ālkys ֓āyh<br />
Ayah <strong>to</strong>re the page. mzqt ֓āyh ālkys<br />
Omar opened his present. fth. ֒mr hdyth<br />
<br />
fth. ֒mr ālnāfdh ¯<br />
Omar opened the window.<br />
Omar opened the door. ֒mr fth. ālbāb<br />
James combed his hair. mˇst. ˇgyms ˇs֒rh<br />
James combed his hair. ˇgyms mˇst. ˇs֒rh<br />
Eric cleaned the window. <br />
Eric cleaned the plane.<br />
nz . f ֓iyryk ālnāfdh ¯<br />
֓iyryk nz Eric cleaned the house.<br />
. f ālt.ā֓yrh<br />
nz. f ālmnzl ֓iyryk<br />
Sarah wiped the table. msh. t sārh ālt.āwlh<br />
Sarah wiped the table. sārh msh. t ālt.āwlh<br />
Roqiah will cook the dinner. stt.bh ˘ rqyh āl֒ˇsā֓<br />
Roqiah will cook the dinner. rqyh stt.bh ˘ āl֒ˇsā֓<br />
Roqiah will cook the dinner. stt.bh ˘ āl֒ˇsā֓rqyh<br />
Harold pinched James. qrs. hārwld ˇgyms<br />
Harold pinched James. hārwld qrs. ˇgyms<br />
151
he is a doc<strong>to</strong>r. hw t.byb<br />
Ayah is spending her money. ֓āyh tnfq nqwdhā<br />
Ayah is spending her money. tnfq ֓āyh nqwdhā<br />
Henry lost his money. hnry fqd nqwdh<br />
Henry lost his money. fqd hnry nqwdh<br />
Adam punched Philip. lkm ֓ādm fylyb<br />
Zakiah killed Sarah. <br />
qtlt zkyh sārh<br />
Zakiah killed Sarah. <br />
zkyh qtlt sārh<br />
Mark slapped Louis. s. f֒mārk lwys<br />
Mark slapped Louis. mārk s. f֒lwys<br />
Sarah hates Zakiah. <br />
tkrh sārh zkyh<br />
Sarah hates Zakiah. <br />
Ayesha phoned Eman.<br />
sārh tkrh zkyh<br />
<br />
hātft ֒ā֓yˇsh ֓iymān<br />
Ayesha phoned Eman. <br />
֒ā֓yˇsh hātft ֓iymān<br />
Ayah thanked Khalid. ˇskrt ֓āyh hāld ˘<br />
Ayah thanked Khalid. ֓āyh ˇskrt hāld ˘<br />
Sarah called Adam. nādt sārh ֓ādm<br />
Sarah called Adam. sārh nādt ֓ādm<br />
152
Eman saw Sarah. <br />
r֓at ֓iymān sārh<br />
Eman saw Carl. ֓iymān r֓at kārl<br />
msk fylyb ālkrh<br />
Philip caught the ball.<br />
Philip caught the ball.<br />
fylyb msk ālkrh<br />
Philip caught the ball. msk ālkrh fylyb<br />
Carl bought pens. ֓iˇstrā kārl ֓aqlām<br />
Carl bought pens. kārl ֓iˇstrā ֓aqlām<br />
qād mārk ālt.ā֓yrh<br />
Mark drove the plane.<br />
Mark drove the bus.<br />
mārk qād ālh. āflh<br />
Mark drove the car. Adam is cleaning his <strong>to</strong>y.<br />
qād ālsyārh mārk<br />
<br />
Adam is cleaning his room.<br />
֓ādm ynz . f l֒bth<br />
Adam is cleaning his car.<br />
<br />
ynz . f ֓ādm ˙grfth<br />
<br />
ynz . f ֓ādm syārth<br />
Mark will clean my kitchen.<br />
Sarah will clean my office.<br />
Sarah will clean the car.<br />
<br />
<br />
mārk synz . f mt.bh y<br />
˘<br />
<br />
<br />
<br />
stnz. f sārh mktby<br />
sārh stnz. f ālsyārh<br />
Eman will clean her room. Eman will clean her room.<br />
stnz . f ֓iymān ˙grfthā<br />
<br />
֓iymān stnz . f ˙grfthā<br />
153
Louis is washing the dishes.<br />
Louis is washing the dishes.<br />
y˙gsl lwys āls.h. wn<br />
lwys y˙gsl āls.h. wn<br />
Louis is washing the dishes. y˙gsl āls.h. wn lwys<br />
Harold is feeding his cat. yt.֒m hārwld qt.th<br />
Harold is feeding his cat. hārwld yt.֒m qt.th<br />
he is a seller. hw bā֓y֒<br />
Eric is fixing his car. ys. lh. ֓iyryk syārth<br />
Eric is fixing his car. ֓iyryk ys. lh. syārth<br />
Fahmy speaks <strong>English</strong>.<br />
Fahmy speaks <strong>English</strong>.<br />
fhmy ytklm ālānklyzyh<br />
ytklm fhmy ālānklyzyh<br />
Ayah is cooking the food. tt.bh ˘ ֓āyh ālt.֒ām<br />
Ayah is cooking the dinner. ֓āyh tt.bh ˘ āl֒ˇsā֓<br />
Rashid is helping Mark. <br />
ysā֒d rāˇsd mārk<br />
Rashid is helping Ayesha.<br />
<br />
rāˇsd ysā֒d ֒ā֓yˇsh<br />
Mansour is eating his food. y֓akl mns. wr t.֒āmh<br />
Mansour is eating his food. mns.wr y֓akl t.֒āmh<br />
154
Carl is brushing his hair. <br />
ymˇst. kārl ˇs֒rh<br />
Carl is brushing his hair. <br />
kārl ymˇst. ˇs֒rh<br />
Abbas is brushing his teeth. <br />
yfrˇs ֒bās ֓asnānh<br />
Abbas is brushing his teeth. <br />
֒bās yfrˇs ֓asnānh<br />
Yousuf is wearing his clothes. <br />
ylbs ywsf t ¯ yābh<br />
Yousuf is wearing his shoes. <br />
Henry is watching the television.<br />
Henry is watching the television.<br />
<br />
ywsf ylbs h. d ¯ ā֓yh<br />
<br />
yˇsāhd hnry āltlfāz<br />
<br />
hnry yˇsāhd āltlfāz<br />
Henry is watching the television. <br />
yˇsāhd āltlfāz hnry<br />
Sulaiman caught the fish. ֓is.t.ād slymān ālsmk<br />
slymān ֓is. t.ād ālsmk<br />
Sulaiman caught the fish. <br />
֓is.t.ād ālsmk slymān<br />
Sulaiman caught the fish.<br />
Omar is planting the trees. <br />
yzr֒āl֓aˇsˇgār ֒mr<br />
Omar is planting the trees. <br />
֒mr yzr֒āl֓aˇsˇgār<br />
James pushed the chairs. ˇgr ˇgyms ālkrāsy<br />
James pushed the chairs. ˇgyms ˇgr ālkrāsy<br />
James pushed the chairs. ˇgr ālkrāsy ˇgyms<br />
155
Roqih drew trees. rsmt rqyh ֓aˇsˇgār<br />
Roqih drew Omar. rqyh rsmt ֒mr<br />
Roqih drew Khalid. <br />
rsmt hāld rqyh<br />
˘<br />
Ayesha picked the flowers. <br />
֒ā֓yˇsh qt.ft ālzhwr<br />
Ayesha picked the flowers. <br />
Ayesha picked the flowers.<br />
qt.ft ֒ā֓yˇsh ālzhwr<br />
qt.ft ālzhwr ֒ā֓yˇsh<br />
Omar is fixing the computer. ys. lh. ֒mr ālh. āswb<br />
Omar is fixing the computer. ֒mr ys. lh. ālh. āswb<br />
Omar is fixing the computer. ys. lh. ālh. āswb ֒mr<br />
Omar bought the <strong>to</strong>ys. <br />
Omar bought the fish. <br />
Eman ironed the clothes. <br />
֓iˇstrā ֒mr āll֒b<br />
֓iˇstrā ālsmk ֒mr<br />
kwt ֓iymān ālmlābs<br />
Eman ironed the clothes. ֓iymān kwt ālmlābs<br />
Eman ironed the clothes. <br />
kwt ālmlābs ֓iymān<br />
lwnt ֓āyh āls. wrh<br />
Ayah painted the picture.<br />
Ayah painted the picture.<br />
֓āyh lwnt āls. wrh<br />
Ayah painted the picture. lwnt āls. wrh ֓āyh<br />
I want the book. ֓aryd ālktāb<br />
156
I want a book. ֓aryd ktāb<br />
I want the food. ֓aryd ālt.֒ām<br />
I ate the dinner. ֓aklt āl֒ˇsā֓<br />
I ate the food. ֓aklt ālt.֒ām<br />
I ate the fish. ֓aklt ālsmk<br />
ˇsrbt āllbn<br />
I drank the milk.<br />
Eman lost her cat. <br />
fqdt ֓iymān qt.thā<br />
dhst ֓iymān qt.thā<br />
Eman ran over her cat. <br />
<br />
Omar won the race. fāz ֒mr bālsbāq<br />
Omar is sleeping. ynām ֒mr<br />
The children are crying. <br />
āl֓at.fāl ybkwn<br />
the wheel squeaks. āldwlāb ys. rs. r<br />
<br />
Omar reads. ֒mr yqr֓a<br />
Omar reads a lot. ֒mr yqr֓a kt ¯ yrā<br />
He hit Khalid. hw d. rb h ˘ āld<br />
He played with the spoon. hw l֒b bālml֒qh<br />
He loves Laila. hw yh. b lylā<br />
He loves Laila. yh. b hw lylā<br />
He loves Laila. yh. b lylā hw<br />
157
Omar gave Khalid the book. <br />
֒mr ֓a֒t.ā h ˘ āld ālktāb<br />
Omar gave Sarah the book. ֒mr ֓a֒t.ā sārh ālktāb<br />
Omar is giving Khalid the book. <br />
֒mr y֒t.y h ˘ āld ālktāb<br />
Omar is giving Sarah the book. ֒mr y֒t.y sārh ālktāb<br />
Sarah is giving Khalid the book. <br />
sārh t֒t.y h ˘ āld ālktāb<br />
Sarah is giving Eman the book. sārh t֒t.y ֓iymān ālktāb<br />
Sarah gave Eman the book. <br />
sārh ֓a֒t.t ֓iymān ālktāb<br />
Sarah gave Khalid the book. sārh ֓a֒t.t hāld ālktāb<br />
˘<br />
He gave Khalid the book. <br />
hw ֓a֒t.ā h ˘ āld ālktāb<br />
He gave Khalid the book. <br />
֓a֒t.ā hw h ˘ āld ālktāb<br />
He gave Sarah the book. hw ֓a֒t.ā sārh ālktāb<br />
He gave Sarah the book. ֓a֒t.ā hw sārh ālktāb<br />
He is giving Khalid the book. <br />
hw y֒t.y h ˘ āld ālktāb<br />
She is giving Sarah the book. hy t֒t.y sārh ālktāb<br />
She is giving Khalid the book. <br />
hy t֒t.y h ˘ āld ālktāb<br />
She gave Sarah the book. hy ֓a֒t.t sārh ālktāb<br />
She gave Sarah the book. ֓a֒t.t hy sārh ālktāb<br />
She gave Khalid the book. <br />
hy ֓a֒t.t h ˘ āld ālktāb<br />
She gave Khalid the book. <br />
֓a֒t.t hy h ˘ āld ālktāb<br />
158
Omar gave the book <strong>to</strong> Khalid. <br />
֒mr ֓a֒t.ā lh ˘ āld ālktāb<br />
Omar gave the book <strong>to</strong> Khalid. <br />
֒mr ֓a֒t.ā ālktāb lh ˘ āld<br />
Omar gave the book <strong>to</strong> Khalid. <br />
֒mr y֒t.y lh ˘ āld ālktāb<br />
Omar is giving the book <strong>to</strong> Khalid. <br />
֒mr y֒t.y ālktāb lh ˘ āld<br />
Omar is giving the book <strong>to</strong> Sarah. ֒mr y֒t.y ālktāb lsārh<br />
Omar is giving the book <strong>to</strong> Sarah. ֒mr y֒t.y lsārh ālktāb<br />
Omar gave the book <strong>to</strong> Sarah. ֒mr ֓a֒t.ā ālktāb lsārh<br />
Omar gave the book <strong>to</strong> Sarah. ֒mr ֓a֒t.ā lsārh ālktāb<br />
Eman is giving the book <strong>to</strong> Khalid. <br />
֓iymān t֒t.y lh ˘ āld ālktāb<br />
Eman is giving the book <strong>to</strong> Khalid. <br />
֓iymān t֒t.y ālktāb lh ˘ āld<br />
Eman is giving the book <strong>to</strong> Sarah. ֓iymān t֒t.y ālktāb lsārh<br />
Eman is giving the book <strong>to</strong> Sarah. ֓iymān t֒t.y lsārh ālktāb<br />
He gave the book <strong>to</strong> Khalid. <br />
hw ֓a֒t.ā lh ˘ āld ālktāb<br />
He gave a book <strong>to</strong> Khalid. <br />
hw ֓a֒t.ā ktāb lh ˘ āld<br />
He is giving a book <strong>to</strong> Khalid. <br />
hw y֒t.y lh ˘ āld ktāb<br />
He is giving a book <strong>to</strong> Khalid. <br />
hw y֒t.y ktāb lh ˘ āld<br />
He is giving a book <strong>to</strong> Sarah. hw y֒t.y ktāb lsārh<br />
159
He is giving a book <strong>to</strong> Sarah. hw y֒t.y lsārh ktāb<br />
He gave a book <strong>to</strong> Sarah. hw ֓a֒t.ā ktāb lsārh<br />
He gave a book <strong>to</strong> Sarah. hw ֓a֒t.ā lsārh ktāb<br />
She is giving a book <strong>to</strong> Khalid. <br />
hy t֒t.y lh ˘ āld ktāb<br />
She is giving a book <strong>to</strong> Khalid. <br />
hy t֒t.y ktāb lh ˘ āld<br />
She is giving a book <strong>to</strong> Sarah. hy t֒t.y ktāb lsārh<br />
She is giving a book <strong>to</strong> Sarah. hy t֒t.y lsārh ktāb<br />
Eman is giving a book <strong>to</strong> Sarah. ֓iymān t֒t.y lsārh ktāb<br />
He gave a book <strong>to</strong> Khalid. <br />
hw ֓a֒t.ā lh ˘ āld ktāb<br />
She is giving Sarah a book. hy t֒t.y sārh ktāb<br />
Omar gave Khalid a book. <br />
֒mr ֓a֒t.ā h ˘ āld ktāb<br />
Khalid drives. yswq hāld ˘<br />
Khalid drives. hāld yswq<br />
˘<br />
Khalid drives a lot. <br />
h ˘ āld yswq kt ¯ yrā<br />
160
D<br />
Verbs in lexicon<br />
Verbs in <strong>Arabic</strong> change depending on gender, number and tense of the subject, so there<br />
are multiple entries in the lexicon that translate <strong>to</strong> the same <strong>English</strong> output. The translit-<br />
eration makes it clear that these are different words in <strong>Arabic</strong>.<br />
<strong>Arabic</strong> Example Logical Structure<br />
֓aryd I want a ring. < TNS : PRES[do ′ (I, [want ′ (I, y)])] ><br />
nsyt I <strong>for</strong>got my wallet. < TNS : PAST[do ′ (I, [<strong>for</strong>get ′ (I, y)])] ><br />
֓aklt I ate an apple. < TNS : PAST[do ′ (I, [eat ′ (I, y)])] ><br />
ˇsrbt I drank the milk. < TNS : PAST[do ′ (I, [drink ′ (I, y)])] ><br />
qr֓a Omar read the book. < TNS : PAST[do ′ (x, [read ′ (x, y)])] ><br />
qr֓at Eman read the book. < TNS : PAST[do ′ (x, [read ′ (x, y)])] ><br />
161
<strong>Arabic</strong> Example Logical Structure<br />
NON I am a <strong>to</strong>urist. be ′ (I,[<strong>to</strong>urist ′ ])<br />
NON He is an engineer. be ′ (he,[engineer ′ ])<br />
NON We are students. be ′ (we,[students ′ ])<br />
ˇsrb Khalid drank the milk. < TNS : PAST[do ′ (x,[drink ′ (x,y)])] ><br />
yˇsrb Khalid is drinking the milk. < TNS : PRES[do ′ (x,[drink ′ (x,y)])] ><br />
syˇsrb Omar will drink the milk. < TNS : FUT[do ′ (x,[drink ′ (x,y)])] ><br />
ylbs Eric is wearing his clothes. < TNS : PRES[do ′ (x,[wear ′ (x,y)])] ><br />
d. rb Louis hit Mark. < TNS : PAST[do ′ (x,[hit ′ (x,y)])] ><br />
yh. b Qays loves Laila. < TNS : PRES[do ′ (x,[love ′ (x,y)])] ><br />
th. b Fatima loves Zaid. < TNS : PRES[do ′ (x,[love ′ (x,y)])] ><br />
qmt I have made a reservation. < TNS : PAST[do ′ (x,[make ′ (x,y)])] ><br />
fqdt Eman lost her cat. < TNS : PAST[do ′ (x,[lose ′ (x,y)])] ><br />
qtl The man killed Jack. < TNS : PAST[do ′ (x,[kill ′ (x,y)])] ><br />
fth. Sulaiman opened the door. < TNS : PAST[do ′ (x,[open ′ (x,y)])] ><br />
<br />
<br />
֓ah ˘ d ¯<br />
Zaid <strong>to</strong>ok the book. < TNS : PAST[do ′ (x,[take ′ (x,y)])] ><br />
yrkb Fahmy rides the car. < TNS : PRES[do ′ (x,[ride ′ (x,y)])] ><br />
stt.bh ˘<br />
Raiqa will cook the dinner. < TNS : FUT[do ′ (x,[cook ′ (x,y)])] ><br />
162
<strong>Arabic</strong> Example Logical Structure<br />
֓aˇgāb Khalid answered the question. < TNS : PAST[do ′ (x,[answer ′ (x,y)])] ><br />
yzwr Omar is visiting Ireland. < TNS : PRES[do ′ (x,[visit ′ (x,y)])] ><br />
yl֒b Abbas is playing with the ball. < TNS : PRES[do ′ (x,[play ′ (x,y)])] ><br />
syl֒b Yousuf will play with the ball. < TNS : FUT[do ′ (x,[play ′ (x,y)])] ><br />
֓akl Mansour ate with the spoon. < TNS : PAST[do ′ (x,[eat ′ (x,y)])] ><br />
y֓akl Mansour is eating his food. < TNS : PRES[do ′ (x,[eat ′ (x,y)])] ><br />
sy֓akl Eric will eat with the spoon. < TNS : FUT[do ′ (x,[eat ′ (x,y)])] ><br />
nfh ˘<br />
Suhaib bellowed the fire. < TNS : PAST[do ′ (x,[bellow ′ (x,y)])] ><br />
dhst Eman runned over her cat. < TNS : PAST[do ′ (x,[runnover ′ (x,y)])] ><br />
ksr Rashid broke the window. < TNS : PAST[do ′ (x,[break ′ (x,y)])] ><br />
tˇgrh. Sarah hurts Yousuf. < TNS : PRES[do ′ (x,[hurt ′ (x,y)])] ><br />
<br />
mzq Almahdi <strong>to</strong>re the book. < TNS : PAST[do ′ (x,[tear ′ (x,y)])] ><br />
<br />
mzqt Ayah <strong>to</strong>re the page. < TNS : PAST[do ′ (x,[tear ′ (x,y)])] ><br />
fth. Almahdi opened the window. < TNS : PAST[do ′ (x,[open ′ (x,y)])] ><br />
mˇst. James combed his hair. < TNS : PAST[do ′ (x,[comb ′ (x,y)])] ><br />
<br />
<br />
nz . f Eric cleaned the plane. < TNS : PAST[do ′ (x,[clean ′ (x,y)])] ><br />
163
<strong>Arabic</strong> Example Logical Structure<br />
qrs. Harold pinched James. < TNS : PAST[do ′ (x,[pinch ′ (x,y)])] ><br />
tnfq Ayah is spending her money. < TNS : PRES[do ′ (x,[spend ′ (x,y)])] ><br />
fqd Henry lost his money. < TNS : PAST[do ′ (x,[lose ′ (x,y)])] ><br />
lkm Adam punched Philip. < TNS : PAST[do ′ (x,[punch ′ (x,y)])] ><br />
tkrh Sarah hates Zakiah. < TNS : PRES[do ′ (x,[hate ′ (x,y)])] ><br />
lkmt Sarah punched Sarah. < TNS : PAST[do ′ (x,[punch ′ (x,y)])] ><br />
qtlt Zakiah killed Sarah. < TNS : PAST[do ′ (x,[kill ′ (x,y)])] ><br />
qtl Jack killed Mary. < TNS : PAST[do ′ (x,[kill ′ (x,y)])] ><br />
s. f֒ Mark slapped Louis. < TNS : PAST[do ′ (x,[slap ′ (x,y)])] ><br />
hātft Ayesha phoned Eman. < TNS : PAST[do ′ (x,[phone ′ (x,y)])] ><br />
<br />
ˇskrt Ayah thanked Khalid. < TNS : PAST[do ′ (x,[thank ′ (x,y)])] ><br />
nādt Sarah called Adam. < TNS : PAST[do ′ (x,[call ′ (x,y)])] ><br />
<br />
fāz Omar won the race. < TNS : PAST[do ′ (x,[win ′ (x,y)])] ><br />
r֓at Eman saw Sarah. < TNS : PAST[do ′ (x,[see ′ (x,y)])] ><br />
msk Philip caught the ball. < TNS : PAST[do ′ (x,[catch ′ (x,y)])] ><br />
<br />
֓iˇstrā Carl bought pens. < TNS : PAST[do′ (x,[buy ′ (x,y)])] ><br />
qād Mark drove the bus. < TNS : PAST[do ′ (x,[drive ′ (x,y)])] ><br />
<br />
<br />
ynz . f Adam is cleaning his room. < TNS : PRES[do ′ (x,[clean ′ (x,y)])] ><br />
164
<strong>Arabic</strong> Example Logical Structure<br />
<br />
<br />
synz . f Mark will clean my kitchen. < TNS : FUT[do ′ (x,[clean ′ (x,y)])] ><br />
<br />
<br />
stnz . f Sarah will clean my office. < TNS : FUT[do ′ (x,[clean ′ (x,y)])] ><br />
y˙gsl Louis is washing the dishes. < TNS : PRES[do ′ (x,[wash ′ (x,y)])] ><br />
yt.֒m Harold is feeding his cat. < TNS : PRES[do ′ (x,[feed ′ (x,y)])] ><br />
ys.lh. Eric is fixing his car. < TNS : PRES[do ′ (x,[fix ′ (x,y)])] ><br />
ytklm Fahmy speaks <strong>English</strong>. < TNS : PRES[do ′ (x,[speak ′ (x,y)])] ><br />
tt.bh ˘<br />
Ayah is cooking the dinner. < TNS : PRES[do ′ (x,[cook ′ (x,y)])] ><br />
ysā֒d Rashid is helping Mark. < TNS : PRES[do ′ (x,[help ′ (x,y)])] ><br />
y֓akl Mansour is eating his food. < TNS : PRES[do ′ (x,[eat ′ (x,y)])] ><br />
ymˇst. Carl is brushing his hair < TNS : PRES[do ′ (x,[brush ′ (x,y)])] ><br />
<br />
yfrˇs Abbas is brushing his teeth. < TNS : PRES[do ′ (x,[brush ′ (x,y)])] ><br />
ylbs Yousuf is wearing his shoes. < TNS : PRES[do ′ (x,[wear ′ (x,y)])] ><br />
ynām Omar sleeps early. < TNS : PRES[do ′ (x,[sleep ′ (x,y)])] ><br />
yˇsāhd Henry is watching the TV. < TNS : PRES[do ′ (x,[watch ′ (x,y)])] ><br />
yzr֒ Almahdi is planting the trees. < TNS : PRES[do ′ (x,[plant ′ (x,y)])] ><br />
֓is. t.ād Sulaiman caught the fishs. < TNS : PAST[do ′ (x,[catch ′ (x,y)])] ><br />
ˇgr James pushed the chairs. < TNS : PAST[do ′ (x,[push ′ (x,y)])] ><br />
165
<strong>Arabic</strong> Example Logical Structure<br />
ktbt Ayesha wrote the book. < TNS : PAST[do ′ (x,[write ′ (x,y)])] ><br />
ktb Brian wrote the book. < TNS : PAST[do ′ (x,[write ′ (x,y)])] ><br />
msh. t Fatima wiped the house. < TNS : PAST[do ′ (x,[wipe ′ (x,y)])] ><br />
rsmt Raiqa drew trees. < TNS : PAST[do ′ (x,[draw ′ (x,y)])] ><br />
qt.ft Ayesha picked the flowers. < TNS : PAST[do ′ (x,[pick ′ (x,y)])] ><br />
kwt Eman ironed the clothes. < TNS : PAST[do ′ (x,[iron ′ (x,y)])] ><br />
<br />
֓iˇstrā Omar bought the <strong>to</strong>ys. < TNS : PAST[do ′ (x,[buy ′ (x,y)])] ><br />
lwnt Ayah painted the picture. < TNS : PAST[do ′ (x,[paint ′ (x,y)])] ><br />
֓a֒t.āk Omar gave you the book. < TNS : PAST[do ′ (x,[give ′ (x,y)])] ><br />
ys.lh. Omar is fixing the computer. < TNS : PAST[do ′ (x,[fix ′ (x,y)])] ><br />
y֒ml Yasser works hard. < TNS : PRES[do ′ (x,[work ′ (x,y)])] ><br />
mrr Philip passed the ball. < TNS : PAST[do ′ (x,[pass ′ (x,y)])] ><br />
yswq Khalid drives. < TNS : PRES >><br />
ybkwn The children are crying. < TNS : PRES >><br />
yqr֓a Omar reads. < TNS : PRES >><br />
166
<strong>Arabic</strong> ys. rs. r<br />
Example the wheel squeaks.<br />
Logical Structure < TNS : PRES >><br />
<strong>Arabic</strong> ynām<br />
Example Omar is sleeping.<br />
Logical Structure < TNS : PRES >><br />
<strong>Arabic</strong> ֓a֒t.ā<br />
Example Omar gave Khalid the book.<br />
Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEhave ′ (y, z)]] ><br />
<strong>Arabic</strong> y֒t.y<br />
Example Omar is giving Eman a book.<br />
Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEhave ′ (y, z)]] ><br />
<strong>Arabic</strong> t֒t.y<br />
Example Sarah is giving Eman a book.<br />
Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEhave ′ (y, z)]] ><br />
<strong>Arabic</strong><br />
֓a֒t.t<br />
Example Sarah gave Khalid a book.<br />
Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEhave ′ (y, z)]] ><br />
167
<strong>Arabic</strong><br />
֓art<br />
Example Fatima showed the letter <strong>to</strong> Khalid.<br />
Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEsee ′ (y, z)]] ><br />
<strong>Arabic</strong> yry<br />
Example Mark is showing Brian the letter.<br />
Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEsee ′ (y, z)]] ><br />
<strong>Arabic</strong> ֓arā<br />
Example Brian showed the letter <strong>to</strong> Sarah.<br />
Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEsee ′ (y, z)]] ><br />
<strong>Arabic</strong> try<br />
Example Fatima is showing Adam the letter.<br />
Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEsee ′ (y, z)]] ><br />
<strong>Arabic</strong> ydrs<br />
Example Suhaib is teaching Eman the his<strong>to</strong>ry.<br />
Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEknow ′ (y, z)]] ><br />
<strong>Arabic</strong> tdrs<br />
Example Eman is teaching mathematics <strong>to</strong> Sarah.<br />
Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEknow ′ (y, z)]] ><br />
168
<strong>Arabic</strong> ֓adrs<br />
Example I am teaching mathematics <strong>to</strong> Sarah.<br />
Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEknow ′ (y, z)]] ><br />
<strong>Arabic</strong> drs<br />
Example Suhaib taught Mark mathematics.<br />
Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEknow ′ (y, z)]] ><br />
<strong>Arabic</strong> Example Logical Structure<br />
<br />
fāttny I missed the plane. < TNS : PAST[do ′ (I, [miss ′ (I, y])] ><br />
֓aryd I want a ring. < TNS : PRES[do ′ (I, [want ′ (I, ring])] ><br />
nsyt I <strong>for</strong>got my wallet. < TNS : PAST[do ′ (I, [<strong>for</strong>get ′ (I, wallet])] ><br />
֓aklt I ate the food. < TNS : PAST[do ′ (I, [eat ′ (I, food])] ><br />
ˇsrbt I drank the milk. < TNS : PAST[do ′ (I, [drink ′ (I, milk])] ><br />
169
E<br />
The UniArab code<br />
Given the large amount of code developed as part of the work presented in this thesis, it<br />
is available in the attached CD rather than included here.<br />
Package: Name of Class Class Summary<br />
pkg1: <strong>Arabic</strong>To<strong>English</strong>MT The main class<br />
pkg1: AdjectiveXMLWriter To write an adjective in the datasource<br />
pkg1: Adjective To hold adjective attributes from the datasource<br />
pkg1: AdverbXMLWriter To write an adverb in the datasource<br />
pkg1: Adverb To hold adverb attributes from the datasource<br />
pkg1: DemonstrativeXMLWriter To write a demonstrative in the datasource<br />
pkg1: Demonstrative To hold demonstrative attributes from the datasource<br />
pkg1: Global This class <strong>to</strong> add a new word in lexicon<br />
pkg1: NounXMLWriter To write a noun in the datasource<br />
pkg1: Noun To hold noun attributes from the datasource<br />
170
Package: Name of Class Class Summary<br />
pkg1: OtherWordXMLWriter To write an OtherWord in the datasource<br />
pkg1: OtherWord To hold OtherWord attributes from the datasource<br />
pkg1: Preparation<br />
To change the value from lexicon interface’s list <strong>to</strong> be saved<br />
in datasource<br />
pkg1: ProperNounXMLWriter To write a proper noun in the datasource<br />
pkg1: ProperNoun To hold proper noun attributes from the datasource<br />
pkg1: SearchEngine2 This class <strong>to</strong> manage search in datasource<br />
pkg1: TempAdjectiveXML<br />
pkg1: TempAdverbXML<br />
pkg1: TempDemonstrativeXML<br />
pkg1: TempNounXML<br />
pkg1: TempOtherWordXML<br />
pkg1: TempProperNounXML<br />
pkg1: TempVerbXML<br />
To manage a written an adjective in the XML datasource<br />
while the UniArab system is running<br />
To manage a written a new adverb in the XML datasource<br />
while the UniArab system is running<br />
To manage a written a new demonstrative in the XML<br />
datasource while the UniArab system is running<br />
To manage a written a new noun in the XML datasource<br />
while the UniArab system is running<br />
To manage a written a new OtherWord in the XML datasource<br />
while the UniArab system is running<br />
To manage a written a new proper noun in the XML datasource<br />
while the UniArab system is running<br />
To manage a written a new verb in the XML datasource<br />
while the UniArab system is running<br />
pkg1: Tokenizer This class <strong>to</strong> split a sentence in<strong>to</strong> word <strong>to</strong>kens<br />
pkg1: VerbXMLWriter To write a verb in the datasource<br />
pkg1: Verb To hold verb attributes from the datasource<br />
171
Package: Name of Class Class Summary<br />
gui: AdjectivePanel<br />
gui: AdverbPanel<br />
gui: DemonstrativesPanel<br />
gui: NounPanel<br />
gui: OtherWordPanel<br />
gui: ProperNounPanel<br />
gui: VerbPanel<br />
This class <strong>to</strong> mange the adjective panel<br />
in the UniArabs lexicon interface<br />
This class <strong>to</strong> mange the adverb panel in the<br />
UniArabs lexicon interface<br />
This class <strong>to</strong> mange the demonstrative panel<br />
in the UniArabs lexicon interface<br />
This class <strong>to</strong> mange the noun panel in the<br />
UniArabs lexicon interface<br />
This class <strong>to</strong> mange the OtherWord panel<br />
in the UniArabs lexicon interface<br />
This class <strong>to</strong> mange the proper noun panel<br />
in the UniArabs lexicon interface<br />
This class <strong>to</strong> mange the verb panel in the<br />
UniArabs lexicon interface<br />
uniArab: PreUniArab This class <strong>to</strong> mange the POS of input words<br />
uniArab: UniArab This class <strong>to</strong> preparation of syntactic parser<br />
uniArab: GenerationLS This class <strong>to</strong> generate the logical structure<br />
syntaxGeneration: syntaxGeneration This class <strong>to</strong> mange the syntax generation<br />
generation<strong>English</strong>Morphology: pressTenseToBe<br />
Package: Name of Class Class Summary<br />
This class <strong>to</strong> mange the generation<br />
of target language morphology<br />
xml: AdjectiveDB.XML This is the adjectives s<strong>to</strong>red in the XML datasource<br />
xml: AdverbDB.XML This is the adverbs s<strong>to</strong>red in the XML datasource<br />
xml: DemonstrativeDB.XML This is the demonstratives s<strong>to</strong>red in the XML datasource<br />
xml: NounDB.XML This is the nouns s<strong>to</strong>red in the XML datasource<br />
xml: OtherWordDB.XML This is the OtherWords s<strong>to</strong>red in the XML datasource<br />
xml: ProperNounDB.XML This is the proper nouns s<strong>to</strong>red in the XML datasource<br />
xml: VerbDB.XML This is the verbs s<strong>to</strong>red in the XML datasource<br />
172