22.08.2013 Views

A generic framework for Arabic to English machine ... - Acsu Buffalo

A generic framework for Arabic to English machine ... - Acsu Buffalo

A generic framework for Arabic to English machine ... - Acsu Buffalo

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

A <strong>generic</strong> <strong>framework</strong> <strong>for</strong> <strong>Arabic</strong><br />

<strong>to</strong> <strong>English</strong> <strong>machine</strong> translation<br />

of simplex sentences using the<br />

Role and Reference Grammar<br />

linguistic model<br />

By<br />

Yasser Salem B.Sc<br />

Supervisors:<br />

Dr. Brian Nolan<br />

Mr. Arnold Hensman<br />

Submitted in Fulfillment of the Requirements <strong>for</strong> the of M.Sc. in<br />

Computing in the School of In<strong>for</strong>matics and Engineering at<br />

the Institute of Technology Blanchards<strong>to</strong>wn<br />

Dublin, Ireland<br />

April, 2009


Abstract<br />

The aim of this research is <strong>to</strong> develop a rule-based lexical <strong>framework</strong> <strong>for</strong> <strong>Arabic</strong> language<br />

processing using the Role and Reference Grammar linguistic model. A system, called<br />

UniArab is introduced <strong>to</strong> support the <strong>framework</strong>. The UniArab system <strong>for</strong> Modern Stan-<br />

dard <strong>Arabic</strong> (MSA), which takes MSA <strong>Arabic</strong> as input in the native orthography, parses<br />

the sentence(s) in<strong>to</strong> a logical meta-representation, and using this, generates a grammati-<br />

cally correct <strong>English</strong> output with full agreement and morphological resolution. UniArab<br />

utilizes an XML-based implementation of elements of the Role and Reference Grammar<br />

theory, and its representations <strong>for</strong> the universal logical structure of <strong>Arabic</strong> sentences.<br />

Role and Reference Grammar (RRG) is a functional theory of grammar that posits a<br />

direct mapping between the semantic representation of a sentence and its syntactic rep-<br />

resentation. The theory allows a sentence in a specific language <strong>to</strong> be described in terms<br />

of its logical structure and grammatical procedures. RRG creates a linking relationship<br />

between syntax and semantics, and can account <strong>for</strong> how semantic representations are<br />

mapped in<strong>to</strong> syntactic representations. We claim that RRG is highly suitable <strong>for</strong> <strong>machine</strong><br />

translation of <strong>Arabic</strong> via an Interlingua bridge implementation model. RRG is a mono<br />

strata–theory, positing only one level of syntactic representation, the actual <strong>for</strong>m of the<br />

i


sentence and its linking algorithm can work in both directions from syntactic representa-<br />

tion <strong>to</strong> semantic representation, or vice versa. In RRG, semantic decomposition of pred-<br />

icates and their semantic argument structures are represented as logical structures. The<br />

lexicon in RRG takes the position that lexical entries <strong>for</strong> verbs should contain unique in-<br />

<strong>for</strong>mation only, with as much in<strong>for</strong>mation as possible derived from general lexical rules.<br />

For this reason and due <strong>to</strong> the functional nature of our linguistic model, we will create<br />

our own lexicon.<br />

We use the RRG theory <strong>to</strong> motivate the architecture of the lexicon and the RRG bidirec-<br />

tional linking system <strong>to</strong> design and implement the parse and generate functions between<br />

the syntax-semantic interfaces. Through an input process with seven phases, including<br />

morphological and syntactic unpacking, UniArab extracts the universal logical structure<br />

of an <strong>Arabic</strong> sentence. Using the XML based metadata representing the RRG logical<br />

structure (XRRG), UniArab accurately generates an equivalent grammatical sentence in<br />

the target language through four output phases. We outline the conceptual structure of<br />

the UniArab System which utilizes the <strong>framework</strong> and translates the <strong>Arabic</strong> language<br />

in<strong>to</strong> another natural language. We follow the Interlingua design approach <strong>for</strong> <strong>machine</strong><br />

translation. We analyse the <strong>Arabic</strong> sentences <strong>to</strong> create a universal, abstract logical repre-<br />

sentation, and from this representation we generate <strong>English</strong> translations.<br />

We also explore how the characteristics of the <strong>Arabic</strong> language will affect the develop-<br />

ment of a Machine Translation (MT) <strong>to</strong>ol. Several characteristics of <strong>Arabic</strong> pertinent<br />

<strong>to</strong> MT will be explored in detail with reference <strong>to</strong> some potential difficulties that they<br />

present. We will conclude with a proposed model incorporating the Role and Reference<br />

Grammar techniques <strong>to</strong> achieve this end. The UniArab system has been tested by gener-<br />

ating equivalent grammatical sentences, in <strong>English</strong>, via the universal logical structure of<br />

<strong>Arabic</strong> sentences, based on MSA <strong>Arabic</strong> input with very significant and accurate results.<br />

ii


It provides more accurate translations when compared with au<strong>to</strong>mated transla<strong>to</strong>rs from<br />

Google and Microsoft though these systems have a much wider coverage than UniArab<br />

at present. The free word order nature of <strong>Arabic</strong> and the challenges of incorporating tran-<br />

sitivity in<strong>to</strong> the logical structure will be outlined in detail. This research demonstrates the<br />

capabilities of the Role and Reference Grammar as a base <strong>for</strong> multilingual translation<br />

systems.<br />

iii


Declaration<br />

I herby certify that this material, which I now submit <strong>for</strong> assessment on the programme<br />

of study leading <strong>to</strong> the award of M.Sc. in Computing in the Institute of Technology Blan-<br />

chards<strong>to</strong>wn, is entirely my own work except where otherwise stated, and has not been<br />

submitted <strong>for</strong> assessment <strong>for</strong> an academic purpose at this or any other academic institu-<br />

tion other than in partial fulfillment of the requirements of that stated above.<br />

. . ..................................<br />

Yasser Salem<br />

Dublin, Ireland<br />

April 2009<br />

iv


Acknowledgements<br />

“In the name of GOD (Allah), Most Gracious, Most Merciful. Praise be <strong>to</strong> GOD, the<br />

Cherisher and Sustainer of the worlds; Most Gracious, Most Merciful; Master of the Day<br />

of Judgement. Thee do we worship, and Thine aid we seek. Show us the straight way,”<br />

[The Quran: Al-Fatiha (The Opening)]. “and my success (in my task) can only come<br />

from GOD. In Him I trust and un<strong>to</strong> Him I turn (repentant).” [The Quran: Hud,88]. I am<br />

not able <strong>to</strong> fulfil the due thanks <strong>to</strong> GOD, but I seek his <strong>for</strong>giveness and that GOD assists<br />

me in thanking His Majesty. “Say. Surely my prayer and my sacrifice and my life and<br />

my death are all <strong>for</strong> God, the Lord of the worlds.” (The Quran: AL-Anaam, 162)<br />

I am deeply indebted <strong>to</strong> my mother <strong>for</strong> accepting that I pursue my M.Sc. away from<br />

her. And May God bless my father’s soul and join all of us in the paradise. I am grateful<br />

<strong>to</strong> my wife, Qamar, and my children, <strong>for</strong> their encouragement and understanding. Thanks<br />

<strong>to</strong> my brothers and sisters.<br />

I am deeply indebted <strong>to</strong> my supervisor, Dr. Brian Nolan, <strong>for</strong> many insightful conver-<br />

sations during the development of the ideas in this thesis, and <strong>for</strong> helpful comments on<br />

the text. Dr. Nolan has supported me not only by providing research guidelines and<br />

v


advices, but also academically and emotionally through the rough road <strong>to</strong> finishing this<br />

thesis. And during the most difficult times when writing this thesis, he gave me the<br />

moral support. I also extend my sincere gratitude <strong>to</strong> Mr. Arnold Hensman who has had<br />

a significantly positive influence on my development through his role as my research co-<br />

supervisor. His patience and willingness <strong>to</strong> discuss the minutiae of the different obstacles<br />

I encountered while working on this project were invaluable.<br />

I also extend my sincere gratitude <strong>to</strong> all the staff of the Department of In<strong>for</strong>matics, In-<br />

stitute of Technology Blanchards<strong>to</strong>wn, Dublin, Ireland, where I did my undergraduate<br />

and masters studies. I specially thank Mark Cummins, Gareth Curran, Dr. Kevin Far-<br />

rell, Geraldine Gray, Dr. Anthony Keane, Laura Keyes, Margaret Kinsella, John Massey,<br />

Hugh McCabe, Dr. Simon McLoughlin, Orla McMahon, Daniel McSweeney, Frances<br />

Murphy, Tom Nolan, Michael O’Donnell, Luke Raeside, Stephen Sheridan and Dr. Matt<br />

Smith.<br />

Thanks <strong>to</strong> all my friends who supported me during my studies, especially, and <strong>to</strong> name<br />

just a few Mansoor Bin Khalifa Al-Thani, Mohammed A. Attia, Khalid Buzakhar, Suhaib<br />

Fahmy, Mohamed Faouzi Gahbiche, James Kennedy, Essam Mansour, Eftaima Najjair<br />

and Aliye Peksen. I would like <strong>to</strong> express my gratitude <strong>to</strong> all those who gave me the<br />

possibility <strong>to</strong> complete this thesis.<br />

. . ..................................<br />

Yasser Salem<br />

Dublin, Ireland<br />

April 2009<br />

vi


Contents<br />

Abstract iii<br />

Declaration iv<br />

Acknowledgements vi<br />

1 Introduction 1<br />

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.3 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

2 The <strong>Arabic</strong> Language 8<br />

2.1 Characteristics of the <strong>Arabic</strong> language . . . . . . . . . . . . . . . . . . . 9<br />

2.2 Characteristics of <strong>Arabic</strong> words . . . . . . . . . . . . . . . . . . . . . . 11<br />

2.2.1 Free word order . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

2.3 Part of speech inven<strong>to</strong>ry of the <strong>Arabic</strong> language . . . . . . . . . . . . . . 14<br />

2.3.1 Noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

2.3.1.1 Definite nouns . . . . . . . . . . . . . . . . . . . . . . 15<br />

vii


CONTENTS<br />

2.3.1.2 Indefinite nouns . . . . . . . . . . . . . . . . . . . . . 16<br />

2.3.2 Adjectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.3.3 Adverbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.3.4 Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

2.3.4.1 Verb tenses . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

2.3.4.2 Aspect . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.3.4.3 Mood . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.3.4.4 Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.3.4.5 Transitivity . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.3.5 Demonstratives . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

2.3.6 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

2.4 Sentence types in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

2.4.1 Equational sentences . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

2.4.1.1 Verb and noun . . . . . . . . . . . . . . . . . . . . . . 23<br />

2.4.1.2 Verb and two nouns . . . . . . . . . . . . . . . . . . . 23<br />

2.4.1.3 Verb and three nouns . . . . . . . . . . . . . . . . . . 24<br />

2.4.1.4 Verb and four nouns . . . . . . . . . . . . . . . . . . . 24<br />

2.4.2 The Verbal Sentence . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

2.4.3 Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

3 Role and Reference Grammar (RRG) 27<br />

3.1 Role and Reference Grammar linguistic model . . . . . . . . . . . . . . . 28<br />

3.2 Formal representation of layered structure of the clause . . . . . . . . . . 31<br />

3.2.1 Representing the universal aspects of the layered structure of the<br />

clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

3.2.2 Layered structure of the clause (LSC) . . . . . . . . . . . . . . . 32<br />

viii


CONTENTS<br />

3.2.3 Non-universal aspects of the layered structure of the clause . . . . 32<br />

3.3 Noun phrase structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />

3.3.1 NP headed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

3.4 Lexical representations <strong>for</strong> verbs . . . . . . . . . . . . . . . . . . . . . . 38<br />

3.4.1 Agents, effec<strong>to</strong>rs, instruments and <strong>for</strong>ces . . . . . . . . . . . . . 39<br />

3.4.2 change of state verb . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

3.5 Why we use RRG as the linguistic model . . . . . . . . . . . . . . . . . 41<br />

3.5.1 RRG representing the universal aspects of the layered structure<br />

of the clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

3.5.2 The lexical representation of verbs and their arguments . . . . . . 43<br />

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

4 Machine translation strategies 46<br />

4.1 Advantages of <strong>machine</strong> translation . . . . . . . . . . . . . . . . . . . . . 47<br />

4.2 Computational techniques in MT . . . . . . . . . . . . . . . . . . . . . . 47<br />

4.2.1 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

4.2.2 Interactive systems . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

4.2.3 Lexical databases . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

4.2.4 Tokens and <strong>to</strong>kenization . . . . . . . . . . . . . . . . . . . . . . 49<br />

4.2.5 Syntactic analysis (Parsing) . . . . . . . . . . . . . . . . . . . . 50<br />

4.3 Basic <strong>machine</strong> translation strategies . . . . . . . . . . . . . . . . . . . . 50<br />

4.3.1 Multilingual versus bilingual systems . . . . . . . . . . . . . . . 50<br />

4.3.2 Direct translation . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

4.3.3 Interlingua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

4.3.4 Transfer systems . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

4.3.5 Statistical <strong>machine</strong> translation . . . . . . . . . . . . . . . . . . . 56<br />

4.4 Linguistic aspects of MT . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

ix


CONTENTS<br />

4.4.1 Non-Roman alphabet scripts . . . . . . . . . . . . . . . . . . . . 57<br />

4.4.2 Lexical ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

4.4.2.1 Category ambiguity . . . . . . . . . . . . . . . . . . . 57<br />

4.4.2.2 Homograph . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.4.3 Syntactic ambiguity . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.4.4 Structural differences . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

4.5 Challenges of <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> MT . . . . . . . . . . . . . . . . . . . . 60<br />

4.6 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

4.6.1 Generation in direct systems . . . . . . . . . . . . . . . . . . . . 62<br />

4.6.2 Generation in transfer-based systems . . . . . . . . . . . . . . . 63<br />

4.6.3 Generation in interlingua systems . . . . . . . . . . . . . . . . . 64<br />

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

5 Design of <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation system based on RRG 67<br />

5.1 UniArab: Interlingua-based system . . . . . . . . . . . . . . . . . . . . . 68<br />

5.2 Designing an XML lexicon architecture <strong>for</strong> <strong>Arabic</strong> MT based on RRG . . 69<br />

5.2.1 An XML-based lexicon . . . . . . . . . . . . . . . . . . . . . . . 70<br />

5.2.2 Lexical representation in UniArab . . . . . . . . . . . . . . . . . 70<br />

5.2.3 Lexical properties . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

5.3 Design of test strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

5.4 Design of evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

6 UniArab: a proof-of-concept <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation system 79<br />

6.1 Conceptual structure of the UniArab system . . . . . . . . . . . . . . . . 80<br />

6.1.1 Technical architecture of the UniArab system . . . . . . . . . . . 81<br />

6.1.2 UniArab: Lexical representation in interlingua system . . . . . . 84<br />

x


CONTENTS<br />

6.2 UniArab: Lexical representation in interlingua system based on RRG . . 86<br />

6.2.1 Verb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

6.2.2 Common noun . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

6.2.3 Proper noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

6.2.4 Adjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

6.2.5 Demonstrative . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />

6.2.6 Adverb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />

6.2.7 Other <strong>Arabic</strong> words . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

6.3 UniArab: Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

6.4 UniArab: Screen design . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

6.4.1 Lexicon interface . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

6.5 Technical challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

7 Testing and evaluation 100<br />

7.1 Evaluation of MT systems . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

7.2 Sentence tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

7.2.1 Verb-Subject with one argument in different tenses . . . . . . . . 102<br />

7.2.2 Gender-ambiguous proper nouns . . . . . . . . . . . . . . . . . . 106<br />

7.2.3 Verb ‘<strong>to</strong> be’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

7.2.4 Verb ‘<strong>to</strong> have’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

7.2.5 Free word order . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

7.2.6 Pro-drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

7.2.7 Transitivity of verbs . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

7.2.7.1 Intransitive . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

7.2.7.2 Transitive . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

7.2.7.3 Ditransitive . . . . . . . . . . . . . . . . . . . . . . . 119<br />

xi


CONTENTS<br />

7.2.8 Limitation of UniArab . . . . . . . . . . . . . . . . . . . . . . . 122<br />

7.3 System evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

8 Conclusion 129<br />

8.1 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

8.2 Summary of thesis contributions . . . . . . . . . . . . . . . . . . . . . . 132<br />

8.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133<br />

References 134<br />

Appendix 140<br />

A The author’s publications related <strong>to</strong> this research 140<br />

B Buckwalter <strong>Arabic</strong> transliteration 142<br />

C List of translatable sentences 145<br />

D Verbs in lexicon 161<br />

E The UniArab code 170<br />

xii


List of Figures<br />

2.1 A classification <strong>for</strong> the <strong>Arabic</strong> language syntax . . . . . . . . . . . . . . 14<br />

2.2 A classification of clauses in the <strong>Arabic</strong> language . . . . . . . . . . . . . 22<br />

3.1 Layout of Role and Reference Grammar . . . . . . . . . . . . . . . . . . 29<br />

3.2 <strong>Arabic</strong> sentence types; verb subject object or subject verb object (<strong>for</strong><br />

gloss please see example 3.1) . . . . . . . . . . . . . . . . . . . . . . . . 30<br />

3.3 Formal representation of the layered structure of the clause . . . . . . . . 31<br />

3.4 <strong>English</strong> Sentence with precore slot and left-detached position . . . . . . . 33<br />

3.5 Opera<strong>to</strong>r projection in LSC . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

3.6 LSC with constituent and opera<strong>to</strong>r projections . . . . . . . . . . . . . . . 35<br />

3.7 <strong>Arabic</strong> LSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />

3.8 The RRG representing the universal aspects of the layered structure of<br />

the clause (Van Valin and LaPolla 1997) . . . . . . . . . . . . . . . . . . 42<br />

4.1 Direct MT system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

4.2 Interlingua1 model with eight languages pairs . . . . . . . . . . . . . . . 52<br />

4.3 Multilinguality transfer model with eight languages pairs . . . . . . . . . 54<br />

xiii


LIST OF FIGURES<br />

4.4 Difference between direct, transfer, and interlingua MT models, (Trujillo<br />

1999) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

4.5 NP rule (NP –> det n pp) . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.6 PP is attached at a higher level . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.7 Direct MT system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

4.8 Semantic generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

4.9 Structure <strong>to</strong> be generated . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

4.10 Interlingua model of <strong>Arabic</strong> MT . . . . . . . . . . . . . . . . . . . . . . 66<br />

5.1 The conceptual architecture of the UniArab system . . . . . . . . . . . . 68<br />

5.2 In<strong>for</strong>mation recorded in the UniArab lexicon . . . . . . . . . . . . . . . 72<br />

6.1 Layout of Role and Reference Grammar . . . . . . . . . . . . . . . . . . 79<br />

6.2 The conceptual architecture of the UniArab system . . . . . . . . . . . . 80<br />

6.3 Generation the right tense <strong>for</strong> the verbs . . . . . . . . . . . . . . . . . . . 84<br />

6.4 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> verb . . . . . . . . . . . . . . . . . . 87<br />

6.5 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> noun . . . . . . . . . . . . . . . . . 88<br />

6.6 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> proper noun . . . . . . . . . . . . . . 89<br />

6.7 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> adjective . . . . . . . . . . . . . . . 90<br />

6.8 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> demonstrative. . . . . . . . . . . . . 91<br />

6.9 In<strong>for</strong>mation recorded on the <strong>Arabic</strong> adverb. . . . . . . . . . . . . . . . . 91<br />

6.10 In<strong>for</strong>mation recorded on the other <strong>Arabic</strong> words . . . . . . . . . . . . . . 92<br />

6.11 UniArab’s GUI 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />

6.12 UniArab’s GUI 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

6.13 UniArab’s GUI 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

6.14 UniArab’s lexicon interface . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

7.1 Verb-Subject with one argument . . . . . . . . . . . . . . . . . . . . . . 102<br />

xiv


LIST OF FIGURES<br />

7.2 Verb-Subject with one argument . . . . . . . . . . . . . . . . . . . . . . 103<br />

7.3 Verb-subject agreement 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

7.4 Verb-subject agreement 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

7.5 Gender-ambiguous proper nouns 1 . . . . . . . . . . . . . . . . . . . . . 106<br />

7.6 Gender-ambiguous proper nouns 2 . . . . . . . . . . . . . . . . . . . . . 107<br />

7.7 Verb ‘<strong>to</strong> be’ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

7.8 Verb ‘<strong>to</strong> be’ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

7.9 Verb ‘<strong>to</strong> have’ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

7.10 Verb ‘<strong>to</strong> have’ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

7.11 Free word order (Verb Noun Noun scenario one) . . . . . . . . . . . . . 112<br />

7.12 Free word order (Verb Noun Noun scenario two) . . . . . . . . . . . . . 113<br />

7.13 Free word order (Verb Noun Noun scenario three) . . . . . . . . . . . . . 114<br />

7.14 Pro-drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

7.15 Intransitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

7.16 Intransitive with an adverb . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

7.17 Transitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

7.18 Ditransitive 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

7.19 Ditransitive with 2 NP . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

7.20 Ditransitive with preposition . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

7.21 Limitation of UniArab 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

7.22 Limitation of UniArab 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />

7.23 Limitation of UniArab 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />

xv


List of Tables<br />

2.1 Dual: merely add two letters <strong>to</strong> achieve dual <strong>for</strong>m in <strong>Arabic</strong> . . . . . . . 10<br />

2.2 Grammatical gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />

2.3 Feminine is different than masculine . . . . . . . . . . . . . . . . . . . . 12<br />

2.4 Feminine and masculine in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . 12<br />

2.5 Definiteness in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

2.6 Definiteness example in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . 12<br />

2.7 Free word order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />

2.8 Noun example in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.9 Definite example in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.10 Indefinite example in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.11 <strong>Arabic</strong> adjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.12 <strong>Arabic</strong> adverb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.13 Imperfect tense <br />

ālf֒l ālmd. ār֒ . . . . . . . . . . . . . . . . 17<br />

2.14 Perfect tense ālf֒l ālmād. y . . . . . . . . . . . . . . . . . . 17<br />

2.15 Imperfect inflectional <strong>for</strong>ms of word ‘write’ . . . . . . . . . . . . . . . . 18<br />

2.16 Perfect inflectional <strong>for</strong>ms of word ‘wrote’ . . . . . . . . . . . . . . . . . 18<br />

2.17 Future tense in <strong>Arabic</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />

xvi


LIST OF TABLES<br />

2.18 Indicative mood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.19 Subjunctive mood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.20 Jussive mood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.21 Imperative mood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.22 Particle ‘Lan’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

2.23 Nominal sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

2.24 Kan and its sisters kān w ֓ahwthā . . . . . . . . . . . . . . 23<br />

˘<br />

2.25 zanna and its sisters z . n w ֓ahwthā . . . . . . . . . . . . . . 24<br />

˘<br />

2.26 In<strong>for</strong>med and showed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

2.27 verb(V), subject(S) and object(O) . . . . . . . . . . . . . . . . . . . . . 25<br />

2.28 subject(S), verb(V) and object(O) . . . . . . . . . . . . . . . . . . . . . 25<br />

2.29 verb(V), object(O) and subject(S) . . . . . . . . . . . . . . . . . . . . . 25<br />

2.30 Two simple clauses by subordinating conjunction . . . . . . . . . . . . . 25<br />

3.1 Relationships between the semantic and syntactic units . . . . . . . . . . 32<br />

3.2 Lexical representations <strong>for</strong> the basic Aktionsart classes . . . . . . . . . . 38<br />

4.1 Modules required in an all-pairs multilingual transfer system . . . . . . . 54<br />

4.2 Derived words from a three-letter-root in <strong>Arabic</strong> . . . . . . . . . . . . . 61<br />

5.1 Verb 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

5.2 Verb 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

5.3 Test strategy: verb-subject agreement . . . . . . . . . . . . . . . . . . . 75<br />

5.4 Test strategy: demonstrative adjective-noun agreement . . . . . . . . . . 75<br />

5.5 Test strategy: gender-ambiguous proper nouns . . . . . . . . . . . . . . 75<br />

5.6 Test strategy: verb ‘<strong>to</strong> be’ . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

5.7 Test strategy: verb ‘<strong>to</strong> have’ . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

5.8 Test strategy: free word order (Verb Noun Noun) . . . . . . . . . . . . . 77<br />

xvii


LIST OF TABLES<br />

5.9 Test strategy: pro–drop . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

6.1 Verb 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

6.2 Verb 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

6.3 Noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

6.4 Proper Noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

6.5 Adjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

6.6 Demonstrative representative . . . . . . . . . . . . . . . . . . . . . . . . 90<br />

6.7 Adverb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />

6.8 Other <strong>Arabic</strong> words (where ‘NON’ means not applicable) . . . . . . . . 92<br />

7.1 Test : Verb-Subject; one argument . . . . . . . . . . . . . . . . . . . . . 102<br />

7.2 Test : Verb-subject; agreement 1 . . . . . . . . . . . . . . . . . . . . . . 103<br />

7.3 Test : verb-subject; agreement 2 . . . . . . . . . . . . . . . . . . . . . . 104<br />

7.4 Test : Gender-ambiguous proper nouns 1 . . . . . . . . . . . . . . . . . 106<br />

7.5 Test : gender-ambiguous proper nouns 2 . . . . . . . . . . . . . . . . . . 107<br />

7.6 Test : Verb ‘<strong>to</strong> be’ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

7.7 Test : Verb ‘<strong>to</strong> be’ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

7.8 Test : Verb ‘<strong>to</strong> have’ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

7.9 Test : Verb ‘<strong>to</strong> have’ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

7.10 Test : Free word order (Verb Noun Noun scenario one) . . . . . . . . . . 112<br />

7.11 Test : Free word order (Verb Noun Noun scenario two) . . . . . . . . . . 113<br />

7.12 Test : Free word order (Verb Noun Noun scenario three) . . . . . . . . . 114<br />

7.13 Test: Pro-drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

7.14 Test : Intransitive 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

7.15 Test : Intransitive 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

7.16 Test : Transitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

xviii


LIST OF TABLES<br />

7.17 Test : Ditransitive 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

7.18 Test : Ditransitive with 2 NP . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

7.19 Test : Ditransitive with preposition . . . . . . . . . . . . . . . . . . . . 121<br />

7.20 Test : Limitation of UniArab . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

7.21 Test : Limitation of UniArab 3 using non existing nonsense word . . . . 124<br />

xix


1<br />

Introduction<br />

The following paragraph was translated from <strong>Arabic</strong> in<strong>to</strong> <strong>English</strong> using the Google trans-<br />

la<strong>to</strong>r (Google 2009).<br />

That rely entirely on <strong>machine</strong> translation ignores the fact that communication in<br />

the language of rights is an integral part of the context, and that the human is<br />

capable of understanding the context of the original text in a manner sufficient.<br />

There<strong>for</strong>e can not be trusted after the <strong>machine</strong> translation programs, they could<br />

not analyze the context of the original version is similar <strong>to</strong> the human<br />

understanding of when listening <strong>to</strong> the same text.<br />

It is clear that the paragraph cannot be easily unders<strong>to</strong>od, and a large amount of the<br />

in<strong>for</strong>mation has been confused or mixed up. This shows the problems facing <strong>machine</strong><br />

translation, and motivates our work. We believe that statistical <strong>machine</strong> translation has<br />

not achieved what people expected in terms of quality. Hence we wish <strong>to</strong> look at another<br />

method, building from the ground up <strong>to</strong> achieve higher quality translations.<br />

1


Machine translation has yet <strong>to</strong> reach its potential within the translation market as a whole.<br />

Figures suggest that MT accounts <strong>for</strong> a small us $ 100 million portion of a us $ 10 billion<br />

translation market (Intelligence 2004). Many have suggested that the reason is the poor<br />

quality of results, hence it only makes sense when very large amounts of data need <strong>to</strong><br />

be processed (Oren 2004). For the MT market <strong>to</strong> expand, it is necessary <strong>to</strong> improve the<br />

quality of results, which will then make it a viable alternative within the much bigger<br />

translation market.<br />

<strong>Arabic</strong> is acquiring attention in the natural language processing (NLP) community be-<br />

cause of its political importance and the linguistic differences between it and European<br />

languages. These linguistic characteristics, especially complex morphology, present in-<br />

teresting challenges <strong>for</strong> NLP researchers. According <strong>to</strong> Holes (2004) <strong>Arabic</strong> is the sole<br />

or joint official language in twenty independent Middle Eastern and African states: Alge-<br />

ria, Bahrain, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Oman,<br />

Palestine, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tunisia, the United Arab Emirates<br />

and Yemen. Since the end of the nineteenth century, there have been large communities<br />

of <strong>Arabic</strong> speakers outside the Middle East, particularly in the United States and Eu-<br />

rope. <strong>Arabic</strong> is also the language of Islam’s holy book the Qur’an, and as such is the<br />

religious language of all Muslims. <strong>Arabic</strong> has been an official language of the United<br />

Nations alongside <strong>English</strong>, French, Spanish, and Chinese since 1 January 1971’(Holes<br />

2004). There are a number of different <strong>Arabic</strong> words in languages such as Persian, Turk-<br />

ish, Urdu or Malawian. The words derived from <strong>Arabic</strong> that exist in Spanish, Portuguese,<br />

German, Italian, <strong>English</strong> or French are also numerous (Bateson 2003).<br />

The aim of this research is <strong>to</strong> create an Interlingua Machine Translation (MT) system<br />

that will accept <strong>Arabic</strong> source sentences and generate <strong>English</strong> sentences, and <strong>to</strong> build a<br />

2


high-quality translation technology that is adequate <strong>for</strong> text-<strong>to</strong>-text translation. In this<br />

research we build an Interlingua architecture in MT which translates efficiently. We con-<br />

sider semantic analysis and other disambiguation related <strong>to</strong> <strong>Arabic</strong>. This research also<br />

represents a starting point <strong>for</strong> the future implementation of a successful and complete<br />

<strong>Arabic</strong> MT engine. The hypothesis under investigation and main aims are <strong>to</strong> present an<br />

interlingua architecture, which is not only successful in translating simplex <strong>Arabic</strong> (in-<br />

transitive, transitive, ditransitive and copula-like nominative) sentences <strong>to</strong> corresponding<br />

<strong>English</strong> sentences, but also does so in the most optimal way.<br />

This research is the first contribution (not just <strong>for</strong> <strong>Arabic</strong>) that uses the Role and Refer-<br />

ence Grammar (RRG) model as a basis <strong>for</strong> <strong>machine</strong> translation. This contribution shows<br />

how RRG can be used <strong>to</strong> deduce the logical structure of sentences and produce a lexical<br />

representation which can then be used as the interlingua bridge. The lexicon in RRG<br />

takes the position that lexical entries <strong>for</strong> verbs should contain unique in<strong>for</strong>mation only,<br />

with as much in<strong>for</strong>mation as possible derived from general lexical rules. This was the<br />

reason <strong>for</strong> creating our own lexicon since we need an RRG–based lexicon of the unique<br />

in<strong>for</strong>mation of verbs and their logical structure.<br />

UniArab stands <strong>for</strong> Universal <strong>Arabic</strong> <strong>machine</strong> transla<strong>to</strong>r system. The UniArab system<br />

is a natural language processing application based on Role and Reference Grammar <strong>for</strong><br />

translating the <strong>Arabic</strong> language in<strong>to</strong> any other language, using an RRG based interlingua<br />

bridge. The UniArab system can understand the part of speech of a word, agreement<br />

features, number, gender and the word type. The syntactic parse unpacks the agreement<br />

features between elements of the <strong>Arabic</strong> sentence in<strong>to</strong> a semantic representation (the log-<br />

ical structure) with the ‘state of affairs’ of the sentence. In the UniArab system we intend<br />

<strong>to</strong> have a strong analysis system that can unpack all in<strong>for</strong>mation and its attributes. This<br />

3


1.1. MOTIVATION<br />

allows <strong>for</strong> a generalized target language <strong>to</strong> be generated from the logical structures. In<br />

this research we translate from <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> only, with a view <strong>to</strong> translate from<br />

<strong>Arabic</strong> <strong>to</strong> any other target language in the future.<br />

1.1 Motivation<br />

The motivation <strong>for</strong> an <strong>Arabic</strong>-<strong>English</strong> translation <strong>to</strong>ol is obvious when one considers that<br />

<strong>Arabic</strong> is the lingua franca of the Middle-Eastern world. Presently, 20 countries with a<br />

combined population of 450 million consider Standard <strong>Arabic</strong> as their national language.<br />

A simple test case during a study at Abu Dhabi University over three popular <strong>Arabic</strong><br />

translation <strong>to</strong>ols (Google, Sakhr’s Tarjim and Systran) revealed little success in generat-<br />

ing the correct meaning (Izwaini 2006). This research demonstrates the capabilities of<br />

Role and Reference Grammar as a base <strong>for</strong> multi-language translation.<br />

1.2 Goals<br />

The goal of our research <strong>to</strong>wards an <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> <strong>machine</strong> translation system is <strong>to</strong><br />

create a system that translates simplex sentences of Modern Standard <strong>Arabic</strong> as a source<br />

language in<strong>to</strong> <strong>English</strong>. Our goal is <strong>to</strong> build a system which can translate a wide variety of<br />

simple sentence types. We aim <strong>to</strong> make this system as scalable as possible by allowing<br />

users <strong>to</strong> add <strong>to</strong> the lexicon and later, in future research, <strong>to</strong> include complex sentences.<br />

To achieve this goal, it is essential <strong>to</strong> build a robust and accurate lexical system and<br />

<strong>machine</strong> transla<strong>to</strong>r. One of the steps we have <strong>to</strong> achieve is <strong>to</strong> generate the universal<br />

logical structure from a source sentence. The system should be capable of dealing with<br />

free word order which <strong>Arabic</strong> exhibits. This poses a significant challenge <strong>to</strong> MT due<br />

<strong>to</strong> the vast number of ways <strong>to</strong> express the same sentence in <strong>Arabic</strong>. Also, we must<br />

account <strong>for</strong> verbs that do not exist in <strong>Arabic</strong> like the copula verb ‘<strong>to</strong> be’ and the verb<br />

‘<strong>to</strong> have’. The system should deal with the transitivity of verbs (intransitive, transitive,<br />

4


1.3. TECHNOLOGIES<br />

ditransitive). The <strong>Arabic</strong> language is written from right <strong>to</strong> left and has a unique letter<br />

shape. Words are written in horizontal lines from right <strong>to</strong> left. The letter shape depends<br />

on its position in the word; initial (prefix), medial (infix), final (suffix) or (Isolated).<br />

In technical linguistic terms, <strong>Arabic</strong> is a ‘pro–drop’ or ‘pronoun–drop’ language. It can<br />

define who takes the action by using conjugations. The pro–drop parameter is an aspect of<br />

grammar that allows subjects <strong>to</strong> be optional in some languages. That is, every inflection<br />

in a verb paradigm is specified uniquely and does not need <strong>to</strong> use independent pronouns<br />

<strong>to</strong> differentiate the person, number, and gender of the verb. The system should cover and<br />

solve the “pro–drop” challenge in <strong>Arabic</strong>.<br />

1.3 Technologies<br />

We introduce the main technologies used <strong>to</strong> support the development of the research pre-<br />

sented in this thesis. These technologies are mainly the XML language and Java. The<br />

most recent recommendation of the XML language has been presented by Bray et al.<br />

(2008). XML has become the default standard <strong>for</strong> data exchange among heterogeneous<br />

data sources (Arciniegas 2000). The UniArab system allows data <strong>to</strong> be s<strong>to</strong>red in XML<br />

<strong>for</strong>mat. This data can then be queried, exported and serialized in<strong>to</strong> any <strong>for</strong>mat the devel-<br />

oper wishes. The Java programming language is used <strong>to</strong> implement the logical structures.<br />

The primary advantage being that Java is plat<strong>for</strong>m-independent and thus highly suitable<br />

<strong>for</strong> MT.<br />

Advantages of XML<br />

XML is a generalized way <strong>to</strong> s<strong>to</strong>re data, which is not married <strong>to</strong> any particular technology.<br />

This makes it easy <strong>to</strong> s<strong>to</strong>re something, and then come back and grab it later with some<br />

other technology <strong>for</strong> processing. Using XML <strong>to</strong> exchange in<strong>for</strong>mation offers a number<br />

of advantages, including the following:<br />

5


1.4. THESIS ORGANIZATION<br />

Easily built: A well-<strong>for</strong>med data element must be enclosed between tags. The XML<br />

document can be parsed without prior knowledge of the tags. XML allows you <strong>to</strong> define<br />

all sorts of tags with all sorts of rules, such as tags representing data description or data<br />

relationships.<br />

Human readable: Using intelligible tag names will make it possible <strong>to</strong> read, even by<br />

novices.<br />

Machine readable: XML was designed <strong>to</strong> be easy <strong>for</strong> computers <strong>to</strong> process. XML is<br />

completely compatible with Java and portable plat<strong>for</strong>ms. Any application can process<br />

XML on any plat<strong>for</strong>m, as it is a plat<strong>for</strong>m-independent language.<br />

XML fully supports <strong>Arabic</strong>: We chose <strong>to</strong> create our datasource as XML files, <strong>for</strong> op-<br />

timum support of different plat<strong>for</strong>ms. It was also easier as we used <strong>Arabic</strong> letters rather<br />

than Unicode inside the datasource.<br />

XML search engine: It is easy <strong>to</strong> extend the search sample <strong>to</strong> display more in<strong>for</strong>mation<br />

about the search. Search by Java API Document Object Model (DOM) is the ideal <strong>to</strong>ol<br />

<strong>for</strong> searching collections of XML documents.<br />

1.4 Thesis organization<br />

This thesis is organized as follows. Chapter 2 explores how the characteristics of the<br />

(Modern Standard) <strong>Arabic</strong> language will affect the development of an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong><br />

<strong>machine</strong> translation (MT) <strong>to</strong>ol. Several distinguishing features of <strong>Arabic</strong> pertinent <strong>to</strong> MT<br />

are explored in detail (Salem et al. 2008b). Chapter 3 reviews the most important fea-<br />

tures of Role and Reference Grammar (RRG) Theory (Salem and Nolan 2009a). Chapter<br />

4 will discuss some distinguishing features of Machine translation strategies. Chapter<br />

5 presents the design of an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation <strong>framework</strong> based on<br />

RRG. It also presents a high-level view of the system <strong>framework</strong> and defines our eval-<br />

uation criteria <strong>for</strong> measuring system per<strong>for</strong>mance and effectiveness (Salem and Nolan<br />

6


1.4. THESIS ORGANIZATION<br />

2009b). Chapter 6 presents UniArab: a proof-of-concept <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong><br />

transla<strong>to</strong>r system. It covers the technical aspects of UniArab, covering all the phases<br />

involved in the <strong>machine</strong> translation process. We describe the lexical system that under-<br />

lies UniArab, detailing the attribute in<strong>for</strong>mation held <strong>for</strong> each type of word. We discuss<br />

the input and generation phase and how the system maps the logical structure <strong>to</strong> a target<br />

<strong>English</strong> sentence. We then briefly discuss the user interface, and some of the technical<br />

challenges encountered during the implementation (Salem et al. 2008a) and (Nolan and<br />

Salem 2009). Chapter 7 discusses the evaluating and experimental results of the case<br />

study. We present the results of our evaluation of UniArab <strong>for</strong> a wide variety of simple<br />

(Intransitive, Transitive and Ditransitive) sentence types (Salem and Nolan 2009c). The<br />

thesis conclusions and future work are discussed in Chapter 8.<br />

7


2<br />

The <strong>Arabic</strong> Language<br />

<strong>Arabic</strong> is a language with a derivational and inflectional rich morphology (Holes 2004).<br />

The version of <strong>Arabic</strong> we consider in this research is Modern Standard <strong>Arabic</strong> (MSA).<br />

When we mention <strong>Arabic</strong> throughout this research we mean MSA which is distinct from<br />

classical <strong>Arabic</strong>. Modern Standard <strong>Arabic</strong> (MSA) is a modernized <strong>for</strong>m of Classical<br />

<strong>Arabic</strong> (Alosh 2005). MSA is the literary and standard variety of <strong>Arabic</strong> used in writing<br />

and <strong>for</strong>mal speeches <strong>to</strong>day (Schulz 2005). MSA is the universal language of the <strong>Arabic</strong>-<br />

speaking population. MSA is printed in most books, newspapers, magazines, official<br />

documents, and reading primers <strong>for</strong> children. Most of the oral <strong>Arabic</strong> spoken <strong>to</strong>day is<br />

more divergent than the written <strong>Arabic</strong> language. <strong>Arabic</strong> words are often ambiguous<br />

in their morphological analysis (Al-Sughaiyer and Al-Kharashi 2004). As a language,<br />

<strong>Arabic</strong> is rich in morphological and syntactic structures. <strong>Arabic</strong> is also challenging in<br />

that it is a derivational or constructional language rather than a concatenative one. Words<br />

8


2.1. CHARACTERISTICS OF THE ARABIC LANGUAGE<br />

like ‘go’ yd ¯ hb and d ¯ hb ∗ can easily be seen as being part of a hierarchy<br />

of inheritance from a ‘specific root (in this case d ¯ hb). In <strong>English</strong> and in many<br />

other languages this is not usually the case. The <strong>Arabic</strong> language is written from right<br />

<strong>to</strong> left. It has 28 letters, many language specific grammar rules with a relatively free<br />

word order language. Each <strong>Arabic</strong> letter represents a specific sound so the spelling of<br />

words can easily be done phonetically. There is no use of silent letters as in <strong>English</strong>.<br />

Similarly, there is no need <strong>to</strong> combine letters in <strong>Arabic</strong> <strong>to</strong> indicate a certain sound. For<br />

example, the ’th’ sound in <strong>English</strong> as in the word ’Thinking’ is reduced in <strong>Arabic</strong> <strong>to</strong> the<br />

character t ¯ . In addition <strong>to</strong> the standard challenges involved in developing an efficient<br />

translation <strong>to</strong>ol from <strong>Arabic</strong> <strong>to</strong> <strong>English</strong>, the relatively free word order nature of <strong>Arabic</strong><br />

creates an obstacle. There is no copula verb ‘<strong>to</strong> be’ in <strong>Arabic</strong>, <strong>for</strong> example, the mere<br />

juxtaposition of the subject and predicate indicates the predicational relationship. The<br />

absence of the indefinite article, while not unique <strong>to</strong> <strong>Arabic</strong> still poses many difficulties<br />

within the context of the language structure.<br />

2.1 Characteristics of the <strong>Arabic</strong> language<br />

The copula verbs ‘<strong>to</strong> be’ and ‘<strong>to</strong> have’ do not exist in <strong>Arabic</strong>. Instead of saying ‘My<br />

name is Zaid’, the <strong>Arabic</strong> equivalent would read like ‘Name mine Zaid’ - <br />

֓ismy<br />

zyd. Instead of saying ‘She is a student’, the <strong>Arabic</strong> equivalent would be ‘She student’;<br />

in <strong>Arabic</strong> hy t.ālbh. The copula in <strong>Arabic</strong> is only realised in the past and<br />

future tenses and in negation. Regarding the verb ‘<strong>to</strong> have’, which in <strong>English</strong> can also<br />

mean ‘<strong>to</strong> own’. Instead of saying “He has a house”, the <strong>Arabic</strong> equivalent is ‘To him a<br />

house’ - lh byt. Adjectives in <strong>Arabic</strong> have both a masculine and a feminine <strong>for</strong>m.<br />

The singular feminine adjective is just like the masculine adjective but morphologically<br />

marked (Ryding 2007).<br />

∗ <strong>Arabic</strong> examples are written here by using Buckwalter <strong>Arabic</strong> Transliteration which is converted in latex in<strong>to</strong> the DIN 31635<br />

standard of <strong>Arabic</strong> transliteration<br />

9


2.1. CHARACTERISTICS OF THE ARABIC LANGUAGE<br />

Table 2.1: Dual: merely add two letters <strong>to</strong> achieve dual <strong>for</strong>m in <strong>Arabic</strong><br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

bāb door<br />

bābān two doors<br />

The <strong>Arabic</strong> number system includes the dual <strong>for</strong>m, whereas other languages move from<br />

the singular <strong>to</strong> the plural <strong>for</strong>m directly. In <strong>Arabic</strong> we need only <strong>to</strong> add two letters <strong>to</strong> the<br />

singular <strong>for</strong>m <strong>to</strong> express the dual <strong>for</strong>m. An example is given in Table 2.1. The plural<br />

<strong>for</strong>m, however, is obtained using a different mechanism.<br />

Plurals are of two types:<br />

(1) The sound plural. The sound plural is one in which the singular <strong>for</strong>m of the word<br />

remains intact (sound) with some addition at the end. Examples;<br />

Masculine in the nominative case e.g. engineers mhndswn in which wn is<br />

added <strong>to</strong> a singular noun. Masculine in the accusative and genitive cases e.g. engineers<br />

mhndsyn in which yn is added <strong>to</strong> the singular noun.<br />

Feminine in the nominative e.g engineers mhndsātun in which ātun is<br />

added <strong>to</strong> the singular noun.<br />

Feminine in the accusative and genitive cases engineers mhndsātin in which <br />

ātin is added <strong>to</strong> the singular noun.<br />

(2) The broken plural. The broken plural is one in which the <strong>for</strong>m of the singular word is<br />

broken, that is, changed. It has no fixed rule <strong>for</strong> making it. Sometimes letters are added<br />

or deleted and sometimes there is merely a change in the vowels. Examples ktb<br />

books, ktāb a book, rˇgl man, rˇgāl men, snh a year snwāt<br />

years.<br />

10


2.2 Characteristics of <strong>Arabic</strong> words<br />

2.2. CHARACTERISTICS OF ARABIC WORDS<br />

There is no upper and lower case distinction. Words are written horizontally from right<br />

<strong>to</strong> left. Most letters change <strong>for</strong>m depending on whether they appear at the beginning,<br />

middle or end of a word or on their own. <strong>Arabic</strong> letters that may be joined are always<br />

joined in both hand-written and printed <strong>for</strong>m.<br />

An interesting feature of <strong>Arabic</strong> is its treatment of the demonstrative. Whereas in En-<br />

glish one refers <strong>to</strong> an object that is either near or far as simply this (very near the speaker)<br />

or that (away from the speaker up <strong>to</strong> any distance), <strong>Arabic</strong> has a third demonstrative <strong>to</strong><br />

specify objects that are in between these points on the distance spectrum.<br />

Table 2.2: Grammatical gender<br />

<strong>Arabic</strong> Masculine <strong>English</strong> Translation<br />

qmr moon<br />

<br />

syf sword<br />

bāb door<br />

<strong>Arabic</strong> Feminine <strong>English</strong> Translation<br />

ˇsms sun<br />

֒s. ā stick<br />

nāfd ¯ h window<br />

In <strong>Arabic</strong>, all nouns must be either feminine or masculine, and the gender can be either<br />

grammatical or natural. The gender of inanimate objects is grammatical, examples are<br />

in Table 2.2. In this case the gender is a built-in lexical property of the word. Animate<br />

objects have a natural gender, and this gender can be either non-productive or productive.<br />

The non-productive gender is the case of nouns where the feminine and the masculine<br />

have different lexical entries, i.e., the feminine is not derived from the masculine, as<br />

in Table 2.3. By contrast, in the productive gender, the feminine is derived from the<br />

masculine, usually by adding a special suffix ‘ta marbuta’ <strong>to</strong> the end of the masculine<br />

11


2.2. CHARACTERISTICS OF ARABIC WORDS<br />

<strong>for</strong>m, as in Table 2.4. The <strong>Arabic</strong> definite article is concatenated <strong>to</strong> nouns and adjectives.<br />

The shape of the definite article is shown in Table 2.5.<br />

Table 2.3: Feminine is different than masculine<br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

daˇgaāˇgah Chicken<br />

dyk Cock<br />

Table 2.4: Feminine and masculine in <strong>Arabic</strong><br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

mu֒ilmun teacher(M)<br />

mu֒limtun teacher(F)<br />

t.ālb student(M)<br />

t. ālbh student(F)<br />

Table 2.5: Definiteness in <strong>Arabic</strong><br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

āl the<br />

The definite article in <strong>Arabic</strong> is graphically prefixed <strong>to</strong> an <strong>Arabic</strong> noun. An example of<br />

<strong>Arabic</strong> definiteness is shown in Table 2.6.<br />

2.2.1 Free word order<br />

Table 2.6: Definiteness example in <strong>Arabic</strong><br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

rˇgl a man<br />

ālrˇgl the man<br />

<strong>Arabic</strong> has a relatively free word order (Ramsay and Mansour 2006), this poses a signif-<br />

icant challenge <strong>to</strong> MT due <strong>to</strong> the number of possible ways <strong>to</strong> express the same sentence<br />

in <strong>Arabic</strong>. For the elements of subject(S), verb(V) and object(O), <strong>Arabic</strong>’s relatively free<br />

12


2.2. CHARACTERISTICS OF ARABIC WORDS<br />

word order allows the combinations of SVO, VSO, VOS and OVS. For example, consider<br />

the following word orders:<br />

(1) Noun1 Verb Noun2<br />

(2) Noun2 Verb Noun1<br />

(3) Verb Noun1 Noun2<br />

(4) Verb Noun2 Noun1<br />

(a) Noun Verb Noun example.<br />

qys yh. b lylā<br />

Qays loves Laila<br />

lylā yh. b qys<br />

Laila loves Qays<br />

noun verb noun<br />

Table 2.7: Free word order<br />

(c) Verb Noun Noun example.<br />

yh. b lylā qys<br />

Qays loves Laila<br />

qys lylā yh. b<br />

Qays Laila loves<br />

noun noun verb<br />

(b) Verb Noun Noun example.<br />

yh. b qys lylā<br />

Qays loves Laila<br />

lylā qys yh. b<br />

Laila Qays loves<br />

noun noun verb<br />

This means that we have a challenge <strong>to</strong> identify exactly which is the subject and the ob-<br />

ject. Tables 2.7(a), 2.7(b) and 2.7(c) show this challenge. In <strong>Arabic</strong> the subject agrees<br />

with the verb with appropriate morphological marking on the word <strong>to</strong> differentiate sub-<br />

ject from object in these free word order sentences. †<br />

The difference in Tables 2.7(a), 2.7(b) and 2.7(c) is the position of the ac<strong>to</strong>r. The sen-<br />

tences in fact have the same meaning. While in <strong>English</strong> the <strong>for</strong>m of a sentence is subject<br />

verb object.<br />

† Note that <strong>Arabic</strong> sentences should be read from right <strong>to</strong> left.<br />

13


2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />

2.3 Part of speech inven<strong>to</strong>ry of the <strong>Arabic</strong> language<br />

In the <strong>Arabic</strong> linguistic tradition there is not a clear-cut, well-defined analysis of the<br />

inven<strong>to</strong>ry of parts of speech in <strong>Arabic</strong>. Attia (2008) mentioned that the traditional clas-<br />

sification of <strong>Arabic</strong> parts of speech in<strong>to</strong> nouns, verbs and particles is not sufficient <strong>for</strong> a<br />

complete computational grammar. This categorization, originally proposed by Sibawaih<br />

(Owens 2006), remains the standard accepted scheme <strong>to</strong>day. However, we have found it<br />

lacking when applied <strong>to</strong> <strong>machine</strong> translation, and so, developed our own lexical scheme.<br />

Our classification of the parts of speech in <strong>Arabic</strong> is illustrated in Figure 2.1. We clas-<br />

sified parts of speech in<strong>to</strong> nouns, adjectives, adverbs, verbs, demonstratives, and others.<br />

Each category will be further explained in the following subsections.<br />

2.3.1 Noun<br />

Figure 2.1: A classification <strong>for</strong> the <strong>Arabic</strong> language syntax<br />

A noun denotes either tangible or intangible identities. Nouns are independent of other<br />

words in indicating their meaning. What distinguishes nouns from verbs is that nouns<br />

refer <strong>to</strong> entities or things. Nouns are further classified in<strong>to</strong> pronouns, common nouns and<br />

proper nouns. Pronouns are classified according <strong>to</strong> person (first, second, third), number<br />

14


2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />

(singular, dual and plural) and gender (masculine and feminine). They can also be nom-<br />

inative, accusative, or genitive. Examples are ֓anā “I”, ֓ant “you”, and hw<br />

“he”. We make a further classification of common nouns in<strong>to</strong> animate and inanimate.<br />

Examples of common noun are in Table 2.8.<br />

Table 2.8: Noun example in <strong>Arabic</strong><br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

<br />

raˇgulun man<br />

ˇsaˇgrh tree<br />

Although this seems more like a semantic classification, the <strong>Arabic</strong> morphology and<br />

syntax needs this classification. For example the choice of demonstrative adjective with<br />

plural nouns depends on whether the noun is human or non-human. For example<br />

hd ¯ h ālklāb<br />

this.sg.f dog.pl<br />

The proper nouns are further classified in<strong>to</strong> names of persons, such as ֒mr “Omar”<br />

and <br />

h ˘ āld “Khalid”; locations, such as ālqāhrh “Cairo” and ֓ayrlndā<br />

“Ireland”; organizations āl֓amm ālmth. dh “United Nations”; and objects,<br />

such as lynwks “Linux” Common nouns can either be definite or indefinite.<br />

2.3.1.1 Definite nouns<br />

A noun normally can be considered as definite (in <strong>Arabic</strong>: m֒rfh) when the speaker<br />

and the reader know about the specific object being referred <strong>to</strong>, <strong>for</strong> example in Table 2.9.<br />

Table 2.9: Definite example in <strong>Arabic</strong><br />

<strong>Arabic</strong> ālktāb āld ¯ y tbh. t ¯ ֒nh fwq ālt.āwlh.<br />

<strong>English</strong> Translation The book you are looking <strong>for</strong> is on the table.<br />

In the example, the word ‘book’ is definite by using the definite article ‘the’, since both<br />

the speaker and the listener know which book they are dealing with. The definite article<br />

15


2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />

in <strong>Arabic</strong> is used <strong>to</strong> introduce and talk about a known subject. The <strong>Arabic</strong> language uses<br />

the same article <strong>for</strong> all nouns, be they male or female, singular or plural. The article is<br />

written be<strong>for</strong>e the noun it refers <strong>to</strong> and, graphically, it appears attached <strong>to</strong> it.<br />

2.3.1.2 Indefinite nouns<br />

Indefinite nouns (in <strong>Arabic</strong>: nkrh ) are nouns which are not specified. It is translated<br />

as ‘a’ or ‘an’ in <strong>English</strong>, e.g. a man, an apple, water. There is no need <strong>to</strong> translate it<br />

everywhere as in the example of water. The absence of the indefinite article is, as in<br />

Table 2.10, a potential source of problems <strong>for</strong> <strong>Arabic</strong>-<strong>English</strong> <strong>machine</strong> translation.<br />

Table 2.10: Indefinite example in <strong>Arabic</strong><br />

<strong>Arabic</strong> wˇgdt ktābā ֒lā ālt.āwlh hl hw lk?<br />

<strong>English</strong> I found (a) book on the table, is it yours?<br />

2.3.2 Adjectives<br />

Adjectives are used <strong>to</strong> modify nouns. <strong>Arabic</strong> adjectives agree with nouns in number,<br />

gender, definiteness and case. An example is the adjective “useful”, in Table 2.11.<br />

2.3.3 Adverbs<br />

Table 2.11: <strong>Arabic</strong> adjective<br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

qr֓at ktābā nāf֒ā I read a useful book<br />

Adverbs are used <strong>to</strong> modify verbs. They can be adverbs of place, time or manner. An<br />

example in Table 2.12.<br />

Table 2.12: <strong>Arabic</strong> adverb<br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

֒qd ālāˇgtmā֒msā֓ The meeting was held in the evening<br />

16


2.3.4 Verbs<br />

2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />

A verb describes both an action and tense. There are four ways <strong>to</strong> classify verbs in<br />

<strong>Arabic</strong>: according <strong>to</strong> tense, transitivity, mood and voice:<br />

2.3.4.1 Verb tenses<br />

There are mainly two tenses in <strong>Arabic</strong>: the imperfect and the perfect.<br />

The imperfect tense <br />

ālf֒l ālmd. ār֒, which indicates that an action has not<br />

yet been completed but is being done or will be done; something that is happening at the<br />

moment. An example is shown in Table 2.13.<br />

Table 2.13: Imperfect tense <br />

ālf֒l ālmd. ār֒<br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

yaktubu he is writing.<br />

The perfect tense ‘ mād. y, which indicates that an action has been completed. An<br />

example is shown in Table 2.14.<br />

Table 2.14: Perfect tense ālf֒l ālmād. y<br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

<br />

kataba he wrote.<br />

Both the perfect and imperfect tenses can be modified by thirteen inflectional <strong>for</strong>ms<br />

which depend on person, mood and number. Table 2.15 shows these <strong>for</strong>ms applied <strong>to</strong><br />

the imperfect, and Table 2.16 shows the thirteen person markers <strong>for</strong> the perfect tense.<br />

<br />

The word ‘ swf’ if it is be<strong>for</strong>e the imperfect tense then the verb has a future meaning.<br />

Graphically a word like this will look like [sawfa + imperfect] or [s + imperfect] similar<br />

<strong>to</strong> the example in Table 2.17.<br />

In <strong>Arabic</strong>, it is possible <strong>to</strong> combine the verb kaana kān with the main verb <strong>to</strong> in-<br />

dicate past progressive. This is where an action <strong>to</strong>ok place in the past but happened<br />

17


2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />

Table 2.15: Imperfect inflectional <strong>for</strong>ms of word ‘write’<br />

Singular Dual Plural<br />

First Person <br />

nktbu <br />

֓aktbu<br />

Second Person (m) tktbwna tktbāni <br />

tktbu<br />

Second Person (f)<br />

tktbna tktbāni tktbna<br />

Third Person (m) yktbwna yktbāni <br />

yktbu<br />

Third Person (f)<br />

yktbna tktbāni <br />

tktbu<br />

Table 2.16: Perfect inflectional <strong>for</strong>ms of word ‘wrote’<br />

Singular Dual Plural<br />

First Person uktbnā ktbt<br />

Second Person (m) ktbtum <br />

ktbtumā<br />

ktbta<br />

<br />

ktbtuna <br />

ktbtumā ktbti<br />

Second Person (f)<br />

Third Person (m) ktbwā ktbā <br />

ktba<br />

Third Person (f)<br />

ktbna ktbatā<br />

ktbat<br />

Table 2.17: Future tense in <strong>Arabic</strong><br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

<br />

<br />

swf yaktubu he will write<br />

<br />

<br />

<br />

syaktubu he will write<br />

over a long period, or represents a state of being. This construct is used when talk-<br />

ing about knowledge of something in the past. In <strong>Arabic</strong>, the past perfect progressive<br />

is actually indicated using the present tense and the particle mundhu mnd. e.g.<br />

¯<br />

֓a֒yˇs hnā mnd hms snwāt I have been living here <strong>for</strong> five<br />

¯ ˘<br />

years. Future perfect in <strong>Arabic</strong> is indicated using the present tense of kaana with a past<br />

tense main verb. e.g. sykwn qd ֓anhā drāsth he will have fin-<br />

ished his studies.<br />

18


2.3.4.2 Aspect<br />

2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />

Tense deals with when an action occurs, aspect determines whether the action has been<br />

completed, is ongoing or is yet <strong>to</strong> occur. In <strong>Arabic</strong>, tense and aspect are generally blended<br />

<strong>to</strong>gether, that is why past/present are often switched with perfect/imperfect in discussion.<br />

For a larger discussion on the sentax the tense and aspect refer <strong>to</strong> Ryding (2007).<br />

2.3.4.3 Mood<br />

Mood is reflected in <strong>Arabic</strong> in word structure, and so analysis is a part of the morphol-<br />

ogy. The mood can be indicative, subjunctive, imperative or jussive. The indicative are<br />

straight<strong>for</strong>ward statements, the subjunctive includes the attitude <strong>to</strong>wards actions, the im-<br />

perative indicates a command. Mood marking is only done on the present tense. There<br />

are no markings <strong>for</strong> past tense. Examples of the four moods are shown in Tables 2.18,<br />

2.19, 2.20 and 2.21.<br />

Table 2.18: Indicative mood<br />

<strong>Arabic</strong> <br />

nrh. b bzbā֓ynnā<br />

<strong>English</strong> we welcome our cus<strong>to</strong>mers.<br />

<strong>Arabic</strong> y˙gādr dbln ālywm<br />

<strong>English</strong> He leaves Dublin <strong>to</strong>day.<br />

Table 2.19: Subjunctive mood<br />

<strong>Arabic</strong> <br />

yˇgb ֓an nqwm bzyārh<br />

<strong>English</strong> It is necessary that we undertake a visit.<br />

<strong>Arabic</strong><br />

Table 2.20: Jussive mood<br />

lm n֓at<br />

<strong>English</strong> we did not come.<br />

֓is. lāh. āt lm tktml mnd ¯ ֒āmyn<br />

<strong>Arabic</strong><br />

<strong>English</strong> renovations that have not been completed <strong>for</strong> two years<br />

19


2.3.4.4 Voice<br />

2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />

Table 2.21: Imperative mood<br />

<strong>Arabic</strong> āfth. yā smsm<br />

<strong>English</strong> Open, Semsame.<br />

<strong>Arabic</strong> āsmh. ly<br />

<strong>English</strong> Permit me.<br />

<strong>Arabic</strong> lā tns<br />

<strong>English</strong> Do not <strong>for</strong>get.<br />

The voice in <strong>Arabic</strong> is indicated by inflection on the verbs and differentiates between<br />

active and passive, as shown in the contrast between qāl “[he] said” and yqāl<br />

“[it is] said”.<br />

2.3.4.5 Transitivity<br />

In <strong>Arabic</strong> and <strong>English</strong>, we can classify verbs as either intransitive, transitive or ditransi-<br />

tive.<br />

<br />

(1) Intransitive ( <br />

āllāazim)<br />

An intransitive verb is unable <strong>to</strong> take an object; it exists alone. Intransitive verbs<br />

include ysbh. , swim, y֓akl , ate, māt ,die, nā֓ym sleep.<br />

Some verbs can be both transitive and intransitive:<br />

<br />

֓anā fzt , I won. (Intransitive)<br />

֓anā fzt bālˇgā֓yzh āl֓awlā , I won the first prize. (Transitive)<br />

(2) Transitive ( ālmt֒dy)<br />

A transitive verb takes one or more objects (an object, or undergoer of the verb).<br />

For example; ֓iˇstrā ֒mr ktāb , Omar bought a book. <br />

֓aktb rsālh , I write a letter. ֓abw bkr wd. ֒ ālktāb<br />

fwq mktbh, Abu Bakr put the book on his disk.<br />

20


2.3. PART OF SPEECH INVENTORY OF THE ARABIC LANGUAGE<br />

A transitive verb is incomplete without a direct object. For example;<br />

Incomplete: hāld yh. ml , Khalid holds.<br />

˘<br />

Complete: <br />

<br />

hāld yh. ml tlāth ktb<br />

˘ ¯ ¯<br />

w h. āswbh ālˇsh ˘ s.y w zhwr , Khalid holds three books, his lap<strong>to</strong>p and flowers.<br />

(3) Ditransitive<br />

A ditransitive verb takes two objects. This can be through an indirect object con-<br />

struction, ֒s. ām ֓a֒t.ā ktāb l֒mr , Essam gave a book <strong>to</strong> Omar<br />

Or double object construction, ֒s. ām ֓a֒t.ā ֒mr ktāb , Essam<br />

gave Omar a book.<br />

2.3.5 Demonstratives<br />

The demonstrative pronouns in <strong>Arabic</strong> include reference <strong>for</strong> the near hd ¯ ā “this”, the<br />

far d ¯ lk “that” and <strong>for</strong> the inbetween d ¯ āk, which has no equivelent in <strong>English</strong>.<br />

2.3.6 Others<br />

This class includes all other types of words not included in the previous categories. It<br />

includes, <strong>for</strong> example, the prepositions, such as mn “from”, ֒lā “on”, fy “in”,<br />

֓ilā “till”. It also includes conjunctions, such as w “and”; determiners such as <br />

āl “the”; relative pronouns, such as āld¯ y “who (masculine)” and ālty “who<br />

(feminine)” and particles, such as ln “Will not” (Khan 2007).<br />

Table 2.22: Particle ‘Lan’<br />

<strong>Arabic</strong> <strong>Arabic</strong> Meaning <strong>English</strong> Translation<br />

<br />

<br />

<br />

<br />

lan yad ¯ haba will not ‘he’ go he will not go<br />

The particle ln is used <strong>to</strong> negate future events. It is used within the imperfect tense<br />

(Versteegh 2001). An example is shown in Table 2.22.<br />

21


2.4 Sentence types in <strong>Arabic</strong><br />

2.4. SENTENCE TYPES IN ARABIC<br />

A sentence is a string of words that expresses a semantically complete message. There are<br />

two main sentence types in <strong>Arabic</strong>: verbal sentences and equational or copula sentences.<br />

The classification of clauses in the <strong>Arabic</strong> language is illustrated in Figure 2.2.<br />

Figure 2.2: A classification of clauses in the <strong>Arabic</strong> language<br />

2.4.1 Equational sentences<br />

Equational sentences contain two parts (subject and predicate). In <strong>Arabic</strong> the copula verb<br />

‘<strong>to</strong> be’ is not used in the present tense. Both the subject and the predicate have <strong>to</strong> be<br />

in the nominative case if they are not preceded by ֓in ”indeed“ or kān ”was“<br />

(Abn-Aqeal 2007). In Table 2.23 the predicate in the first example is realized as a noun<br />

phrase, in the second example as an adjective, and in the third example as a preposition.<br />

The subject and predicate can serve as arguments <strong>for</strong> other verbs as will be shown in the<br />

following subsections.<br />

22


2.4.1.1 Verb and noun<br />

2.4. SENTENCE TYPES IN ARABIC<br />

Table 2.23: Nominal sentence<br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

zaydun t.aālibun Zaid is (a) student.<br />

zaydun krym Zaid is generous.<br />

<br />

<br />

<br />

<br />

zaydun fy ālbyt Zaid is in the house.<br />

Verb and noun. such as: s֓al ֓ah. md Ahmad asked.<br />

2.4.1.2 Verb and two nouns<br />

It only occurs in one construct <br />

kān w ֓ah ˘ wāthā kan and its sisters. The<br />

verb kna kān and its sister verbs mark the time or duration of actions, states, and<br />

events. Sentences that use these verbs are considered <strong>to</strong> be a type of nominal sentence<br />

according <strong>to</strong> <strong>Arabic</strong> grammar, not a type of verbal sentence. The word order resembles<br />

Verb Subject Object when there is no other verb in the sentence,<br />

They are kān was, s. ār <strong>to</strong> become, ֓as. bh. <strong>to</strong> become, ֓ad. h. ā <strong>to</strong><br />

become, ֓amsā <strong>to</strong> become, z . l <strong>to</strong> remain, bāt <strong>to</strong> be, lys it is not.<br />

<strong>English</strong> can not express the punctual and telic aspectual differences encoded within the<br />

<strong>Arabic</strong> examples just mentioned.<br />

Table 2.24: Kan and its sisters <br />

kān w ֓ah ˘ wthā<br />

<strong>Arabic</strong> <strong>English</strong> Translation<br />

<br />

<br />

kān āl֓aklu ld¯ ydāan The food was delicious.<br />

¯<br />

With these verbs the subject is in the nominative case and the predicate is in the accusative<br />

case, an example is shown in Table 2.24.<br />

23


2.4.1.3 Verb and three nouns<br />

2.4. SENTENCE TYPES IN ARABIC<br />

It only occurs in one construct <br />

<br />

z . n w ֓ah ˘ wāthā anna and its sisters. Both the<br />

subject and the predicate of z . n and its sisters are in an equational clause.<br />

They are z . n <strong>to</strong> guess or <strong>to</strong> think, h. sb <strong>to</strong> consider, ֒lm <strong>to</strong> learn (about), <br />

ˇg֒l <strong>to</strong> make, s. yr <strong>to</strong> make. They usually come be<strong>for</strong>e the nominal sentences ‘subject<br />

and a predicate’, an example is in Table 2.25. <strong>English</strong> can not express the semantic and<br />

causative ‘make’ differences encoded within the <strong>Arabic</strong> examples just mentioned.<br />

Table 2.25: zanna and its sisters <br />

z . n w ֓ah ˘ wthā<br />

<strong>Arabic</strong> z . n ֓ah. md ālqyādh shlt.<br />

<strong>English</strong> Translation Ahmad thinks leadership is easy.<br />

2.4.1.4 Verb and four nouns<br />

This clausal type is used in classical <strong>Arabic</strong>, but not in MSA. It is mentioned here only<br />

<strong>for</strong> the sake of completeness. It has one type in ֓a֒lm w ֓ary in<strong>for</strong>med and<br />

showed. They are ֓a֒lm in<strong>for</strong>med, ֓ary showed, ֓anb֓a <strong>to</strong>ld. nb֓a <strong>to</strong>ld,<br />

<br />

֓ahbr <strong>to</strong>ld,<br />

˘ <br />

hbr <strong>to</strong>ld. <br />

˘ h. dt talked. <br />

¯ ֓a֒lm when it has hamza above it<br />

can has four nouns (ibn Abd Allah Ibn Malik 1984), such as in Table 2.26.<br />

<br />

<br />

<br />

֓a֒lmtu ֒mrāan h˘ aālidāan tlmydāan ¯<br />

Table 2.26: In<strong>for</strong>med and showed<br />

<strong>Arabic</strong><br />

<strong>English</strong> Translation I in<strong>for</strong>med Omer that Khalid (is) a student.<br />

2.4.2 The Verbal Sentence<br />

The verbal sentence is the second type of sentence in <strong>Arabic</strong>. It contains a verb and one<br />

or more participants depending on the verb transitivity. The default word order in <strong>Arabic</strong><br />

is <strong>to</strong> begin with a verb: verb(V), subject(S) and object(O), such as in Table 2.27<br />

24


Table 2.27: verb(V), subject(S) and object(O)<br />

<strong>Arabic</strong> <br />

Gloss drank Khalid the-milk<br />

<strong>English</strong> Translation Khalid drank the milk.<br />

ˇsrb h ˘ āld āllbn<br />

2.5. SUMMARY<br />

Another possible word order is <strong>to</strong> start with the subject, i.e. SVO, such as in Table 2.28<br />

Table 2.28: subject(S), verb(V) and object(O)<br />

<strong>Arabic</strong> hāld ˇsrb āllbn<br />

˘<br />

Gloss Khalid drank the-milk<br />

<strong>English</strong> Translation Khalid drank the milk.<br />

Another possible, but more restricted, word order is VOS, such as in Table 2.29<br />

Table 2.29: verb(V), object(O) and subject(S)<br />

<strong>Arabic</strong> <br />

ˇsrbh h ˘ āld<br />

Gloss drank it Khalid<br />

<strong>English</strong> Translation Khalid drank it<br />

The OVS word order is perfectly acceptable in Classical <strong>Arabic</strong> but no longer occurs in<br />

MSA (Attia 2008).<br />

2.4.3 Clause<br />

A clause in <strong>Arabic</strong> may be simple or complex. A complex clause is <strong>for</strong>med by conjoining<br />

two simple clauses by subordinating conjunction, such as in Table 2.30.<br />

Table 2.30: Two simple clauses by subordinating conjunction<br />

<strong>Arabic</strong> <br />

<strong>English</strong> Khalid drank the milk be<strong>for</strong>e he went <strong>to</strong> school.<br />

2.5 Summary<br />

ˇsrb hāld āllbn qbl ֓an ydhb ֓ilā ālmdrsh<br />

˘ ¯<br />

We have shown that <strong>Arabic</strong> is a language of increasing importance in the modern world.<br />

As a language it is fundamentally different from European languages and has many<br />

25


2.5. SUMMARY<br />

unique features. Considerations such as its derivational structure, its distinction of gender<br />

<strong>for</strong>ms, and its numerous sentence orders present a challenge <strong>for</strong> au<strong>to</strong>matic <strong>machine</strong> trans-<br />

lation. We discussed an inven<strong>to</strong>ry of the language including examples. In order <strong>to</strong> deal<br />

with these challenges it is important that a <strong>machine</strong> transla<strong>to</strong>r understands the structure<br />

of the source language. We aim <strong>to</strong> use this knowledge <strong>to</strong> build the UniArab transla<strong>to</strong>r. In<br />

order <strong>to</strong> provide a standards-based, cross-plat<strong>for</strong>m solution, we will make use of XML<br />

<strong>for</strong> data representation and build the system using Java.<br />

26


3<br />

Role and Reference Grammar (RRG)<br />

This chapter is based largely on material taken from (Van Valin and LaPolla 1997), which<br />

explains the theory behind Role and Reference Grammar. Role and Reference Grammar<br />

(RRG) is a model of grammar developed by William Foley and Robert Van Valin, Jr. in<br />

the 1980s, which incorporates many of the points of view of current functional grammar<br />

theories. We have chosen RRG because it has been shown <strong>to</strong> be flexable and universal<br />

in the creation of parsers <strong>for</strong> <strong>English</strong> (Van Valin and LaPolla 1997). We wish <strong>to</strong> apply<br />

this success <strong>to</strong> MT in order <strong>to</strong> discover its importance and demonstate its viability with<br />

accuracy of translation.<br />

In RRG, the description of a sentence in a particular language is <strong>for</strong>mulated in terms<br />

of its logical structure and communicative functions, and the grammatical procedures<br />

that are available in the language <strong>for</strong> the expression of these meanings. The main fea-<br />

tures of RRG are the use of lexical decomposition, based upon predicate semantics, an<br />

analysis of clause structure and the use of a set of thematic roles organized in<strong>to</strong> a hier-<br />

27


3.1. ROLE AND REFERENCE GRAMMAR LINGUISTIC MODEL<br />

archy in which the highest-ranking roles are ‘Ac<strong>to</strong>r’ (<strong>for</strong> the most active participant) and<br />

‘Undergoer’ (Van Valin 1993). RRG takes language <strong>to</strong> be a system of communicative<br />

social action, and accordingly, analysing the communicative functions of grammatical<br />

structures plays a vital role in grammatical description and theory from this perspective.<br />

Role and Reference Grammar posits algorithms <strong>to</strong> go from syntax <strong>to</strong> semantics and se-<br />

mantics <strong>to</strong> syntax. The main contribution is the use of parsing templates and the notion<br />

of the core. A core consists of a predicate (generally a verb) and (normally) a number of<br />

arguments. It must have a predicate. Everything else is built around one or more cores.<br />

Simple sentences contain a single core; complex sentences contain several cores. The<br />

fact that RRG focuses on cores, means that the semantics is relatively easy <strong>to</strong> extract<br />

from a parse tree. You just have <strong>to</strong> look <strong>for</strong> the (PRED), and (ARG) branches of the core<br />

<strong>to</strong> obtain the predicate (PRED) and the arguments (ARG). Who did what <strong>to</strong> whom will<br />

depend either on the ordering of the ARG branches (in the case of <strong>English</strong>), or on their<br />

cases, or both.<br />

3.1 Role and Reference Grammar linguistic model<br />

Role and Reference Grammar is a model which presupposes a direct mapping between<br />

the semantic representation of a sentence and its syntactic representation; there are no<br />

intermediate levels of representation (Van Valin 2007). The general view of RRG is<br />

presented in Figure 3.1.<br />

RRG creates a relationship between syntax and semantics and can account <strong>for</strong> how se-<br />

mantic representations are mapped in<strong>to</strong> syntactic representations. RRG also accounts<br />

<strong>for</strong> the very different process of mapping syntactic representations <strong>to</strong> semantic repre-<br />

sentations. Be<strong>for</strong>e developing the linking algorithms that govern these mappings, it is<br />

necessary <strong>to</strong> first introduce a general principle constraining these algorithms (Van Valin<br />

and LaPolla 1997). Of the two directions, syntactic representation <strong>to</strong> semantic represen-<br />

28


3.1. ROLE AND REFERENCE GRAMMAR LINGUISTIC MODEL<br />

Figure 3.1: Layout of Role and Reference Grammar<br />

tation is the more difficult since it involves interpreting the morphosyntactic <strong>for</strong>m of a<br />

sentence and inferring the semantic functions of the sentence from it. Accordingly, the<br />

linking rules must refer <strong>to</strong> the morphosyntactic features of the sentence. The question<br />

remains why a grammar should deal with linking from syntax <strong>to</strong> semantics at all. Simply<br />

specifying the possible realizations of a particular semantic representation should suffice.<br />

They refute this using the argument that theories of linguistic structure should be directly<br />

relatable <strong>to</strong> testable theories of language production and comprehension (Van Valin and<br />

LaPolla 1997). One of our hypotheses it that RRG is very suitable <strong>for</strong> <strong>machine</strong> translation<br />

of <strong>Arabic</strong> via an interlingua bridge. It is a mono strata-theory, positing only one level of<br />

syntactic representation, the actual <strong>for</strong>m of the sentence. The RRG Linking algorithm can<br />

work in the both directions from syntactic representation <strong>to</strong> semantic representation or<br />

vice versa. UniArab will fulfil this role. In RRG, semantic decomposition of predicates<br />

and their semantic argument structures are represented as logical structures. The lexicon<br />

in RRG takes the position that lexical entries <strong>for</strong> verbs should contain unique in<strong>for</strong>mation<br />

only, with as much in<strong>for</strong>mation as possible derived from general lexical rules. We briefly<br />

illustrate the active voice linking in (3.1) where (3.1a) is a subject, verb, object (SVO)<br />

clause and (3.1b) is the verb, subject, object (VSO) equivalent.<br />

29


(3.1)<br />

3.1. ROLE AND REFERENCE GRAMMAR LINGUISTIC MODEL<br />

a. <br />

zyd r֓ay ֒mr Zaid saw Omar<br />

<br />

zyd MsgNOM see.past ֒mr - MsgNOM<br />

b. <br />

r֓aā zyd ֒mr Saw Zaid Omar<br />

see.past <br />

zyd MsgNOM ֒mr MsgNOM<br />

<strong>Arabic</strong> allows variation in clause word order. The active-voice linkings, those in the<br />

sentence in (3.1a)-(3.1b), are illustrated in figure 3.2.<br />

Figure 3.2: <strong>Arabic</strong> sentence types; verb subject object or subject verb object (<strong>for</strong> gloss<br />

please see example 3.1)<br />

The first (leftmost) argument of ‘see’ in the logical structure is the ac<strong>to</strong>r, the second the<br />

undergoer, following the RRG Ac<strong>to</strong>r-Undergoer Hierarchy. Since <strong>Arabic</strong> is an accusative<br />

language and r֓aā ‘see’ is a regular verb, the ac<strong>to</strong>r will receive nominative case and<br />

the undergoer accusative case. On the other hand, in <strong>Arabic</strong> we can start the sentence<br />

with verb first as shown in the example in (3.1b). The only changes in the clause are the<br />

<strong>for</strong>m of the verb and the <strong>for</strong>m of the ac<strong>to</strong>r NP; the arrangement of the arguments has not<br />

changed in the logical structure.<br />

30


3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />

3.2 Formal representation of layered structure of the clause<br />

Having introduced the fundamental units of clause structure, we need <strong>to</strong> have an explicit<br />

representation of them. We will present the non-universal features of the layered structure<br />

of the clause (LSC).<br />

3.2.1 Representing the universal aspects of the layered structure of the clause<br />

Figure 3.3: Formal representation of the layered structure of the clause<br />

To represent the nucleus, core, periphery and clause, we will use a type of tree diagram<br />

which differs substantially from the constituent-structure trees discussed earlier. The ab-<br />

stract schema of the layered structure of the clause can be represented as in Figure 3.3.<br />

The clause consists of the core with its arguments, and then the nucleus, which subsumes<br />

the predicate. At the very bot<strong>to</strong>m are the actual syntactic categories which realize these<br />

units. Notice that there is no VP in the tree, <strong>for</strong> it is not a concept that plays a direct role<br />

in this conception of clause structure. The periphery is represented on the margin, and<br />

the arrow there indicates that it is an adjunct; that is, it is an optional modifier of the core<br />

(Van Valin and LaPolla 1997).<br />

31


3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />

Constituent structure representations of sentences in free-word-order and head-marking<br />

languages are unrevealing, because they fail <strong>to</strong> capture what is common <strong>to</strong> clauses in the<br />

different language types. The layered approach <strong>to</strong> clause structure does not suffer <strong>for</strong>m<br />

the same shortcomings. For a language like <strong>Arabic</strong>, the line linking the head nouns with<br />

their determiners will be discussed in the section on noun phrase structure 3.3 below.<br />

3.2.2 Layered structure of the clause (LSC)<br />

In the simplex <strong>English</strong> sentences, James ate the sandwich in the class, James ate the sand-<br />

wich is the core (with ate the nucleus and James and the sandwich the core arguments);<br />

and in the class is in the periphery. The first division in the clause is between a core and<br />

a periphery, and within the core a distinction is made between the nucleus (containing<br />

the predicating element, normally a verb) and its core arguments (NPs and PPs which are<br />

arguments of the predicate in the nucleus). Core arguments are those arguments which<br />

are part of the semantic representation of the verb (Van Valin and LaPolla 1997). The<br />

relationships between the semantic and syntacts units are summarized in Table 3.1<br />

Table 3.1: Relationships between the semantic and syntactic units<br />

Semantic element (s) Syntactic unit<br />

Predicate Nucleus<br />

Argument in semantic representation of predicate Core argument<br />

Non arguments Periphery<br />

Predicate + arguments Core<br />

Predicate + arguments + Non- arguments Clause (= core + periphery)<br />

3.2.3 Non-universal aspects of the layered structure of the clause<br />

An initial phrase cannot be in the precore slot, because there is a WH–word (<strong>for</strong> example,<br />

<strong>for</strong> <strong>English</strong> who, where, what etc.) in the precore slot in the sentence; hence the position<br />

of the initial phrase is distinct from the precore slot. This position, which will be termed<br />

32


3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />

the left-detached position, is outside of the clause but within the sentence. An example<br />

from <strong>English</strong> with all of these elements is given in Figure 3.4.<br />

Figure 3.4: <strong>English</strong> Sentence with precore slot and left-detached position<br />

The opera<strong>to</strong>r projection in Figure 3.5 may be combined with what we will call the ‘con-<br />

stituent projection’ in Figure3.8 <strong>to</strong> yield a more complete picture of the clause, as in<br />

Figure 3.6; the periphery is omitted, since it can occur in a number of different positions.<br />

What we have here is two projections of the clause, one of which contains the predicate<br />

and its arguments (the constituent projection), while the other contains the opera<strong>to</strong>rs (the<br />

opera<strong>to</strong>r projection) (Van Valin and LaPolla 1997).<br />

They are both linked through the predicate, which may be a verb, NP, AdjP or PP, be-<br />

cause it is the one crucial element common <strong>to</strong> both. The opera<strong>to</strong>r projection mirrors the<br />

constituent projection in terms of layering; hence ‘nucleus’ in the opera<strong>to</strong>r projection<br />

corresponds <strong>to</strong> ‘nucleus’ in the constituent projection, and so on. The multiple nucleus,<br />

core and clause nodes represent each of the individual opera<strong>to</strong>rs at that level; the num-<br />

ber of multiple nodes corresponds <strong>to</strong> the number of opera<strong>to</strong>rs at that level present in the<br />

sentence. If there are no opera<strong>to</strong>rs at a given level, a bare node will be given. As the<br />

‘bare skele<strong>to</strong>n’ of the layered structure of the clause on the right makes clear, the two<br />

33


3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />

Figure 3.5: Opera<strong>to</strong>r projection in LSC<br />

projections are indeed mirror images of each other, and this will become particularly im-<br />

portant in representing the structure of complex sentences. A more complete picture of<br />

the clause in <strong>Arabic</strong>, is given in Figure 3.7. Please note that the sentences in Figure 3.7<br />

should be read from right <strong>to</strong> left.<br />

One of the major motivations <strong>for</strong> this scheme is that opera<strong>to</strong>rs virtually always occur<br />

in the same linear sequence with respect <strong>to</strong> the predicating element. When an ordering<br />

relationship can be established among opera<strong>to</strong>rs, they are always ordered in the same<br />

way cross-linguistically, such that their linear order reflects their scope. This is a very<br />

significant point. Opera<strong>to</strong>rs are ordered with respect <strong>to</strong> each other in terms of the scope<br />

principle discussed earlier, with the verb or other predicating element in the nucleus<br />

34


3.2. FORMAL REPRESENTATION OF LAYERED STRUCTURE OF THE CLAUSE<br />

Figure 3.6: LSC with constituent and opera<strong>to</strong>r projections<br />

as the anchorpoint, and thus the ordering restrictions on the morphemes expressing the<br />

opera<strong>to</strong>rs are universal. For a technical discussion of the meaning of the various opera<strong>to</strong>rs<br />

in the LSC (Van Valin and LaPolla 1997).<br />

35


3.3 Noun phrase structure<br />

Figure 3.7: <strong>Arabic</strong> LSC<br />

3.3. NOUN PHRASE STRUCTURE<br />

Noun phrases refer, while clauses predicate, and yet there are striking parallels between<br />

the structure of the two which have long been noted. For example, both can be said <strong>to</strong><br />

have arguments; while this is obvious in the case of verbs in clauses, it is also clear that<br />

relational nouns like father, friend and sister can take what could be analyzed as argu-<br />

ments, e.g. father of James / James’s father, a friend of Khalid / Khalid’s friend and the<br />

other sister of Sarah / Sarah’s other sister. Clauses sometimes have clauses within them<br />

as arguments, as in Zaid believed that pollution isn’t a problem, and the same is true of<br />

NPs, e.g. Zaid’s belief that pollution isn’t a problem. Given these parallels, it would be<br />

36


3.3. NOUN PHRASE STRUCTURE<br />

appropriate <strong>to</strong> say that at least some nouns take arguments analogous <strong>to</strong> verbs taking ar-<br />

guments, and there<strong>for</strong>e it is also appropriate <strong>to</strong> posit a layered structure <strong>for</strong> NPs (LSNP)<br />

similar but not identical <strong>to</strong> that <strong>for</strong> clauses. Relating <strong>to</strong> the fundamental functional dif-<br />

ference between verbs and nouns, is that the nominal nucleus NUCN dominates a REF<br />

(<strong>for</strong> ‘reference’) node, indicating that the unit in question refers, in contrast <strong>to</strong> the PRED<br />

(<strong>for</strong> ‘predicate’) node which appears in the nucleus of a clause. The word ‘of’ is non-<br />

predicative in this construction, because it does not license the argument; moreover, it is<br />

semantically empty, as it can occur with argument NPs having many different semantic<br />

functions (Van Valin and LaPolla 1997). Consider the range of semantic functions which<br />

the of-NPs have in the following examples.<br />

(2.2)<br />

a. the attack of the killer bees Agent<br />

b. the gift of a new car Theme<br />

c. the destruction of the city Patient<br />

d. the leg of the table Possessor<br />

e. the resupplying of the troops (with ammunition) Recipient<br />

(Nunes 1993) shows that NPs have only a single direct core argument, and it is marked<br />

by of. This is consistent with the point made above that of does not mark any particular<br />

semantic relation, in much the same way that the direct grammatical functions, subject<br />

and direct object, are not restricted <strong>to</strong> particular semantic functions. Accordingly, the of-<br />

marked NP counts as the single direct syntactic argument of the nominal nucleus in the<br />

core of the NP. Predicative adpositions, by contrast, have well-defined semantic content,<br />

like other predicates.<br />

37


3.4. LEXICAL REPRESENTATIONS FOR VERBS<br />

An important feature of the layered structure of the clause is the differential treatment<br />

given <strong>to</strong> opera<strong>to</strong>rs like tense, aspect and illocutionary <strong>for</strong>ce, and the same contrast is a<br />

vital part of the layered structure of the noun phrase. NP opera<strong>to</strong>rs include determin-<br />

ers (articles, demonstratives, deictics), quantifiers, negation and adjectival and nominal<br />

modifiers (Van Valin and LaPolla 1997).<br />

3.3.1 NP headed<br />

Pronouns can be classified in<strong>to</strong> a number of subtypes: personal pronouns, including pos-<br />

sessive pronouns (PRO), e.g. I liked her book; relative pronouns (PRO REL), e.g. the<br />

book which I bought; demonstrative pronouns (PRO DEM), e.g. That pleased Mary; WH-<br />

pronouns (PRO wh ), e.g. who did Fred see?; and expletive pronouns (PRO EXP ), e.g. it<br />

rained.<br />

3.4 Lexical representations <strong>for</strong> verbs<br />

These distinctions among the four basic Aktionsart types may be represented <strong>for</strong>mally<br />

as in Table 3.2. The term Aktionsart refers <strong>to</strong> the means of a capturing the distinctions<br />

between basic states of affairs, or events, of individual verbs. These representations<br />

are called logical structures. Following the conventions of <strong>for</strong>mal semantics, constants<br />

(which are normally predicates) are presented in boldface followed by a prime, whereas<br />

variable elements are presented in normal typeface. The elements in boldface and prime<br />

are part of the vocabulary of the semantic metalanguage used in the decomposition; they<br />

are not words from any particular human language.<br />

Table 3.2: Lexical representations <strong>for</strong> the basic Aktionsart classes<br />

Verb class Logical structure<br />

State predicate’ (x) or (x, y)<br />

Activity do’ (x, [predicate’(x) or (x, y)])<br />

Achievement INGR predicate’ (x) or (x, y)<br />

Accomplishment BECOME predicate’ (x) or (x, y)<br />

38


3.4. LEXICAL REPRESENTATIONS FOR VERBS<br />

Hence the same representations are used <strong>for</strong> all languages (where appropriate), e.g. the<br />

logical structure <strong>for</strong> <strong>Arabic</strong> and <strong>English</strong> ‘die’ (intransitive) would be BECOME dead’(x).<br />

The elements in all capitals, INGR and BECOME, are modifiers of the predicate in the<br />

logical structure; their function will be explained below. The variables are filled by lexi-<br />

cal items from the language being analysed; <strong>for</strong> example, the <strong>English</strong> sentence The dog<br />

died would have the logical structure BECOME dead’ (dog), while the correspond-<br />

ing <strong>Arabic</strong> sentence ālklb māt . “The dog died” would have the logical<br />

structure BECOME dead’ (ālklb) should be this sentence start with the verb<br />

māt ālklb. States are represented as simple predicates, e.g. broken’ (x),<br />

be-at’(x,y), and see’(x,y). There is no special <strong>for</strong>mal indica<strong>to</strong>r that a predicate<br />

is stative.<br />

The logical structure, be’(x,[pred’] ) is <strong>for</strong> identificational constructions, e.g.<br />

Omar is a student, and attributive constructions, such as The watch is broken require a<br />

different logical structure. In this logical structure the second argument is the attribute or<br />

identificational NP, e.g. be’ (Ayesha, [tall’]), be’(Omar,[student’]).<br />

The primary criteria <strong>for</strong> distinguishing between attributive constructions and result state<br />

constructions is whether the attribute is inherent, e.g. Coal is black (be’ (coal,<br />

[black’])), or whether it is the result of some kind of process, e.g. The fire black-<br />

ened the wood (... BECOME black’(wood)) (Van Valin and LaPolla 1997).<br />

3.4.1 Agents, effec<strong>to</strong>rs, instruments and <strong>for</strong>ces<br />

In ‘Zaid is cutting the bread with a knife’, an EFFECTOR, typically human, manipulates<br />

a knife and brings it in<strong>to</strong> contact with the bread, whereupon the interaction of the knife<br />

with the bread brings about the result that the bread becomes cut. This may be represented<br />

as in (3.3). (The main CLAUSE in the logical structure is italicized.)<br />

39


(3.3)<br />

[do’(Zaid, [use’(Zaid, knife)])] CAUSE<br />

[[do’(knife, [cut’(knife, bread)])]CAUSE<br />

[BECOME cut’(bread)]]<br />

3.4. LEXICAL REPRESENTATIONS FOR VERBS<br />

The causing event in (3.3) is complex, and the INSTRUMENT argument appears three<br />

times in the logical structure: as the IMPLEMENT of use’ and as the EFFECTOR of<br />

do’(x,[cut’(x,y)]). It is possible, if the first argument of the highest do’ were left unspec-<br />

ified, <strong>to</strong> say The knife cut the bread, with the INSTRUMENT knife as ac<strong>to</strong>r.<br />

3.4.2 change of state verb<br />

A change of state verb may be punctual in one language and non-punctual in another. A<br />

good example of this cross-linguistic variation is <strong>English</strong> ‘die’ and <strong>Arabic</strong>. Both have the<br />

result that the subject is dead. Accordingly, it is possible <strong>to</strong> say in <strong>English</strong><br />

He died quickly , He died slowly and He died suddenly. In <strong>Arabic</strong> we can say as (3.4),<br />

also, it is possible <strong>to</strong> say in <strong>Arabic</strong> Hence the logical structure <strong>for</strong> <strong>English</strong> and <strong>Arabic</strong><br />

‘die’ would be [ BECOME dead’(x)], an accomplishment.<br />

(3.4)<br />

(a) māt sry֒ā<br />

(b) <br />

He died quickly.<br />

māt bbt.֓y<br />

He died slowly.<br />

(c) māt fˇgāh<br />

He died suddenly.<br />

40


3.5. WHY WE USE RRG AS THE LINGUISTIC MODEL<br />

3.5 Why we use RRG as the linguistic model<br />

A reader might ask the question, why use Role and Reference Grammar as the basis of<br />

our <strong>machine</strong> transla<strong>to</strong>r? More than one reason prompts us <strong>to</strong> choose RRG. The most<br />

important one is that RRG is a new linguistic method and there is no research using the<br />

Role and Reference Grammar linguistic model as a basis <strong>for</strong> <strong>machine</strong> translation until<br />

now. We would like <strong>to</strong> discover this area using the RRG rules and techniques.<br />

What distinguishes the RRG conception is the conviction that grammatical structure can<br />

only be unders<strong>to</strong>od with reference <strong>to</strong> its semantic and communicative functions. Syntax<br />

is not au<strong>to</strong>nomous. In terms of the abstract paradigmatic and syntagmatic relations that<br />

define a structural system, RRG is concerned not only with relations of co-occurrence and<br />

combination in strictly <strong>for</strong>mal terms but also with semantic and pragmatic co-occurrence<br />

and combina<strong>to</strong>ry relations. According <strong>to</strong> Van Valin and LaPolla (1997) RRG takes lan-<br />

guage <strong>to</strong> be a system of communicative social action, and accordingly, analysing the<br />

communicative functions of grammatical structures plays a vital role in grammatical de-<br />

scription and theory from this perspective language is a system, and grammar is a system<br />

in the traditional structuralist sense.<br />

We claim that RRG is very suitable <strong>for</strong> <strong>machine</strong> translation of <strong>Arabic</strong> via an Interlingua<br />

bridge implementation model. RRG is a mono strata-theory, positing only one level of<br />

syntactic representation, the actual <strong>for</strong>m of the sentence and its linking algorithm can<br />

work in both directions from syntactic representation <strong>to</strong> semantic representation, or vice<br />

versa. In RRG, semantic decomposition of predicates and their semantic argument struc-<br />

tures are represented as logical structures. The lexicon in RRG takes the position that<br />

lexical entries <strong>for</strong> verbs should contain unique in<strong>for</strong>mation only, with as much in<strong>for</strong>-<br />

mation as possible derived from general lexical rules. The main features of RRG are<br />

the use of lexical decomposition, based upon predicate semantics, an analysis of clause<br />

41


3.5. WHY WE USE RRG AS THE LINGUISTIC MODEL<br />

structure and the use of a set of thematic roles organized in<strong>to</strong> a hierarchy in which the<br />

highest-ranking roles are ‘Ac<strong>to</strong>r’ (<strong>for</strong> the most active participant) and ‘Undergoer’.<br />

3.5.1 RRG representing the universal aspects of the layered structure of the clause<br />

A sentence in <strong>English</strong> is NP VP, but this is not valid in <strong>Arabic</strong> sentences. There is no<br />

copula (verb <strong>to</strong> be) in the <strong>Arabic</strong> language, this means some types of sentence in <strong>Arabic</strong><br />

may not contain any verb (nominal sentence). For example <br />

h ˘ āld t.ālb Khalid<br />

(is) a student; there is no ‘is’ in this sentence in <strong>Arabic</strong>. In RRG there is no VP in<br />

sentence structure. The abstract schema of the RRG layered structure of the clause can<br />

be represented as in figure3.8.<br />

Figure 3.8: The RRG representing the universal aspects of the layered structure of the<br />

clause (Van Valin and LaPolla 1997)<br />

The clause consists of the core with its arguments, and then the nucleus, which subsumes<br />

the predicate. At the very bot<strong>to</strong>m are the actual syntactic categories which realize these<br />

units. Notice that there is no VP in the tree, <strong>for</strong> it is not a concept that plays a direct role<br />

in this conception of clause structure in RRG.<br />

42


3.5. WHY WE USE RRG AS THE LINGUISTIC MODEL<br />

3.5.2 The lexical representation of verbs and their arguments<br />

The approach <strong>to</strong> the depiction of the lexical meaning of verbs which we will adopt is<br />

lexical decomposition, which involves paraphrasing verbs in terms of primitive elements<br />

in a well-defined semantic metalanguage. As a simple example of the mechanism of<br />

lexical decomposition, ‘kill’ can be paraphrased in<strong>to</strong> something like ‘cause <strong>to</strong> die’, and<br />

then ‘die’ can be broken down in<strong>to</strong> ‘become dead’, Thus the lexical representation<br />

of ‘kill’ would be something like ‘x causes [y become dead]’ (Van Valin and<br />

LaPolla 1997). A system of lexical representation should include a way of expressing the<br />

fact that the subject of ‘die’ and the object of ‘kill’ are the same argument semantically.<br />

There are many verbs pairs like this, and in many cases the relationship between them<br />

is overt. Examples include ‘sink’, as in ‘the boat sank’ and ‘the <strong>to</strong>rpedo sank the boat’,<br />

where boat is the subject of intransitive ‘sink’ and the object of transitive ‘sink’ (Van Valin<br />

and LaPolla 1997). Another example is the predicate ‘cool’, which can take three <strong>for</strong>ms,<br />

one adjectival and two verbal: ‘The soup is cool’, ‘the soup is cooling’ and ‘the wind<br />

cooled the soup’. Thus, there seems <strong>to</strong> be a pattern of intransitive verbs whose subjects<br />

are identical <strong>to</strong> the objects of their transitive counterparts. There are cases, however,<br />

when the intransitive-transitive alternates do not have the same lexical <strong>for</strong>m, as in ‘die’<br />

and ‘kill’, or ‘receive’ and ‘give’. An adequate theory of lexical representation should<br />

be able <strong>to</strong> capture these relationships, and lexical decomposition provides a promising<br />

method <strong>for</strong> doing it. There are many theories of lexical decomposition, which differ<br />

in terms of how fine-grained they are. It is necessary <strong>to</strong> find the right level of detail,<br />

one which allows the expression of certain important generalizations but which also has<br />

representations whose differences have morphosyntactic consequences. Thus, arriving<br />

at a decompositional system is a compromise between the demands of semantics (make<br />

all necessary distinctions relevant <strong>to</strong> meaning) and those of syntax (make syntactically<br />

43


3.6. SUMMARY<br />

relevant distinctions that permit the expression of significant generalizations) (Van Valin<br />

and LaPolla 1997).<br />

3.6 Summary<br />

RRG describes mainly a sentence of a specific language in terms of:<br />

1) logical structure;<br />

2) grammatical procedures.<br />

We use RRG <strong>to</strong> model <strong>Arabic</strong>, because there are certain cases where the standard NP VP<br />

categorisation does not apply due <strong>to</strong> the absence of a copula verb in the language. Since<br />

RRG does not structure sentences based around a VP, it is more suited <strong>to</strong> representing<br />

such sentences.<br />

The main features of RRG are the use of lexical decomposition, based upon predicate<br />

semantics. The RRG model creates a relationship between syntax and semantics and<br />

can account <strong>for</strong> how semantic representations are mapped in<strong>to</strong> syntactic representations.<br />

RRG also accounts <strong>for</strong> the very different process of mapping syntactic representations <strong>to</strong><br />

semantic representations.<br />

The division in the clause is between a core and a periphery The clause consists of the<br />

core with its arguments, and then the nucleus, which subsumes the predicate. The core<br />

arguments are those which are part of the semantic representation of the verb. The pe-<br />

riphery is represented on the margin, and the arrow there indicates that it is an adjunct;<br />

that is, it is an optional modifier of the core.<br />

There are languages in which opera<strong>to</strong>rs occur on both sides of the nucleus; <strong>for</strong> example,<br />

in <strong>Arabic</strong>, the imperfect tense <br />

ālf֒l ālmd. ār֒ marker is a prefix, while the<br />

perfect tense ālf֒l ālmād. y marker is a suffix (Ryding 2007). In such cases<br />

44


3.6. SUMMARY<br />

there will be more complex language-specific linear precedence rules <strong>for</strong> opera<strong>to</strong>rs.<br />

45


4<br />

Machine translation strategies<br />

In this chapter, we introduce background in<strong>for</strong>mation about Machine Translation. We<br />

discuss the computational techniques, basic strategies, linguistic aspects and the gener-<br />

ation problem. Much of the background in<strong>for</strong>mation is summarised from Hutchins and<br />

Somers (1992).<br />

Natural language processing (NLP) can be thought of as a subfield of artificial intelli-<br />

gence. It refers <strong>to</strong> understanding and au<strong>to</strong>matic generation of natural human languages.<br />

Machine translation (MT) is a part of computational linguistics and refers <strong>to</strong> comput-<br />

erised systems that can translate from one natural language <strong>to</strong> another. Hence, MT uses<br />

many ideas, methods and techniques from these related fields and has also built up a body<br />

of techniques which can, in turn, be applied in other areas of computer-based language<br />

processing.<br />

46


4.1. ADVANTAGES OF MACHINE TRANSLATION<br />

Modularity has changed as MT systems have developed. In transfer systems, lexical<br />

and structural transfer were sometimes separated. In many direct translation systems,<br />

analysis, transfer and generation are often mixed <strong>to</strong>gether and were not clearly distinct.<br />

As the area has matured, modularity has become an important aspect of MT systems,<br />

allowing different aspects <strong>to</strong> be developed independently.<br />

4.1 Advantages of <strong>machine</strong> translation<br />

Some of the advantages of <strong>machine</strong> translations are as follows:<br />

• Machine translation is quicker than human translation.<br />

• It ensures consistency. There is no concern that a transla<strong>to</strong>r might take <strong>to</strong>o much<br />

creative license with a translation or <strong>for</strong>get how a particular word was translated in<br />

earlier pages. MT will translate a particular word in the same way. However, the<br />

downside is that will exhibit the same errors over and over again.<br />

• It gives a neutral approach <strong>to</strong> translation without introducing bias, which can happen<br />

with human transla<strong>to</strong>rs.<br />

• Machine translation is considerably cheaper. It is a one time cost; the cost of the<br />

<strong>to</strong>ol and its installation.<br />

4.2 Computational techniques in MT<br />

Computational processing allows <strong>for</strong> the analysis and processing of large amounts of<br />

data. Be<strong>for</strong>e looking at the computational aspects of MT, we introduce some basic con-<br />

cepts. Machine translation can take advantage of one of the basic concepts in computing.<br />

Since data and programs are separate, it is possible <strong>to</strong> build a program that functions with<br />

different types of data. In the case of MT, this means that the algorithms <strong>for</strong> translation,<br />

47


4.2. COMPUTATIONAL TECHNIQUES IN MT<br />

and the data used <strong>for</strong> doing the actual translation can be developed separately. In reality<br />

this is a little simplistic, but there are certain examples of MT systems that operate in a<br />

similar manner <strong>for</strong> different sets of data like dictionaries and grammar rules (Hutchins<br />

and Somers 1992).<br />

4.2.1 System design<br />

As in standard software engineering, recent trends are <strong>to</strong>wards modular and incremental<br />

system design. Whereas previously, systems would be built in a monolithic structure,<br />

with some debugging access in<strong>to</strong> the system, now we build systems up in stages, com-<br />

pletely defining and testing each stage, be<strong>for</strong>e incorporating it in<strong>to</strong> the overall system.<br />

This method has revolutionarised software engineering and enabled much more effective<br />

collaborative design, as well as the integration of other people’s work in any design.<br />

4.2.2 Interactive systems<br />

Interactivity is a key aspect of computer systems. MT systems can take advantage of<br />

interactivity <strong>to</strong> achieve higher quality results. It is possible <strong>for</strong> an MT system <strong>to</strong> ask the<br />

user <strong>to</strong> select from a set of possible solutions. It is also possible <strong>to</strong> extend the lexicon<br />

through user input at the time of translation. The system might flag unfamiliar words,<br />

which the user can then categorise <strong>for</strong> inclusion in the lexicon. However, intereactivity<br />

and relying on user input can have disadvantages. For example, should the user be relied<br />

upon <strong>to</strong> be correct in his input? Is he fully aware of the linguistic properties of the words?<br />

Furthermore, as more user input is required, the benefits of MT over human translation<br />

become less significant.<br />

48


4.2.3 Lexical databases<br />

4.2. COMPUTATIONAL TECHNIQUES IN MT<br />

A key component of any rule-based MT system is its lexical resources; the in<strong>for</strong>mation<br />

associated with individual words. The field of computational lexicography is concerned<br />

with creating and maintaining computerised dictionaries. In practice, rule-based MT<br />

systems can often have different dictionaries, some containing the core entries, and others<br />

containing specialised vocabulary. An MT lexicon is different from a standard dictionary,<br />

and so is typically concentrated on some linguistically homogeneous set of words, e.g.<br />

abstract nouns, intransitive verbs, or the terminology of a specialist field. It is a good<br />

investment <strong>to</strong> develop <strong>to</strong>ols which aid lexicographers <strong>to</strong> expand the lexicon.<br />

4.2.4 Tokens and <strong>to</strong>kenization<br />

The term “<strong>to</strong>ken” refers <strong>to</strong> an abstraction <strong>for</strong> the smallest unit in a text that is considered<br />

when describing the syntax of a language. A process of <strong>to</strong>kenization can be used <strong>to</strong> split<br />

the sentence in<strong>to</strong> word <strong>to</strong>kens. Although the following example is given as XML there<br />

are many ways <strong>to</strong> represent <strong>to</strong>kenized input. The sentence He went <strong>to</strong> the school. could<br />

be <strong>to</strong>kenised as follows:<br />

<br />

He<br />

went<br />

<strong>to</strong><br />

the<br />

school<br />

<br />

49


4.2.5 Syntactic analysis (Parsing)<br />

4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />

Syntactic analysis, or parsing, is a major component in a rule-based MT system. It is the<br />

process by which a sentence is dissected or analysed in<strong>to</strong> constituent parts, <strong>to</strong> determine<br />

grammatical structure. One of the key challenges in analysis is dealing with ambiguity.<br />

One approach is what is called depth-first parsing, in which each possible solution is pur-<br />

sued <strong>to</strong> its conclusion. Each time a solution is found <strong>to</strong> be wrong, the system backtracks<br />

and takes another route until it eventually finds the correct categorisation of a word. In<br />

breadth-first parsing, alternatives are evaluated in parallel, until each alternative is found<br />

<strong>to</strong> be wrong except the right one.<br />

4.3 Basic <strong>machine</strong> translation strategies<br />

Traditionally three different approaches <strong>to</strong> MT have been used: direct translation, inter-<br />

lingua translation and transfer based translation. A few new approaches have also been<br />

established. In this section we will discuss basic strategies of MT systems.<br />

4.3.1 Multilingual versus bilingual systems<br />

Bilingual systems translate between a single pair of languages; multilingual systems<br />

translate between more than two languages. Bilingual systems are uni-directional or<br />

bi-directional, they may be designed <strong>to</strong> translate from one language <strong>to</strong> another in one<br />

direction only, or they are able <strong>to</strong> translate from both members of a language pair. As a<br />

further modification we may differentiate between reversible bilingual systems and non-<br />

reversible systems. In a reversible bilingual system the process involved in the analysis<br />

of a language can be inverted without change <strong>for</strong> the generation of output in the same<br />

language.<br />

50


4.3.2 Direct translation<br />

4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />

Figure 4.1: Direct MT system<br />

Direct translation is the oldest approach <strong>to</strong> MT. The direct translation strategy passes each<br />

sentence text <strong>to</strong> be translated through a series of standard stages. If the MT system uses<br />

direct translation, this usually means that there is no syntactic analysis after the morpho-<br />

logical analysis <strong>for</strong> the source language. The translation is based on large dictionaries<br />

and word-by-word translation with some simple grammatical adjustments e.g. on word<br />

order and morphology. A direct translation model is shown in Figure 4.1. This strategy<br />

is no longer in significant use.<br />

51


4.3.3 Interlingua<br />

4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />

The Interlingua approach is <strong>to</strong> develop a universal language-representation <strong>for</strong> text. In<br />

effect, in Interlingua there is no transfer map, and the MT model thus has phases: anal-<br />

ysis and generation. In a standard multilingual system with X source languages and Y<br />

target languages, the transfer approach will involve XY transfer maps; moreover, we<br />

need X analysers and Y genera<strong>to</strong>rs. In the Interlingua approach, only X parsers and Y<br />

genera<strong>to</strong>rs are needed per language. Interlingua based MT is done via an intermediary<br />

(semantic) representation of the source language text. Interlingua is supposed <strong>to</strong> be a lan-<br />

guage independent representation from which translations can be generated <strong>to</strong> different<br />

target languages. Translation needs two phases: analysis from the source language <strong>to</strong> the<br />

Interlingua (universal language) and generation from the universal language <strong>to</strong> the target<br />

language. An Interlingua translation model with eight languages is shown in Figure 4.2.<br />

Figure 4.2: Interlingua1 model with eight languages pairs<br />

52


4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />

To apply our <strong>framework</strong> <strong>to</strong> other generation languages, we only need <strong>to</strong> change the gen-<br />

eration phases. The intermediate representation is independent of the target language,<br />

and this is the benefit of using an Interlingua approach, since analysis and generation are<br />

separate tasks which are implemented independently.<br />

4.3.4 Transfer systems<br />

Transfer systems are a middle course between direct and Interlingua MT strategies.<br />

Transfer systems divide translation in<strong>to</strong> steps which clearly differentiate source lan-<br />

guage and target language parts. In the transfer approach there is there<strong>for</strong>e no language-<br />

independent representation: the source language intermediate representation is specific<br />

<strong>to</strong> a particular language, as is the target language intermediate representation. There is no<br />

necessary equivalence between the source and target intermediate representations <strong>for</strong> the<br />

same language. In the transfer strategy a source language sentence is first parsed in<strong>to</strong> an<br />

internal representation. Thereafter a transfer is made at both lexical and structural levels<br />

in<strong>to</strong> equivalent structures of the target language. In the third stage a translation is gener-<br />

ated. Whereas the Interlingua approach requires complete resolution of all ambiguities<br />

in the source language text so that translation in<strong>to</strong> any other language is possible, in the<br />

transfer approach only those ambiguities inherent in the language in question are tackled.<br />

This approach is a development over direct translation and this was lexically driven. The<br />

level of transfer differs from system <strong>to</strong> system - the representation varies from only syn-<br />

tactic deep structure <strong>to</strong> syntactic-semantic interpret trees. A multilingual transfer model<br />

with eight languages pairs is presented in Figure 4.3.<br />

In comparison with the Interlingua system there are clear disadvantages in the transfer<br />

approach. The addition of a new language involves not only the two modules <strong>for</strong> analy-<br />

sis and generation, but also the addition of new transfer modules, the number of which<br />

53


4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />

may vary according <strong>to</strong> the number of languages in the existing system: in the case of<br />

a two-language system, a third language would require four new transfer modules. The<br />

addition of a fourth language would entail the development of six new transfer modules,<br />

and so on as illustrated in Table 4.1.<br />

Figure 4.3: Multilinguality transfer model with eight languages pairs<br />

Table 4.1: Modules required in an all-pairs multilingual transfer system<br />

Number of languages 2 3 4 5 ... 8 n<br />

Analysis models 2 3 4 5 ... 8 n<br />

Generation models 2 3 4 5 ... 8 n<br />

Transfer models 2 6 12 20 ... 56 n 2 − n<br />

Total models 6 12 20 30 ... 72 n 2 + n<br />

The number of transfer modules in a multilingual transfer system, <strong>for</strong> all combinations<br />

of n languages, is n 2 − n. Also needed are n analysis and n generation modules, which<br />

54


would also be needed <strong>for</strong> an interlingua system.<br />

4.3. BASIC MACHINE TRANSLATION STRATEGIES<br />

As shown in Figure 4.4, the direct method has no modules <strong>for</strong> source language analysis<br />

or target language generation. In the interlingua method the source language is fully<br />

analyzed in<strong>to</strong> a language-independent representation from which the target language is<br />

generated. The transfer strategy can be viewed as falling between interlingua systems<br />

and direct systems.<br />

Figure 4.4: Difference between direct, transfer, and interlingua MT models, (Trujillo<br />

1999)<br />

Figure 4.4 shows language analysis up the left-hand side, and target-language generation<br />

down the right. The peak of the pyramid represents the theoretical interlingua represen-<br />

tation achieved by analysis and suitable <strong>for</strong> direct use by generation. However, the path<br />

<strong>to</strong> that interlingua is long. By cutting off the monolingual analysis at some point and<br />

entering in<strong>to</strong> a bilingual transfer phase, one can avoid the difficulties of a full analysis.<br />

The diagram is also intended <strong>to</strong> suggest that the more the text is analysed, the simpler the<br />

transfer will be, as depicted by the length of the line cutting across the pyramid. At the<br />

very bot<strong>to</strong>m, where there is smallest amount of analysis, and nearly all the work is done<br />

in transfer, as was the case with the early direct method systems.<br />

55


4.3.5 Statistical <strong>machine</strong> translation<br />

4.4. LINGUISTIC ASPECTS OF MT<br />

The ideas behind statistical <strong>machine</strong> translation come out of in<strong>for</strong>mation theory. Essen-<br />

tially, the document is translated on the probability p(e|a) that a string e in the target<br />

language (<strong>for</strong> example <strong>English</strong>) is the translation of a string a in the source language (<strong>for</strong><br />

example <strong>Arabic</strong>). As translation systems are not able <strong>to</strong> s<strong>to</strong>re all native strings and their<br />

translations, a document is typically translated sentence by sentence, but even this is not<br />

enough. We assign <strong>to</strong> every pair of strings (e|a) a number P(e|a), which we interpret as<br />

the probability that a transla<strong>to</strong>r, when presented with e, will produce a as its translation.<br />

You could imagine another program that takes a sentence a as input, and outputs every<br />

conceivable string e along with its P(e|a). This program would take a long time <strong>to</strong> run,<br />

even if you limit <strong>English</strong> translations <strong>to</strong> some arbitrary length. They seek the <strong>English</strong> sen-<br />

tence e that maximizes P(e|a) and minimizes time (Brown et al. 1993). To summarize,<br />

we compute P(e|a) by summing the probabilities of all alignments. For each alignment,<br />

we make two significant simplifying assumptions: Each <strong>English</strong> word is generated by<br />

exactly one <strong>Arabic</strong> word; and the generation of each <strong>English</strong> word is independent of the<br />

generation of all other <strong>English</strong> words in the sentence. This is clearly not true in theory.<br />

4.4 Linguistic aspects of MT<br />

In this section we will look more closely at the kinds of linguistic problems that MT has <strong>to</strong><br />

face and will discuss ways in which MT programs work around these problems. We will<br />

distinguish monolingual problems of morphology, lexical ambiguity, syntactic ambiguity,<br />

pragmatic aspects from bilingual problems of language contrast: lexical mismatches,<br />

structural divergence, typological differences.<br />

56


4.4.1 Non-Roman alphabet scripts<br />

4.4. LINGUISTIC ASPECTS OF MT<br />

Since computer technology developed mostly in <strong>English</strong>, other languages, particularly<br />

those with non-Roman alphabet have his<strong>to</strong>rically been seen as a special case and required<br />

new code sets <strong>to</strong> define character representations. Furthermore, not all languages with<br />

alphabetic scripts are written left-<strong>to</strong>-right, e.g. <strong>Arabic</strong> and Hebrew, so any input/output<br />

devices making this assumption will be useless <strong>for</strong> such languages. Be<strong>for</strong>e Unicode was<br />

standardised, there were different encoding systems <strong>for</strong> assigning this problem. Unicode<br />

provides a unique code <strong>for</strong> every character, no matter what the plat<strong>for</strong>m, the program and<br />

the language are. Appendix A provides the corresponding Unicode <strong>for</strong> each <strong>Arabic</strong> letter<br />

and describes the letters with their corresponding written shapes.<br />

4.4.2 Lexical ambiguity<br />

Category ambiguities or homographs are examples of lexical ambiguities which arise<br />

when there are potentially two or more ways in which a word can be analysed. More<br />

complex are lexical ambiguities, where one word can be interpreted in more than one<br />

way. Lexical ambiguities are of three basic types: category ambiguities, homographs and<br />

transfer (or translational) ambiguities.<br />

4.4.2.1 Category ambiguity<br />

The simplest type of lexical ambiguity is that of category ambiguity: a given word could<br />

be assigned <strong>to</strong> more than one grammatical or syntactic category (e.g. noun, verb or<br />

adjective) according <strong>to</strong> the context. There are several examples of this in <strong>English</strong>: light<br />

can be a noun, verb or adjective, also, control can be a noun or verb. In <strong>Arabic</strong> there<br />

are some words that can be in more than one category, <strong>for</strong> example ֒lā could be a<br />

preposition with meaning of “on”, or a verb with meaning of “raise”.<br />

57


4.4.2.2 Homograph<br />

4.4. LINGUISTIC ASPECTS OF MT<br />

The second type of lexical ambiguity occurs when a word can have two or more differ-<br />

ent meanings. Linguists distinguish between homographs, homophones and polysemes.<br />

Homographs are two (or more) ‘words’ with quite different meanings which have the<br />

same spelling: example, light (not dark or not heavy). Many <strong>Arabic</strong> words can have two<br />

or more overlapping meanings examples; ֓i֒lān could be announcement, advertisement,<br />

declaration or sign. Also, <br />

mrkz could be centre, position, rank or status.<br />

Moreover, mwq֒ could be position, rank, site or status. The direct approach has<br />

particular problems with homographs; the usual method of resolving homograph ambi-<br />

guities is <strong>to</strong> look at the closest words <strong>for</strong> clues.<br />

4.4.3 Syntactic ambiguity<br />

Syntactic ambiguity arises when there is more than one way of analysing the underlying<br />

structure of a sentence according <strong>to</strong> the grammar used in the system. Example, I know a<br />

man with a dog who has fleas, is ambiguous. It could be the man or the dog who has fleas.<br />

It is the syntax not the meaning of the words which is unclear. The classical example is<br />

He saw the girl with the telescope. For the purposes of this discussion, we represent these<br />

examples in the notation of a context-tree grammar rather than in RRG notation.<br />

The two trees in Figure 4.5 and Figure 4.6 represent the two different analyses in the<br />

sense of recording two different ‘parse his<strong>to</strong>ries’. In linguistic terms, they correspond <strong>to</strong><br />

the two readings of the sentence: one in which the PP is part of the NP (i.e. the girl has<br />

the telescope), and the other where the PP is the same level as the subject (i.e. the man<br />

has the telescope). For convenience, a bracketed notation <strong>for</strong> trees is sometimes used:<br />

the equivalents <strong>for</strong> the trees in Figure 4.5 and Figure 4.6 are shown in (4.1a) and (4.1b)<br />

respectively.<br />

58


Figure 4.5: NP rule (NP –> det n pp)<br />

Figure 4.6: PP is attached at a higher level<br />

4.4. LINGUISTIC ASPECTS OF MT<br />

59


(4.1a)<br />

4.5. CHALLENGES OF ARABIC TO ENGLISH MT<br />

S(NP(pron(he)), VP( v(saw),NP(det(the), N(girl),<br />

PP(prep(with),NP(det(the),n(telescope))))))<br />

(4.1b)<br />

S(NP(pron(he)), VP(v(saw),NP(det(the),n(girl)),<br />

PP(prep(with),NP(det(the),n(telescope)))))<br />

The tree structures required may of course be much more complex, not only in the sense<br />

of having more levels, or more branches at any given level, but also in that the labelling<br />

of the nodes (i.e. the ends of the branches) may be more in<strong>for</strong>mative.<br />

4.4.4 Structural differences<br />

Many relatively trivial syntactic differences between languages are well known, e.g. in<br />

<strong>Arabic</strong> most adjectives follow nouns but in <strong>English</strong> adjectives normally precede the nouns<br />

they qualify. Also, <strong>Arabic</strong> sentences have more than one structural type. The sentence<br />

which contains a verb, will have order of the <strong>for</strong>m verb(V), subject(S) and object(O) or<br />

verb(V), object(O) and subject(S). The only combinations that do not occur in <strong>Arabic</strong> are<br />

OSV and SOV (Attia 2004).<br />

4.5 Challenges of <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> MT<br />

<strong>Arabic</strong> words can often be ambiguous due <strong>to</strong> the three-letter root system. These conso-<br />

nant roots interlock with patterns of vowels or consonants <strong>to</strong> words or word stems. This<br />

root system allows the language <strong>to</strong> evolve <strong>to</strong> cover a wide range of meanings. In some<br />

derivations one or more of the root letters is dropped, resulting in possible ambiguity.<br />

Examples of derived words from a three-letter-root are shown in Table 4.2.<br />

60


4.5. CHALLENGES OF ARABIC TO ENGLISH MT<br />

Table 4.2: Derived words from a three-letter-root in <strong>Arabic</strong><br />

<strong>Arabic</strong> Example POS<br />

<br />

kataba he wrote verb<br />

<br />

kātaba he corresponded verb<br />

kutiba it was written verb<br />

ktiāb book noun<br />

<br />

kutub books noun<br />

kātib writer; (adj) writing noun<br />

kutāb writers noun<br />

<br />

maktab desk; office noun<br />

makātib desks; offices noun<br />

<br />

maktabah library noun<br />

A root is defined in (Ryding 2007) as “a relatively invariable discontinuous bound mor-<br />

pheme, represented by two <strong>to</strong> five phonemes, typically three consonants in a certain order,<br />

which interlocks with a pattern <strong>to</strong> <strong>for</strong>m a stem and which has lexical meaning.”<br />

There are also two and four letter roots. They are discontinuous because the root letters<br />

can be interspersed with other letters in a pattern. However, the order of the root letters<br />

must be the same.<br />

A pattern is defined in (Ryding 2007) as “a bound and in many cases discontinuous mor-<br />

pheme consisting of one or more vowels and slots <strong>for</strong> root phonemes (radicals), which<br />

either alone or in combination with one <strong>to</strong> three derivational affixes, interlocks with a<br />

root <strong>to</strong> <strong>for</strong>m a stem, and which generally has grammatical meaning.”<br />

Patterns signify grammatical or language-internal in<strong>for</strong>mation, distinguishing word types<br />

and classes. These patterns can differentiate between nouns, verbs and adjectives, but<br />

also give more detailed in<strong>for</strong>mation about sublasses of these categories. There are fewer<br />

patterns than roots.<br />

<strong>Arabic</strong> has a large set of morphological features (Al-Sughaiyer and Al-Kharashi 2004).<br />

61


4.6. GENERATION<br />

These features are in the <strong>for</strong>m of prefixes, suffixes and also infixes that can completely<br />

change the meaning of the word. Also, in <strong>Arabic</strong> there are some words that hold the<br />

meaning of a full sentence <strong>for</strong> example, snsāfr , would translate <strong>to</strong>; We will<br />

travel. in <strong>English</strong>. This means any MT system should apply thorough analysis in order<br />

<strong>to</strong> obtain the root or <strong>to</strong> deduce that in one word there is in fact a full sentence. <strong>Arabic</strong><br />

has a relatively free word order, this poses a significant challenge <strong>to</strong> MT due <strong>to</strong> the vast<br />

possibilities <strong>to</strong> express the same sentence in <strong>Arabic</strong>.<br />

4.6 Generation<br />

In this section we discuss the generation of target language texts.<br />

4.6.1 Generation in direct systems<br />

In direct systems in Figure 4.7, generation is based as much as possible on source lan-<br />

guage structures: nothing is changed more than strictly needed <strong>for</strong> the creation of a suit-<br />

able target language word order.<br />

62


Figure 4.7: Direct MT system<br />

4.6.2 Generation in transfer-based systems<br />

4.6. GENERATION<br />

In a transfer system, the generation phase is generally divided in<strong>to</strong> two parts, syntac-<br />

tic generation and morphological generation. Syntactic generation involves creating a<br />

deep-tree structure from the output of the analysis, which is then re-ordered by trans-<br />

<strong>for</strong>mational rules. The final tree is labelled with the grammatical functions and features<br />

of the target language. This re-ordered surface structure can now be processed by the<br />

morphological genera<strong>to</strong>r, which creates labelled lexical items which can be easily turned<br />

in<strong>to</strong> target sentences.<br />

63


4.6.3 Generation in interlingua systems<br />

4.6. GENERATION<br />

The steps <strong>for</strong> generating texts in interlingua-based systems are similar <strong>to</strong> those described<br />

<strong>for</strong> transfer-based systems. Generation includes phases of syntactic and morphological<br />

generation. The main difference is that the start point is not a deep-structure syntactic<br />

representation, but an interlingua representation, probably based on predicate-argument<br />

structures. The syntactic structure must first be generated from the interlingual represen-<br />

tation by a phase often known as semantic generation. The process may be described<br />

using example in Figure 4.8. The structure <strong>to</strong> be generated is shown in Figure 4.9.<br />

Figure 4.8: Semantic generation<br />

64


4.7 Summary<br />

Figure 4.9: Structure <strong>to</strong> be generated<br />

4.7. SUMMARY<br />

In the stages of analysis and generation, most MT systems contain separated components<br />

dealing with different levels of linguistic description: morphology, syntax, semantics.<br />

Hence, analysis may be divided in<strong>to</strong> morphological analysis, syntactic analysis and se-<br />

mantic analysis (Hutchins and Somers 1992).<br />

For the purposes of this study, our proposed solution <strong>to</strong> an <strong>Arabic</strong>-<strong>English</strong> transla<strong>to</strong>r will<br />

be based upon the interlingua model of <strong>machine</strong> translation. <strong>Arabic</strong> is unique in many<br />

ways but is not immune <strong>to</strong> the standard challenges faced by other languages such as<br />

multiple meanings of words, non-verbalisation and insufficient lexicons.<br />

An Interlingua model that incorporates source language analysis, thereby creating a so<br />

called universal logical structure, will facilitate multiple language generation in a more<br />

flexible way. An Interlingua model is presented in Figure 4.10.<br />

65


Figure 4.10: Interlingua model of <strong>Arabic</strong> MT<br />

4.7. SUMMARY<br />

For the elements of subject(S), verb(V) and object(O), <strong>Arabic</strong>’s relatively free word order<br />

allows the combinations of SVO, VSO and VOS. The only combinations that do not<br />

occur in <strong>Arabic</strong> are OSV and SOV. <strong>Arabic</strong>’s flexible word order is discussed later in this<br />

research. Our research develops a rule-based and lexical <strong>framework</strong> <strong>for</strong> the processing of<br />

<strong>Arabic</strong> using the Role and Reference Grammar (RRG) linguistic model. The <strong>framework</strong><br />

is <strong>to</strong> be evaluated using a <strong>machine</strong> translation system that translates an <strong>Arabic</strong> text as<br />

source language in<strong>to</strong> an <strong>English</strong> text as target language.<br />

66


5<br />

Design of <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation<br />

system based on RRG<br />

The UniArab system is a natural language processing application based on Role and<br />

Reference Grammar (RRG) <strong>for</strong> translating the <strong>Arabic</strong> language in<strong>to</strong> any other language,<br />

using an interlingua bridge. An interlingua based MT approach <strong>to</strong> translation is done<br />

via an intermediate semantic representation of the source language (Hutchins 2003). The<br />

conceptual architecture of the UniArab system is shown in Figure 5.1. To apply it <strong>to</strong><br />

any other language, we need only change phases 9, 10, 11 and 12. Figure 5.1 will be<br />

discussed in more detail in Chapter 6.<br />

67


5.1. UNIARAB: INTERLINGUA-BASED SYSTEM<br />

Figure 5.1: The conceptual architecture of the UniArab system<br />

5.1 UniArab: Interlingua-based system<br />

In interlingua MT systems, the result of source language analysis is a language indepen-<br />

dent representation of the text which is the basis <strong>for</strong> the generation of the target language<br />

text. The advantages of using interlingua <strong>for</strong> multilingual systems have already have<br />

been mentioned in Chapter 4. The challenges start with analysis and generation, they<br />

have <strong>to</strong> be strictly separated; it is not desirable <strong>to</strong> learn about analysis <strong>to</strong>wards a par-<br />

ticular target language and it is not possible, during generation, <strong>to</strong> refer <strong>to</strong> the original<br />

source language text. Using an RRG based interlingua bridge creates strong analysis<br />

methods that incorporate all attributes of a sentence and its words including the logical<br />

structure of its verbs. This technique could be very amenable <strong>to</strong> interlingua. The interlin-<br />

gua representation must include all the in<strong>for</strong>mation that can possibly be required during<br />

68


5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />

the generation of any target language text or rather more correctly: any target language<br />

included in the system from the outset or planned <strong>for</strong> the future. In effect, this high<br />

degree of language-independence and objectivity means that interlinguas must strive <strong>to</strong>-<br />

wards universality in lexicon and structure: one might almost say, <strong>to</strong>wards representing<br />

the meaning of the text. Most interlingua-based systems use representations. The Chom-<br />

skyan theory of deep structures was thought <strong>to</strong> be attractive, but it is now agreed they<br />

are not sufficiently abstract, being <strong>to</strong>o oriented <strong>to</strong>wards the surface features of individ-<br />

ual languages. The implications of neutral structural representations can be illustrated<br />

by allowing <strong>for</strong> differences of word order between languages, and their significance. In<br />

<strong>English</strong>, word order is the primary means of distinguishing grammatical functions like<br />

subject and object. The <strong>Arabic</strong> language has a relatively free word order. The implica-<br />

tion <strong>for</strong> an interlingua is that it is not enough <strong>to</strong> designate word order on its own: the<br />

interlingua must represent the significance in terms of grammatical function (syntactic<br />

relations), text function, determination, case role or whatever else the interpretation of<br />

the word-order dictates. Structural differences can be treated in transfer-based systems<br />

by structural transfer rules. But in interlingua-based systems the representation must be<br />

language-neutral.<br />

5.2 Designing an XML lexicon architecture <strong>for</strong> <strong>Arabic</strong> MT based<br />

on RRG<br />

The lexicon in RRG takes the position that lexical entries <strong>for</strong> verbs should contain unique<br />

in<strong>for</strong>mation only, with as much in<strong>for</strong>mation as possible derived from general lexical<br />

rules. The lexicon is designed <strong>to</strong> reflect the word categories in the <strong>Arabic</strong> language with<br />

as much in<strong>for</strong>mation as possible derived from general lexical rules. The lexicon s<strong>to</strong>res<br />

the <strong>Arabic</strong> words in categories, each category is s<strong>to</strong>red in an XML <strong>for</strong>mat datasource<br />

69


5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />

file. In order <strong>to</strong> be able <strong>to</strong> analyses <strong>Arabic</strong> by computer we must first extract the lexical<br />

properties of the <strong>Arabic</strong> words. The UniArab system uses the lexicon <strong>to</strong> construct a logi-<br />

cal structure <strong>for</strong> <strong>Arabic</strong> input sentences, also represented in XML, which is then used <strong>for</strong><br />

generating the target language translation. We show the structure of the UniArab lexicon,<br />

discuss how it is used in the system, and show the user interface used <strong>for</strong> adding <strong>to</strong> the<br />

lexicon. The lexicon is built from individual words at present.<br />

5.2.1 An XML-based lexicon<br />

In order <strong>to</strong> build this system and represent the data sources, we use the XML language<br />

and Java. The most recent recommendation of the XML language has been presented<br />

by Bray et al. (2008). XML has become the default standard <strong>for</strong> data exchange among<br />

heterogeneous data sources (Arciniegas 2000). The UniArab system allows data <strong>to</strong> be<br />

s<strong>to</strong>red in XML <strong>for</strong>mat. This data can then be queried, exported and serialized in<strong>to</strong> any<br />

<strong>for</strong>mat the developer wishes.<br />

We choose <strong>to</strong> create our data source as XML, <strong>for</strong> optimum support or different plat-<br />

<strong>for</strong>ms. It was also easier as we used <strong>Arabic</strong> letters not Unicode inside the data source,<br />

XML fully supported <strong>Arabic</strong>. We created our search engine using Java.<br />

5.2.2 Lexical representation in UniArab<br />

Lexical frames represent the language-dependent lexicon. We use an XML data source<br />

<strong>to</strong> represent the UniArab lexicon. The lexicon creates pointers <strong>to</strong> corresponding con-<br />

ceptual frames or attributes of each word. These frames also have relations which link<br />

them <strong>to</strong> verb class frames, which are organized hierarchically according <strong>to</strong> the particular<br />

language.<br />

70


5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />

In Phase 3 in Figure 5.1, the UniArab system <strong>to</strong>kenizes a sentence in<strong>to</strong> words, then sends<br />

each word <strong>to</strong> the search engine within the Lexicon <strong>to</strong> query the category of each word<br />

and all attributes <strong>for</strong> that word. The Lexicon returns the corresponding category and its<br />

attributes as detailed below. The Morphology Parser, Phase 5, receives the word meta-<br />

data and ensures that the properties of the words are consistent. The verb attributes in<br />

particular, are of great importance in correctly extracting sentence logical structure fur-<br />

ther down the processing chain, helping <strong>to</strong> answer the basic question ‘Who does what?’<br />

In free word order sentences, <strong>for</strong> example, yh. b qys lylā, ‘Qays loves<br />

Laila’ multiple orders are possible including verb-subject-object, verb-object-subject or<br />

subject-verb-object. The attributes of the verb agree with the gender of the subject. Given<br />

the masculine gender of the verb in this case, the Syntactic Parser will look <strong>for</strong> a mascu-<br />

line proper noun <strong>to</strong> make the ac<strong>to</strong>r <strong>for</strong> this sentence. If there is more than one masculine<br />

proper noun in such a case, then Modern Standard <strong>Arabic</strong> defines the first proper noun<br />

as the ac<strong>to</strong>r. The Morphology Parser will be extended so that it can deal with words that<br />

are defined in multiple categories, deciding which should be processed. Meanwhile the<br />

Syntactic Parser, so far, has only been implemented <strong>for</strong> extracting word order, though it<br />

will be extended <strong>to</strong> deal with word ambiguities in future versions.<br />

5.2.3 Lexical properties<br />

Figure 5.2 shows the structure of the Lexicon including the properties s<strong>to</strong>red <strong>for</strong> each<br />

word category. For all categories, an <strong>Arabic</strong> word is s<strong>to</strong>red along with its <strong>English</strong> repre-<br />

sentation. Since word ambiguity has not been dealt with so far, there is a one <strong>to</strong> one map-<br />

ping <strong>for</strong> the simple sentences which UniArab processes up <strong>to</strong> now. However, word am-<br />

biguity is supported in the structure, with each possible case s<strong>to</strong>red as a separate record.<br />

All search results will be passed <strong>to</strong> the Morphology Parser <strong>to</strong> decide which is taken.<br />

71


5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />

Figure 5.2: In<strong>for</strong>mation recorded in the UniArab lexicon<br />

72


5.2. DESIGNING AN XML LEXICON ARCHITECTURE FOR ARABIC MT BASED ON RRG<br />

Since the verb is the key component when analysing using RRG, each verb has an asso-<br />

ciated logical structure, which is later used <strong>to</strong> determine the logical structure of the full<br />

sentence. The tense of the verb is also s<strong>to</strong>red within its metadata along with the person.<br />

The verb type also s<strong>to</strong>res the gender, which in <strong>Arabic</strong> must be either masculine or femi-<br />

nine; there is no neutral gender. The number property in arabic can be singular, dual or<br />

plural. These properties help the Syntactic Parser analyse the sentence, since there must<br />

be agreement with the subject and verb, among other rules.<br />

Although we adhere <strong>to</strong> the Interlingua approach, we do not do so with the translation<br />

of lexical items. In an ideal Interlingua system lexical entries should be broken down<br />

in<strong>to</strong> sets of semantic features. For example the word “man” is broken down in<strong>to</strong> +human<br />

+male +adult. While this works in theory, in practice we cannot find enough seman-<br />

tic features <strong>to</strong> describe every entity in the world. For example “cow”, “computer” and<br />

“chair” cannot be described using these sets of semantic features unless we invent a<br />

unique semantic feature <strong>for</strong> every object and this is practically impossible, and of course,<br />

beyond the scope of this thises.<br />

Table 5.1: Verb 1<br />

<strong>Arabic</strong> verb qr֓a<br />

<strong>English</strong> translation read<br />

Logical structure [do’(x,[read’(x,(y)])]<br />

Tense past<br />

Gender m<br />

Person 3rd<br />

Number singular<br />

In Tables 5.1, 5.2, we show two examples of records <strong>for</strong> verbs in the Lexicon. The<br />

absence of t ‘t’ suffix signifies m: gender. The <strong>English</strong> translation of these verbs are<br />

‘read’ and ‘wrote’.<br />

An example of the XML record <strong>for</strong> a verb in the Lexicon is shown here;<br />

73


/><br />

<strong>English</strong>Translate=“read”<br />

Table 5.2: Verb 2<br />

<strong>Arabic</strong> verb <br />

ktbt<br />

<strong>English</strong> translation wrote<br />

Logical structure [do’(x,[write’(x,(y)])]<br />

Tense past<br />

Gender f<br />

Person 3rd<br />

Number singular<br />

LogicalStructures= “”<br />

NumberVerb=“sg”<br />

P.O.S=“Verb”<br />

genderVerb=“M”<br />

personVerb=“3rd”<br />

tenseVerb=“PAST”<br />

5.3 Design of test strategy<br />

5.3. DESIGN OF TEST STRATEGY<br />

We will create variants of <strong>Arabic</strong> sentences that represent all possible structures of sen-<br />

tences that UniArab can translate. We will evaluate the result of the system output by<br />

comparing between human-translated and <strong>machine</strong>-translated versions. In Tables 5.3 <strong>to</strong><br />

5.9 we represent some examples of sentences that are used <strong>to</strong> test the UniArab system.<br />

For actual test examples see Appendix C.<br />

Verb-Subject one argument in deferent tenses:<br />

In Table 5.3, Verb-Subject Agreement with two arguments sentences, are sentences where<br />

UniArab should select the correct <strong>for</strong>m of the verb. In particular the verb must agree with<br />

the subject.<br />

74


5.3. DESIGN OF TEST STRATEGY<br />

Table 5.3: Test strategy: verb-subject agreement<br />

<strong>Arabic</strong> human-translated UniArab other<br />

yˇsrb ֒mr ālh. lyb Omar is drinking the milk. ? ?<br />

ˇsrb ֒mr ālh. lyb Omar drank the milk. ? ?<br />

mārk qrā ālktāb Mark read the book. ? ?<br />

syˇsrb mārk āllbn Mark will drink the milk ? ?<br />

Demonstrative Adjective-Noun:<br />

The system should place the Demonstrative Adjective-Noun Agreement that agrees in<br />

number and gender. The test sentences are shown in Table 5.4.<br />

Table 5.4: Test strategy: demonstrative adjective-noun agreement<br />

<strong>Arabic</strong> human-translated UniArab best of rest<br />

hd ¯ ā ālrˇgl This man ? ?<br />

d ¯ lk ālrˇgl That man ? ?<br />

Gender-Ambiguous proper nouns:<br />

Proper nouns can confuse MT in two different ways. The first, the MT system may<br />

not identify that the word is a proper noun and analyse it as a noun, adjective, or any<br />

other categories. The second is that it may fail <strong>to</strong> identify the gender of the noun and<br />

thus fail <strong>to</strong> provide in<strong>for</strong>mation needed <strong>for</strong> agreement in <strong>Arabic</strong>. The test sentences<br />

are shown in Table 5.5. The UniArab system should follow the rules <strong>for</strong> agreement in<br />

number and gender. This is due <strong>to</strong> the fact that <strong>Arabic</strong> differs greatly from <strong>English</strong> in<br />

the distribution of number and gender in the pronoun system, lexical items as well as<br />

the syntactic structure. This difference results in many agreement problems during the<br />

translation process.<br />

Table 5.5: Test strategy: gender-ambiguous proper nouns<br />

<strong>Arabic</strong> human-translated UniArab best of rest<br />

<br />

qr֓a ˇgāk ālktāb Jack read the book. ? ?<br />

qr֓at māry ālktāb Mary read the book. ? ?<br />

75


Copula verb ‘<strong>to</strong> be’:<br />

5.3. DESIGN OF TEST STRATEGY<br />

There are certain cases where the standard NP VP categorisation does not apply due <strong>to</strong> the<br />

absence of a copula verb in the language. In <strong>Arabic</strong> there is no verb ‘<strong>to</strong> be’ (Salem et al.<br />

2008b). UniArab should understand if the sentences contain verb ‘<strong>to</strong> be’ and generate<br />

them correctly. The test sentences are shown in Table 5.6.<br />

Table 5.6: Test strategy: verb ‘<strong>to</strong> be’<br />

<strong>Arabic</strong> human-translated UniArab best of rest<br />

֓anā ālmhnds I am the engineer ? ?<br />

hw mhnds He is an engineer ? ?<br />

Verb ‘<strong>to</strong> have’:<br />

UniArab should understand if the sentences contain ‘<strong>to</strong> have’ and generate them correctly.<br />

<strong>Arabic</strong>, like Modern Irish, has no verb of ‘<strong>to</strong> have’. The test sentences are shown in Table<br />

5.7.<br />

<strong>Arabic</strong><br />

Table 5.7: Test strategy: verb ‘<strong>to</strong> have’<br />

human-translated UniArab best of rest<br />

<br />

lqd qmt bālh. ˇgz I have made a reservation. ? ?<br />

<br />

lqd fqdt tdkrty I have lost my ticket. ? ?<br />

¯<br />

The free word order in <strong>Arabic</strong>:<br />

<strong>Arabic</strong> has free word order, this poses a significant challenge <strong>to</strong> MT due <strong>to</strong> the vast<br />

possibilities <strong>to</strong> express the same sentence in <strong>Arabic</strong> (Salem et al. 2008a). The ac<strong>to</strong>r in<br />

Table 5.8 could be the first, second or third argument. UniArab should analyse who the<br />

ac<strong>to</strong>r is.<br />

Pro–Drop:<br />

In technical linguistic terms, <strong>Arabic</strong> is a ‘pro–drop’ or ‘pronoun–drop’ language (Ryd-<br />

ing 2007). The pro–drop parameter is an aspect of grammar that allows subjects <strong>to</strong> be<br />

optional but unders<strong>to</strong>od in some languages. That is, every inflection in a verb paradigm<br />

76


5.4. DESIGN OF EVALUATION CRITERIA<br />

Table 5.8: Test strategy: free word order (Verb Noun Noun)<br />

<strong>Arabic</strong> human-translated UniArab best of rest<br />

yh. b qys lylā Qays loves Laila. ? ?<br />

qys yh. b lylā Qays loves Laila. ? ?<br />

yh. b lylā qys Qays loves Laila. ? ?<br />

is specified uniquely and does not need <strong>to</strong> use independent pronouns <strong>to</strong> differentiate the<br />

person, number, and gender of the verb. The test sentences are shown in Table 5.9.<br />

Table 5.9: Test strategy: pro–drop<br />

<strong>Arabic</strong> human-translated UniArab best of rest<br />

<br />

fāttny ālt.ā֓yrh (I) missed the plane. ? ?<br />

֓aryd ˙grfh (I) want a room. ? ?<br />

<br />

nsyt mh. fz . ty (I) <strong>for</strong>got my wallet. ? ?<br />

֓aryd hātm (I) want a ring. ? ?<br />

˘<br />

5.4 Design of evaluation criteria<br />

We will evaluate the result of output by comparing with human-translated and <strong>machine</strong>-<br />

translated versions . Comparisons can be made between two <strong>machine</strong> translation systems,<br />

or between human-translated and <strong>machine</strong>-translated sentences. UniArab system is com-<br />

pared with translations done by human transla<strong>to</strong>rs. Then this result is compared with<br />

the results of other (<strong>Arabic</strong> <strong>to</strong> <strong>English</strong>) Machine translation systems. We are comparing<br />

different levels of human translation with UniArab system output, using human subjects<br />

as judges. The human judges were skilled <strong>for</strong> the purpose of Machine Translation; it is<br />

an efficient evaluation <strong>for</strong> MT research. The evaluation study compared an MT system<br />

translating from <strong>Arabic</strong> in<strong>to</strong> <strong>English</strong> with human transla<strong>to</strong>rs. The human transla<strong>to</strong>rs were<br />

a native <strong>Arabic</strong> speaking L1 adults who had <strong>English</strong> as their L2. The five point scale <strong>for</strong><br />

adequacy indicates how much of the meaning expressed in the reference translation is<br />

77


also expressed in a hypothetical translation:<br />

5 = All<br />

4 = Most<br />

3 = Much<br />

2 = Little<br />

1 = None<br />

5.5. SUMMARY<br />

The second five point scale indicates how fluent the translation is. When translating in<strong>to</strong><br />

<strong>English</strong> the values correspond <strong>to</strong>:<br />

5 = Flawless <strong>English</strong><br />

4 = Good <strong>English</strong><br />

3 = Non-native <strong>English</strong><br />

2 = Bad <strong>English</strong><br />

1 = Incomprehensible<br />

5.5 Summary<br />

UniArab is designed as an Interlingua <strong>machine</strong> transla<strong>to</strong>r, which takes <strong>Arabic</strong> sentences<br />

and analyses their structure producing in interlingua representation which can then be<br />

used in isolation <strong>to</strong> generate the <strong>English</strong> translation. We presented a test strategy in<br />

which a wide range of sentence types will be used <strong>to</strong> test the effectiveness of UniArab.<br />

We then set evaluation criteria which can be used <strong>to</strong> quantify how the system per<strong>for</strong>ms<br />

<strong>for</strong> each of these test types.<br />

78


6<br />

UniArab: a proof-of-concept <strong>Arabic</strong> <strong>to</strong> <strong>English</strong><br />

<strong>machine</strong> translation system<br />

This chapter presents an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> transla<strong>to</strong>r system, called UniArab.<br />

UniArab is a proof-of-concept translation system supporting the fundamental aspects of<br />

<strong>Arabic</strong>, such as the parts of speech, agreement and tenses. UniArab stands <strong>for</strong> Universal<br />

<strong>Arabic</strong> <strong>machine</strong> transla<strong>to</strong>r system. UniArab is based on the linking algorithm of RRG<br />

(syntax <strong>to</strong> semantics and vice versa) as indicated in Figure 6.1.<br />

Figure 6.1: Layout of Role and Reference Grammar<br />

79


6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />

6.1 Conceptual structure of the UniArab system<br />

The conceptual structure of the UniArab system is shown in figure 6.2. The system<br />

accepts <strong>Arabic</strong> as its source language. The morphology parser and word <strong>to</strong>kenizer have<br />

a connection <strong>to</strong> the lexicon which holds all attributes of a word.<br />

Figure 6.2: The conceptual architecture of the UniArab system<br />

UniArab s<strong>to</strong>res data in XML <strong>for</strong>mat. This data can then be queried, exported and se-<br />

rialized in<strong>to</strong> any <strong>for</strong>mat the developer wishes. The system can understand the part of<br />

speech of a word, agreement features, number, gender and the word type. The syntactic<br />

parse unpacks the agreement features between elements of the <strong>Arabic</strong> sentence in<strong>to</strong> a<br />

semantic representation (the logical structure) with the ‘state of affairs’ of the sentence.<br />

80


6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />

In UniArab we intend <strong>to</strong> have a strong analysis system that can extract all attributes from<br />

the words in a sentence.<br />

6.1.1 Technical architecture of the UniArab system<br />

The structure of the UniArab system in Figure 6.2 breaks down in<strong>to</strong> the following phases:<br />

Phase (1) - <strong>Arabic</strong> language sentence. The input <strong>to</strong> the system consists of one or more<br />

sentences in <strong>Arabic</strong>.<br />

Phase (2) - Sentence Tokenizer. Tokenization is the process of demarcating and classi-<br />

fying sections of a string of input characters. In this phase the system splits the text<br />

in<strong>to</strong> sentence <strong>to</strong>kens. The resulting <strong>to</strong>kens are then passed <strong>to</strong> the word <strong>to</strong>kenizer<br />

phase. For example <br />

<br />

qr֓a hāld ālktāb. hāld tlmyd<br />

˘ ˘ ¯<br />

d<br />

¯ ky. will be two <strong>to</strong>kens; qr֓a hāld ālktāb and <br />

˘ <br />

hāld ˘<br />

tlmyd ¯ d ¯ ky the translation of these two sentences is Khalid read the book. Khalid is<br />

a clever student.<br />

Phase (3) Word Tokenizer There, sentences are split in<strong>to</strong> <strong>to</strong>kens <br />

qr֓a<br />

hāld<br />

ālktāb Khalid read the book, the output of phase 3 is as follows;<br />

˘<br />

<br />

qr֓a<br />

h ˘ āld<br />

ālkt āb<br />

<br />

Phase (4) Lexicon Datasource A set of XML documents <strong>for</strong> each component category<br />

of <strong>Arabic</strong>.<br />

Phase (5) Morphology Parser Directly works with both the Lexicon and Tokenizer <strong>to</strong><br />

produce the word order. A connection is made <strong>to</strong> the datasource of phase 4 which<br />

81


6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />

has been implemented as a set of XML documents. The use of XML has the added<br />

advantage of portability. UniArab will effectively work the same regardless of the<br />

operating system. To understand the morphology of each word, we first <strong>to</strong>kenize<br />

each sentence and determine the word relationships. Phase 5 of the system holds all<br />

attributes specific <strong>to</strong> each word of the source sentence.<br />

Phase (6) Syntactic Parser Determines the precise phrasal structure and category of the<br />

<strong>Arabic</strong> sentence. At this point, the types and attributes of all words in the sentence<br />

are known.<br />

Phase (7) Syntactic linking (RRG) We must first develop the link from syntax <strong>to</strong> se-<br />

mantics out of the phrasal structure created in Phase 6, if we are <strong>to</strong> create a logical<br />

structure that will generate a target language and also act as the link in the opposite<br />

direction from semantics <strong>to</strong> syntax. The system should answer the main question in<br />

this phase, who does what <strong>to</strong> whom? We use the gender of the verb <strong>to</strong> determine<br />

the ac<strong>to</strong>r. When the subject and object have different genders, the gender of the verb<br />

must match the subject. If they both agree with the verb, then MSA dictates that<br />

the first noun is the subject. In this case the ac<strong>to</strong>r is Khalid and the undergoer is the<br />

book.<br />

Phase (8) Logical Structure Creation of logical structure is the most crucial phase. An<br />

accurate representation of the logical structure of an <strong>Arabic</strong> sentence is the primary<br />

strength of UniArab. Below is a sample output from the UniArab system. The<br />

<strong>Arabic</strong> equivalent of the past tense sentence ‘Khalid read the book’ <br />

<br />

qr֓a h ˘ āld ālktāb is input as the source.<br />

ālktāb book:N h ˘ āld Khalid:MsgN qr֓a read:V<br />

The results of the parse can be seen in the following logical structure:<br />

Verb read<br />

82


6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />

<br />

sg 3rd M PAST qr֓a<br />

where the Proper Noun is Khalid sg unspec M <br />

h ˘ āld<br />

and the Noun is, the book sg def M ālktāb<br />

Consider the following example; Omar is a student. be’(Omar,[student’]).<br />

in <strong>Arabic</strong> ֒mr tlmyd ¯ . This is a challenge since there is no verb ‘<strong>to</strong> be’ in<br />

<strong>Arabic</strong>, but this must be inferred <strong>for</strong> correct translation. Instead of saying ‘Omar is<br />

a student’, the <strong>Arabic</strong> equivalent would be ‘Omar student’. We also face the chal-<br />

lenge of inferring the indefinite article, which does not exist in <strong>Arabic</strong>. All of the<br />

unique in<strong>for</strong>mation <strong>for</strong> each word can thus be taken from the lexicon <strong>to</strong> aid in the<br />

creation of a logical structure of the target language.<br />

Phase (9) Semantic <strong>to</strong> Syntax Assuming we have an input and have produced a struc-<br />

tured syntactic representation of it, the grammar can map this structure from a se-<br />

mantic representation. In this phase the system uses a linking algorithm provided by<br />

RRG <strong>to</strong> determine ac<strong>to</strong>r and undergoer assignments, assign the core arguments and<br />

assign the predicate in the nucleus. The system uses semantic arguments of logical<br />

structures other than of the main verb.<br />

Phase (10) Syntax Generation This will be unique <strong>for</strong> each target language. In this<br />

phase the system uses the target language rules <strong>to</strong> generate the syntax. In this case<br />

<strong>English</strong> language rules are used.<br />

Phase (11) Generate <strong>English</strong> Morphology The system generates <strong>English</strong> morphology<br />

in an innovative way, generating the tenses not existent in <strong>Arabic</strong> but in <strong>English</strong> as<br />

well as verb ‘<strong>to</strong> be’.<br />

Figure 6.3 shows the technique used <strong>to</strong> generate the correct verb tenses, and generate<br />

verb <strong>to</strong> be. Verbs in <strong>English</strong> have a mood; e.g. indicative, subjunctive, imperative<br />

83


6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />

Figure 6.3: Generation the right tense <strong>for</strong> the verbs<br />

and can be in one of many tenses. We discussed the special situation with refer-<br />

ence <strong>to</strong> the intersection of <strong>Arabic</strong> tense and aspect in Chapter 2. The solution is <strong>to</strong><br />

recognize the difference between morphological features and syntactic functional<br />

categories. The tense features must be expressed analytically.<br />

Phase (12) <strong>English</strong> Sentence Generation The process of generating an <strong>English</strong> sentence<br />

can be as simple as keeping a list of rules. These rules can be extended through the<br />

life of the MT system. The system will use some operations in <strong>English</strong> such as<br />

vowel change: examples; man men. Sometimes this accompanies affixations: break<br />

broke broken (= broke + en).<br />

6.1.2 UniArab: Lexical representation in interlingua system<br />

In transfer-based systems there are no problems if <strong>for</strong> a particular language pair there<br />

are one-<strong>to</strong>-one equivalents; the problems arise when there is more than one target word<br />

<strong>for</strong> a single source word. But <strong>for</strong> an interlingua in a multilingual system there are prob-<br />

lems even if only one of the languages involved has two or more potential <strong>for</strong>ms <strong>for</strong> a<br />

84


6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />

single given word in one of the other languages. If an interlingua is <strong>to</strong> be completely<br />

language-neutral, it must represent not the words of one or another of the languages,<br />

but language-independent lexical units. Any distinction which is (or can be) expressed<br />

lexically in the languages of the system must be represented explicitly in the interlingua<br />

representation (Hutchins and Somers 1992). The UniArab system can generate a target<br />

language by classifying every <strong>Arabic</strong> word in the source text. There are six major parts<br />

of speech in <strong>Arabic</strong>. These are Verbs, Nouns, Adjectives, Proper nouns, Demonstratives,<br />

Adverbs and we create a seventh, so called ‘other’ category <strong>for</strong> <strong>Arabic</strong> words which do<br />

not fit in<strong>to</strong> any of these six categories. The major parts of speech in the <strong>Arabic</strong> language<br />

have their own attributes, and we use these attributes within the UniArab system. For<br />

example, verbs in the <strong>Arabic</strong> language agree with their subjects in gender. <strong>Arabic</strong> words<br />

are masculine and feminine; there is no neutral gender. In the UniArab system we record<br />

the gender associated with a verb in the syntax <strong>for</strong> a particular subject NP. Adjectives<br />

and demonstratives also agree with the subject in gender <strong>to</strong>o. In <strong>Arabic</strong>, words come in<strong>to</strong><br />

three categories with regards <strong>to</strong> number: They are (1) singular, indicating one, e.g. <br />

rˇgl ‘one man’. (2) dual, indicating two, e.g. rˇglān ‘two men’ and (3) plural,<br />

indicating three or more e.g. rˇgāl ‘men’. The UniArab system records these at-<br />

tributes of gender and number. It is important <strong>to</strong> understand that source language specific<br />

features may not be used or may be different in the target language. For example, the<br />

<strong>Arabic</strong> number categ ory of dual is not relevant in <strong>English</strong>. The UniArab system is based<br />

on RRG and uses logical structures <strong>for</strong> each verb in the lexicon.<br />

85


6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />

6.2 UniArab: Lexical representation in interlingua system based<br />

on RRG<br />

Lexical frames represent the language-dependent lexicon. We use an XML data source<br />

<strong>to</strong> represent the UniArab lexicon. The lexicon creates pointers <strong>to</strong> corresponding con-<br />

ceptual frames or attributes of each word. These frames also have relations which link<br />

them <strong>to</strong> verb class frames, which are organized hierarchically according <strong>to</strong> the particular<br />

language.<br />

Although we adhere <strong>to</strong> the Interlingua approach, we do not do so with the translation<br />

of lexical items. In an ideal Interlingua system lexical entries should be broken down<br />

in<strong>to</strong> sets of semantic features. For example the word “man” is broken down in<strong>to</strong> +human<br />

+male +adult. While this works in theory, in practice we cannot find enough seman-<br />

tic features <strong>to</strong> describe every entity in the world. For example “cow”, “computer” and<br />

“chair” cannot be described using these sets of semantic features unless we invent a<br />

unique semantic feature <strong>for</strong> every object and this is practically impossible.<br />

6.2.1 Verb<br />

In the UniArab system, we capture the in<strong>for</strong>mation shown in Figure 6.4 <strong>for</strong> each verb.<br />

The verb in<strong>for</strong>mation captured consists of <strong>Arabic</strong> Verb, <strong>English</strong> Translation, Logical<br />

Structure, Tense, Gender, Person and Number. The <strong>Arabic</strong> Verb represents one of the<br />

<strong>Arabic</strong> verbs in a specific tense, <strong>for</strong> a specific gender, person and number. The <strong>English</strong><br />

translation is the <strong>English</strong> equivalent of the <strong>Arabic</strong> verb. The Logical Structure attribute is<br />

the RRG equivalent logical structure or lexical entry representation <strong>for</strong> the <strong>Arabic</strong> Verb.<br />

<strong>Arabic</strong> inflects verbs <strong>for</strong> tense and they agree in person, number and gender with the<br />

subject. In RRG, Tense is a verbal opera<strong>to</strong>r in the layer structure of the clause providing<br />

86


6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />

in<strong>for</strong>mation about the tense of this verb.<br />

Figure 6.4: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> verb<br />

Table 6.1: Verb 1<br />

<strong>Arabic</strong> verb qr֓a<br />

<strong>English</strong> translation read<br />

Logical structure [do’(x,[read’(x,(y)])]<br />

Tense past<br />

Gender m<br />

Person 3rd<br />

Number singular<br />

Table 6.2: Verb 2<br />

<strong>Arabic</strong> verb ktbt<br />

<strong>English</strong> translation wrote<br />

Logical structure [do’(x,[write’(x,(y)])]<br />

Tense past<br />

Gender f<br />

Person 3rd<br />

Number singular<br />

In the <strong>Arabic</strong> language, tense can be past or present as the primary distinction. Gender is<br />

an <strong>Arabic</strong> attribute of the verb. The verb agrees with the subject in gender. The Person<br />

attribute could be first, second or third person. The Number attribute refers <strong>to</strong> number of<br />

the subject. In <strong>Arabic</strong>, the number of a verb can be singular, dual or plural. Table 6.1<br />

and Table 6.2 shows an example of one <strong>Arabic</strong> verb applied <strong>to</strong> different genders. The<br />

absence of t ‘t’ suffix signifies m: gender. The <strong>English</strong> translation of these verbs are<br />

‘read’ and ‘wrote’.<br />

87


6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />

6.2.2 Common noun<br />

In the UniArab system, we capture the in<strong>for</strong>mation shown in Figure 6.5 <strong>for</strong> each noun.<br />

The noun in<strong>for</strong>mation captured consists of <strong>Arabic</strong> Noun, <strong>English</strong> Translation, Definite-<br />

ness, Gender and Number. <strong>Arabic</strong> Noun represents a noun in the <strong>Arabic</strong> language. The<br />

<strong>English</strong> translation is the <strong>English</strong> equivalent of the <strong>Arabic</strong> Noun. Definiteness of the<br />

nouns can be definite or indefinite. Gender is an <strong>Arabic</strong> attribute of the noun.<br />

Figure 6.5: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> noun<br />

Table 6.3: Noun<br />

<strong>Arabic</strong> noun ֓aˇsˇgār ālktāb<br />

<strong>English</strong> translation trees the book<br />

Definiteness indefinite definite<br />

Gender f m<br />

Number plural singular<br />

The Number attribute refers <strong>to</strong> number of the noun. In the <strong>Arabic</strong> language number of<br />

nouns can be single, dual or plural. Table 6.3 shows examples of two different <strong>Arabic</strong><br />

noun words, whose <strong>English</strong> translations are ‘trees’ and ‘book’. Please note that ‘book’ is<br />

def+, meaning ‘definite’.<br />

6.2.3 Proper noun<br />

Proper nouns in <strong>Arabic</strong> are not capitalized. In the UniArab system we capture the in<strong>for</strong>-<br />

mation shown in Figure 6.6. For each proper noun the system captures <strong>Arabic</strong> proper<br />

noun, <strong>English</strong> translation, definiteness, gender and number. <strong>Arabic</strong> proper nouns rep-<br />

88


6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />

resents a proper noun in the <strong>Arabic</strong> language. The <strong>English</strong> translation is the <strong>English</strong><br />

equivalent of the <strong>Arabic</strong> proper noun. Gender is an <strong>Arabic</strong> attribute of the proper noun.<br />

The Number attribute refers <strong>to</strong> the number of the proper noun; single, dual or plural.<br />

Figure 6.6: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> proper noun<br />

Table 6.4: Proper Noun<br />

<strong>Arabic</strong> proper noun ֒mr ֓iymān<br />

<strong>English</strong> translation Omar Eman<br />

Gender m f<br />

Number singular singular<br />

Table 6.4 shows examples of two different <strong>Arabic</strong> proper noun words, whose <strong>English</strong><br />

translations are ‘Omar’ and ‘Eman’.<br />

6.2.4 Adjective<br />

In the UniArab system, we capture the in<strong>for</strong>mation shown in Figure 6.7 <strong>for</strong> each adjec-<br />

tive. This consists of <strong>Arabic</strong> Adjective, <strong>English</strong> Translation, Definiteness, Gender and<br />

Number. <strong>Arabic</strong> Adjectives represent adjectives in the <strong>Arabic</strong> language. The <strong>English</strong><br />

translation is the <strong>English</strong> equivalent of the <strong>Arabic</strong> Adjective. Definiteness can be definite<br />

or indefinite. Gender is an <strong>Arabic</strong> attribute of the adjective.<br />

Table 6.5: Adjective<br />

<strong>Arabic</strong> adjective qs. yr ālt.wylh<br />

<strong>English</strong> translation short the long<br />

Definiteness indefinite definite<br />

Gender m f<br />

Number singular singular<br />

89


6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />

Figure 6.7: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> adjective<br />

The Number attribute refers <strong>to</strong> the number of the adjective. In the <strong>Arabic</strong> language num-<br />

ber agreement <strong>for</strong> adjectives can be singular, dual or plural. Table 6.5 shows examples of<br />

two different <strong>Arabic</strong> adjective words, whose <strong>English</strong> translations are ‘short’ and ‘long’,<br />

please note that ‘long’ is def+.<br />

6.2.5 Demonstrative<br />

In the UniArab system we capture the in<strong>for</strong>mation shown in Figure 6.8 <strong>for</strong> each demon-<br />

strative. this consists of <strong>Arabic</strong> Demonstrative, <strong>English</strong> Translation, Demonstrative type,<br />

Gender and Number. <strong>Arabic</strong> Demonstratives represents a demonstrative in the <strong>Arabic</strong><br />

language. The <strong>English</strong> translation is the <strong>English</strong> equivalent of the <strong>Arabic</strong> Demonstra-<br />

tive. Demonstrative type can be, in the <strong>Arabic</strong> language, near <strong>to</strong> the speaker, far from the<br />

speaker or between near and far from the speaker. Gender is an <strong>Arabic</strong> attribute of the<br />

demonstrative. The Number attribute refers <strong>to</strong> number of the demonstrative. Table 6.6<br />

shows examples of two different <strong>Arabic</strong> demonstratives, whose <strong>English</strong> translations are<br />

‘this’ and ‘that’.<br />

Table 6.6: Demonstrative representative<br />

<strong>Arabic</strong> demonstrative hd ¯ ā d ¯ lk ֓awl֓yk<br />

<strong>English</strong> translation this that those<br />

Demonstrative type close far between near and far from the speaker<br />

Gender m m both m and f<br />

Number singular singular plural<br />

90


6.2. UNIARAB: LEXICAL REPRESENTATION IN INTERLINGUA SYSTEM BASED ON RRG<br />

6.2.6 Adverb<br />

Figure 6.8: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> demonstrative.<br />

In the UniArab system we capture the in<strong>for</strong>mation shown in Figure 6.9 <strong>for</strong> each adverb.<br />

this consists of <strong>Arabic</strong> Adverb, <strong>English</strong> Translation and Adverb type. <strong>Arabic</strong> Adverbs<br />

represents an adverb in the <strong>Arabic</strong> language. The <strong>English</strong> translation is the <strong>English</strong><br />

equivalent of the <strong>Arabic</strong> Adverb. Adverbs type refers <strong>to</strong> time or place (proposition), time<br />

such as ‘<strong>to</strong>day’ or ‘<strong>to</strong>morrow’ and places like ‘under’, ‘in’, or ‘on’ etc.<br />

Figure 6.9: In<strong>for</strong>mation recorded on the <strong>Arabic</strong> adverb.<br />

Table 6.7: Adverb<br />

<strong>Arabic</strong> adverb <br />

bˇgānb ālywm<br />

<strong>English</strong> translation beside <strong>to</strong>day<br />

Adverb type Proposition time<br />

Table 6.7 shows examples of two different <strong>Arabic</strong> adverbs, whose <strong>English</strong> translations<br />

are ‘beside’ and ‘<strong>to</strong>day’.<br />

91


6.2.7 Other <strong>Arabic</strong> words<br />

6.3. UNIARAB: GENERATION<br />

In the UniArab system, we capture the in<strong>for</strong>mation shown in Figure 6.10 <strong>for</strong> each other<br />

<strong>Arabic</strong> word. This consists of <strong>Arabic</strong> Other Word, <strong>English</strong> Translation, Logical Structure,<br />

Part of Speech, Tense, Gender, Person, Number and Definiteness.<br />

Figure 6.10: In<strong>for</strong>mation recorded on the other <strong>Arabic</strong> words<br />

Table 6.8: Other <strong>Arabic</strong> words (where ‘NON’ means not applicable)<br />

<strong>Arabic</strong> other words w hy<br />

<strong>English</strong> translation and she<br />

Logical structure NON NON<br />

Part of speech conjunction pronoun<br />

Tense NON NON<br />

Gender NON f<br />

Person NON 3rd<br />

Number NON singular<br />

Definiteness<br />

We allow a variety of attribute possibilities <strong>for</strong> the category ‘other’ in <strong>Arabic</strong> words <strong>for</strong><br />

the moment. Table 6.8 shows examples of two different <strong>Arabic</strong> Other words, whose<br />

<strong>English</strong> translations are ‘and’ and ‘she’.<br />

6.3 UniArab: Generation<br />

The target language generation phases in the UniArab system follow the syntactic re-<br />

alization model. Generation takes as input, the universal logical structure of the input<br />

sentence(s) and produces as output a morphology-syntactic realisation of the sentence in<br />

the target language. The UniArab system is designed as a universal <strong>machine</strong> transla<strong>to</strong>r,<br />

92


6.3. UNIARAB: GENERATION<br />

which means that it can support translation of the <strong>Arabic</strong> in<strong>to</strong> any other natural language<br />

with the addition of additional language generation bridges. The UniArab system is eval-<br />

uated using <strong>Arabic</strong> as source language in<strong>to</strong> <strong>English</strong> as the target language. In the UniArab<br />

system phases 9, 10, 11 and 12 are <strong>for</strong> generation of the target languages, in our case this<br />

is <strong>English</strong>. For the example given under Phase 8 in Section 6.1.1, <br />

qr֓a<br />

hāld<br />

ālktāb, Khalid read the book, we have the logical structure:<br />

˘<br />

Verb read [do’(x,[read’(x,(y)])] sg 3rd M PAST qr֓a><br />

where the Proper Noun is Khalid sg unspec M hāld ˘<br />

and the Noun is the book sg def M ālktāb<br />

Firstly, the Semantic <strong>to</strong> Syntactic phase determines the ac<strong>to</strong>r and undergoer assignments,<br />

assigns the core arguments and assigns the predicate in the nucleus. In the UniArab sys-<br />

tem we keep all word attributes whether they are used in the target language or not. In<br />

this case, the gender of the noun the book, in <strong>Arabic</strong> is masculine, but in <strong>English</strong> book<br />

has neutral gender. In Phase 10, Syntax Generation, and Phase 11, Generate <strong>English</strong><br />

Morphology, UniArab uses target language rule <strong>to</strong> generate the syntax. The verb log-<br />

ical structure gives UniArab a flag indicating how many arguments this verb takes. In<br />

this case the logical structure will be read[do’(x,[read’(x,(y)])]. Now the<br />

UniArab system replaces x with Khalid, and y with the book. The UniArab system now<br />

holds the following <strong>for</strong> this simple sentence:<br />

read[do’(Khalid,[read’(Khalid,(the book)])].<br />

In the last phase, <strong>English</strong> Sentence Generation, the UniArab system builds the final shape<br />

of a sentence: Khalid read the book. Moreover, there are some special cases, like the<br />

UniArab system adding verb <strong>to</strong> be or changing the verb tense of the source language <strong>to</strong><br />

93


6.4. UNIARAB: SCREEN DESIGN<br />

another tense in the target language. Also, the role of word order in the target language<br />

must be considered.<br />

ālktāb book:N h ˘ āld Khalid:MsgN qr֓a read:V<br />

read [do’(x,[read’(x,(y)])] sg 3rd M PAST qr֓a<br />

The results of the parse can be seen here with LS as :<br />

Verb read [do’(x,[read’(x,(y)])] sg 3rd M PAST qr֓a><br />

where the Proper Noun is Khalid sg unspec M <br />

h ˘ āld<br />

and the Noun is the book sg def M ālktāb<br />

At this point the generation will start; first of all the semantic <strong>to</strong> syntactic phase deter-<br />

mines the ac<strong>to</strong>r and undergoer assignments, assign the core arguments and assign the<br />

predicate in the nucleus. In the last phase, <strong>English</strong> Sentence Generation, the UniArab<br />

system builds the final shape of a sentence: Khalid read the book. Moreover, there are<br />

some special cases, like the UniArab system adding the verb ‘<strong>to</strong> be’ or ensure the verb<br />

tense of the source language is reflected as the appropriate tense in the target language.<br />

Also, the rules of word order in the target languages must be considered.<br />

6.4 UniArab: Screen design<br />

The graphical user interface (GUI) of UniArab is interactive. Designing the visual com-<br />

position and temporal behaviour of the GUI is an important aspect of the design of<br />

UniArab. We use one text area <strong>to</strong> allow a user <strong>to</strong> input source language sentences, two<br />

but<strong>to</strong>ns, Enter <strong>to</strong> submit the text <strong>to</strong> the system, Clear <strong>to</strong> delete all text in the input and<br />

output text areas. There is a separate text area <strong>for</strong> output of the translated text. Also there<br />

is a text area <strong>for</strong> logical structure output of every sentence.<br />

94


Figure 6.11: UniArab’s GUI 1<br />

6.4. UNIARAB: SCREEN DESIGN<br />

If a user needs <strong>to</strong> add a new <strong>Arabic</strong> word in the UniArab system datasource; he/she can<br />

click on the appropriate tab. There are seven different tabs each representing a category of<br />

words in the <strong>Arabic</strong> language; Add <strong>Arabic</strong> Verb, Add <strong>Arabic</strong> Noun, Add <strong>Arabic</strong> Adjective,<br />

Add <strong>Arabic</strong> Proper nouns, Add <strong>Arabic</strong> Demonstratives, Add <strong>Arabic</strong> Adverb and Add<br />

other <strong>Arabic</strong> Word. In every tab there are a number of combo boxes. A combo box is a<br />

combination of a drop-down list or list box, allowing the user <strong>to</strong> choose from the list of<br />

existing options. For example, when a user needs <strong>to</strong> add a new adjective <strong>to</strong> the datasource,<br />

the user will be presented with a text field <strong>to</strong> let him/her enter an <strong>Arabic</strong> adjective. There<br />

is another text field <strong>for</strong> adding the <strong>English</strong> equivalent of the <strong>Arabic</strong> adjective. There are<br />

a number of combo boxes; number, definition and gender, a user chooses from the list an<br />

option. There are two but<strong>to</strong>ns under each tab, Enter <strong>to</strong> submit the in<strong>for</strong>mation in<strong>to</strong> the<br />

95


Figure 6.12: UniArab’s GUI 2<br />

6.4. UNIARAB: SCREEN DESIGN<br />

system, and Clear <strong>to</strong> delete all words in all text fields and return combo boxes <strong>to</strong> their<br />

default state. Figures 6.11,6.12,6.13 shows GUI of the UniArab system.<br />

96


6.4.1 Lexicon interface<br />

Figure 6.13: UniArab’s GUI 3<br />

6.4. UNIARAB: SCREEN DESIGN<br />

In order <strong>to</strong> allow <strong>for</strong> robust user interaction with the lexicon, we use a graphical interface<br />

<strong>to</strong> capture the in<strong>for</strong>mation <strong>for</strong> each part of speech. The user selects the part of speech of<br />

the word he is adding, and is then presented with only the options relevant <strong>to</strong> it. The in-<br />

terface also limits the user’s selections <strong>to</strong> acceptable values and ensures that all attributes<br />

are filled. With this technique, we minimize the risk of human error, and there<strong>for</strong>e the<br />

in<strong>for</strong>mation is more accurate. The graphical interface is quicker and easier when a user<br />

adds a new word in the lexical (XML data source). When the system displays an in<strong>for</strong>-<br />

mation error. Figure 6.14 shows the entry interface that is implemented as part of the<br />

UniArab system.<br />

97


6.5 Technical challenges<br />

Figure 6.14: UniArab’s lexicon interface<br />

6.5. TECHNICAL CHALLENGES<br />

<strong>Arabic</strong> letters in the GUI We can not write <strong>Arabic</strong> letters in UniArab’s GUI. We use<br />

Unicode <strong>to</strong> represent them. Unicode Converter System allows us <strong>to</strong> enter <strong>Arabic</strong><br />

text and click on a but<strong>to</strong>n <strong>to</strong> get the equivalent Unicode of the text.<br />

<strong>Arabic</strong> letters in Eclipse IDE <strong>for</strong> Java We used Eclipse IDE <strong>for</strong> Java development. We<br />

can not write <strong>Arabic</strong> as a string in Eclipse. While Java does support <strong>Arabic</strong>, the<br />

problem lies in the operating system not supporting <strong>Arabic</strong> letter shapes in IDE. We<br />

used Windows XP and Windows 2000 which both have the same problem. To fix<br />

this we changed <strong>to</strong> Ubuntu Linux. Under Linux we can write <strong>Arabic</strong> text as a string<br />

in the Eclipse IDE.<br />

<strong>Arabic</strong> in data source We choose <strong>to</strong> create our data source as XML, <strong>for</strong> optimum sup-<br />

port or different plat<strong>for</strong>ms. It was also easier as we used <strong>Arabic</strong> letters not Unicode<br />

inside the data source. XML fully supports <strong>Arabic</strong>. We created our search engine<br />

using Java. We used a HashMap <strong>to</strong> make the keyword in <strong>Arabic</strong> when we search<br />

inside the datasource. We used verbMap.containsKey(word) in order <strong>to</strong> check<br />

the presence of an <strong>Arabic</strong> word in the data source.<br />

98


6.6 Summary<br />

6.6. SUMMARY<br />

We presented the conceptual structure and architecture of the UniArab system. We dis-<br />

cussed each of the phases from source language analysis, through the logical represen-<br />

tation, then the generation of the target sentences. We detailed the lexical properties of<br />

<strong>Arabic</strong> sentences and the attributes <strong>for</strong> each type of word. We discussed how generation<br />

maps the logic structure <strong>to</strong> the target language. Finally, we discussed the user interface<br />

and some of the problems encountered during development.<br />

99


7<br />

Testing and evaluation<br />

This chapter presents the results of the evaluation. Evaluation of MT software is nec-<br />

essary in order <strong>to</strong> improve system per<strong>for</strong>mance and analyse potential problems and, of<br />

course, its accuracy and effectiveness. In the evaluation session we consider many differ-<br />

ent aspects of the MT system including quality of translation, time <strong>for</strong> translation ability<br />

<strong>to</strong> add a new word in the lexicon of the system and resource utilization.<br />

7.1 Evaluation of MT systems<br />

The evaluation of MT systems is a difficult task. This is not only because many different<br />

metrics are involved, but also because translation is itself difficult (Laoudi et al. 2004).<br />

The first important aspect <strong>for</strong> a potential test is <strong>to</strong> determine the translational capability.<br />

There<strong>for</strong>e, we need <strong>to</strong> draw up a complete overview of the translational process, in all its<br />

different aspects. A good translation has <strong>to</strong> effectively capture the meaning. This involves<br />

establishing the size of the translation task, is it <strong>machine</strong> legible and if so, according<br />

100


7.2. SENTENCE TESTS<br />

<strong>to</strong> which standards? Current general function MT systems can not translate all texts<br />

consistently. Output can have very poor quality. It is important <strong>to</strong> mention that the<br />

‘subsequent editing required’ increases as translation quality gets poorer (Turian et al.<br />

2003).<br />

Given the limited lexicon implemented in this work so far, we evaluate the effectiveness<br />

and accuracy of UniArab by comparison. We create variants of <strong>Arabic</strong> sentences that<br />

represent all possible structures of the sentences that UniArab can translate. We then<br />

compare between human-translated and <strong>machine</strong>-translated versions.<br />

7.2 Sentence tests<br />

We have sentences (<strong>for</strong> actual test examples see Appendix C) in <strong>Arabic</strong> and their equiv-<br />

alent translations in <strong>English</strong>. We have covered a representative broad selection of verbs<br />

across intransitive, transitive and ditransitive constructions in simplex sentences in ac-<br />

tive voice. Complex sentences are beyond the thesis scope. However, we do address<br />

copula-like nominative clauses in <strong>Arabic</strong>. We tested UniArab in more than one way. We<br />

tested single sentences and multiple sentences. UniArab easily deals with more than one<br />

sentence as input and its output matches. We entered random sentences <strong>to</strong>gether in one<br />

input or as individual sentences.<br />

101


7.2.1 Verb-Subject with one argument in different tenses<br />

Table 7.1: Test : Verb-Subject; one argument<br />

<strong>Arabic</strong> yˇsrb ֒mr āllbn<br />

Human Omar is drinking the milk.<br />

Google Omar drink milk<br />

Microsoft drink milk Omar<br />

UniArab Omar is drinking the milk .<br />

7.2. SENTENCE TESTS<br />

In Table 7.1, the output of the Google transla<strong>to</strong>r (Google 2009) is faulty in tense and verb<br />

‘<strong>to</strong> be’. Microsoft’s MT (Microsoft 2009) failed <strong>to</strong> translate most of the sentence in tense,<br />

verb and word order. UniArab successfully translates the sentence in its entirety. Figure<br />

7.1 shows this sentence output in the UniArab system.<br />

Figure 7.1: Verb-Subject with one argument<br />

102


Table 7.2: Test : Verb-subject; agreement 1<br />

<strong>Arabic</strong> ˇsrb ֒mr āllbn<br />

Human Omar drank the milk<br />

Google Omar drinking milk<br />

Microsoft drinking milk Omar<br />

UniArab Omar drank the milk.<br />

7.2. SENTENCE TESTS<br />

In Table 7.2, the output of the Google transla<strong>to</strong>r is faulty in tense and definition. The<br />

Microsoft transla<strong>to</strong>r failed <strong>to</strong> translate most of the sentence in tense, definition and word<br />

order. UniArab successfully translates the sentence in its entirety. Figure 7.2 shows this<br />

sentence output in the UniArab system.<br />

Figure 7.2: Verb-Subject with one argument<br />

103


Table 7.3: Test : verb-subject; agreement 2<br />

<strong>Arabic</strong> <br />

h˘ āld qr֓a ālktāb<br />

human-translated Khalid read the book<br />

Google Khalid read the book<br />

Microsoft Khaled read book<br />

UniArab<br />

<strong>Arabic</strong><br />

Khalid read the book.<br />

<br />

human-translated Khalid will drink the milk<br />

Google Khalid drink milk.<br />

Microsoft Khaled drink milk.<br />

UniArab Khalid will drink the milk.<br />

syˇsrb h ˘ āld āllbn<br />

7.2. SENTENCE TESTS<br />

In Table 7.3, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT failed <strong>to</strong><br />

translate the definition. UniArab successfully translates the sentence in its entirety. In<br />

the output of the second sentence, the Google transla<strong>to</strong>r is faulty in tense and definition.<br />

Microsoft’s MT failed <strong>to</strong> translate the tense and definition. UniArab successfully trans-<br />

lates the sentence in its entirety. This is becouse of the RRG lexicalist approach in the<br />

interlingua. Figures 7.3 and 7.4 show this sentence output in the UniArab system.<br />

Figure 7.3: Verb-subject agreement 1<br />

104


Figure 7.4: Verb-subject agreement 2<br />

7.2. SENTENCE TESTS<br />

105


7.2.2 Gender-ambiguous proper nouns<br />

Table 7.4: Test : Gender-ambiguous proper nouns 1<br />

<strong>Arabic</strong> <br />

qr֓a ˇgāk ālktāb<br />

human-translated Jack read the book<br />

Google Jack read the book<br />

Microsoft read Jack book<br />

UniArab Jack read the book.<br />

7.2. SENTENCE TESTS<br />

In Table 7.4, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT failed<br />

<strong>to</strong> translate the definition. UniArab successfully translates the sentence in its entirety.<br />

Figure 7.5 shows this sentence output in the UniArab system.<br />

Figure 7.5: Gender-ambiguous proper nouns 1<br />

106


Table 7.5: Test : gender-ambiguous proper nouns 2<br />

<strong>Arabic</strong> qr֓at māry ālktāb<br />

human-translated Mary read the book<br />

Google Marie read the book<br />

Microsoft read Marrie book<br />

UniArab Mary read the book.<br />

7.2. SENTENCE TESTS<br />

In Table 7.5, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT failed <strong>to</strong><br />

translate the definition and word order. UniArab successfully translates the sentence in<br />

its entirety. Figure 7.6 shows this sentence output in the UniArab system.<br />

Figure 7.6: Gender-ambiguous proper nouns 2<br />

107


7.2.3 Verb ‘<strong>to</strong> be’<br />

Table 7.6: Test : Verb ‘<strong>to</strong> be’ 1<br />

<strong>Arabic</strong> hw mhnds<br />

human-translated He is an engineer.<br />

Google Is the architect of<br />

Microsoft is the engineer<br />

UniArab He is an engineer.<br />

7.2. SENTENCE TESTS<br />

In Table 7.6, the output of the Google transla<strong>to</strong>r is faulty. Microsoft’s MT successfully<br />

translated the person. UniArab successfully translates the sentence in its entirety. Figure<br />

7.7 shows this sentence output in the UniArab system.<br />

Figure 7.7: Verb ‘<strong>to</strong> be’ 1<br />

108


Table 7.7: Test : Verb ‘<strong>to</strong> be’ 2<br />

<strong>Arabic</strong> ֓anā ālmhnds<br />

human-translated I am the engineer.<br />

Google I Engineer<br />

Microsoft i am engineer<br />

UniArab I am the engineer.<br />

7.2. SENTENCE TESTS<br />

In Table 7.6, the output of the Google transla<strong>to</strong>r is faulty in the verb ‘<strong>to</strong> be’ and defini-<br />

tion. Microsoft’s MT successfully translated the verb ‘<strong>to</strong> be’, it is faulty in the definition<br />

only. UniArab successfully translates the sentence in its entirety. Figure 7.8 shows this<br />

sentence output in the UniArab system.<br />

Figure 7.8: Verb ‘<strong>to</strong> be’ 2<br />

109


7.2.4 Verb ‘<strong>to</strong> have’<br />

<strong>Arabic</strong><br />

Table 7.8: Test : Verb ‘<strong>to</strong> have’ 1<br />

<br />

human-translated I have made a reservation.<br />

Google I have made a reservation<br />

Microsoft You have a booking<br />

UniArab I have made a reservation.<br />

lqd qmt bālh. ˇgz<br />

7.2. SENTENCE TESTS<br />

In Table 7.8, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT is faulty<br />

in person. UniArab successfully translates the sentence in its entirety. Figure 7.9 shows<br />

this sentence output in the UniArab system.<br />

Figure 7.9: Verb ‘<strong>to</strong> have’ 1<br />

110


Table 7.9: Test : Verb ‘<strong>to</strong> have’ 2<br />

<strong>Arabic</strong> lqd fqdt td ¯ krty<br />

human-translated I have lost my ticket.<br />

Google I’ve lost my ticket<br />

Microsoft i have lost td¯ krty<br />

UniArab I have lost my ticket.<br />

7.2. SENTENCE TESTS<br />

In Table 7.9, the output of the Google transla<strong>to</strong>r is successful. Microsoft’s MT is faulty<br />

in the object word. UniArab successfully translates the sentence in its entirety. Figure<br />

7.10 shows this sentence output in the UniArab system.<br />

Figure 7.10: Verb ‘<strong>to</strong> have’ 2<br />

111


7.2.5 Free word order<br />

7.2. SENTENCE TESTS<br />

Here we show three <strong>Arabic</strong> sentences with different word order which translate <strong>to</strong> the<br />

same <strong>English</strong> output.<br />

Table 7.10: Test : Free word order (Verb Noun Noun scenario one)<br />

<strong>Arabic</strong> yh. b qys lylā<br />

human-translated Qays loves Laila<br />

Google Qais likes of Laila<br />

Microsoft Love Qais laili<br />

UniArab Qays loves Laila<br />

In Table 7.10, the output of the Google transla<strong>to</strong>r is faulty in the verb meaning and the<br />

system added ‘of’ without any meaning in this sentence. Microsoft’s MT translated each<br />

word while ignoring the word order and meaning of the sentence. UniArab success-<br />

fully translates the sentence in its entirety. Figure 7.11 shows this sentence output in the<br />

UniArab system.<br />

Figure 7.11: Free word order (Verb Noun Noun scenario one)<br />

112


7.2. SENTENCE TESTS<br />

Table 7.11: Test : Free word order (Verb Noun Noun scenario two)<br />

<strong>Arabic</strong> yh. b lylā qys<br />

human-translated Qays loves Laila<br />

Google Leila loves measured<br />

Microsoft Love laili Qais<br />

UniArab Qays loves Laila<br />

In Table 7.11, the second ordering possibility is shown. The output of the Google transla-<br />

<strong>to</strong>r is faulty in the ac<strong>to</strong>r, the system can not analyse ‘who does what’, the ac<strong>to</strong>r is Qais but<br />

the output makes the object the subject. Microsoft’s transla<strong>to</strong>r translates each word while<br />

ignores the word order and the meaning of the sentence. It also makes the object the<br />

subject. UniArab successfully translates the sentence in its entirety. Figure 7.12 shows<br />

this sentence output in the UniArab system.<br />

Figure 7.12: Free word order (Verb Noun Noun scenario two)<br />

113


7.2. SENTENCE TESTS<br />

Table 7.12: Test : Free word order (Verb Noun Noun scenario three)<br />

<strong>Arabic</strong> qys yh. b lylā<br />

human-translated Qays loves Laila<br />

Google Qais likes of Laila<br />

Microsoft Qais love laili<br />

UniArab Qays loves Laila<br />

Table 7.12 shows the third possible sentence order. The output of the Google transla<strong>to</strong>r<br />

is faulty in verb meaning and adds an extra ‘of’ which does not carry any meaning.<br />

Microsoft’s MT translates each word while ignoring the word order, tense and meaning<br />

of the sentence. UniArab successfully translates the sentence in its entirety. Figure 7.13<br />

shows this sentence output in the UniArab system.<br />

Figure 7.13: Free word order (Verb Noun Noun scenario three)<br />

114


7.2.6 Pro-drop<br />

<strong>Arabic</strong><br />

Table 7.13: Test: Pro-drop<br />

fāttny ālt.ā֓yrh<br />

human-translated I missed the plane.<br />

Google Missed the plane<br />

Microsoft <br />

fāttny plane<br />

UniArab I missed the plane.<br />

7.2. SENTENCE TESTS<br />

Table 7.13 shows an example of a pro–drop sentence. The output of the Google trans-<br />

la<strong>to</strong>r is faulty in the point of pro-drop; the system did not find the subject. Microsoft’s<br />

MT did not recognize the important word in the sentence and passed it through <strong>to</strong> the<br />

output. UniArab successfully translates the sentence in its entirety. Figure 7.14 shows<br />

this sentence output in the UniArab system.<br />

Figure 7.14: Pro-drop<br />

115


7.2.7 Transitivity of verbs<br />

7.2. SENTENCE TESTS<br />

In <strong>Arabic</strong> and <strong>English</strong>, we can classify verbs as both intransitive, transitive and ditransi-<br />

tive.<br />

7.2.7.1 Intransitive<br />

Table 7.14: Test : Intransitive 1<br />

<strong>Arabic</strong> s. hyb yqr֓a<br />

human-translated Suhaib reads.<br />

Google Suhaib read<br />

Microsoft suhaib reads<br />

UniArab Suhaib reads.<br />

Table 7.14 shows an example of an intransitive sentence. The output of the Google trans-<br />

la<strong>to</strong>r is faulty in tense. Microsoft’s and UniArab transla<strong>to</strong>rs successfully. Both systems<br />

are translate the sentence in its entirety. Figure 7.15 shows this sentence output in the<br />

UniArab system.<br />

Figure 7.15: Intransitive<br />

116


Table 7.15: Test : Intransitive 2<br />

<strong>Arabic</strong> s. hyb yqr֓a kt ¯ yrā<br />

human-translated Suhaib reads a lot.<br />

Google Souhaib read a lot<br />

Microsoft suhaib reads much<br />

UniArab Suhaib reads a lot.<br />

7.2. SENTENCE TESTS<br />

Table 7.15 shows an example of an intransitive with an adverb. The output of the Google<br />

transla<strong>to</strong>r has given the wrong tense. Microsoft’s and UniArab transla<strong>to</strong>rs are both suc-<br />

cessful. Both systems are translate the sentence in its entirety, though the Microsoft<br />

output is more <strong>for</strong>mal. Figure 7.16 shows this sentence output in the UniArab system.<br />

Figure 7.16: Intransitive with an adverb<br />

117


7.2.7.2 Transitive<br />

Table 7.16: Test : Transitive<br />

<strong>Arabic</strong> ys. lh. mārk ālh. āswb<br />

human-translated Mark is fixing the computer.<br />

Google Mark works computer<br />

Microsoft Marc computer works<br />

UniArab Mark is fixing the computer.<br />

7.2. SENTENCE TESTS<br />

In Table 7.16, the output of the Google and Microsoft’s transla<strong>to</strong>rs are faulty in the verb<br />

‘<strong>to</strong> be’ and the meaning of the verb. UniArab successfully translates the sentence in its<br />

entirety. Figure 7.17 shows this sentence output in the UniArab system.<br />

Figure 7.17: Transitive<br />

118


7.2.7.3 Ditransitive<br />

Table 7.17: Test : Ditransitive 1<br />

<strong>Arabic</strong> <br />

hw ֓a֒t.ā h ˘ āld ktāb<br />

human-translated He gave Khalid a book.<br />

Google Khaled was given a book<br />

Microsoft is given Khaled book<br />

UniArab He gave Khalid a book.<br />

7.2. SENTENCE TESTS<br />

Table 7.17 shows an example of a ditransitive. The output of the Google has been given<br />

in the passive tense. Microsoft’s transla<strong>to</strong>r gives an incorrect output. UniArab success-<br />

fully translates the sentence in its entirety. Figure 7.18 shows this sentence output in the<br />

UniArab system.<br />

Figure 7.18: Ditransitive 1<br />

119


Table 7.18: Test : Ditransitive with 2 NP<br />

<strong>Arabic</strong><br />

human-translated Mark is showing Adam the letter.<br />

Google Mark Adam see the letter.<br />

Microsoft Marc finds Adam message.<br />

UniArab Mark is showing Adam the letter.<br />

7.2. SENTENCE TESTS<br />

mārk yry ֓ādm ālrsālh<br />

Table 7.18 shows an example of another ditransitive. The output of the Google transla<strong>to</strong>r<br />

is faulty in determining the ac<strong>to</strong>r, the system can not analyse who does what. Microsoft’s<br />

transla<strong>to</strong>r is faulty in the meaning of the verb and in sentence meaning. UniArab suc-<br />

cessfully translates the sentence in its entirety. Figure 7.19 shows this sentence output in<br />

the UniArab system.<br />

Figure 7.19: Ditransitive with 2 NP<br />

120


Table 7.19: Test : Ditransitive with preposition<br />

<strong>Arabic</strong> <br />

֒mr ֓a֒t.ā lh ˘ āld ktāb<br />

human-translated Omar gave a book <strong>to</strong> Khalid.<br />

Google Omar Khaled gave a book<br />

Microsoft Omar gave Khalid book<br />

UniArab Omar gave a book <strong>to</strong> Khalid.<br />

7.2. SENTENCE TESTS<br />

Table 7.19 shows an example of ditransitive. The output of the Google transla<strong>to</strong>r is<br />

faulty in determining the ac<strong>to</strong>r, the system can not analyse who does what. Microsoft’s<br />

transla<strong>to</strong>r is faulty loosing the definite article and sentence meaning. UniArab success-<br />

fully translates the sentence in its entirety. Figure 7.20 shows this sentence output in the<br />

UniArab system.<br />

Figure 7.20: Ditransitive with preposition<br />

121


7.2.8 Limitation of UniArab<br />

Table 7.20: Test : Limitation of UniArab<br />

<strong>Arabic</strong> snsāfr ˙gdā ֓ilā ms. r<br />

human-translated We will travel <strong>to</strong> Egypt <strong>to</strong>morrow<br />

Google Tomorrow he travels <strong>to</strong> Egypt<br />

Microsoft snsāfr on <strong>to</strong> Egypt<br />

UniArab<br />

7.2. SENTENCE TESTS<br />

Table 7.20 shows another example of a pro–drop sentence. The output of the Google<br />

transla<strong>to</strong>r is faulty on the point of pro-drop; the system did not find the subject. We found<br />

that Microsoft’s MT did not recognize the important word in the sentence and passed it<br />

through <strong>to</strong> the output. UniArab fails <strong>to</strong> give a translation, because this structure does not<br />

exist in the system. Since RRG is built upon the logical structure, when an unknown<br />

structure is encountered, it cannot produce an output, even if some of the words are in<br />

the lexicon. Figure 7.21 shows this sentence output in the UniArab system.<br />

Figure 7.21: Limitation of UniArab 1<br />

122


7.2. SENTENCE TESTS<br />

In a case where a word is not available in the lexicon, but the logical structure is recog-<br />

nised, UniArab will output a correctly structured translation, but with the unknown Ara-<br />

bic word in its position within the <strong>English</strong> sentence. This makes the system resilient<br />

<strong>to</strong> slight misspellings which can be recognised and corrected. Figure 7.22 shows this<br />

sentence output in the UniArab system.<br />

Figure 7.22: Limitation of UniArab 2<br />

123


7.2. SENTENCE TESTS<br />

Table 7.21: Test : Limitation of UniArab 3 using non existing nonsense word<br />

<strong>Arabic</strong> h˘ āld ksr d. s. tqf˙g ¯<br />

human-translated Khaled broke ( d. s. tqf˙g this word is not an <strong>Arabic</strong> word).<br />

¯<br />

Google Khalid break Dsthagafg<br />

Microsoft Khaled d. s. tqf˙g break<br />

¯<br />

UniArab Khaled broke <br />

d. s. t ¯ qf˙g<br />

In Table 7.21, we show how the system responds <strong>to</strong> an unknown word. We have put in a<br />

non-word in the <strong>Arabic</strong> sentence. The output of the Google and Microsoft’s transla<strong>to</strong>rs<br />

are faulty in the verb. Microsoft’s transla<strong>to</strong>r put the unknown word in the wrong position.<br />

Google transliterates the word and puts it in the correct position. UniArab successfully<br />

translates the verb and puts the unknown word in the correct position. Figure 7.23 shows<br />

this sentence output in the UniArab system.<br />

Figure 7.23: Limitation of UniArab 3<br />

124


7.3 System evaluation<br />

7.3. SYSTEM EVALUATION<br />

UniArab supports simple sentences with one or two arguments or compounds. We have<br />

a number of sentences that were created <strong>to</strong> aid the grammar in terms of coverage of basic<br />

<strong>Arabic</strong> sentence structures. Further research should be conducted <strong>to</strong> incorporate more<br />

stages in<strong>to</strong> UniArab.<br />

UniArab is based on RRG, and the logical structure of a sentence is the key piece of<br />

in<strong>for</strong>mation <strong>for</strong> producing a translation. The system is programmed <strong>to</strong> be capable of<br />

dealing with specific structures. Once a structure is enabled within the system, the only<br />

limit on translating sentences of that structure is the coverage of the vocabulary. Hence, if<br />

a specific sentence structure exists with in UniArab, any sentence of that structure can be<br />

translated. This is a strength of being RRG-based, since the structure and vocabulary are<br />

dealt with independently, and vocabulary is more straight<strong>for</strong>ward <strong>to</strong> improve. The struc-<br />

ture is also independent of issues of gender and tense, which are only considered once<br />

a structure has been assigned, <strong>to</strong> determine who does what. As we develop UniArab,<br />

adding further structures increases the coverage by a considerable amount. However, as<br />

the number of structures increases, word ambiguity will become a bigger issue.<br />

UniArab uses an XML-<strong>for</strong>matted data-source as its lexicon. The key strength is that<br />

this data source is open, and can be used under any operating system, and accessed us-<br />

ing different <strong>to</strong>ols and languages. The search engine we use <strong>to</strong> access the data source is<br />

able <strong>to</strong> deal with <strong>Arabic</strong> words which translate in<strong>to</strong> multiple-word <strong>English</strong> phrases. For<br />

example, ֓aryd in <strong>Arabic</strong> translates <strong>to</strong> I want. However, in its current state, we<br />

cannot find single entries that consist of multiple <strong>Arabic</strong> words. For example <br />

<br />

ˇsbāk ālh. ˇgz in <strong>Arabic</strong> translates <strong>to</strong> counter in <strong>English</strong>; the system cannot deal with this<br />

125


7.3. SYSTEM EVALUATION<br />

yet. Another example is ֒mr yl֒b ywm ālsbt which means Omar<br />

plays on Saturday. The structure of this sentence works in the system, however, since the<br />

two words ywm ālsbt translate <strong>to</strong> a single <strong>English</strong> word, Saturday, and the<br />

search engine cannot deal with this now. This also affects idioms and bigrams. We can<br />

overcome this issue by modifying the database and search algorithm. For each n-gram<br />

phrase, we index it by the first word. When we search the database <strong>for</strong> this <strong>to</strong>ken, we get<br />

the single word translation and any n-grams starting with the word. Then we can check<br />

the sentence <strong>to</strong> see if these n-grams are matched.<br />

Another limitation exists because we are not yet dealing with ambiguity. A word like<br />

֒lm can have many different meanings in <strong>Arabic</strong>, <strong>for</strong> example, it can mean flag,<br />

taught, knowledge or discovered. At the moment, the <strong>Arabic</strong> word might exist <strong>for</strong> differ-<br />

ent parts of speech. The search extracts all of them, but we only use the first one returned.<br />

Once we deal with ambiguity, we will have <strong>to</strong> analyse the different results, looking at the<br />

sentence structures <strong>to</strong> decide which translation <strong>to</strong> use.<br />

Since our system is based on RRG, the logical structure of a sentence is the basis of<br />

the translation. This was very useful, since it allows the system <strong>to</strong> deal with issues that<br />

can be complicated, like free-word-order, and determining the ac<strong>to</strong>r and undergoer. The<br />

lexicon used in UniArab can be refined further, and we would like <strong>to</strong> do this in further<br />

research. At the moment, the lexicon contains entries <strong>for</strong> single <strong>Arabic</strong> words, which can<br />

in some cases translate <strong>to</strong> clauses in <strong>English</strong>. For example, qlmy translates <strong>to</strong> my<br />

pen. The y at the end of the <strong>Arabic</strong> word is the possessive my in <strong>English</strong>. Similarly,<br />

bālqlm translates <strong>to</strong> with the pen, the <br />

b translates <strong>to</strong> with (or using), the āl<br />

is the definite article the. Finally, bqlmy translates <strong>to</strong> with my pen. In the future,<br />

it makes sense <strong>to</strong> simplify the lexicon by including only the basic noun, and allowing the<br />

126


7.3. SYSTEM EVALUATION<br />

search engine <strong>to</strong> extract these extra modifiers. Hence, we need only a single entry <strong>for</strong> the<br />

basic noun, rather than an entry <strong>for</strong> each possible occurrence. This reduces the size of<br />

the lexicon, and hence the speed of the search routine.<br />

At the moment, the lexicon is categorised in<strong>to</strong> seven parts of speech. We have designed<br />

the GUI so that when adding a specific word <strong>to</strong> the lexicon, only the related options are<br />

presented <strong>to</strong> the user <strong>for</strong> that part of speech. This minimises errors when entering data.<br />

As our research extends, we may need <strong>to</strong> modify the categorisation of the lexicon <strong>to</strong> allow<br />

<strong>for</strong> more complicated word types.<br />

UniArab does not process ambiguous words or complex sentences, so far, in this research.<br />

This research focussed first on discovering whether the logical structure of a sentence,<br />

based on RRG can be used <strong>for</strong> translation. Hence, we decided <strong>to</strong> limit the scope of the<br />

project, since this is work in a new area, that has not been investigated be<strong>for</strong>e. We fully<br />

expect <strong>to</strong> expand the system <strong>to</strong> allow it <strong>to</strong> cope with ambiguity in the future. The system’s<br />

reliability depends on the data source and fails <strong>to</strong> handle unknown words. UniArab does<br />

not process single words, even if those words are in its lexicon, because UniArab is built<br />

on the logical structure of verbs.<br />

In our comparison with other translation systems we have used simplex sentences. While<br />

UniArab is limited <strong>to</strong> simplex sentences and has limited coverage, we believe it is essen-<br />

tial <strong>to</strong> reach high quality translation of these sentences first, in order <strong>to</strong> be able <strong>to</strong> expand<br />

<strong>to</strong> high quality translations of more complex sentences. We can see that the existing <strong>to</strong>ols<br />

cannot even achieve reasonable translations of simplex sentences, so how can we expect<br />

them <strong>to</strong> give high quality translations of larger text? We have found that small errors in<br />

the initial analysis of a sentence can cause huge errors in the final translation, so high<br />

quality analysis is very important.<br />

127


7.4 Summary<br />

7.4. SUMMARY<br />

In this chapter we subjected the UniArab System <strong>to</strong> a series of tests in a wide range of<br />

sentence categories. For each test we compared the results obtained through UniArab<br />

<strong>to</strong> those obtained when using translation engines from Google and Microsoft. We also<br />

presented a human-translated equivalent <strong>to</strong> each. In contrast, the Google and Microsoft<br />

transla<strong>to</strong>rs gave mixed results. In many cases, sentence meaning was lacking, and even<br />

some basic constructs could not be translated. Perhaps this is due <strong>to</strong> their focus on<br />

translating long sentences and paragraphs. We highlighted this by comparing them <strong>to</strong><br />

UniArab <strong>for</strong> longer compound sentences and found that they did indeed convey more<br />

of the meaning. These results suggest that RRG is a promising candidate <strong>for</strong> <strong>Arabic</strong> <strong>to</strong><br />

<strong>English</strong> Machine translation, and as the grammar is developed, the system should begin<br />

<strong>to</strong> cope with more complicated sentences. For simplex sentences (intransitive, transitive<br />

and ditransitive) it clearly outper<strong>for</strong>ms existing systems.<br />

128


Now is not the end.<br />

It is not the beginning of the end.<br />

It is perhaps, the end of the beginning.<br />

Wins<strong>to</strong>n Churchill<br />

8<br />

Conclusion<br />

In this thesis we have presented an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> translation system called<br />

UniArab, which is based on the Role and Reference Grammar model. We detailed the<br />

design of the system and how it was built <strong>to</strong> accommodate specifics of the <strong>Arabic</strong> lan-<br />

guage and the generation of <strong>English</strong> translations.<br />

We started with the goal of designing a <strong>machine</strong> translation system that could show<br />

whether we can extract logical structure from <strong>Arabic</strong> sentences using RRG, and use this<br />

<strong>to</strong> produce high quality translations in<strong>to</strong> <strong>English</strong>. We believe the results shown in Chap-<br />

ter 7 show that our system has proved this, and that our method is more robust <strong>for</strong> these<br />

cases, than other MT systems. There are still a number of areas which need <strong>to</strong> be de-<br />

veloped <strong>for</strong> UniArab <strong>to</strong> achieve more coverage, and we believe that we can build on the<br />

work we have done so far.<br />

Since the logical structure is separate from the vocabulary, when we focus on giving<br />

129


the system capability <strong>to</strong> deal with a large number of structure variations, it becomes sig-<br />

nificantly more powerful since each structure represents all possible sentences of that<br />

structure regardless of the specific words included. This is a significant point, since vo-<br />

cabulary is easy <strong>to</strong> develop, but structure requires much more ef<strong>for</strong>t.<br />

The major challenge we faced was <strong>to</strong> use RRG within a <strong>machine</strong> translation system.<br />

UniArab is the first MT system that uses RRG.In the <strong>Arabic</strong> linguistic tradition there is<br />

not a clear-cut, well-defined analysis of the inven<strong>to</strong>ry of parts of speech in <strong>Arabic</strong>. We<br />

found that the existent classifications were not suitable, and so we had <strong>to</strong> create a clas-<br />

sification that made sense <strong>for</strong> RRG-based translation. We were able <strong>to</strong> extract logical<br />

structures that made sense from natural <strong>Arabic</strong>. And we were also able <strong>to</strong> generate En-<br />

glish translations from this logical structure.<br />

Some specific challenges included dealing with the absence of the copula verb, ‘<strong>to</strong> be’,<br />

in <strong>Arabic</strong>. To solve this, we had <strong>to</strong> look at some <strong>Arabic</strong> sentence which do not contain<br />

verbs, and correctly deduce how <strong>to</strong> extract the copula. Free word order was another chal-<br />

lenge due <strong>to</strong> its widespread presence in <strong>Arabic</strong>, and this was solved by detailed analysis<br />

of the source language and incorporating this in the logical structure.<br />

We have discoverd that RRG is a realistic basis <strong>for</strong> <strong>machine</strong> translation systems. The<br />

use of a sentence’s logical structure <strong>to</strong> create translations is robust and gives high quality<br />

translations which can deal with some of the challenges of languages like <strong>Arabic</strong>.<br />

Our work has contributed the first <strong>machine</strong> translation system based on RRG, which<br />

we have used <strong>to</strong> prove its effectiveness <strong>for</strong> MT. This was a major challenge as we had<br />

little work <strong>to</strong> refer <strong>to</strong>. We have also advanced work on <strong>Arabic</strong> language classification,<br />

130


8.1. THESIS SUMMARY<br />

and so we hope our work will be the beginning of more work in this arena. We believe<br />

this serves as an excellent foundation <strong>for</strong> further research in the area.<br />

While statistical <strong>machine</strong> translation has been promoted by many, we believe that lan-<br />

gauges, expecially rich languages like <strong>Arabic</strong>, are very organised and structural, and<br />

such approaches cannot correctly deal with the wide variety of sentence structures. When<br />

these systems cannot deal with simplex sentences, as we have shown in Chapter 7, how<br />

do we expect them <strong>to</strong> correctly translate whole paragraphs? In our approach, we expect<br />

that high quality translations of simplex sentences are the only basis which can build <strong>to</strong><br />

good translations of whole paragraphs. The results we have presented are the first step<br />

in applying RRG <strong>to</strong> sophisticated translation. By focussing in this initial stage on the<br />

basics, we build a more solid foundation <strong>for</strong> the next stage.<br />

8.1 Thesis summary<br />

In Chapter 2, we presented a summary overview of the grammatical structure of the Ara-<br />

bic language. We detailed various sentence structures as well as unique word attributes<br />

like gender rules applied <strong>to</strong> all words and duality in number. We discussed how some of<br />

these properties could be used <strong>to</strong> extract in<strong>for</strong>mation about sentence structure.<br />

In Chapter 3, we presented the Role and Reference Grammar model, and showed how it<br />

could be used <strong>to</strong> deduce the logical structure of sentences and produce a lexical represen-<br />

tation which could be used as the interlingua.<br />

In Chapter 4, we presented various approaches <strong>to</strong> <strong>machine</strong> translation. We compared<br />

direct translation, transfer systems and interlingua systems and showed how interlingua<br />

systems require significantly more ef<strong>for</strong>t in the analysis and generation stages, but have<br />

131


8.2. SUMMARY OF THESIS CONTRIBUTIONS<br />

a distinct advantage in the simplicity of the translation process. Furthermore, they are<br />

more flexible in terms of adding extra languages. We also talked about the challenges of<br />

<strong>machine</strong> translation, with a specific focus on those specific <strong>to</strong> the <strong>Arabic</strong> language.<br />

In Chapter 5, we presented a high-level view of the system <strong>framework</strong> and defined our<br />

evaluation criteria <strong>for</strong> measuring system per<strong>for</strong>mance.<br />

In Chapter 6, we detailed the technical aspects of UniArab, covering all the phases in-<br />

volved in the <strong>machine</strong> translation process. We described the lexical system that underlies<br />

UniArab, detailing the attribute in<strong>for</strong>mation held <strong>for</strong> each type of word. We discussed the<br />

generation phase and how the system maps the logical structure <strong>to</strong> a target <strong>English</strong> sen-<br />

tence. We then briefly discussed the user interface, and some of the technical challenges<br />

encountered during the implementation.<br />

In Chapter 7, we presented the results of our evaluation of UniArab <strong>for</strong> a wide vari-<br />

ety of sentence types. We compared its results <strong>to</strong> those of the Google and Microsoft<br />

transla<strong>to</strong>rs as well as human translation. We found that it significantly outper<strong>for</strong>ms the<br />

other au<strong>to</strong>mated translation systems, matching human translation. We discussed its limits<br />

in regards <strong>to</strong> complex sentence structures.<br />

8.2 Summary of thesis contributions<br />

This thesis contributions are summarised as follows:<br />

• A detailed presentation of the structure of <strong>Arabic</strong> sentences and a discussion of the<br />

language’s unique features.<br />

• A detailed system <strong>framework</strong> <strong>for</strong> implementing RRG <strong>machine</strong> translation <strong>for</strong> <strong>Arabic</strong><br />

132


and proving the suitability of the model.<br />

8.3. FUTURE WORK<br />

• A detailed technical implementation of an <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong> transla<strong>to</strong>r<br />

based on the RRG model, including user interface and a cus<strong>to</strong>m designed, extensible<br />

data source.<br />

• An evaluation of the translation system and comparison <strong>to</strong> existing commercial sys-<br />

tems.<br />

• Specifying verb ‘<strong>to</strong> be’, free word order, pro-drop and transitivity of verbs.<br />

8.3 Future work<br />

Given the scope of this Masters research project, there are a number of areas where this<br />

work could be extended. Firstly, the question of ambiguity is very interesting. We feel<br />

that RRG is suited <strong>to</strong> overcoming word ambiguity by using sentence structure, and would<br />

like <strong>to</strong> explore this. We would also like <strong>to</strong> incorporate more compound structures allow-<br />

ing UniArab <strong>to</strong> deal with more complex sentences. We would also like <strong>to</strong> explore the au<strong>to</strong><br />

generation of lexicon in<strong>for</strong>mation from <strong>Arabic</strong> source verbs as a way <strong>to</strong> quickly populate<br />

the lexical source.<br />

The main <strong>to</strong>pic of investigation is the development of a <strong>framework</strong> <strong>for</strong> translating <strong>Arabic</strong><br />

<strong>to</strong> <strong>English</strong> based on RRG. The <strong>framework</strong> is designed <strong>to</strong> demonstrate the capabilities of<br />

RRG as a base <strong>for</strong> <strong>machine</strong> translation of <strong>Arabic</strong> in<strong>to</strong> <strong>English</strong> using an interlingua bridge<br />

strategy. This thesis showed that RRG facilitates the translation process from a specific<br />

language <strong>to</strong> other languages. Future research should focus on:<br />

(1) Enhancing and extending the UniArab system <strong>to</strong> support more natural <strong>Arabic</strong> sen-<br />

tences, and word ambiguity, in particular:<br />

133


8.3. FUTURE WORK<br />

• To understand, process, and translate complex predicates and multi-clause sen-<br />

tences in coordinate, subordinate and cosubordination structures.<br />

• To understand, process and translate voice and valence increasing/decreasing<br />

operations in the <strong>machine</strong> translation of <strong>Arabic</strong>.<br />

• To design a lexicon architecture <strong>to</strong> support the morphological templates <strong>for</strong><br />

<strong>Arabic</strong> words in<strong>to</strong> their respective consonantal and vowel components with the<br />

appropriate word <strong>for</strong>mation rules implemented in software.<br />

• To extend the underlying theory of RRG <strong>to</strong> encompass more fully the lexicon,<br />

syntax and morphology of <strong>Arabic</strong>.<br />

(2) Evaluating UniArab with respect <strong>to</strong> other systems based on non-RRG methods.<br />

134


References<br />

Abn-Aqeal, 2007. sharah Abn-Aqeal ala’lfat Abn-Malek. Dar Al’alem llmlaien.<br />

Al-Sughaiyer, I. A. and Al-Kharashi, I. A., 2004. <strong>Arabic</strong> morphological analysis tech-<br />

niques: A comprehensive survey. JASIST, 55(3):189–213.<br />

Alosh, M., 2005. Using <strong>Arabic</strong>: A Guide <strong>to</strong> Contemporary Usage. Cambridge University<br />

Press.<br />

Arciniegas, F., 2000. XML Developer’s Guide. McGraw-Hill.<br />

Attia, M., 2004. Report on the introduction of <strong>Arabic</strong> <strong>to</strong> ParGram. In Proceedings of<br />

ParGram Fall Meeting, Dublin, Ireland.<br />

Attia, M. A., 2008. Handling <strong>Arabic</strong> Morphological and Syntactic Ambiguity within<br />

the LFG Framework with a View <strong>to</strong> Machine Translation. PhD thesis, University of<br />

Manchester.<br />

Bateson, M. C., 2003. <strong>Arabic</strong> Language Handbook. George<strong>to</strong>wn University Press.<br />

Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., and Yergeau, F., 2008. Extensi-<br />

ble Markup Language (XML) 1.0 (Fifth Edition).<br />

Brown, P., Pietra, S. D., Pietra, V. D., and Mer-cer., R., 1993. The mathematics of statisti-<br />

cal <strong>machine</strong> translation: Parameter estimation. In Computational Linguistics, volume<br />

19 i2, pages 263–311.<br />

Google, 2009. Google transla<strong>to</strong>r. http://translate.google.com.<br />

Holes, C., 2004. Modern <strong>Arabic</strong>: Structures, Functions, and Varieties. George<strong>to</strong>wn<br />

University Press.<br />

Hutchins, J., 2003. Machine Translation: General Overview. in : book, Ox<strong>for</strong>d Univer-<br />

sity Press.<br />

135


REFERENCES<br />

Hutchins, J. and Somers, H. L., 1992. An introduction <strong>to</strong> <strong>machine</strong> translation. Academic<br />

Press.<br />

ibn Abd Allah Ibn Malik, M., 1984. Alfiyat Ibn Malik. Maktabat al-Adab.<br />

Intelligence, A. B., 2004. Abi research. http://www.abiresearch.com/abiprdisplay2.jsp?pressid=116.<br />

Izwaini, S., 2006. Problems of <strong>Arabic</strong> <strong>machine</strong> translation: evaluation of three systems.<br />

In The British Computer Society (BSC), London ., pages 118–148.<br />

Khan, M. A. S., 2007. <strong>Arabic</strong> Tu<strong>to</strong>r - Volume One, volume one. Madrasah In’aamiyyah.<br />

Laoudi, J., Tate, C., and Voss, C., 2004. Towards an au<strong>to</strong>mated evaluation of an embed-<br />

ded mt system. In roceedings of the European Association <strong>for</strong> Machine Translation<br />

Workshop, Malta.<br />

Microsoft, 2009. Microsoft transla<strong>to</strong>r. http://www.windowslivetransla<strong>to</strong>r.com/Default.aspx.<br />

Nolan, B. and Salem, Y., 2009. UniArab: An RRG <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> Machine Trans-<br />

lation Software. In Proceedings of The 2009 International Conference on Role and<br />

Reference Grammar, University of Cali<strong>for</strong>nia, Berkeley, USA.<br />

Nunes, M., 1993. Argument Linking in <strong>English</strong> Derived Nominals In Van Valin (ed)<br />

Advances in Role and Reference Grammar. John Benjamins.<br />

Oren, T., 2004. Machine translation and the global blogosphere.<br />

http://www.windsofchange.net/archives/006011.html. website.<br />

Owens, J., 2006. A Linguistic His<strong>to</strong>ry of <strong>Arabic</strong>. Ox<strong>for</strong>d University Press.<br />

Ramsay, A. and Mansour, H., 2006. Local constraints on <strong>Arabic</strong> word order. In 5th<br />

International Conference on NLP, FinTAL 2006, pages 447–457.<br />

Ryding, K. C., 2007. A Reference Grammar of Modern Standard <strong>Arabic</strong>. Cambridge<br />

University Press, 3rd edition.<br />

136


REFERENCES<br />

Salem, Y., Hensman, A., and Nolan, B., 2008a. Implementing <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> ma-<br />

chine translation using the Role and Reference Grammar linguistic model. In Proceed-<br />

ings of the Eighth Annual International Conference on In<strong>for</strong>mation Technology and<br />

Telecommunication (IT&T 2008), Galway, Ireland.<br />

Salem, Y., Hensman, A., and Nolan, B., 2008b. Towards <strong>Arabic</strong> <strong>to</strong> <strong>English</strong> <strong>machine</strong><br />

translation. ITB Journal, 17:20–31.<br />

Salem, Y. and Nolan, B., 2009a. An <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> <strong>machine</strong> translation system using<br />

an XML-based Role and Reference Grammar representation. In Proceedings of the<br />

23rd Annual Symposium on <strong>Arabic</strong> Linguistics, University of Wisconsin-Milwaukee.<br />

Salem, Y. and Nolan, B., 2009b. Designing an XML lexicon architecture <strong>for</strong> <strong>Arabic</strong><br />

<strong>machine</strong> translation based on Role and Reference Grammar. In Proceedings of the 2nd<br />

International Conference on <strong>Arabic</strong> Language Resources and Tools (MEDAR 2009),<br />

Cairo, Egypt.<br />

Salem, Y. and Nolan, B., 2009c. UNIARAB: An universal <strong>machine</strong> transla<strong>to</strong>r system<br />

<strong>for</strong> <strong>Arabic</strong> Based on Role and Reference Grammar. In Proceedings of the 31st Annual<br />

Meeting of the Linguistics Association of Germany (DGfS 2009).<br />

Schulz, E., 2005. A Student Grammar of Modern Standard <strong>Arabic</strong>. Cambridge University<br />

Press.<br />

Trujillo, A., 1999. Translation Engines: Techniques <strong>for</strong> Machine Translation. Springer.<br />

Turian, J. P., Shen, L., and Melamed, I. D., 2003. Evaluation of <strong>machine</strong> translation<br />

and its evaluation. In Proceedings of the MT Summit IX, New Orleans, USA, pages<br />

386–393.<br />

Van Valin, R., 1993. Advances in Role and Reference Grammar. John Benjamins.<br />

137


REFERENCES<br />

Van Valin, R., 2007. The Role and Reference Grammar analysis of three place predicates.<br />

CEEOL Contemporary Linguistics, 63:31–63.<br />

Van Valin, R. and LaPolla, R., 1997. Syntax: Structure, Meaning, and Function. Cam-<br />

bridge University Press.<br />

Versteegh, K., 2001. The <strong>Arabic</strong> Language. Edinburgh University Press; New Ed edition.<br />

138


Appendix<br />

139


A<br />

The author’s publications related <strong>to</strong> this research<br />

• Brian Nolan and Yasser Salem. 2009. “UniArab: An RRG <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> Ma-<br />

chine Translation Software”, in Proceedings of The 2009 International Conference<br />

on Role and Reference Grammar, University of Cali<strong>for</strong>nia, Berkeley, USA, August<br />

2009.<br />

• Yasser Salem and Brian Nolan. 2009. “Designing an XML Lexicon Architecture<br />

<strong>for</strong> <strong>Arabic</strong> Machine Translation Based on Role and Reference Grammar”, in Pro-<br />

ceedings of the 2nd International Conference on <strong>Arabic</strong> Language Resources and<br />

Tools (MEDAR 2009), Cairo, Egypt, April 2009.<br />

• Yasser Salem and Brian Nolan, 2009. “An <strong>Arabic</strong>-<strong>to</strong>-<strong>English</strong> Machine transla-<br />

tion system using an XMLbased Role and Reference Grammar representation”, in<br />

Proceedings of the 23rd Annual Symposium on <strong>Arabic</strong> Linguistics, University of<br />

Wisconsin-Milwaukee, USA, April 2009.<br />

140


• Yasser Salem and Brian Nolan. 2009. “ UNIARAB: An Universal Machine Trans-<br />

la<strong>to</strong>r System For <strong>Arabic</strong> Based On Role And Reference Grammar”, in Proceedings<br />

of the 31st Annual Meeting of the Linguistics Association of Germany (DGfS 2009),<br />

University of Osnabruck, Germany, March 2009.<br />

• Yasser Salem, Arnold Hensman and Brian Nolan. 2008. Implementing <strong>Arabic</strong>-<br />

<strong>to</strong>-<strong>English</strong> Machine Translation using the Role and Reference Grammar Linguistic<br />

Model, in Proceedings of the Eighth Annual International Conference on In<strong>for</strong>-<br />

mation Technology and Telecommunication (ITT 2008), Galway, Ireland, Oc<strong>to</strong>ber<br />

2008. (Runner-up <strong>for</strong> Best Paper Award)<br />

• Yasser Salem, Arnold Hensman and Brian Nolan. 2008. “Towards <strong>Arabic</strong> <strong>to</strong> En-<br />

glish Machine Translation”, ITB Journal, May 2008, Issue No. 17: 20-31.<br />

141


B<br />

Buckwalter <strong>Arabic</strong> transliteration<br />

1<br />

<strong>Arabic</strong> Letter and Phonetic Value<br />

ā<br />

Letter Name<br />

ALEF<br />

Unicode<br />

u0627<br />

2 b BEH u0628<br />

3 t TEH u062A<br />

4 t THEH u062B<br />

¯<br />

5 ˇg JEEM u062C<br />

6 h. HAH u062D<br />

7 h KHAH u062E<br />

˘<br />

8 d DAL u062F<br />

9<br />

d¯<br />

THAL u0630<br />

10 r REH u0631<br />

11 <br />

z ZAIN u0632<br />

12 s SEEN u0633<br />

13 ˇs SHEEN u0634<br />

142


<strong>Arabic</strong> Letter and Phonetic Value Letter Name Unicode<br />

14 s. SAD u0635<br />

15 d. DAD u0636<br />

16 t. TAH u0637<br />

17 z. ZAH u0638<br />

18 ֒ AIN u0639<br />

<br />

19 ˙g GHAIN u063A<br />

<br />

20 f FEH u0641<br />

21 q QAF u0642<br />

22 k KAF u0643<br />

23 l LAM u0644<br />

24 m MEEM u0645<br />

25 n NOON u0646<br />

26 h HEH ˘0647<br />

27 w WAW u0648<br />

28 y YEH u064A<br />

143


29<br />

<strong>Arabic</strong> Letter and Phonetic Value<br />

֓<br />

Letter Name<br />

HAMZA<br />

Unicode<br />

u0621<br />

30 ֓i ALEF WITH HAMZA UNDER u0625<br />

31<br />

32<br />

<br />

֓a ALEF WITH HAMZA ABOVE u0623<br />

<br />

֓ā ALEF WITH MADDA ABOVE u0622<br />

33 ā YEH u0649<br />

34 ֓y YEH WITH HAMZA ABOVE u0626<br />

35 ֓w WAW WITH HAMZA ABOVE u0624<br />

36 h TEH MARBUTA u0629<br />

<br />

37 a FATHA u064E<br />

<br />

38 u DAMMA u064F<br />

39 i KASRA u0650<br />

40<br />

41<br />

<br />

an TANWIN ALFATH u064B<br />

<br />

un TANWIN ALDAM u064C<br />

42 in TANWIN ALKASER u064D<br />

<br />

43 SKOON u0652<br />

<br />

44 SHADDA u0651<br />

45 ? ARABIC QUESTION MARK u061F<br />

144


C<br />

List of translatable sentences<br />

I want a ring. <br />

֓aryd h ˘ ātm<br />

I <strong>for</strong>got my wallet. nsyt mh. fz . ty<br />

I missed the plane. fāttny ālt.ā֓yrh<br />

֓aryd ˙grfh<br />

I want a room.<br />

I am a <strong>to</strong>urist. ֓anā sā֓yh.<br />

I am alone. ānā wh. dy<br />

I am Irish. ֓anā āyrlndy<br />

<br />

we are students. nh. n tlāmyd<br />

¯<br />

he is an engineer. hw mhnds<br />

145


I am an engineer. ānā mhnds<br />

I am the engineer. ֓anā ālmhnds<br />

Sarah hurts Yousuf. <br />

tˇgrh. sārh ywsf<br />

Sarah hurts Yousuf. <br />

sārh tˇgrh. ywsf<br />

Sarah hurts Yousuf. <br />

tˇgrh. ywsf sārh<br />

Omar will drink the milk. syˇsrb āllbn ֒mr<br />

Omar will drink the milk. syˇsrb ֒mr āllbn<br />

Khalid is drinking the milk. <br />

Khalid drank the milk.<br />

yˇsrb hāld āllbn<br />

˘<br />

ˇsrb hāld āllbn<br />

˘<br />

Omar is visiting Ireland. yzwr ֒mr ֓ayrlndā<br />

Qays loves Laila. qys yh. b lylā<br />

Qays loves Laila. yh. b qys lylā<br />

Qays loves Laila. yh. b lylā qys<br />

Laila loves Qays. lylā th. b qys<br />

Laila loves Qays. th. b lylā qys<br />

Laila loves Qays. th. b qys lylā<br />

Omar read the book. qr֓a ֒mr ālktāb<br />

Brian read the book. <br />

brāyn qr֓a ālktāb<br />

146


she is an engineer.<br />

Zaid loves Fatima.<br />

Zaid loves Fatima.<br />

hy mhndsh<br />

<br />

zyd yh. b fāt.mh<br />

<br />

yh. b zyd fāt.mh<br />

Zaid loves Fatima. <br />

yh. b fāt.mh zyd<br />

Fatima loves Zaid. <br />

fāt.mh th. b zyd<br />

Fatima loves Zaid. th. b fāt.mh zyd<br />

Fatima loves Zaid. th. b zyd fāt.mh<br />

Eman drew her school. <br />

rsmt ֓iymān mdrsthā<br />

Louis hit Mark. d. rb lwys mārk<br />

Louis hit Mark. lwys d. rb mārk<br />

Mark hit Louis. mārk d. rb lwys<br />

Mark hit Louis. d. rb mārk lwys<br />

Brian wrote the book. brāyn ktb ālktāb<br />

Ayesha wrote the book. ֒ā֓yˇsh ktbt ālktāb<br />

Eman wrote the book. <br />

ktbt ֓iymān ālktāb<br />

I have made a reservation. <br />

lqd qmt bālh. ˇgz<br />

I have lost my ticket. lqd fqdt td ¯ krty<br />

I am a doc<strong>to</strong>r. ֓anā t.byb<br />

147


Abbas is playing with the ball.<br />

Abbas is playing with the ball.<br />

yl֒b ֒bās bālkrh<br />

֒bās yl֒b bālkrh<br />

bālkrh yl֒b ֒bās<br />

Abbas is playing with the ball. <br />

Yousuf played with the ball. <br />

l֒b ywsf bālkrh<br />

Yousuf played with the ball. <br />

l֒b bālkrh ywsf<br />

Yousuf played with the ball. <br />

bālkrh l֒b ywsf<br />

Yousuf will play with the ball.<br />

Yousuf will play with the ball.<br />

Yousuf will play with the ball.<br />

Essam played with the spoon.<br />

Essam will play with the spoon.<br />

Essam is playing with the spoon.<br />

Essam played with the spoons.<br />

Essam will play with the spoons.<br />

Essam is playing with the spoons.<br />

Mansour ate with the spoon.<br />

Mansour ate with the spoon.<br />

<br />

<br />

syl֒b ywsf bālkrh<br />

<br />

ywsf syl֒b bālkrh<br />

<br />

bālkrh syl֒b ywsf<br />

l֒b ֒s. ām bālml֒qh<br />

syl֒b ֒s. ām bālml֒qh<br />

yl֒b ֒s. ām bālml֒qh<br />

l֒b ֒s. ām bālmlā֒q<br />

syl֒b ֒s. ām bālmlā֒q<br />

yl֒b ֒s. ām bālmlā֒q<br />

֓akl mns.wr bālml֒qh<br />

mns.wr ֓akl bālml֒qh<br />

Mansour ate with the spoon. bālml֒qh ֓akl mns.wr<br />

148


Jack is eating with the spoon. y֓akl ˇgāk bālml֒qh<br />

Jack will eat with the spoon. sy֓akl ˇgāk bālml֒qh<br />

Jack killed Mary. qtl ˇgāk māry<br />

Jack killed Mary. ˇgāk qtl māry<br />

Mary killed Jack. qtlt māry ˇgāk<br />

Mary killed Jack. <br />

māry qtlt ˇgāk<br />

Jack killed the man. qtl ˇgāk ālrˇgl<br />

Jack killed the man. ˇgāk qtl ālrˇgl<br />

The man killed Jack. ālrˇgl qtl ˇgāk<br />

The man killed Jack. qtl ālrˇgl ˇgāk<br />

Suhaib bellowed the fire. s. hyb nfh ˘ ālnār<br />

Suhaib bellowed the fire. nfh ˘ ālnār s. hyb<br />

Suhaib bellowed the fire. nfh s. hyb ālnār<br />

˘<br />

Sulaiman opened the door. fth. ālbāb slymān<br />

Sulaiman opened the door. slymān fth. ālbāb<br />

Sulaiman opened the door. fth. slymān ālbāb<br />

Zaid <strong>to</strong>ok the book. ֓ahd zyd ālktāb<br />

˘ ¯<br />

149


Zaid <strong>to</strong>ok the book. <br />

zyd ֓ahd ālktāb<br />

˘ ¯<br />

Qays <strong>to</strong>ok the files. <br />

֓ahd qys ālmlfāt<br />

˘ ¯<br />

Qays <strong>to</strong>ok the files. <br />

֓ahd qys ālmlfāt<br />

˘ ¯<br />

brāyn yrkb ālh. āflh<br />

Brian rides the bus.<br />

Brian rides the bus.<br />

Fahmy rides the car.<br />

Fahmy rides the car.<br />

yrkb brāyn ālh. āflh<br />

fhmy yrkb ālsyārh<br />

yrkb fhmy ālsyārh<br />

Fahmy rides the car. yrkb ālsyārh fhmy<br />

Khalid answered the question. <br />

h ˘ āld ֓aˇgāb āls֓wāl<br />

Khalid answered the question. ֓aˇgāb āls֓wāl h ˘ āld<br />

Khalid answered the question. ֓aˇgāb hāld āls֓wāl<br />

˘<br />

<br />

<br />

Rashid broke the window.<br />

Rashid broke the window.<br />

rāˇsd ksr ālnāfd ¯ h<br />

<br />

ksr rāˇsd ālnāfd ¯ h<br />

Rashid broke the window. <br />

ksr ālnāfd ¯ h rāˇsd<br />

Khalid broke his <strong>to</strong>y. <br />

ksr h ˘ āld l֒bth<br />

Omar <strong>to</strong>re the book. mzq ֒mr ālktāb<br />

Omar <strong>to</strong>re the book. <br />

֒mr mzq ālktāb<br />

Omar <strong>to</strong>re the book. mzq ālktāb ֒mr<br />

150


Ayah <strong>to</strong>re the page. <br />

֓āyh mzqt ālkys<br />

Ayah <strong>to</strong>re the page. mzqt ālkys ֓āyh<br />

Ayah <strong>to</strong>re the page. mzqt ֓āyh ālkys<br />

Omar opened his present. fth. ֒mr hdyth<br />

<br />

fth. ֒mr ālnāfdh ¯<br />

Omar opened the window.<br />

Omar opened the door. ֒mr fth. ālbāb<br />

James combed his hair. mˇst. ˇgyms ˇs֒rh<br />

James combed his hair. ˇgyms mˇst. ˇs֒rh<br />

Eric cleaned the window. <br />

Eric cleaned the plane.<br />

nz . f ֓iyryk ālnāfdh ¯<br />

֓iyryk nz Eric cleaned the house.<br />

. f ālt.ā֓yrh<br />

nz. f ālmnzl ֓iyryk<br />

Sarah wiped the table. msh. t sārh ālt.āwlh<br />

Sarah wiped the table. sārh msh. t ālt.āwlh<br />

Roqiah will cook the dinner. stt.bh ˘ rqyh āl֒ˇsā֓<br />

Roqiah will cook the dinner. rqyh stt.bh ˘ āl֒ˇsā֓<br />

Roqiah will cook the dinner. stt.bh ˘ āl֒ˇsā֓rqyh<br />

Harold pinched James. qrs. hārwld ˇgyms<br />

Harold pinched James. hārwld qrs. ˇgyms<br />

151


he is a doc<strong>to</strong>r. hw t.byb<br />

Ayah is spending her money. ֓āyh tnfq nqwdhā<br />

Ayah is spending her money. tnfq ֓āyh nqwdhā<br />

Henry lost his money. hnry fqd nqwdh<br />

Henry lost his money. fqd hnry nqwdh<br />

Adam punched Philip. lkm ֓ādm fylyb<br />

Zakiah killed Sarah. <br />

qtlt zkyh sārh<br />

Zakiah killed Sarah. <br />

zkyh qtlt sārh<br />

Mark slapped Louis. s. f֒mārk lwys<br />

Mark slapped Louis. mārk s. f֒lwys<br />

Sarah hates Zakiah. <br />

tkrh sārh zkyh<br />

Sarah hates Zakiah. <br />

Ayesha phoned Eman.<br />

sārh tkrh zkyh<br />

<br />

hātft ֒ā֓yˇsh ֓iymān<br />

Ayesha phoned Eman. <br />

֒ā֓yˇsh hātft ֓iymān<br />

Ayah thanked Khalid. ˇskrt ֓āyh hāld ˘<br />

Ayah thanked Khalid. ֓āyh ˇskrt hāld ˘<br />

Sarah called Adam. nādt sārh ֓ādm<br />

Sarah called Adam. sārh nādt ֓ādm<br />

152


Eman saw Sarah. <br />

r֓at ֓iymān sārh<br />

Eman saw Carl. ֓iymān r֓at kārl<br />

msk fylyb ālkrh<br />

Philip caught the ball.<br />

Philip caught the ball.<br />

fylyb msk ālkrh<br />

Philip caught the ball. msk ālkrh fylyb<br />

Carl bought pens. ֓iˇstrā kārl ֓aqlām<br />

Carl bought pens. kārl ֓iˇstrā ֓aqlām<br />

qād mārk ālt.ā֓yrh<br />

Mark drove the plane.<br />

Mark drove the bus.<br />

mārk qād ālh. āflh<br />

Mark drove the car. Adam is cleaning his <strong>to</strong>y.<br />

qād ālsyārh mārk<br />

<br />

Adam is cleaning his room.<br />

֓ādm ynz . f l֒bth<br />

Adam is cleaning his car.<br />

<br />

ynz . f ֓ādm ˙grfth<br />

<br />

ynz . f ֓ādm syārth<br />

Mark will clean my kitchen.<br />

Sarah will clean my office.<br />

Sarah will clean the car.<br />

<br />

<br />

mārk synz . f mt.bh y<br />

˘<br />

<br />

<br />

<br />

stnz. f sārh mktby<br />

sārh stnz. f ālsyārh<br />

Eman will clean her room. Eman will clean her room.<br />

stnz . f ֓iymān ˙grfthā<br />

<br />

֓iymān stnz . f ˙grfthā<br />

153


Louis is washing the dishes.<br />

Louis is washing the dishes.<br />

y˙gsl lwys āls.h. wn<br />

lwys y˙gsl āls.h. wn<br />

Louis is washing the dishes. y˙gsl āls.h. wn lwys<br />

Harold is feeding his cat. yt.֒m hārwld qt.th<br />

Harold is feeding his cat. hārwld yt.֒m qt.th<br />

he is a seller. hw bā֓y֒<br />

Eric is fixing his car. ys. lh. ֓iyryk syārth<br />

Eric is fixing his car. ֓iyryk ys. lh. syārth<br />

Fahmy speaks <strong>English</strong>.<br />

Fahmy speaks <strong>English</strong>.<br />

fhmy ytklm ālānklyzyh<br />

ytklm fhmy ālānklyzyh<br />

Ayah is cooking the food. tt.bh ˘ ֓āyh ālt.֒ām<br />

Ayah is cooking the dinner. ֓āyh tt.bh ˘ āl֒ˇsā֓<br />

Rashid is helping Mark. <br />

ysā֒d rāˇsd mārk<br />

Rashid is helping Ayesha.<br />

<br />

rāˇsd ysā֒d ֒ā֓yˇsh<br />

Mansour is eating his food. y֓akl mns. wr t.֒āmh<br />

Mansour is eating his food. mns.wr y֓akl t.֒āmh<br />

154


Carl is brushing his hair. <br />

ymˇst. kārl ˇs֒rh<br />

Carl is brushing his hair. <br />

kārl ymˇst. ˇs֒rh<br />

Abbas is brushing his teeth. <br />

yfrˇs ֒bās ֓asnānh<br />

Abbas is brushing his teeth. <br />

֒bās yfrˇs ֓asnānh<br />

Yousuf is wearing his clothes. <br />

ylbs ywsf t ¯ yābh<br />

Yousuf is wearing his shoes. <br />

Henry is watching the television.<br />

Henry is watching the television.<br />

<br />

ywsf ylbs h. d ¯ ā֓yh<br />

<br />

yˇsāhd hnry āltlfāz<br />

<br />

hnry yˇsāhd āltlfāz<br />

Henry is watching the television. <br />

yˇsāhd āltlfāz hnry<br />

Sulaiman caught the fish. ֓is.t.ād slymān ālsmk<br />

slymān ֓is. t.ād ālsmk<br />

Sulaiman caught the fish. <br />

֓is.t.ād ālsmk slymān<br />

Sulaiman caught the fish.<br />

Omar is planting the trees. <br />

yzr֒āl֓aˇsˇgār ֒mr<br />

Omar is planting the trees. <br />

֒mr yzr֒āl֓aˇsˇgār<br />

James pushed the chairs. ˇgr ˇgyms ālkrāsy<br />

James pushed the chairs. ˇgyms ˇgr ālkrāsy<br />

James pushed the chairs. ˇgr ālkrāsy ˇgyms<br />

155


Roqih drew trees. rsmt rqyh ֓aˇsˇgār<br />

Roqih drew Omar. rqyh rsmt ֒mr<br />

Roqih drew Khalid. <br />

rsmt hāld rqyh<br />

˘<br />

Ayesha picked the flowers. <br />

֒ā֓yˇsh qt.ft ālzhwr<br />

Ayesha picked the flowers. <br />

Ayesha picked the flowers.<br />

qt.ft ֒ā֓yˇsh ālzhwr<br />

qt.ft ālzhwr ֒ā֓yˇsh<br />

Omar is fixing the computer. ys. lh. ֒mr ālh. āswb<br />

Omar is fixing the computer. ֒mr ys. lh. ālh. āswb<br />

Omar is fixing the computer. ys. lh. ālh. āswb ֒mr<br />

Omar bought the <strong>to</strong>ys. <br />

Omar bought the fish. <br />

Eman ironed the clothes. <br />

֓iˇstrā ֒mr āll֒b<br />

֓iˇstrā ālsmk ֒mr<br />

kwt ֓iymān ālmlābs<br />

Eman ironed the clothes. ֓iymān kwt ālmlābs<br />

Eman ironed the clothes. <br />

kwt ālmlābs ֓iymān<br />

lwnt ֓āyh āls. wrh<br />

Ayah painted the picture.<br />

Ayah painted the picture.<br />

֓āyh lwnt āls. wrh<br />

Ayah painted the picture. lwnt āls. wrh ֓āyh<br />

I want the book. ֓aryd ālktāb<br />

156


I want a book. ֓aryd ktāb<br />

I want the food. ֓aryd ālt.֒ām<br />

I ate the dinner. ֓aklt āl֒ˇsā֓<br />

I ate the food. ֓aklt ālt.֒ām<br />

I ate the fish. ֓aklt ālsmk<br />

ˇsrbt āllbn<br />

I drank the milk.<br />

Eman lost her cat. <br />

fqdt ֓iymān qt.thā<br />

dhst ֓iymān qt.thā<br />

Eman ran over her cat. <br />

<br />

Omar won the race. fāz ֒mr bālsbāq<br />

Omar is sleeping. ynām ֒mr<br />

The children are crying. <br />

āl֓at.fāl ybkwn<br />

the wheel squeaks. āldwlāb ys. rs. r<br />

<br />

Omar reads. ֒mr yqr֓a<br />

Omar reads a lot. ֒mr yqr֓a kt ¯ yrā<br />

He hit Khalid. hw d. rb h ˘ āld<br />

He played with the spoon. hw l֒b bālml֒qh<br />

He loves Laila. hw yh. b lylā<br />

He loves Laila. yh. b hw lylā<br />

He loves Laila. yh. b lylā hw<br />

157


Omar gave Khalid the book. <br />

֒mr ֓a֒t.ā h ˘ āld ālktāb<br />

Omar gave Sarah the book. ֒mr ֓a֒t.ā sārh ālktāb<br />

Omar is giving Khalid the book. <br />

֒mr y֒t.y h ˘ āld ālktāb<br />

Omar is giving Sarah the book. ֒mr y֒t.y sārh ālktāb<br />

Sarah is giving Khalid the book. <br />

sārh t֒t.y h ˘ āld ālktāb<br />

Sarah is giving Eman the book. sārh t֒t.y ֓iymān ālktāb<br />

Sarah gave Eman the book. <br />

sārh ֓a֒t.t ֓iymān ālktāb<br />

Sarah gave Khalid the book. sārh ֓a֒t.t hāld ālktāb<br />

˘<br />

He gave Khalid the book. <br />

hw ֓a֒t.ā h ˘ āld ālktāb<br />

He gave Khalid the book. <br />

֓a֒t.ā hw h ˘ āld ālktāb<br />

He gave Sarah the book. hw ֓a֒t.ā sārh ālktāb<br />

He gave Sarah the book. ֓a֒t.ā hw sārh ālktāb<br />

He is giving Khalid the book. <br />

hw y֒t.y h ˘ āld ālktāb<br />

She is giving Sarah the book. hy t֒t.y sārh ālktāb<br />

She is giving Khalid the book. <br />

hy t֒t.y h ˘ āld ālktāb<br />

She gave Sarah the book. hy ֓a֒t.t sārh ālktāb<br />

She gave Sarah the book. ֓a֒t.t hy sārh ālktāb<br />

She gave Khalid the book. <br />

hy ֓a֒t.t h ˘ āld ālktāb<br />

She gave Khalid the book. <br />

֓a֒t.t hy h ˘ āld ālktāb<br />

158


Omar gave the book <strong>to</strong> Khalid. <br />

֒mr ֓a֒t.ā lh ˘ āld ālktāb<br />

Omar gave the book <strong>to</strong> Khalid. <br />

֒mr ֓a֒t.ā ālktāb lh ˘ āld<br />

Omar gave the book <strong>to</strong> Khalid. <br />

֒mr y֒t.y lh ˘ āld ālktāb<br />

Omar is giving the book <strong>to</strong> Khalid. <br />

֒mr y֒t.y ālktāb lh ˘ āld<br />

Omar is giving the book <strong>to</strong> Sarah. ֒mr y֒t.y ālktāb lsārh<br />

Omar is giving the book <strong>to</strong> Sarah. ֒mr y֒t.y lsārh ālktāb<br />

Omar gave the book <strong>to</strong> Sarah. ֒mr ֓a֒t.ā ālktāb lsārh<br />

Omar gave the book <strong>to</strong> Sarah. ֒mr ֓a֒t.ā lsārh ālktāb<br />

Eman is giving the book <strong>to</strong> Khalid. <br />

֓iymān t֒t.y lh ˘ āld ālktāb<br />

Eman is giving the book <strong>to</strong> Khalid. <br />

֓iymān t֒t.y ālktāb lh ˘ āld<br />

Eman is giving the book <strong>to</strong> Sarah. ֓iymān t֒t.y ālktāb lsārh<br />

Eman is giving the book <strong>to</strong> Sarah. ֓iymān t֒t.y lsārh ālktāb<br />

He gave the book <strong>to</strong> Khalid. <br />

hw ֓a֒t.ā lh ˘ āld ālktāb<br />

He gave a book <strong>to</strong> Khalid. <br />

hw ֓a֒t.ā ktāb lh ˘ āld<br />

He is giving a book <strong>to</strong> Khalid. <br />

hw y֒t.y lh ˘ āld ktāb<br />

He is giving a book <strong>to</strong> Khalid. <br />

hw y֒t.y ktāb lh ˘ āld<br />

He is giving a book <strong>to</strong> Sarah. hw y֒t.y ktāb lsārh<br />

159


He is giving a book <strong>to</strong> Sarah. hw y֒t.y lsārh ktāb<br />

He gave a book <strong>to</strong> Sarah. hw ֓a֒t.ā ktāb lsārh<br />

He gave a book <strong>to</strong> Sarah. hw ֓a֒t.ā lsārh ktāb<br />

She is giving a book <strong>to</strong> Khalid. <br />

hy t֒t.y lh ˘ āld ktāb<br />

She is giving a book <strong>to</strong> Khalid. <br />

hy t֒t.y ktāb lh ˘ āld<br />

She is giving a book <strong>to</strong> Sarah. hy t֒t.y ktāb lsārh<br />

She is giving a book <strong>to</strong> Sarah. hy t֒t.y lsārh ktāb<br />

Eman is giving a book <strong>to</strong> Sarah. ֓iymān t֒t.y lsārh ktāb<br />

He gave a book <strong>to</strong> Khalid. <br />

hw ֓a֒t.ā lh ˘ āld ktāb<br />

She is giving Sarah a book. hy t֒t.y sārh ktāb<br />

Omar gave Khalid a book. <br />

֒mr ֓a֒t.ā h ˘ āld ktāb<br />

Khalid drives. yswq hāld ˘<br />

Khalid drives. hāld yswq<br />

˘<br />

Khalid drives a lot. <br />

h ˘ āld yswq kt ¯ yrā<br />

160


D<br />

Verbs in lexicon<br />

Verbs in <strong>Arabic</strong> change depending on gender, number and tense of the subject, so there<br />

are multiple entries in the lexicon that translate <strong>to</strong> the same <strong>English</strong> output. The translit-<br />

eration makes it clear that these are different words in <strong>Arabic</strong>.<br />

<strong>Arabic</strong> Example Logical Structure<br />

֓aryd I want a ring. < TNS : PRES[do ′ (I, [want ′ (I, y)])] ><br />

nsyt I <strong>for</strong>got my wallet. < TNS : PAST[do ′ (I, [<strong>for</strong>get ′ (I, y)])] ><br />

֓aklt I ate an apple. < TNS : PAST[do ′ (I, [eat ′ (I, y)])] ><br />

ˇsrbt I drank the milk. < TNS : PAST[do ′ (I, [drink ′ (I, y)])] ><br />

qr֓a Omar read the book. < TNS : PAST[do ′ (x, [read ′ (x, y)])] ><br />

qr֓at Eman read the book. < TNS : PAST[do ′ (x, [read ′ (x, y)])] ><br />

161


<strong>Arabic</strong> Example Logical Structure<br />

NON I am a <strong>to</strong>urist. be ′ (I,[<strong>to</strong>urist ′ ])<br />

NON He is an engineer. be ′ (he,[engineer ′ ])<br />

NON We are students. be ′ (we,[students ′ ])<br />

ˇsrb Khalid drank the milk. < TNS : PAST[do ′ (x,[drink ′ (x,y)])] ><br />

yˇsrb Khalid is drinking the milk. < TNS : PRES[do ′ (x,[drink ′ (x,y)])] ><br />

syˇsrb Omar will drink the milk. < TNS : FUT[do ′ (x,[drink ′ (x,y)])] ><br />

ylbs Eric is wearing his clothes. < TNS : PRES[do ′ (x,[wear ′ (x,y)])] ><br />

d. rb Louis hit Mark. < TNS : PAST[do ′ (x,[hit ′ (x,y)])] ><br />

yh. b Qays loves Laila. < TNS : PRES[do ′ (x,[love ′ (x,y)])] ><br />

th. b Fatima loves Zaid. < TNS : PRES[do ′ (x,[love ′ (x,y)])] ><br />

qmt I have made a reservation. < TNS : PAST[do ′ (x,[make ′ (x,y)])] ><br />

fqdt Eman lost her cat. < TNS : PAST[do ′ (x,[lose ′ (x,y)])] ><br />

qtl The man killed Jack. < TNS : PAST[do ′ (x,[kill ′ (x,y)])] ><br />

fth. Sulaiman opened the door. < TNS : PAST[do ′ (x,[open ′ (x,y)])] ><br />

<br />

<br />

֓ah ˘ d ¯<br />

Zaid <strong>to</strong>ok the book. < TNS : PAST[do ′ (x,[take ′ (x,y)])] ><br />

yrkb Fahmy rides the car. < TNS : PRES[do ′ (x,[ride ′ (x,y)])] ><br />

stt.bh ˘<br />

Raiqa will cook the dinner. < TNS : FUT[do ′ (x,[cook ′ (x,y)])] ><br />

162


<strong>Arabic</strong> Example Logical Structure<br />

֓aˇgāb Khalid answered the question. < TNS : PAST[do ′ (x,[answer ′ (x,y)])] ><br />

yzwr Omar is visiting Ireland. < TNS : PRES[do ′ (x,[visit ′ (x,y)])] ><br />

yl֒b Abbas is playing with the ball. < TNS : PRES[do ′ (x,[play ′ (x,y)])] ><br />

syl֒b Yousuf will play with the ball. < TNS : FUT[do ′ (x,[play ′ (x,y)])] ><br />

֓akl Mansour ate with the spoon. < TNS : PAST[do ′ (x,[eat ′ (x,y)])] ><br />

y֓akl Mansour is eating his food. < TNS : PRES[do ′ (x,[eat ′ (x,y)])] ><br />

sy֓akl Eric will eat with the spoon. < TNS : FUT[do ′ (x,[eat ′ (x,y)])] ><br />

nfh ˘<br />

Suhaib bellowed the fire. < TNS : PAST[do ′ (x,[bellow ′ (x,y)])] ><br />

dhst Eman runned over her cat. < TNS : PAST[do ′ (x,[runnover ′ (x,y)])] ><br />

ksr Rashid broke the window. < TNS : PAST[do ′ (x,[break ′ (x,y)])] ><br />

tˇgrh. Sarah hurts Yousuf. < TNS : PRES[do ′ (x,[hurt ′ (x,y)])] ><br />

<br />

mzq Almahdi <strong>to</strong>re the book. < TNS : PAST[do ′ (x,[tear ′ (x,y)])] ><br />

<br />

mzqt Ayah <strong>to</strong>re the page. < TNS : PAST[do ′ (x,[tear ′ (x,y)])] ><br />

fth. Almahdi opened the window. < TNS : PAST[do ′ (x,[open ′ (x,y)])] ><br />

mˇst. James combed his hair. < TNS : PAST[do ′ (x,[comb ′ (x,y)])] ><br />

<br />

<br />

nz . f Eric cleaned the plane. < TNS : PAST[do ′ (x,[clean ′ (x,y)])] ><br />

163


<strong>Arabic</strong> Example Logical Structure<br />

qrs. Harold pinched James. < TNS : PAST[do ′ (x,[pinch ′ (x,y)])] ><br />

tnfq Ayah is spending her money. < TNS : PRES[do ′ (x,[spend ′ (x,y)])] ><br />

fqd Henry lost his money. < TNS : PAST[do ′ (x,[lose ′ (x,y)])] ><br />

lkm Adam punched Philip. < TNS : PAST[do ′ (x,[punch ′ (x,y)])] ><br />

tkrh Sarah hates Zakiah. < TNS : PRES[do ′ (x,[hate ′ (x,y)])] ><br />

lkmt Sarah punched Sarah. < TNS : PAST[do ′ (x,[punch ′ (x,y)])] ><br />

qtlt Zakiah killed Sarah. < TNS : PAST[do ′ (x,[kill ′ (x,y)])] ><br />

qtl Jack killed Mary. < TNS : PAST[do ′ (x,[kill ′ (x,y)])] ><br />

s. f֒ Mark slapped Louis. < TNS : PAST[do ′ (x,[slap ′ (x,y)])] ><br />

hātft Ayesha phoned Eman. < TNS : PAST[do ′ (x,[phone ′ (x,y)])] ><br />

<br />

ˇskrt Ayah thanked Khalid. < TNS : PAST[do ′ (x,[thank ′ (x,y)])] ><br />

nādt Sarah called Adam. < TNS : PAST[do ′ (x,[call ′ (x,y)])] ><br />

<br />

fāz Omar won the race. < TNS : PAST[do ′ (x,[win ′ (x,y)])] ><br />

r֓at Eman saw Sarah. < TNS : PAST[do ′ (x,[see ′ (x,y)])] ><br />

msk Philip caught the ball. < TNS : PAST[do ′ (x,[catch ′ (x,y)])] ><br />

<br />

֓iˇstrā Carl bought pens. < TNS : PAST[do′ (x,[buy ′ (x,y)])] ><br />

qād Mark drove the bus. < TNS : PAST[do ′ (x,[drive ′ (x,y)])] ><br />

<br />

<br />

ynz . f Adam is cleaning his room. < TNS : PRES[do ′ (x,[clean ′ (x,y)])] ><br />

164


<strong>Arabic</strong> Example Logical Structure<br />

<br />

<br />

synz . f Mark will clean my kitchen. < TNS : FUT[do ′ (x,[clean ′ (x,y)])] ><br />

<br />

<br />

stnz . f Sarah will clean my office. < TNS : FUT[do ′ (x,[clean ′ (x,y)])] ><br />

y˙gsl Louis is washing the dishes. < TNS : PRES[do ′ (x,[wash ′ (x,y)])] ><br />

yt.֒m Harold is feeding his cat. < TNS : PRES[do ′ (x,[feed ′ (x,y)])] ><br />

ys.lh. Eric is fixing his car. < TNS : PRES[do ′ (x,[fix ′ (x,y)])] ><br />

ytklm Fahmy speaks <strong>English</strong>. < TNS : PRES[do ′ (x,[speak ′ (x,y)])] ><br />

tt.bh ˘<br />

Ayah is cooking the dinner. < TNS : PRES[do ′ (x,[cook ′ (x,y)])] ><br />

ysā֒d Rashid is helping Mark. < TNS : PRES[do ′ (x,[help ′ (x,y)])] ><br />

y֓akl Mansour is eating his food. < TNS : PRES[do ′ (x,[eat ′ (x,y)])] ><br />

ymˇst. Carl is brushing his hair < TNS : PRES[do ′ (x,[brush ′ (x,y)])] ><br />

<br />

yfrˇs Abbas is brushing his teeth. < TNS : PRES[do ′ (x,[brush ′ (x,y)])] ><br />

ylbs Yousuf is wearing his shoes. < TNS : PRES[do ′ (x,[wear ′ (x,y)])] ><br />

ynām Omar sleeps early. < TNS : PRES[do ′ (x,[sleep ′ (x,y)])] ><br />

yˇsāhd Henry is watching the TV. < TNS : PRES[do ′ (x,[watch ′ (x,y)])] ><br />

yzr֒ Almahdi is planting the trees. < TNS : PRES[do ′ (x,[plant ′ (x,y)])] ><br />

֓is. t.ād Sulaiman caught the fishs. < TNS : PAST[do ′ (x,[catch ′ (x,y)])] ><br />

ˇgr James pushed the chairs. < TNS : PAST[do ′ (x,[push ′ (x,y)])] ><br />

165


<strong>Arabic</strong> Example Logical Structure<br />

ktbt Ayesha wrote the book. < TNS : PAST[do ′ (x,[write ′ (x,y)])] ><br />

ktb Brian wrote the book. < TNS : PAST[do ′ (x,[write ′ (x,y)])] ><br />

msh. t Fatima wiped the house. < TNS : PAST[do ′ (x,[wipe ′ (x,y)])] ><br />

rsmt Raiqa drew trees. < TNS : PAST[do ′ (x,[draw ′ (x,y)])] ><br />

qt.ft Ayesha picked the flowers. < TNS : PAST[do ′ (x,[pick ′ (x,y)])] ><br />

kwt Eman ironed the clothes. < TNS : PAST[do ′ (x,[iron ′ (x,y)])] ><br />

<br />

֓iˇstrā Omar bought the <strong>to</strong>ys. < TNS : PAST[do ′ (x,[buy ′ (x,y)])] ><br />

lwnt Ayah painted the picture. < TNS : PAST[do ′ (x,[paint ′ (x,y)])] ><br />

֓a֒t.āk Omar gave you the book. < TNS : PAST[do ′ (x,[give ′ (x,y)])] ><br />

ys.lh. Omar is fixing the computer. < TNS : PAST[do ′ (x,[fix ′ (x,y)])] ><br />

y֒ml Yasser works hard. < TNS : PRES[do ′ (x,[work ′ (x,y)])] ><br />

mrr Philip passed the ball. < TNS : PAST[do ′ (x,[pass ′ (x,y)])] ><br />

yswq Khalid drives. < TNS : PRES >><br />

ybkwn The children are crying. < TNS : PRES >><br />

yqr֓a Omar reads. < TNS : PRES >><br />

166


<strong>Arabic</strong> ys. rs. r<br />

Example the wheel squeaks.<br />

Logical Structure < TNS : PRES >><br />

<strong>Arabic</strong> ynām<br />

Example Omar is sleeping.<br />

Logical Structure < TNS : PRES >><br />

<strong>Arabic</strong> ֓a֒t.ā<br />

Example Omar gave Khalid the book.<br />

Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEhave ′ (y, z)]] ><br />

<strong>Arabic</strong> y֒t.y<br />

Example Omar is giving Eman a book.<br />

Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEhave ′ (y, z)]] ><br />

<strong>Arabic</strong> t֒t.y<br />

Example Sarah is giving Eman a book.<br />

Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEhave ′ (y, z)]] ><br />

<strong>Arabic</strong><br />

֓a֒t.t<br />

Example Sarah gave Khalid a book.<br />

Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEhave ′ (y, z)]] ><br />

167


<strong>Arabic</strong><br />

֓art<br />

Example Fatima showed the letter <strong>to</strong> Khalid.<br />

Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEsee ′ (y, z)]] ><br />

<strong>Arabic</strong> yry<br />

Example Mark is showing Brian the letter.<br />

Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEsee ′ (y, z)]] ><br />

<strong>Arabic</strong> ֓arā<br />

Example Brian showed the letter <strong>to</strong> Sarah.<br />

Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEsee ′ (y, z)]] ><br />

<strong>Arabic</strong> try<br />

Example Fatima is showing Adam the letter.<br />

Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEsee ′ (y, z)]] ><br />

<strong>Arabic</strong> ydrs<br />

Example Suhaib is teaching Eman the his<strong>to</strong>ry.<br />

Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEknow ′ (y, z)]] ><br />

<strong>Arabic</strong> tdrs<br />

Example Eman is teaching mathematics <strong>to</strong> Sarah.<br />

Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEknow ′ (y, z)]] ><br />

168


<strong>Arabic</strong> ֓adrs<br />

Example I am teaching mathematics <strong>to</strong> Sarah.<br />

Logical Structure < TNS : PRES[do ′ (x, 0)CAUSE[BECOMEknow ′ (y, z)]] ><br />

<strong>Arabic</strong> drs<br />

Example Suhaib taught Mark mathematics.<br />

Logical Structure < TNS : PAST[do ′ (x, 0)CAUSE[BECOMEknow ′ (y, z)]] ><br />

<strong>Arabic</strong> Example Logical Structure<br />

<br />

fāttny I missed the plane. < TNS : PAST[do ′ (I, [miss ′ (I, y])] ><br />

֓aryd I want a ring. < TNS : PRES[do ′ (I, [want ′ (I, ring])] ><br />

nsyt I <strong>for</strong>got my wallet. < TNS : PAST[do ′ (I, [<strong>for</strong>get ′ (I, wallet])] ><br />

֓aklt I ate the food. < TNS : PAST[do ′ (I, [eat ′ (I, food])] ><br />

ˇsrbt I drank the milk. < TNS : PAST[do ′ (I, [drink ′ (I, milk])] ><br />

169


E<br />

The UniArab code<br />

Given the large amount of code developed as part of the work presented in this thesis, it<br />

is available in the attached CD rather than included here.<br />

Package: Name of Class Class Summary<br />

pkg1: <strong>Arabic</strong>To<strong>English</strong>MT The main class<br />

pkg1: AdjectiveXMLWriter To write an adjective in the datasource<br />

pkg1: Adjective To hold adjective attributes from the datasource<br />

pkg1: AdverbXMLWriter To write an adverb in the datasource<br />

pkg1: Adverb To hold adverb attributes from the datasource<br />

pkg1: DemonstrativeXMLWriter To write a demonstrative in the datasource<br />

pkg1: Demonstrative To hold demonstrative attributes from the datasource<br />

pkg1: Global This class <strong>to</strong> add a new word in lexicon<br />

pkg1: NounXMLWriter To write a noun in the datasource<br />

pkg1: Noun To hold noun attributes from the datasource<br />

170


Package: Name of Class Class Summary<br />

pkg1: OtherWordXMLWriter To write an OtherWord in the datasource<br />

pkg1: OtherWord To hold OtherWord attributes from the datasource<br />

pkg1: Preparation<br />

To change the value from lexicon interface’s list <strong>to</strong> be saved<br />

in datasource<br />

pkg1: ProperNounXMLWriter To write a proper noun in the datasource<br />

pkg1: ProperNoun To hold proper noun attributes from the datasource<br />

pkg1: SearchEngine2 This class <strong>to</strong> manage search in datasource<br />

pkg1: TempAdjectiveXML<br />

pkg1: TempAdverbXML<br />

pkg1: TempDemonstrativeXML<br />

pkg1: TempNounXML<br />

pkg1: TempOtherWordXML<br />

pkg1: TempProperNounXML<br />

pkg1: TempVerbXML<br />

To manage a written an adjective in the XML datasource<br />

while the UniArab system is running<br />

To manage a written a new adverb in the XML datasource<br />

while the UniArab system is running<br />

To manage a written a new demonstrative in the XML<br />

datasource while the UniArab system is running<br />

To manage a written a new noun in the XML datasource<br />

while the UniArab system is running<br />

To manage a written a new OtherWord in the XML datasource<br />

while the UniArab system is running<br />

To manage a written a new proper noun in the XML datasource<br />

while the UniArab system is running<br />

To manage a written a new verb in the XML datasource<br />

while the UniArab system is running<br />

pkg1: Tokenizer This class <strong>to</strong> split a sentence in<strong>to</strong> word <strong>to</strong>kens<br />

pkg1: VerbXMLWriter To write a verb in the datasource<br />

pkg1: Verb To hold verb attributes from the datasource<br />

171


Package: Name of Class Class Summary<br />

gui: AdjectivePanel<br />

gui: AdverbPanel<br />

gui: DemonstrativesPanel<br />

gui: NounPanel<br />

gui: OtherWordPanel<br />

gui: ProperNounPanel<br />

gui: VerbPanel<br />

This class <strong>to</strong> mange the adjective panel<br />

in the UniArabs lexicon interface<br />

This class <strong>to</strong> mange the adverb panel in the<br />

UniArabs lexicon interface<br />

This class <strong>to</strong> mange the demonstrative panel<br />

in the UniArabs lexicon interface<br />

This class <strong>to</strong> mange the noun panel in the<br />

UniArabs lexicon interface<br />

This class <strong>to</strong> mange the OtherWord panel<br />

in the UniArabs lexicon interface<br />

This class <strong>to</strong> mange the proper noun panel<br />

in the UniArabs lexicon interface<br />

This class <strong>to</strong> mange the verb panel in the<br />

UniArabs lexicon interface<br />

uniArab: PreUniArab This class <strong>to</strong> mange the POS of input words<br />

uniArab: UniArab This class <strong>to</strong> preparation of syntactic parser<br />

uniArab: GenerationLS This class <strong>to</strong> generate the logical structure<br />

syntaxGeneration: syntaxGeneration This class <strong>to</strong> mange the syntax generation<br />

generation<strong>English</strong>Morphology: pressTenseToBe<br />

Package: Name of Class Class Summary<br />

This class <strong>to</strong> mange the generation<br />

of target language morphology<br />

xml: AdjectiveDB.XML This is the adjectives s<strong>to</strong>red in the XML datasource<br />

xml: AdverbDB.XML This is the adverbs s<strong>to</strong>red in the XML datasource<br />

xml: DemonstrativeDB.XML This is the demonstratives s<strong>to</strong>red in the XML datasource<br />

xml: NounDB.XML This is the nouns s<strong>to</strong>red in the XML datasource<br />

xml: OtherWordDB.XML This is the OtherWords s<strong>to</strong>red in the XML datasource<br />

xml: ProperNounDB.XML This is the proper nouns s<strong>to</strong>red in the XML datasource<br />

xml: VerbDB.XML This is the verbs s<strong>to</strong>red in the XML datasource<br />

172

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!