A generic framework for Arabic to English machine ... - Acsu Buffalo
A generic framework for Arabic to English machine ... - Acsu Buffalo
A generic framework for Arabic to English machine ... - Acsu Buffalo
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />
In UniArab we intend <strong>to</strong> have a strong analysis system that can extract all attributes from<br />
the words in a sentence.<br />
6.1.1 Technical architecture of the UniArab system<br />
The structure of the UniArab system in Figure 6.2 breaks down in<strong>to</strong> the following phases:<br />
Phase (1) - <strong>Arabic</strong> language sentence. The input <strong>to</strong> the system consists of one or more<br />
sentences in <strong>Arabic</strong>.<br />
Phase (2) - Sentence Tokenizer. Tokenization is the process of demarcating and classi-<br />
fying sections of a string of input characters. In this phase the system splits the text<br />
in<strong>to</strong> sentence <strong>to</strong>kens. The resulting <strong>to</strong>kens are then passed <strong>to</strong> the word <strong>to</strong>kenizer<br />
phase. For example <br />
<br />
qr֓a hāld ālktāb. hāld tlmyd<br />
˘ ˘ ¯<br />
d<br />
¯ ky. will be two <strong>to</strong>kens; qr֓a hāld ālktāb and <br />
˘ <br />
hāld ˘<br />
tlmyd ¯ d ¯ ky the translation of these two sentences is Khalid read the book. Khalid is<br />
a clever student.<br />
Phase (3) Word Tokenizer There, sentences are split in<strong>to</strong> <strong>to</strong>kens <br />
qr֓a<br />
hāld<br />
ālktāb Khalid read the book, the output of phase 3 is as follows;<br />
˘<br />
<br />
qr֓a<br />
h ˘ āld<br />
ālkt āb<br />
<br />
Phase (4) Lexicon Datasource A set of XML documents <strong>for</strong> each component category<br />
of <strong>Arabic</strong>.<br />
Phase (5) Morphology Parser Directly works with both the Lexicon and Tokenizer <strong>to</strong><br />
produce the word order. A connection is made <strong>to</strong> the datasource of phase 4 which<br />
81