22.08.2013 Views

A generic framework for Arabic to English machine ... - Acsu Buffalo

A generic framework for Arabic to English machine ... - Acsu Buffalo

A generic framework for Arabic to English machine ... - Acsu Buffalo

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.1. CONCEPTUAL STRUCTURE OF THE UNIARAB SYSTEM<br />

In UniArab we intend <strong>to</strong> have a strong analysis system that can extract all attributes from<br />

the words in a sentence.<br />

6.1.1 Technical architecture of the UniArab system<br />

The structure of the UniArab system in Figure 6.2 breaks down in<strong>to</strong> the following phases:<br />

Phase (1) - <strong>Arabic</strong> language sentence. The input <strong>to</strong> the system consists of one or more<br />

sentences in <strong>Arabic</strong>.<br />

Phase (2) - Sentence Tokenizer. Tokenization is the process of demarcating and classi-<br />

fying sections of a string of input characters. In this phase the system splits the text<br />

in<strong>to</strong> sentence <strong>to</strong>kens. The resulting <strong>to</strong>kens are then passed <strong>to</strong> the word <strong>to</strong>kenizer<br />

phase. For example <br />

<br />

qr֓a hāld ālktāb. hāld tlmyd<br />

˘ ˘ ¯<br />

d<br />

¯ ky. will be two <strong>to</strong>kens; qr֓a hāld ālktāb and <br />

˘ <br />

hāld ˘<br />

tlmyd ¯ d ¯ ky the translation of these two sentences is Khalid read the book. Khalid is<br />

a clever student.<br />

Phase (3) Word Tokenizer There, sentences are split in<strong>to</strong> <strong>to</strong>kens <br />

qr֓a<br />

hāld<br />

ālktāb Khalid read the book, the output of phase 3 is as follows;<br />

˘<br />

<br />

qr֓a<br />

h ˘ āld<br />

ālkt āb<br />

<br />

Phase (4) Lexicon Datasource A set of XML documents <strong>for</strong> each component category<br />

of <strong>Arabic</strong>.<br />

Phase (5) Morphology Parser Directly works with both the Lexicon and Tokenizer <strong>to</strong><br />

produce the word order. A connection is made <strong>to</strong> the datasource of phase 4 which<br />

81

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!