26.12.2013 Views

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

e argued that this separation is unnecessary, <strong>and</strong> all the root in<strong>for</strong>mation can be stored in the<br />

lexemes (de-normalised).<br />

In addition to the above, all entries optionally have a source field which indicates the citation<br />

reference of the source of the entry. This exists so that the in<strong>for</strong>mation stored in the database<br />

does not become disconnected from its original authors (the sources entity is not shown in the<br />

schema diagram).<br />

roots<br />

radicals<br />

type<br />

variant (Int)<br />

lemma<br />

pos<br />

lexemes<br />

gloss<br />

<strong>for</strong>m (Int)<br />

transitivity<br />

hypothetical (Bool)<br />

frequency (Bool)<br />

...<br />

word<strong>for</strong>ms<br />

surface_<strong>for</strong>m<br />

gender<br />

number<br />

aspect<br />

subject<br />

do<br />

io<br />

polarity<br />

...<br />

Figure 3.1: Basic data schema used in the <strong>lexicon</strong>. Each box shows (in order) the entity name,<br />

the fields that are always present, <strong>and</strong> optional fields that only occur in some documents. All<br />

fields are of type string unless marked. Primary <strong>and</strong> <strong>for</strong>eign key fields are omitted.<br />

Realisation<br />

The database engine chosen <strong>for</strong> implementing this <strong>lexicon</strong> is MongoDB 1 , a document-oriented<br />

system which stores data as JSON-style documents 2 with dynamic schemas. The great advantage<br />

of this system is that documents in the same collection can have different structures, which<br />

is exactly what is needed in this project. This makes it very easy to add new features to any<br />

entry without changing the table schema <strong>and</strong> ending up with lots of null fields.<br />

A side effect of this however is that the database engine does not itself h<strong>and</strong>le joins, as<br />

with a traditional RDBMS. Thus when modelling data using MongoDB, data de-normalisation<br />

is often recommended 3 . This means that rather than splitting data between multiple tables<br />

<strong>and</strong> depending on <strong>for</strong>eign keys <strong>and</strong> joins, the same data may be stored in multiple locations.<br />

This introduces some overhead in maintenance <strong>and</strong> introduces a risk of data inconsistency, but<br />

makes data retrieval <strong>and</strong> searching easier <strong>and</strong> faster.<br />

1 http://www.mongodb.org/, accessed 2013-09-05<br />

2 Note about MongoDB nomenclature: a “collection” can be thought of as a table, a “document” is an entry in a<br />

collection, <strong>and</strong> a “field” is analogous to a table column. The concept of a schema is not en<strong>for</strong>ced at all, <strong>and</strong> the term<br />

is used loosely.<br />

3 See http://docs.mongodb.org/manual/core/data-modeling/, accessed 2013-09-03<br />

51

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!