14.06.2013 Views

Databases and Systems

Databases and Systems

Databases and Systems

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

215<br />

In SRS, databank structures are represented by Backus-Naur Form (as described<br />

by Wirth [6]) abstract syntax trees or grammars. BNF grammatical rules consist of<br />

terminal <strong>and</strong> non-terminal definitions. A non-terminal represents a rule <strong>and</strong> can in<br />

turn be subdivided into terminals <strong>and</strong> non-terminals. Terminals <strong>and</strong> non-terminals<br />

group together to form a production. We call a piece of input stream corresponding<br />

to a single terminal a token <strong>and</strong> a set of tokens a token-list. Biological databank<br />

structures generally use the following scheme: a databank consists of a sequence of<br />

entries, an entry is made of data-fields, a data-field consists of tokens. Tokens are the<br />

part of the input that are parsed by terminals or non-terminals. For instance, a parser<br />

for a databank in an EMBL-like format might look like:<br />

databank = {entry}<br />

entry = id_line<br />

(ra_line de_line oc_line)<br />

end_line<br />

id_line = 'ID' id<br />

de_line = {'DE' word_list)<br />

...<br />

end_line = '//'<br />

word_list = {word}<br />

word = /[a-z]+/<br />

This might operate on a text entry of the (EMBL-like) form:<br />

ID HS40428 1 st<strong>and</strong>ard; DNA; HUM; 186 BP.<br />

DE Human tumor suppressor (p53) gene, exon 3<br />

OC Eukaryota; Metazoa; Chordata; Vertebrata<br />

OC Mammalia; Eutheria; Primates<br />

RA Herrmann M., El-Maghrabi R. E., Abumrad N.N.<br />

//<br />

This grammar defines the databank as being a list of entries. The first line of each<br />

entry should be the ID-line followed by zero or more data lines, <strong>and</strong> the last line is<br />

the string ‘//’. The non-terminals id_line, de_line <strong>and</strong> oc_line can be recursively<br />

parsed <strong>and</strong> finally indexed.<br />

Icarus parsers not only recognize entries, but can also perform some semantic<br />

actions during the parsing process. To produce output, any terminal or non-terminal<br />

can be associated with one or more different action comm<strong>and</strong>s, such as create a<br />

token, extend (add text to) existing tokens, set some global states of the parsing<br />

process, input/output directives, print comm<strong>and</strong>s, variable assignments <strong>and</strong> function<br />

calls.<br />

Parsers generally parse all they can, i.e. they decompose the input starting with<br />

the root production <strong>and</strong> go on recursively until having only terminals. In SRS, this<br />

scheme is referred to as forced parsing. This is in opposition to lazy parsing, in

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!