13.07.2015 Views

gene pathway text mining and visualization - Artificial Intelligence ...

gene pathway text mining and visualization - Artificial Intelligence ...

gene pathway text mining and visualization - Artificial Intelligence ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

528 MEDICAL INFORMATICSsyntactic phenomena. Using over 150 new tags with a regular grammareliminates the problem of over <strong>gene</strong>rating parses. Two different parsingrules are not allowed to act on the same input token sequence. Only oneparse tree is <strong>gene</strong>rated for each sentence, with only two levels beinganalyzed for relation extraction. The particulars of the parsing process nowfollow. Output of the parsing steps is shown in Figure 18-1 below. Theboxed numbers on the left in the figure correspond to the subsectionnumbers listed beside the <strong>text</strong> below.Table 18-1. A sample of tags used by the Arizona Relation ParserSelection MethodTagsUnique tags in our lexicon obtained by observingambiguous tags from the PENN TREE BANKDomain relevant noun classesPENN TREE BANK syntax tags3.1.1 Sentence <strong>and</strong> Word BoundariesBE, GET, DO, KEEP, MAKE, INCD (include),COV (cover), HAVE, INF (infinitive), ABT(about), ABOV, ACROS, AFT (after), AGNST,AL (although), AMG (among), ARD (around),AS, AT, BEC, BEF (before) , BEL (below), BTN(between), DUR (during), TO, OF, ON, OPP(neg/opposite ), OVR (over), UNT (until), UPN(upon), VI (via), WAS (whereas), WHL (while),WI (with), WOTDATE, PRCT, TIME, GENE, LOCATION,PERSON, ORGANIZATIONIN, NP, VBD, VBN, VBG, VP, NN, NNP, NNS,NNPS, PRP, PRP$, RB, RBRThe parsing process begins with tokenization, where word <strong>and</strong> sentenceboundaries are recognized. The sentence splitting relies on a lexicon of 210common abbreviations <strong>and</strong> rules to recognize new abbreviations. Documentsare tokenized <strong>gene</strong>rally according to the PENN TREE BANK tokenizingrules. In addition, words are also split on hyphens, a practice commonlyperformed in the bioinformatics domain (Gaizauskas et al., 2003). A phrasetagger based on a finite state automaton (FSA) is also run during this step torecognize words that are best tokenized as phrases. Common idiomatic <strong>and</strong>discourse phrases (i.e. “for example” <strong>and</strong> “on the other h<strong>and</strong>”) are groupedtogether in this step along with other compound lexemes, such as compound<strong>gene</strong> names. Such phrases receive a single initial tag, despite being made upof multiple words.3.1.2 Arizona Part-of-Speech (POS)/Semantic TaggerWe developed a Brill-style tagger (Brill, 1993) written in Java <strong>and</strong>trained on the Brown <strong>and</strong> Wall Street Journal corpora. The tagger was alsotrained using 100 PubMed abstracts <strong>and</strong> its lexicon was augmented by the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!