gene pathway text mining and visualization - Artificial Intelligence ...

More documents

Recommendations

Info

528 MEDICAL INFORMATICSsyntactic phenomena. Using over 150 new tags with a regular grammareliminates the problem of over generating parses. Two different parsingrules are not allowed to act on the same input token sequence. Only oneparse tree is generated for each sentence, with only two levels beinganalyzed for relation extraction. The particulars of the parsing process nowfollow. Output of the parsing steps is shown in Figure 18-1 below. Theboxed numbers on the left in the figure correspond to the subsectionnumbers listed beside the text below.Table 18-1. A sample of tags used by the Arizona Relation ParserSelection MethodTagsUnique tags in our lexicon obtained by observingambiguous tags from the PENN TREE BANKDomain relevant noun classesPENN TREE BANK syntax tags3.1.1 Sentence and Word BoundariesBE, GET, DO, KEEP, MAKE, INCD (include),COV (cover), HAVE, INF (infinitive), ABT(about), ABOV, ACROS, AFT (after), AGNST,AL (although), AMG (among), ARD (around),AS, AT, BEC, BEF (before) , BEL (below), BTN(between), DUR (during), TO, OF, ON, OPP(neg/opposite ), OVR (over), UNT (until), UPN(upon), VI (via), WAS (whereas), WHL (while),WI (with), WOTDATE, PRCT, TIME, GENE, LOCATION,PERSON, ORGANIZATIONIN, NP, VBD, VBN, VBG, VP, NN, NNP, NNS,NNPS, PRP, PRP$, RB, RBRThe parsing process begins with tokenization, where word and sentenceboundaries are recognized. The sentence splitting relies on a lexicon of 210common abbreviations and rules to recognize new abbreviations. Documentsare tokenized generally according to the PENN TREE BANK tokenizingrules. In addition, words are also split on hyphens, a practice commonlyperformed in the bioinformatics domain (Gaizauskas et al., 2003). A phrasetagger based on a finite state automaton (FSA) is also run during this step torecognize words that are best tokenized as phrases. Common idiomatic anddiscourse phrases (i.e. “for example” and “on the other hand”) are groupedtogether in this step along with other compound lexemes, such as compoundgene names. Such phrases receive a single initial tag, despite being made upof multiple words.3.1.2 Arizona Part-of-Speech (POS)/Semantic TaggerWe developed a Brill-style tagger (Brill, 1993) written in Java andtrained on the Brown and Wall Street Journal corpora. The tagger was alsotrained using 100 PubMed abstracts and its lexicon was augmented by the
Gene Pathway Text Mining and Visualization 529words and tags from the GENIA corpus (Ohta et al., 2002). The tags used tomark tokens include the 150 new tags (generated by methods describedearlier) along with the original tags from the PENN TREE BANK tag set.3.1.3 Hybrid ParsingFigure 18-1. An example of a hybrid parse tree.In this step word and phrases are combined into larger phrase classesconsistent with the role the phrases play in the sentence. The parser outputappears in boxed-number 3 shown in Figure 18-1. We attempt to address thepoor grammar coverage problem by relaxing some of the parsingassumptions made by full-sentence parsers. For example, full sentenceparsing up to a root node of a binary branching tree is not required becausewe extract semantic triples. Thus, ARP uses a shallow parse structure with n-ary branching, such that any number of tokens up to 24 can be combinedinto a new node on the tree. Internally, a cascade of five finite state automata
Page 1 and 2: Chapter 18GENE PATHWAY TEXT MINING
Page 3 and 4: Gene Pathway Text Mining and Visual
Page 9: Gene Pathway Text Mining and Visual
Page 14 and 15: 532 MEDICAL INFORMATICSaugment, eli
Page 16 and 17: 534 MEDICAL INFORMATICS3.2 GeneScen
Page 18 and 19: 536 MEDICAL INFORMATICSinformation
Page 20 and 21: 538 MEDICAL INFORMATICS3.3 GeneScen
Page 22 and 23: 540 MEDICAL INFORMATICSproliferativ
Page 24 and 25: 542 MEDICAL INFORMATICSanalysis and
Page 26 and 27: 544 MEDICAL INFORMATICSPurchase, H.
Page 28: 546 MEDICAL INFORMATICS3. How can b

gene pathway text mining and visualization - Artificial Intelligence ...

Create successful ePaper yourself

Delete template?

Save as template?