gene pathway text mining and visualization - Artificial Intelligence ...

More documents

Recommendations

Info

536 MEDICAL INFORMATICSinformation correctly. On average, there were 13 relations extracted perabstract and 90 percent were correct.2. FSA-specific ResultsWe also report whether the relations were extracted by the appropriateFSA and calculate precision and recall (see Table 18-5) and coverage. Alldetails of this study can be found in (Leroy et al., 2003).Table 18-5. Genescene: Precision and RecallFSA TotalCorrectTotalExtractedTotal inTextPrecision(%)Recall(%)Relationships:BS-FSA: 8 15 23 53 35OF-FSA: 145 157 203 92 71BY-FSA: 15 17 24 88 63IN-FSA: 11 13 37 85 30All: 179 202 287 89 62Conjunctions:BS-FSA: 1 1 1 100 100OF-FSA: 10 10 22 100 45BY-FSA: 0 0 1 - 0IN-FSA: 1 1 6 100 16All: 12 12 30 100 40Precision was calculated by dividing the number of correct relations bythe total number of extracted relations. The correct relations are thoserelations considered correct by the researchers, as described above, but withthe additional restriction that they need to be extracted by the appropriateFSA. This is a more strict evaluation. Recall was calculated as the number ofcorrect relations divided by the total number of relations available in thetext. Only those relations that could have been captured with the describedFSA were considered.There were 267 relations (excludes the conjunctional copies) extractedfrom the abstracts. Overall, we achieved 62 percent recall of the describedpatterns and 89 percent precision. The numbers varied by FSA. The highestrecall was found for the OF-FSA and the lowest for the IN-FSA where arelation was considered missing when any noun phrase introduced by “in”was missing. These relations were often extracted by another FSA butconsidered incorrect here. Many of the errors were due to incomplete nounphrases, e.g., a missing adjective.To evaluate conjunctions, we counted all relations where a conjunctionwas part of the FSA. Conjunctions where the elements neededrecombination, e.g., “breast and ovarian cancer,” were not counted since weexplicitly avoid them. A conjunction was considered correct if eachconstituent is correctly placed in the FSA. The conjunctions were eithercorrectly extracted (100 percent precision) or ignored. This adds a few
Gene Pathway Text Mining and Visualization 537selective relations without introducing any errors. To learn the coverage ofthe FSA, we counted all occurrences of “by,” “of,” and “in,” with a fewexceptions such as “in addition,” which are explicitly disregarded by theparser because they result in irrelevant relations. Seventy-seven percent ofall “of” prepositions, 29 percent of all “by” prepositions, and 14 percent ofall “in” prepositions were correctly captured. This indicates that the OF-FSAis relatively complete for biomedical text. The BY-FSA and IN-FSA cover asmaller portion of the available structures.3.2.5 Ontology and Concept Space Integration1. Additional Genescene ComponentsWe parsed more than 100,000 PUBMED abstracts related to p53, ap1,and yeast. The parser processes 15 abstracts per second on a regular desktopcomputer. We stored all relations and combined them with Concept Space(Chen and Lynch 1992), a co-occurrence based semantic network, inGenescene. Both techniques extract complementary biomedical relations: theparser extracts precise, semantically rich relations and Concept Spaceextracts co-occurrence relations. The Gene Ontology (Ashburner et al.,2000), the Human Genome Nomenclature (Wain et al., 2002), and theUMLS were used to tag terms. More than half of the terms received a tag.The UMLS provided most tags (57 percent), and GO (1 percent) and HUGO(0.5 percent) fewer.2. Results of Ontology IntegrationIn an additional user study, two researchers evaluated terms and relationsfrom abstracts of interest to them. The results showed very high precision ofthe terms (93 percent) and parser relations (95 percent). Concept Spacerelations with terms found in the ontologies were more precise (78 percent)than without (60 percent). Terms with more specific tags, e.g., from GOversus the UMLS, were evaluated as more relevant. Parser relations weremore relevant than Concept Space relations. Details of this system and studycan be found in (Leroy and Chen, in press).3.2.6 ConclusionThis study described an efficient parser based on closed-class Englishwords to efficiently capture relations between noun phrases in biomedicaltext. Relations are specified with syntactic constraints and described in FSAbut may contain any verb, noun, or noun phrase. On average, the extractedrelations are more than 90 percent correct. The parser is very efficient andlarger collections have been parsed and combined with the UMLS, GO,HUGO and a semantic network called Concept Space. This facilitatesintegration.
Page 1 and 2: Chapter 18GENE PATHWAY TEXT MINING
Page 3 and 4: Gene Pathway Text Mining and Visual
Page 11: Gene Pathway Text Mining and Visual
Page 14 and 15: 532 MEDICAL INFORMATICSaugment, eli
Page 16 and 17: 534 MEDICAL INFORMATICS3.2 GeneScen
Page 20 and 21: 538 MEDICAL INFORMATICS3.3 GeneScen
Page 22 and 23: 540 MEDICAL INFORMATICSproliferativ
Page 24 and 25: 542 MEDICAL INFORMATICSanalysis and
Page 26 and 27: 544 MEDICAL INFORMATICSPurchase, H.
Page 28: 546 MEDICAL INFORMATICS3. How can b

gene pathway text mining and visualization - Artificial Intelligence ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?