12.07.2015 Views

NooJ, 1/4 Max Silberztein, Université de Franche-Comté, France ...

NooJ, 1/4 Max Silberztein, Université de Franche-Comté, France ...

NooJ, 1/4 Max Silberztein, Université de Franche-Comté, France ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>NooJ</strong>, 1/4<strong>NooJ</strong><strong>Max</strong> <strong>Silberztein</strong>, Université <strong>de</strong> <strong>Franche</strong>-Comté, <strong>France</strong>www.nooj4nlp.net<strong>NooJ</strong> is a linguistic software that processes texts andcorpora in real time.<strong>NooJ</strong> un<strong>de</strong>rstands over 100+ file formats, includingall variants of ASCII, ISO, Unico<strong>de</strong> and HTML. <strong>NooJ</strong>reads XML documents, and can import XML tags inits Text Annotation System.(will + shall) * + * (about + going) to to retrieve any sentence in the future (e.g. John willcome) or in the near future (e.g. John is going tocome). The symbol matches any conjugatedform of “to be” (e.g. “was”), matches anyadverb (e.g. “never”, “often”), including compoundadverbs such as “as far as possible”, e.g. in:Where relevant the two administrations will as far aspossible abi<strong>de</strong> by the Concordats ... matches all verbs in the infinitive, etc.Here is another typical query:<strong>NooJ</strong> processes corpora in 100+ file formats<strong>NooJ</strong> is mainly used as a corpus processor (to buildconcordances) and as an information extractor forsearch engines, text mining and competitiveintelligence applications.Each <strong>NooJ</strong> language module contains large-coveragedictionaries and grammars. Dictionaries andgrammars are applied to corpora in or<strong>de</strong>r to locatemorphological, lexical and syntactic patterns, as wellas to build complex concordances. Thanks to its largecoveragelinguistic resources, <strong>NooJ</strong>’s query systemcan retrieve complex patterns, such as:This pattern matches all nouns in the singular(including compound nouns such as customer service)followed by a verb conjugated in the preterit,


followed by an expression of date, followed by “in”,followed by a word form in uppercase.<strong>NooJ</strong>, 2/4French, Hebrew Hungarian, Italian and Spanish, withmore to come soon.<strong>NooJ</strong>’s linguistic engine manages annotations. Anannotation is any piece of information (lexical,morphological, syntactic or semantic) that isassociated with a sequence of text. All annotations arestored in the Text’s Annotation Structure (TAS),which is kept synchronized with the text.Grammars<strong>NooJ</strong> grammars are organized sets of graphs, used torecognize syntactic constructs (e.g. noun phrases, verbgroups, etc.), entities (names of persons, companies,products, expressions of locations, addresses, dates,durations, etc.), as well as complex technicalexpressions (e.g. Protein X inhibits Gene Y), etc.Grammars can be applied to corpora, just like otherqueries, and the result is usually displayed in aconcordance:The Text Annotation System<strong>NooJ</strong>’s dictionaries and grammars are applied tocorpora in or<strong>de</strong>r to add annotations to (or filter outannotations from) texts’ annotation system. <strong>NooJ</strong> canalso import information from (or export to) XMLdocuments.<strong>NooJ</strong> inclu<strong>de</strong>s tools to create, edit, test, <strong>de</strong>bug andmanage dictionaries and grammars.Dictionaries<strong>NooJ</strong> dictionaries contain indistinctly simple words(e.g. “a meeting”) or compounds (e.g. “a roundtable”). <strong>NooJ</strong>’s inflectional (e.g. verb conjugations)and <strong>de</strong>rivational (e.g; nominalization) engineautomatically links all word forms to their lemmas.<strong>NooJ</strong> can process languages with heavy morphology,such as Hungarian, as well as Semitic languages (withan automatic vowellization) and Asian languages(with an automatic tokenization).Large <strong>NooJ</strong> dictionaries are already available forArabic, Armenian, Bulgarian, Chinese, English,Look for Date in a textSimple extractors, such as one to recognize allreferences to Canada (Canada, Canadians, Ontario,Montréal, etc.) are typically built in a few minutes.Hundreds of graphs have already been <strong>de</strong>veloped by<strong>NooJ</strong> users.Our project is to <strong>de</strong>velop thousands of such reusablegraphs for each language:-- morphological grammars to recognize <strong>de</strong>rivationsand synonymous expressions;-- syntactic grammars to recognize noun phrases andverb groups;-- semantic grammars to recognize entities, such asnames of persons, of companies or products,locations, addresses, technical expressions, etc.


<strong>NooJ</strong>, 3/4Concordance for the pattern: * The DATE grammar (structure, a graph, contract)


<strong>NooJ</strong>, 4/4Concordance for the graph “Canada” in the ProMED medical databaseCanada in ProMED2015Frequency, Standard Deviation10501 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176Série1-5-10News Bytes

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!