In Figure 2 we can see the schematic diagram of a typical text mining system as was described above. Corporate Databases External Systems Integration File Systems Rich XML /API Semantic Tagging Statistical Tagging Structural Tagging WEB SITES/ HTML Workflow Systems NEWS FEEDS CapLits 22 Business Intelligence Suite Business Intelligence Suites ClearTags Intelligent Suite (Intelligent Tagging Auto-Tagging) INTERNAL DOCUMENTS Rich XML /API U n s t r u c t u r e d C o n t e n t Figure 3 - Architecture of Text Mining Systems OTHER “RAW” DATA A more detailed description of the intelligent tagging component is shown in Figure 3. Each of the taggers is using a separate training module that is based on annotated examples. A more detailed discussion of the training modules will be presented in the following sections. The training module for the structural tagging is producing document signatures that are then saved and mapped against new documents. The training module for the statistical tagging is producing classifiers for each of the categories and the training module for the semantic training is producing information extraction rules based on annotated documents. ClearTags IntelligentSuite (Intelligent TaggingAuto Auto-Tagging) Tagging) Tagging Controller Rich XML /API Semantic tagger Statistical tagger Structural tagger Fetcher Unstructured Content Rulebooks Classifiers Classifiers & Term Extraction Structural Templates Semantic Clear Trainer Lab Statistical Trainer Structural Trainer Figure 3 - Detailed Description of the Intelligent Tagging Component
3. Visualizations and Analytics for Text Mining When developing a text-mining system one of the crucial needs is the ability to browse through the document collection and being able to “visualize” the various elements within the collection. This type of interactive exploration enables one to identify new types of entities and relationships that can be extracted and, better explore the results of the information extraction phase. We provide example by using a visualization tool by ClearForest Corporation called ClearResearch. This tool enables the user to visualize relationships between entities that were extracted from the documents. The system enables to view collocations between entities or a semantic map that will show entities that are related by any of a defined set (user definable) of relationships. Demonstrations of this product can be viewed at http://www.clearforest.com/downloads/white_papers.asp. 4. Summary Due to the abundance of available textual data, there is a growing need for efficient tools for Text Mining. Unlike structured data, where the data mining algorithms can be performed directly on the underlying data, textual data requires some preprocessing before the data mining algorithm can be successfully applied. Information Extraction has proved to be an efficient method for this first preprocessing phase. Text mining based on Information Extraction attempts to hit a midpoint, reaping some benefits from each of the extremes while avoiding many of their pitfalls. On the one hand, there is no need for human effort in labeling documents, and we are not constrained to a smaller set of labels that lose much of the information present in the documents. Thus the system has the ability to work on new collections without any preparation, as well as the ability to merge several distinct collections into one (even though they might have been tagged according to different guidelines which would prohibit their merger in a tagged-based system). On the other hand, the number of meaningless results is greatly reduced and the execution time of the mining algorithms is also reduced relative to pure word-based approaches. Text mining using Information Extraction thus hits a useful middle ground on the quest for tools for understanding the information present in the large amount of data that is only available in textual form. The powerful combination of precise analysis of the documents and a set of visualization tools enable the user to easily navigate and utilize very large document collections. For more information about the ClearResearch product, please contact : Mark McCarthy (212) 432-1515, ext. 211 www.clearforest.com REFERENCES [1] Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, and Mabry Tyson, 1993. ``FASTUS: A Finite-State Processor for Information Extraction from Real-World Text'', Proceedings. IJCAI-93, Chambery, France, August 1993. [2] Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, Megumi Kameyama, and Mabry Tyson, 1993a. ``The SRI MUC-5 JV-FASTUS Information Extraction System'', Proceedings, Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland, August 1993. [3] Feldman R., and Hirsh H., 1996. Exploiting Background Information in Knowledge Discovery from Text. Journal of Intelligent Information Systems. 1996. [4] Feldman R., Aumann Y., Amir A., Klösgen W. and Zilberstien A., 1997. Maximal <strong>Association</strong> Rules: a New Tool for Mining for Keyword co-occurrences in Document Collections, In Proceedings of the 3rd International Conference on Knowledge Discovery, KDD-97, Newport Beach, CA. [5] Lin D. 1995. University of Manitoba: Description of the PIE System as Used for MUC-6 . In Proceedings of the Sixth Conference on Message Understanding (MUC-6), Columbia, Maryland. [6] Rajman M. and Besançon R., 1997. Text Mining: Natural Language Techniques and Text Mining Applications. In Proceedings of the seventh IFIP 2.6 Working Conference on Database Semantics (DS-7), Chapam & Hall IFIP Proceedings serie. Leysin, Switzerland, Oct 7-10, 1997. [7] Riloff Ellen and Lehnert Wendy, Information Extraction as a Basis for High-Precision Text Classification, ACM Transactions on Information Systems (special issue on text categorization) or also Umass-TE-24, 1994. [8] Sundheim, Beth, ed., 1993. Proceedings, Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland, August 1993. Distributed by M<strong>org</strong>an Kaufmann Publishers, Inc., San Mateo, California. [9] Tipster Text Program (Phase I), 1993. Proceedings, Advanced Research Projects Agency, September 1993. CapLits 23