Dr. Vasant Honavar Department of Computer Science honavar@cs ...
Dr. Vasant Honavar Department of Computer Science honavar@cs ...
Dr. Vasant Honavar Department of Computer Science honavar@cs ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
(c) The development <strong>of</strong> a theoretically sound approach to formulation and execution <strong>of</strong> statistical queries across<br />
semantically heterogeneous data sources [Reinoso-Castillo et al., 2003; Caragea et al., 2004b]. This work has<br />
shown how to use semantic correspondences and mappings specified by users from a set <strong>of</strong> terms and<br />
relationships among terms (user ontology) to terms and relations in data source specific ontologies to construct a<br />
sound procedure for answering queries for sufficient statistics needed for learning classifiers from semantically<br />
heterogeneous data. An important component <strong>of</strong> this work has to do with the development <strong>of</strong> statistically sound<br />
approaches to handling data specified at different levels <strong>of</strong> abstraction across different data sources.<br />
(d) The design and implementation <strong>of</strong> INDUS – a modular, extensible, open source s<strong>of</strong>tware toolkit for data-driven<br />
knowledge acquisition from large, distributed, autonomous, semantically heterogeneous data sources (in<br />
collaboration with two other members <strong>of</strong> my laboratory). INDUS enables users (e,g. biologists with little expertise in<br />
programming or computing, but with an understanding <strong>of</strong> specific data repositories e.g., the SWISSPROT protein<br />
databank) to rapidly and flexibly construct classifiers (e.g., for functional annotation <strong>of</strong> proteins) using data from<br />
multiple physically distributed, semantically heterogeneous data repositories.<br />
Research in progress is aimed at:<br />
(a) Extending algorithms that are being developed by our group as well as others for learning from multiple<br />
relational databases to work with semantically heterogeneous data sources, taking advantage <strong>of</strong> the capability<br />
<strong>of</strong> INDUS to view heterogeneous information sources as though they were a collection <strong>of</strong> relational databases.<br />
(b) Extending the ontology-based approach to information integration to develop ontology-based frameworks for<br />
composition <strong>of</strong> autonomously developed components and services using emerging frameworks for data source<br />
(or more generally resource or service) description, registry services, that are being developed as part <strong>of</strong> the<br />
Semantic Web efforts.<br />
(c) Development <strong>of</strong> tools for collaborative and modular ontology development, specification <strong>of</strong> semantic mappings<br />
between information sources, ontology merging, learning specific types <strong>of</strong> ontologies (e.g., attribute value<br />
taxonomies) from data.<br />
(d) Extension <strong>of</strong> approaches used in INDUS to support user and context-specific information integration and<br />
knowledge acquisition in peer-to-peer environments and distributed sensor networks.<br />
(e) Further development <strong>of</strong> the INDUS prototype into a platform to support exploratory data analysis and knowledge<br />
acquisition in representative problems in bioinformatics and computational biology e.g., data-driven construction<br />
<strong>of</strong> classifiers <strong>of</strong> protein function; and predictors <strong>of</strong> protein-protein interaction.<br />
(f) Applications <strong>of</strong> INDUS to data integration problems that arise in monitoring and control <strong>of</strong> complex engineered<br />
systems e.g., distributed power grid, distributed information and communication networks.<br />
(g) Dissemination <strong>of</strong> INDUS and associated s<strong>of</strong>tware to the broader scientific community.<br />
Algorithms and S<strong>of</strong>tware for Learning Classifiers from Attribute Value Taxonomies and Data<br />
In many applications <strong>of</strong> machine learning, data consist <strong>of</strong> partially specified instances, that is, instances whose attribute<br />
values, class labels, or both, are specified at different levels <strong>of</strong> abstraction. Such partially specified data are simply<br />
unavoidable when information from multiple, semantically heterogeneous data sources. The main scientific objective <strong>of</strong><br />
the proposed research is to design, implement, analyze, and apply machine learning algorithms for construction <strong>of</strong> pattern<br />
classifiers from such partially specified data sets using user-supplied attribute value taxonomies (AVT), class taxonomies<br />
(CT), or both as needed. Anticipated results <strong>of</strong> the proposed research include:<br />
a. A collection <strong>of</strong> well-documented data sets and associated AVT and CT, drawn from diverse application domains<br />
including computational molecular biology, security informatics, and census data, and a suite <strong>of</strong> programs for<br />
generating synthetic data sets, AVT, CT<br />
b. A suite <strong>of</strong> theoretically well-founded algorithms for learning classifiers from partially specified data – including AVT<br />
and CT-guided variants <strong>of</strong> existing algorithms for construction <strong>of</strong> Decision trees, Bag-<strong>of</strong>-Words classifiers, Bayesian<br />
networks (e.g., Naïve Bayes and Tree-Augmented Naïve Bayes Classifiers), Hyperplane Classifiers (Perceptrons,<br />
Winnow Perceptron, and Support Vector Machines)<br />
Dec 2005<br />
8