22.04.2013 Views

Dr. Vasant Honavar Department of Computer Science honavar@cs ...

Dr. Vasant Honavar Department of Computer Science honavar@cs ...

Dr. Vasant Honavar Department of Computer Science honavar@cs ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

(c) The development <strong>of</strong> a theoretically sound approach to formulation and execution <strong>of</strong> statistical queries across<br />

semantically heterogeneous data sources [Reinoso-Castillo et al., 2003; Caragea et al., 2004b]. This work has<br />

shown how to use semantic correspondences and mappings specified by users from a set <strong>of</strong> terms and<br />

relationships among terms (user ontology) to terms and relations in data source specific ontologies to construct a<br />

sound procedure for answering queries for sufficient statistics needed for learning classifiers from semantically<br />

heterogeneous data. An important component <strong>of</strong> this work has to do with the development <strong>of</strong> statistically sound<br />

approaches to handling data specified at different levels <strong>of</strong> abstraction across different data sources.<br />

(d) The design and implementation <strong>of</strong> INDUS – a modular, extensible, open source s<strong>of</strong>tware toolkit for data-driven<br />

knowledge acquisition from large, distributed, autonomous, semantically heterogeneous data sources (in<br />

collaboration with two other members <strong>of</strong> my laboratory). INDUS enables users (e,g. biologists with little expertise in<br />

programming or computing, but with an understanding <strong>of</strong> specific data repositories e.g., the SWISSPROT protein<br />

databank) to rapidly and flexibly construct classifiers (e.g., for functional annotation <strong>of</strong> proteins) using data from<br />

multiple physically distributed, semantically heterogeneous data repositories.<br />

Research in progress is aimed at:<br />

(a) Extending algorithms that are being developed by our group as well as others for learning from multiple<br />

relational databases to work with semantically heterogeneous data sources, taking advantage <strong>of</strong> the capability<br />

<strong>of</strong> INDUS to view heterogeneous information sources as though they were a collection <strong>of</strong> relational databases.<br />

(b) Extending the ontology-based approach to information integration to develop ontology-based frameworks for<br />

composition <strong>of</strong> autonomously developed components and services using emerging frameworks for data source<br />

(or more generally resource or service) description, registry services, that are being developed as part <strong>of</strong> the<br />

Semantic Web efforts.<br />

(c) Development <strong>of</strong> tools for collaborative and modular ontology development, specification <strong>of</strong> semantic mappings<br />

between information sources, ontology merging, learning specific types <strong>of</strong> ontologies (e.g., attribute value<br />

taxonomies) from data.<br />

(d) Extension <strong>of</strong> approaches used in INDUS to support user and context-specific information integration and<br />

knowledge acquisition in peer-to-peer environments and distributed sensor networks.<br />

(e) Further development <strong>of</strong> the INDUS prototype into a platform to support exploratory data analysis and knowledge<br />

acquisition in representative problems in bioinformatics and computational biology e.g., data-driven construction<br />

<strong>of</strong> classifiers <strong>of</strong> protein function; and predictors <strong>of</strong> protein-protein interaction.<br />

(f) Applications <strong>of</strong> INDUS to data integration problems that arise in monitoring and control <strong>of</strong> complex engineered<br />

systems e.g., distributed power grid, distributed information and communication networks.<br />

(g) Dissemination <strong>of</strong> INDUS and associated s<strong>of</strong>tware to the broader scientific community.<br />

Algorithms and S<strong>of</strong>tware for Learning Classifiers from Attribute Value Taxonomies and Data<br />

In many applications <strong>of</strong> machine learning, data consist <strong>of</strong> partially specified instances, that is, instances whose attribute<br />

values, class labels, or both, are specified at different levels <strong>of</strong> abstraction. Such partially specified data are simply<br />

unavoidable when information from multiple, semantically heterogeneous data sources. The main scientific objective <strong>of</strong><br />

the proposed research is to design, implement, analyze, and apply machine learning algorithms for construction <strong>of</strong> pattern<br />

classifiers from such partially specified data sets using user-supplied attribute value taxonomies (AVT), class taxonomies<br />

(CT), or both as needed. Anticipated results <strong>of</strong> the proposed research include:<br />

a. A collection <strong>of</strong> well-documented data sets and associated AVT and CT, drawn from diverse application domains<br />

including computational molecular biology, security informatics, and census data, and a suite <strong>of</strong> programs for<br />

generating synthetic data sets, AVT, CT<br />

b. A suite <strong>of</strong> theoretically well-founded algorithms for learning classifiers from partially specified data – including AVT<br />

and CT-guided variants <strong>of</strong> existing algorithms for construction <strong>of</strong> Decision trees, Bag-<strong>of</strong>-Words classifiers, Bayesian<br />

networks (e.g., Naïve Bayes and Tree-Augmented Naïve Bayes Classifiers), Hyperplane Classifiers (Perceptrons,<br />

Winnow Perceptron, and Support Vector Machines)<br />

Dec 2005<br />

8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!