01.08.2013 Views

Predicting post-synaptic activity in proteins - ResearchGate

Predicting post-synaptic activity in proteins - ResearchGate

Predicting post-synaptic activity in proteins - ResearchGate

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

BIOINFORMATICS<br />

Databases<br />

Vol. 21 Suppl. 2 2005, pages ii19–ii25<br />

doi:10.1093/bio<strong>in</strong>formatics/bti1102<br />

<strong>Predict<strong>in</strong>g</strong> <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> <strong>in</strong> prote<strong>in</strong>s with data m<strong>in</strong><strong>in</strong>g<br />

Gisele L. Pappa 1 , Anthony J. Ba<strong>in</strong>es 2 and Alex A. Freitas 1,∗<br />

1 Comput<strong>in</strong>g Laboratory, University of Kent, Canterbury CT2 7NF, UK and 2 Department of Biosciences,<br />

University of Kent, Canterbury CT2 7NJ, UK<br />

ABSTRACT<br />

Summary: The bio<strong>in</strong>formatics problem be<strong>in</strong>g addressed <strong>in</strong> this paper<br />

is to predict whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. This<br />

problem is of great <strong>in</strong>tr<strong>in</strong>sic <strong>in</strong>terest because prote<strong>in</strong>s with <strong>post</strong><strong>synaptic</strong><br />

activities are connected with function<strong>in</strong>g of the nervous<br />

system. Indeed, many prote<strong>in</strong>s hav<strong>in</strong>g <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> have<br />

been functionally characterized by biochemical, immunological and<br />

proteomic exercises. They represent a wide variety of prote<strong>in</strong>s with<br />

functions <strong>in</strong> extracellular signal reception and propagation through<br />

<strong>in</strong>tracellular apparatuses, cell adhesion molecules and scaffold<strong>in</strong>g prote<strong>in</strong>s<br />

that l<strong>in</strong>k them <strong>in</strong> a web. The challenge is to automatically discover<br />

features of the primary sequences of prote<strong>in</strong>s that typically occur <strong>in</strong><br />

prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> but rarely (or never) occur <strong>in</strong> prote<strong>in</strong>s<br />

without <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>, and vice-versa. In this context, we<br />

used data m<strong>in</strong><strong>in</strong>g to automatically discover classification rules that predict<br />

whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. The discovered<br />

rules were analysed with respect to their predictive accuracy (generalization<br />

ability) and with respect to their <strong>in</strong>terest<strong>in</strong>gness to biologists<br />

(<strong>in</strong> the sense of represent<strong>in</strong>g novel, unexpected knowledge).<br />

Contact: A.A.Freitas@kent.ac.uk<br />

1 INTRODUCTION<br />

One of the great challenges of our era is to predict the functions of<br />

prote<strong>in</strong>s based on their primary sequence. This is a very difficult<br />

problem, s<strong>in</strong>ce the relationship between prote<strong>in</strong> sequence and function<br />

is very complex (Gerlt and Babbitt, 2000; Devos and Valencia,<br />

2000; Nagl, 2003). Indeed, although there is a vast amount of data<br />

stored <strong>in</strong> prote<strong>in</strong> databases, there is still a large gap between the huge<br />

amount of data about prote<strong>in</strong> sequences and the knowledge necessary<br />

for understand<strong>in</strong>g the process of prote<strong>in</strong> fold<strong>in</strong>g and associated<br />

prote<strong>in</strong> functions.<br />

Intuitively, however, prote<strong>in</strong> databases conta<strong>in</strong> important ‘hidden<br />

relationships’ (knowledge) between prote<strong>in</strong> sequence and prote<strong>in</strong><br />

function. There is a clear and urgent motivation for discover<strong>in</strong>g this<br />

hidden knowledge from prote<strong>in</strong> databases for a number of reasons,<br />

such as a better understand<strong>in</strong>g of diseases, design<strong>in</strong>g more effective<br />

medical drugs, etc. This creates both a need and an opportunity<br />

to apply data m<strong>in</strong><strong>in</strong>g techniques to the problem of automatically<br />

discover<strong>in</strong>g knowledge from prote<strong>in</strong> databases.<br />

Data m<strong>in</strong><strong>in</strong>g is a multi-discipl<strong>in</strong>ary field, which consists of us<strong>in</strong>g<br />

methods of several research areas (arguably, ma<strong>in</strong>ly mach<strong>in</strong>e learn<strong>in</strong>g<br />

and statistical pattern recognition) to extract <strong>in</strong>terest<strong>in</strong>g knowledge<br />

from real-world datasets (Witten and Frank, 2000; Fayyad et al.,<br />

1996).<br />

∗ To whom correspondence should be addressed.<br />

This paper proposes a data m<strong>in</strong><strong>in</strong>g approach to the problem of<br />

predict<strong>in</strong>g whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>, based<br />

on features of the prote<strong>in</strong>’s primary sequence. The proposed approach<br />

will be described later, <strong>in</strong> Sections 3 and 4. In this Introduction<br />

we only emphasize a major difference between the proposed data<br />

m<strong>in</strong><strong>in</strong>g approach and a more traditional bio<strong>in</strong>formatics approach for<br />

predict<strong>in</strong>g prote<strong>in</strong> function, as follows.<br />

In general, the approach most used to predict the function of a new<br />

prote<strong>in</strong>—for which we know only its sequence—consists of perform<strong>in</strong>g<br />

a similarity search <strong>in</strong> a prote<strong>in</strong> database. In essence, the program<br />

f<strong>in</strong>ds the most similar prote<strong>in</strong>(s) to the new prote<strong>in</strong>, and if that similarity<br />

is higher than a threshold, the function of the most similar<br />

prote<strong>in</strong> is transferred to the new prote<strong>in</strong>. Although this approach is<br />

very useful <strong>in</strong> many cases, it also has some limitations, as follows.<br />

First, it is well-known that two prote<strong>in</strong>s might have very similar<br />

sequences and perform different functions, or have very different<br />

sequences and perform the same or similar function (Syed and Yona,<br />

2003; Gerlt and Babbitt, 2000). Second, the prote<strong>in</strong>s be<strong>in</strong>g compared<br />

may be similar <strong>in</strong> regions of the sequence that are not determ<strong>in</strong>ants<br />

of function (Schug et al., 2002). Third, the prediction of function is<br />

based only on sequence similarity, ignor<strong>in</strong>g many relevant biochemical<br />

properties of prote<strong>in</strong>s (Karwath and K<strong>in</strong>g, 2002; Syed and Yona,<br />

2003). Fourth, it does not produce a model for predict<strong>in</strong>g function,<br />

and so it does not give <strong>in</strong>sights <strong>in</strong>to the relationship between the<br />

sequence, biochemical properties and function of prote<strong>in</strong>s.<br />

Another approach consists of <strong>in</strong>duc<strong>in</strong>g, from prote<strong>in</strong> data, a<br />

model describ<strong>in</strong>g (<strong>in</strong> a very summarized form) the data, so that<br />

new prote<strong>in</strong>s can be classified by the model. This is the data<br />

m<strong>in</strong><strong>in</strong>g approach followed <strong>in</strong> this project. We emphasize that<br />

this model-<strong>in</strong>duction approach aims at complement<strong>in</strong>g—rather than<br />

replac<strong>in</strong>g—the conventional similarity-based approach. In any case,<br />

the model-<strong>in</strong>duction approach followed <strong>in</strong> this project has two<br />

important advantages (K<strong>in</strong>g et al., 2001). First, it can predict the<br />

function of a new prote<strong>in</strong> even <strong>in</strong> the absence of sequence similarity<br />

between that prote<strong>in</strong> and other prote<strong>in</strong>s with known function. Second,<br />

if the discovered model is expressed <strong>in</strong> a comprehensible form (which<br />

is the case <strong>in</strong> this research, where knowledge is expressed by <strong>in</strong>tuitively<br />

comprehensible IF-THEN rules), it can be used by biologists,<br />

to give new <strong>in</strong>sights and possibly suggest new biological experiments.<br />

Indeed, <strong>in</strong> this research the rules discovered by a data m<strong>in</strong><strong>in</strong>g<br />

algorithm were not only automatically evaluated with respect to their<br />

predictive accuracy—as is usual <strong>in</strong> data m<strong>in</strong><strong>in</strong>g—but also manually<br />

<strong>in</strong>terpreted <strong>in</strong> the context of relevant biochemical knowledge, <strong>in</strong> order<br />

to determ<strong>in</strong>e how <strong>in</strong>terest<strong>in</strong>g they were with respect to provid<strong>in</strong>g<br />

novel <strong>in</strong>sights—unknown <strong>in</strong> the current literature—about the relationship<br />

between some prote<strong>in</strong> sequence patterns and <strong>post</strong>-<strong>synaptic</strong><br />

<strong>activity</strong>. This two-criteria evaluation re<strong>in</strong>forces the <strong>in</strong>ter-discipl<strong>in</strong>ary<br />

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org ii19


G.L.Pappa et al.<br />

Fig. 1. Ma<strong>in</strong> elements <strong>in</strong>volved <strong>in</strong> pre-<strong>synaptic</strong> and <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>.<br />

A synapse is the po<strong>in</strong>t where two nerve cells communicate with each other by<br />

transmission of a chemical known as a neurotransmitter. The ma<strong>in</strong> elements<br />

found <strong>in</strong> synapses are shown <strong>in</strong> Figure 1. The cells are held together by<br />

cell adhesion molecules (1). In the cell where the signal is com<strong>in</strong>g from<br />

(the pre-<strong>synaptic</strong> cell) neurotransmitters are stored <strong>in</strong> bags called <strong>synaptic</strong><br />

vesicles. When signals are to be transmitted from the pre-<strong>synaptic</strong> cell to the<br />

<strong>post</strong>-<strong>synaptic</strong> cell, <strong>synaptic</strong> vesicles fuse with the pre-<strong>synaptic</strong> membrane and<br />

release their contents <strong>in</strong>to the <strong>synaptic</strong> cleft between the cells. The transmitters<br />

then diffuse with<strong>in</strong> the cleft, and some of them meet a <strong>post</strong>-<strong>synaptic</strong> receptor<br />

(2), which recognises them as a signal. This activates the receptor, which<br />

then transmits the signal on to other signall<strong>in</strong>g components such as voltagegated<br />

ion channels (3), prote<strong>in</strong> k<strong>in</strong>ases (4) and phosphatases (5). To ensure<br />

that the signal is term<strong>in</strong>ated and to clear up residual neurotransmitters after<br />

the signal has term<strong>in</strong>ated, transporters (6) remove neurotransmitters from the<br />

cleft. With<strong>in</strong> the <strong>post</strong>-<strong>synaptic</strong> cell, the signall<strong>in</strong>g apparatus is organised by<br />

various scaffold<strong>in</strong>g prote<strong>in</strong>s (7).<br />

nature of this project, which is, of course, a desirable feature <strong>in</strong> a<br />

bio<strong>in</strong>formatics project.<br />

The rema<strong>in</strong>der of this paper is organised as follows: Section 2<br />

describes the target biological problem; Section 3 describes the<br />

preparation of the dataset for m<strong>in</strong><strong>in</strong>g purposes, by us<strong>in</strong>g prote<strong>in</strong><br />

data available <strong>in</strong> UniProt/SwissProt and Prosite; Section 4 discusses<br />

the proposed data m<strong>in</strong><strong>in</strong>g approach and the correspond<strong>in</strong>g analysis<br />

of results; f<strong>in</strong>ally, Section 5 reports the conclusions and future<br />

research directions.<br />

2 THE TARGET BIOLOGICAL PROBLEM<br />

In essence, <strong>post</strong>-<strong>synaptic</strong> sites represent po<strong>in</strong>ts where one nerve cell<br />

receives signals from another. As <strong>in</strong>dicated <strong>in</strong> Figure 1, multiple<br />

types of prote<strong>in</strong>s are expected to be found at these sites for reception<br />

and propagation of signals, and for jo<strong>in</strong><strong>in</strong>g the two nerve cells to each<br />

other. Note that Figure 1 is a very m<strong>in</strong>imal summary of the types of<br />

prote<strong>in</strong>s found <strong>in</strong> <strong>post</strong><strong>synaptic</strong> sites.<br />

The bio<strong>in</strong>formatics problem be<strong>in</strong>g addressed <strong>in</strong> this paper is to predict<br />

whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. This problem<br />

is of great <strong>in</strong>tr<strong>in</strong>sic <strong>in</strong>terest because prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong> activities<br />

are connected with function<strong>in</strong>g of the nervous system. Indeed,<br />

many prote<strong>in</strong>s hav<strong>in</strong>g <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> have been functionally<br />

characterized by biochemical, immunological and proteomic exercises<br />

[see e.g. Husi et al. (2000) and Walikonis et al. (2000)], and are<br />

ii20<br />

now extensively catalogued and annotated <strong>in</strong> the Uniprot/SwissProt<br />

database. They represent a wide variety of prote<strong>in</strong>s with functions <strong>in</strong><br />

extracellular signal reception and propagation through <strong>in</strong>tracellular<br />

apparatuses, cell adhesion molecules and scaffold<strong>in</strong>g prote<strong>in</strong>s that<br />

l<strong>in</strong>k them <strong>in</strong> a web.<br />

The challenge is to automatically discover features of prote<strong>in</strong>s’<br />

primary sequences that typically occur <strong>in</strong> prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong><br />

<strong>activity</strong> but rarely (or never) occur <strong>in</strong> prote<strong>in</strong>s without <strong>post</strong>-<strong>synaptic</strong><br />

<strong>activity</strong>, and vice-versa. These discovered features constitute the<br />

essence of the knowledge discovered by a data m<strong>in</strong><strong>in</strong>g algorithm.<br />

If the algorithm is successful <strong>in</strong> discover<strong>in</strong>g knowledge with a high<br />

predictive power, that knowledge can be used to accurately discrim<strong>in</strong>ate<br />

between the two classes of prote<strong>in</strong>s. In addition, and<br />

most important, the fact that <strong>in</strong> this project discovered knowledge<br />

will be expressed <strong>in</strong> a comprehensible form—as mentioned <strong>in</strong> the<br />

Introduction—represents potentially valuable knowledge by itself,<br />

because such knowledge can potentially give new <strong>in</strong>sights to biologists<br />

about which sequence features are predictive of <strong>post</strong>-<strong>synaptic</strong><br />

<strong>activity</strong>.<br />

3 METHODS<br />

The goal of this project is to predict—with a data m<strong>in</strong><strong>in</strong>g algorithm—whether<br />

or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. The algorithm is used to discover<br />

<strong>in</strong>terest<strong>in</strong>g relationships between sequence features that are often present <strong>in</strong><br />

<strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s but usually absent <strong>in</strong> prote<strong>in</strong>s without <strong>post</strong>-<strong>synaptic</strong><br />

prote<strong>in</strong>s, and vice-versa. Therefore, <strong>in</strong> order to discover this k<strong>in</strong>d of knowledge,<br />

we need not only a set of <strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s but also a control set<br />

of prote<strong>in</strong>s which do not have <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>.<br />

In data m<strong>in</strong><strong>in</strong>g term<strong>in</strong>ology, the set of prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong><br />

is called the set of positive examples, whereas the control set of prote<strong>in</strong>s<br />

(without <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>) is called the set of negative examples. The<br />

problem of f<strong>in</strong>d<strong>in</strong>g sequence features that discrim<strong>in</strong>ate between these two<br />

k<strong>in</strong>ds of prote<strong>in</strong>s is then cast as a classification problem, where the goal is<br />

to predict the value of a class attribute for each example (prote<strong>in</strong>) based on<br />

a set of predictor attributes for that example. The classes are whether or not<br />

a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>, and the predictor attributes are ma<strong>in</strong>ly<br />

Prosite patterns, as described below.<br />

More precisely, the dataset m<strong>in</strong>ed <strong>in</strong> this project was constructed <strong>in</strong><br />

two phases. In the first phase we carefully selected relevant prote<strong>in</strong>s<br />

from the UniProt database. UniProt (Universal Prote<strong>in</strong> Resource) was<br />

created by the union of Swiss-Prot, TrEMBL and PIR databases, and<br />

is a repository for prote<strong>in</strong> sequence and functional data (Uniprot, 2004,<br />

http://www.ebi.uniprot.org/<strong>in</strong>dex.shtml). UniProt was chosen as the source<br />

of the data to be m<strong>in</strong>ed because it is the world standard non-redundant,<br />

comprehensive prote<strong>in</strong> sequence database. UniProt is divided <strong>in</strong> three<br />

database layers: UniParc (UniProt Archive), UniProt Knowledgebase (Uni-<br />

Prot/SwissProt) and UniProt NREF (UniRef). In this research we have used<br />

UniProt/SwissProt, which is the richest annotated layer. One of the great<br />

advantages of this layer is the comprehensive annotation of SwissProt. This<br />

is curated to ensure m<strong>in</strong>imal redundancy, accurate and comprehensive annotation<br />

of function, expression, sequence features (e.g. doma<strong>in</strong> structures),<br />

literature references, l<strong>in</strong>ks to other databases, etc.<br />

The second phase of data preparation used the l<strong>in</strong>ks from Unit-<br />

Prot/SwissProt to the Prosite database, <strong>in</strong> order to retrieve <strong>in</strong>formation from<br />

Prosite that was used to create the predictor attributes. These two phases are<br />

described <strong>in</strong> detail next.<br />

3.1 Phase 1: select<strong>in</strong>g positive and negative examples<br />

Table 1 shows the queries submitted to UniProt/SwissProt <strong>in</strong> order to select the<br />

set of positive examples (prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>) and the set of<br />

negative examples (prote<strong>in</strong>s without <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>). For each query,<br />

the table shows the specification of the query and the number of examples


Table 1. Queries submitted to UniProt/SwissProt to create the dataset<br />

Query No. Query specification # Examples<br />

Selection of positive examples<br />

1 Post-<strong>synaptic</strong> !tox<strong>in</strong> 356<br />

Total number of positive examples 356<br />

Query No. Query/Keywords # Examples<br />

Selection of negative examples<br />

2 Heart !(result of query 1) 3106<br />

3 Cardiac !(results of queries 1, 2) 331<br />

4 Liver !(results of queries 1, 2, 3) 2794<br />

5 Hepatic !(results of queries 1, 2, 3, 4) 256<br />

6 Kidney !(results of query 1, 2, 3, 4, 5) 988<br />

Total number of negative examples 7475<br />

(prote<strong>in</strong>s) returned by that query. The specification of each query consists of<br />

keywords and the logical operator ‘NOT’ (!).<br />

Note that the queries did not specify any specific species, i.e. prote<strong>in</strong>s from<br />

all species were considered. This was done <strong>in</strong> order to maximize the number<br />

of examples <strong>in</strong> the data to be m<strong>in</strong>ed. Positive examples were selected not<br />

only by the presence of the keyword ‘<strong>post</strong>-<strong>synaptic</strong>’ but also by the absence<br />

of the keyword ‘tox<strong>in</strong>’. The reason for this latter criterion is that several<br />

entries <strong>in</strong> UniProt/SwissProt refer to the tox<strong>in</strong> α-latrotox<strong>in</strong>. This prote<strong>in</strong> acts<br />

<strong>post</strong>-<strong>synaptic</strong>ally, but it is not, of course, a <strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>.<br />

The selection of the negative examples was more difficult, s<strong>in</strong>ce of course<br />

the UniProt/SwissProt entries do not have an explicit ‘not <strong>post</strong>-<strong>synaptic</strong>’<br />

keyword. The trivial ‘solution’ of retriev<strong>in</strong>g all entries that do not explicitly<br />

have the keyword ‘<strong>post</strong>-<strong>synaptic</strong>’ is not satisfactory, for two reasons.<br />

First, this would produce a very large number of negative examples, which<br />

would be much larger than the number of positive examples. This would create<br />

a dataset with an extremely unbalanced class distribution, and it would be<br />

very difficult for the classification algorithm to discover rules that correctly<br />

classify the positive (<strong>post</strong>-<strong>synaptic</strong>) class. Second, and most important, many<br />

of the negative examples would be ‘trivial’, lead<strong>in</strong>g to un<strong>in</strong>terest<strong>in</strong>g, trivial<br />

rules for the discrim<strong>in</strong>ation between positive and negative examples. For<br />

<strong>in</strong>stance, there is no need to discover rules discrim<strong>in</strong>at<strong>in</strong>g plant prote<strong>in</strong>s from<br />

<strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s, s<strong>in</strong>ce it is obvious that plants do not have a nervous<br />

system. Hence, <strong>in</strong>clud<strong>in</strong>g plant prote<strong>in</strong>s <strong>in</strong> the set of negative examples would<br />

only contribute to the discovery of un<strong>in</strong>terest<strong>in</strong>g, trivial classification rules.<br />

The goal is to select a set of negative examples where, although the prote<strong>in</strong>s<br />

do not have <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>, they have some characteristics that<br />

could be confused with some of the characteristics of <strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s<br />

(the positive examples), mak<strong>in</strong>g it difficult to discrim<strong>in</strong>ate between these two<br />

k<strong>in</strong>ds of prote<strong>in</strong>s. Intuitively, the higher the similarity between positive and<br />

negative examples, the harder the classification problem, and so the stronger<br />

the motivation to use a data m<strong>in</strong><strong>in</strong>g algorithm to discover <strong>in</strong>terest<strong>in</strong>g classification<br />

rules represent<strong>in</strong>g novel knowledge to a biologist. (Recall that the<br />

ultimate goal is to automatically discover <strong>in</strong>terest<strong>in</strong>g, novel rules that provide<br />

new <strong>in</strong>sight to biologists about which sequence features are most correlated<br />

with the presence or absence of <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>.)<br />

Note that, although the positive and negative examples must have enough<br />

similarity to lead to the discovery of <strong>in</strong>terest<strong>in</strong>g, non-trivial rules, they also<br />

must be different enough to allow the discovery of reliable classification rules<br />

for each class. Hence, the challenge is to f<strong>in</strong>d a good trade-off between these<br />

two goals.<br />

In this spirit, the negative examples were selected by us<strong>in</strong>g the keywords<br />

‘heart’, ‘cardiac’, ‘liver’, ‘hepatic’ and ‘kidney’. Consider, for <strong>in</strong>stance, the<br />

query 4 <strong>in</strong> Table 1: ‘Liver !(results of queries 1, 2, 3)’. This query means<br />

<strong>Predict<strong>in</strong>g</strong> <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> <strong>in</strong> prote<strong>in</strong>s with data m<strong>in</strong><strong>in</strong>g<br />

that all UniProt/ SwissProt entries that had the keyword ‘liver’ and were not<br />

already <strong>in</strong>cluded <strong>in</strong> the results of the previous queries (1, 2, 3) were selected<br />

as negative examples. The latter selection criterion was necessary to avoid the<br />

possibility that two or more copies of the same data entry—i.e. a data entry<br />

hav<strong>in</strong>g two or more of the above-mentioned keywords—were duplicated <strong>in</strong><br />

the set of negative examples.<br />

The previously-mentioned five keywords were chosen as the basis for<br />

obta<strong>in</strong><strong>in</strong>g negative examples for two reasons. First, it is known that, <strong>in</strong> general,<br />

prote<strong>in</strong>s found <strong>in</strong> these sites do not present <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. Second,<br />

prote<strong>in</strong>s found <strong>in</strong> these sites often have some characteristics similar to <strong>post</strong><strong>synaptic</strong><br />

prote<strong>in</strong>s. Indeed, <strong>in</strong> the context of this project, many of the same types<br />

of prote<strong>in</strong>s represented at <strong>post</strong>-<strong>synaptic</strong> sites (k<strong>in</strong>ases/phosphatases/channels,<br />

etc.) are present <strong>in</strong> abundance <strong>in</strong> heart, liver and kidney tissues.<br />

It should be noted that the queries listed <strong>in</strong> Table 1 searched for the correspond<strong>in</strong>g<br />

keywords <strong>in</strong> all the fields of the UniProt/ SwissProt entries. This<br />

means that the selection criteria are not perfect, s<strong>in</strong>ce those keywords could<br />

be present <strong>in</strong> fields where the presence of the keyword does not mean that the<br />

prote<strong>in</strong> has the correspond<strong>in</strong>g function/characteristic. This potentially <strong>in</strong>troduces<br />

some ‘noise’ <strong>in</strong> the dataset be<strong>in</strong>g m<strong>in</strong>ed. However, this was necessary,<br />

because queries search<strong>in</strong>g for keywords only <strong>in</strong> the field ‘KEYWORD’ of<br />

UniProt/ SwissProt returned too few prote<strong>in</strong>s. In any case, the amount of<br />

noise <strong>in</strong>troduced by the imperfect selection criteria seems to be relatively<br />

small, s<strong>in</strong>ce the data m<strong>in</strong><strong>in</strong>g algorithm was able to discovery quite accurate<br />

classification rules, as will be shown later.<br />

3.2 Phase 2: generat<strong>in</strong>g the predictor attributes<br />

Once the set of examples to be m<strong>in</strong>ed has been selected from Uniprot/Swissprot,<br />

the next step was to generate a set of predictor attributes<br />

represent<strong>in</strong>g relevant properties of the sequences those prote<strong>in</strong>s. The predictor<br />

attributes must have a good predictive power and facilitate easy <strong>in</strong>terpretation<br />

by biologists. In this project we have focused ma<strong>in</strong>ly on generat<strong>in</strong>g attributes<br />

based on Prosite patterns associated with the prote<strong>in</strong>s—a type of attribute<br />

satisfy<strong>in</strong>g both the previously-mentioned properties. The Prosite database<br />

stores significant patterns and profiles that help to identify the family of a<br />

new prote<strong>in</strong> (Hulo et al., 2004). We decided to use attributes based only on<br />

Prosite patterns, and not Prosite profiles, for two reasons. First, the match<strong>in</strong>g<br />

between a Prosite pattern and a prote<strong>in</strong> can be exactly computed, produc<strong>in</strong>g<br />

a simple b<strong>in</strong>ary attribute—i.e. the pattern either occurs or does not occur <strong>in</strong><br />

a given prote<strong>in</strong>. This reduces the size of the search space for the data m<strong>in</strong><strong>in</strong>g<br />

algorithm and simplifies the <strong>in</strong>terpretation of the rule by biologists. By<br />

contrast, the match<strong>in</strong>g between a Prosite profile and a prote<strong>in</strong> is an approximate<br />

match<strong>in</strong>g, and the data m<strong>in</strong><strong>in</strong>g algorithm would have to search <strong>in</strong> a<br />

correspond<strong>in</strong>gly much larger search space. Second, the use of both patterns<br />

and profiles would lead to a very large number of attributes, which would<br />

aga<strong>in</strong> expand the size of the search space, and so would significantly <strong>in</strong>crease<br />

the risk of overfitt<strong>in</strong>g the <strong>in</strong>duced model to the data. It should be noted that,<br />

even consider<strong>in</strong>g only the b<strong>in</strong>ary attributes derived from Prosite patterns, this<br />

led to 443 attributes (as expla<strong>in</strong>ed next), which corresponds to a huge search<br />

space of size 2443 .<br />

For each prote<strong>in</strong> selected <strong>in</strong> phase 1 (regardless of the prote<strong>in</strong> be<strong>in</strong>g a<br />

positive or a negative example), we retrieved all Prosite entry id’s that occurred<br />

<strong>in</strong> the field database cross-references (DR) of UniProt/SwissProt. For each<br />

Prosite entry id, we ‘followed the l<strong>in</strong>k’ from UniProt/SwissProt to Prosite, <strong>in</strong><br />

order to access <strong>in</strong>formation about that Prosite entry. Once this was done for<br />

all prote<strong>in</strong>s selected <strong>in</strong> phase 1, we had a large set of Prosite entries. We then<br />

selected, for use as predictor attributes, the entries that:<br />

(i) were marked as a ‘pattern’ <strong>in</strong> the ID l<strong>in</strong>e of that entry <strong>in</strong> the Prosite<br />

database;<br />

(ii) were not commented (<strong>in</strong> the CC l<strong>in</strong>e of the Prosite database) as<br />

very general patterns (/SKIP_FLAG = TRUE)—it was necessary to<br />

exclude those patterns because they appear <strong>in</strong> almost all prote<strong>in</strong>s, and<br />

so are not useful to discrim<strong>in</strong>ate between the two classes of prote<strong>in</strong>s;<br />

(iii) occurred <strong>in</strong> at least two prote<strong>in</strong>s of the dataset be<strong>in</strong>g m<strong>in</strong>ed—this<br />

was necessary to remove extremely specific patterns, occurr<strong>in</strong>g <strong>in</strong><br />

ii21


G.L.Pappa et al.<br />

just one prote<strong>in</strong> of the dataset, which do not have any generalisation<br />

power.<br />

F<strong>in</strong>ally, each of the selected Prosite patterns was encoded as one b<strong>in</strong>ary<br />

attribute of the dataset be<strong>in</strong>g m<strong>in</strong>ed, tak<strong>in</strong>g on the value ‘yes’ or<br />

‘no’ for each prote<strong>in</strong>—<strong>in</strong>dicat<strong>in</strong>g whether or not the pattern occurs <strong>in</strong> that<br />

prote<strong>in</strong>, respectively.<br />

Note that, a few prote<strong>in</strong>s <strong>in</strong> the dataset to be m<strong>in</strong>ed did not have any Prosite<br />

pattern, i.e. they had the value ‘no’ for all predictor attributes based on Prosite<br />

patterns. These prote<strong>in</strong>s were removed from the data to be m<strong>in</strong>ed.<br />

In addition to the previously-described attributes based on Prosite patterns,<br />

we added to the dataset two simple predictor attributes derived directly from<br />

the prote<strong>in</strong>s’ sequences, namely the sequence length and the molecular weight<br />

of the prote<strong>in</strong> (both attributes are available from the correspond<strong>in</strong>g fields <strong>in</strong><br />

UniProt/SwissProt entries). Other k<strong>in</strong>ds of attributes will be considered <strong>in</strong><br />

future research, but for now it is <strong>in</strong>terest<strong>in</strong>g to note that even the current set of<br />

predictor attributes is enough to discover quite accurate classification rules,<br />

as will be shown later.<br />

After all this data preparation process, we ended up with a dataset composed<br />

by 4303 examples (260 belong<strong>in</strong>g to the positive class and 4043<br />

belong<strong>in</strong>g to the negative class) and 445 predictor attributes. These 445<br />

attributes <strong>in</strong>clude 443 Prosite patterns, the sequence length and the molecular<br />

weight of each prote<strong>in</strong>.<br />

4 RESULTS<br />

In order to discover knowledge from the dataset described <strong>in</strong> the<br />

previous section, we have used the well-known C4.5Rules rule <strong>in</strong>duction<br />

algorithm (Qu<strong>in</strong>lan, 1993). This algorithm was chosen as the<br />

data m<strong>in</strong><strong>in</strong>g algorithm <strong>in</strong> our experiments ma<strong>in</strong>ly because it produces<br />

comprehensible knowledge, represented by a set of high-level,<br />

easily-<strong>in</strong>terpretable classification rules of the form: IF (conditions)<br />

THEN (class). This k<strong>in</strong>d of rule has the <strong>in</strong>tuitive mean<strong>in</strong>g that, if an<br />

example (prote<strong>in</strong>) satisfies the conditions <strong>in</strong> the rule antecedent, the<br />

example is assigned to the class predicted by the rule consequent.<br />

It should be noted that the comprehensibility of discovered knowledge<br />

is a very important issue <strong>in</strong> bio<strong>in</strong>formatics [see e.g. Mirk<strong>in</strong><br />

and Ritter (2000), Clare and K<strong>in</strong>g (2002) and Sebban et al.<br />

(2002)], because the discovered knowledge should be <strong>in</strong>terpreted<br />

and validated by biologists, rather than be<strong>in</strong>g bl<strong>in</strong>dly trusted as a<br />

‘black box’.<br />

We used the default parameters of C4.5Rules. The classification<br />

rules discovered by C4.5Rules were evaluated accord<strong>in</strong>g to two criteria,<br />

namely predictive accuracy and <strong>in</strong>terest<strong>in</strong>gness to biologists, as<br />

follows. Predictive accuracy was estimated by a well-known 10-fold<br />

cross-validation procedure (Witten and Frank, 2000), as usual <strong>in</strong> data<br />

m<strong>in</strong><strong>in</strong>g. In essence, the dataset was divided <strong>in</strong>to 10 partitions, with<br />

approximately the same number of examples (prote<strong>in</strong>s) <strong>in</strong> each partition.<br />

In the i-th iteration, i = 1, 2, ..., 10, the i-th partition was used<br />

as the test set and the other 9 partitions were temporarily merged and<br />

used as the tra<strong>in</strong><strong>in</strong>g set. In each iteration C4.5Rules discovered a rule<br />

set from the tra<strong>in</strong><strong>in</strong>g set and used that rule set to classify examples <strong>in</strong><br />

the test set (unseen dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g), <strong>in</strong> order to evaluate the generalisation<br />

ability of discovered knowledge. The classification accuracy<br />

rate of the discovered rules can then be computed as the average<br />

accuracy rate over the 10 test sets, and this is the measure of predictive<br />

accuracy most popular <strong>in</strong> the literature. In our experiments, this<br />

produced an accuracy rate of 97.85%.<br />

It should be noted, however, that <strong>in</strong> the context of this project this<br />

traditional measure of accurate rate is not a very effective one. The<br />

reason is that the class distribution is very unbalanced: only 6.4% of<br />

the examples have the positive class. Hence, as a basel<strong>in</strong>e solution for<br />

ii22<br />

this classification problem, the ‘majority classifier’—which predicts<br />

the majority (negative) class for all examples—would trivially obta<strong>in</strong><br />

an accuracy rate of 93.9%, without provid<strong>in</strong>g any <strong>in</strong>sight about the<br />

relationship between the predictor attributes and the classes.<br />

Therefore, we use a more ‘demand<strong>in</strong>g’ measure of predictive<br />

accuracy, for which a high value can be obta<strong>in</strong>ed only by accurately<br />

classify<strong>in</strong>g examples of both classes. The measure <strong>in</strong> question is the<br />

product: true positive rate (TPR) × true negative rate (TNR) (Hand,<br />

1997). These terms (which are sometimes referred to as Sensitivity<br />

and Specificity, respectively) are def<strong>in</strong>ed as follows.<br />

TPR = TP/(TP + FN) TNR = TN/(TN + FP),<br />

where<br />

TP = number of true positives—i.e. the number of examples that<br />

were predicted as positive class by the discovered rule set,<br />

and <strong>in</strong>deed have the positive class;<br />

FN = number of false negatives—i.e. the number of examples that<br />

were predicted as negative class, but actually have the positive<br />

class;<br />

TN = number of true negatives—i.e. the number of examples that<br />

were predicted as negative class, and <strong>in</strong>deed have the negative<br />

class;<br />

FP = number of false positives—i.e. the number of examples that<br />

were predicted as positive class, but actually have the negative<br />

class.<br />

In our experiments the average values (over the 10 iterations of the<br />

cross-validation procedure) of the TPR and TNR were 0.85 and 0.98<br />

respectively, result<strong>in</strong>g <strong>in</strong> the f<strong>in</strong>al measure of predictive accuracy as<br />

TPR × TNR = 0.84 (with a standard deviation of 0.09). Note that,<br />

the basel<strong>in</strong>e majority classifier obta<strong>in</strong>s TPR × TNR = 0 × 93.9 = 0,<br />

i.e. it is very strongly penalized (as it should be) for never predict<strong>in</strong>g<br />

the positive class.<br />

It should also be noted that, although the vast majority of the data<br />

m<strong>in</strong><strong>in</strong>g literature focuses on measur<strong>in</strong>g only the predictive accuracy<br />

of the discovered rules, the ultimate goal of data m<strong>in</strong><strong>in</strong>g is to discover<br />

knowledge that is comprehensible and <strong>in</strong>terest<strong>in</strong>g (novel, unexpected)<br />

to the user (Fayyad et al., 1996; Han and Kamber, 2001). We<br />

emphasize that a very accurate rule will not be useful to the user if<br />

it represents a previously known pattern. Consider, for <strong>in</strong>stance, the<br />

follow<strong>in</strong>g hypothetical example. In a hospital’s medical database a<br />

data m<strong>in</strong><strong>in</strong>g algorithm could discover the rule: IF (patient is pregnant)<br />

THEN (patient’s gender is female). This rule is extremely accurate,<br />

but it is also completely useless, s<strong>in</strong>ce it represents an obvious pattern.<br />

As a real-world example of the difficult of discover<strong>in</strong>g novel,<br />

unexpected rules, (Tsumoto, 2000) reports that, <strong>in</strong> experiments with<br />

two medical datasets,


Table 2. Rules discovered by C4.5Rules<br />

Id Classification rule<br />

32 IF (NEUROTR_ION_CHANNEL = yes) THEN (class = yes)<br />

19 IF (CADHERIN_1 = yes) AND (920 < seq_length 699)<br />

THEN (class = yes)<br />

10 IF (G_PROTEIN_RECEP_F1_1 = yes)<br />

AND (11287 < mol_weigth 318)<br />

THEN (class= yes)<br />

21 IF (G_PROTEIN_RECEP_F3_1 = yes)<br />

AND (mol_weight


G.L.Pappa et al.<br />

31.4%. The low accuracy of these two rules comes from the fact that<br />

ser/thr phosphatases and G-prote<strong>in</strong> coupled receptors are expressed<br />

<strong>in</strong> every human cell, and not just <strong>post</strong>-<strong>synaptic</strong>ally.<br />

The unexpected rules are much more complicated, but they<br />

are very surpris<strong>in</strong>gly accurate. Therefore, <strong>in</strong> general they represent<br />

<strong>in</strong>terest<strong>in</strong>g knowledge to biologists who are experts <strong>in</strong><br />

<strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s.<br />

For example, Rule 7 states that if 8 specific Prosite signatures are<br />

absent, then the prote<strong>in</strong> is not <strong>post</strong>-<strong>synaptic</strong> with 99.8% accuracy.<br />

Similarly, Rule 2 states the same th<strong>in</strong>g with 12 Prosite signatures.<br />

These rules could not have been predicted a priori with just biological<br />

knowledge. What Rules 2 and 7 do is to take a number of<br />

Prosite signatures that appear <strong>in</strong> <strong>in</strong>dividual ‘expected’ rules and comb<strong>in</strong>e<br />

them <strong>in</strong> a way that says that, when none of those signatures is<br />

present, then the prote<strong>in</strong>s are not <strong>post</strong>-<strong>synaptic</strong>. This has real utility<br />

<strong>in</strong> classify<strong>in</strong>g novel prote<strong>in</strong>s, exclud<strong>in</strong>g them from the <strong>post</strong>-<strong>synaptic</strong><br />

class.<br />

In order to better understand those rules, we also retrieved, from<br />

the dataset be<strong>in</strong>g m<strong>in</strong>ed, the prote<strong>in</strong>s which are exceptions to each<br />

of those rules—i.e. prote<strong>in</strong>s that satisfy the conditions <strong>in</strong> the rule<br />

antecedent but have a class different from the one predicted by the<br />

rule consequent. These exceptions are quite reveal<strong>in</strong>g.<br />

An exception to Rule 7 is the SPOCK prote<strong>in</strong> (Uniprot:<br />

TIC1_MOUSE). This is an extracellular matrix (ECM) prote<strong>in</strong> that<br />

is associated with the <strong>post</strong>-<strong>synaptic</strong> area of pyramidal neurons. The<br />

signatures <strong>in</strong> Rule 7 are membrane-associated or cytoplasmic; thus<br />

they do not cover this ECM prote<strong>in</strong>. The <strong>synaptic</strong> cleft is rather poor<br />

<strong>in</strong> ECM prote<strong>in</strong>s, so most other ECM prote<strong>in</strong>s (collagen etc.) are<br />

accurately <strong>in</strong>cluded <strong>in</strong> this rule.<br />

Rule 2 has some <strong>in</strong>terest<strong>in</strong>g exceptions, and aga<strong>in</strong> po<strong>in</strong>ts to limitations<br />

<strong>in</strong> the method used to generate the predictor attributes. The<br />

prote<strong>in</strong> b-Raf is a prote<strong>in</strong> k<strong>in</strong>ase that is ubiquitous <strong>in</strong> animal tissues,<br />

and has a role <strong>in</strong> mitogenic signall<strong>in</strong>g. It is also found <strong>in</strong> <strong>synaptic</strong><br />

structures. In neurons, it is thought to be part of the system that<br />

responds to growth factors such as NGF. In this sense it is not classically<br />

part of the system that responds to neurotransmitters, but rather<br />

has a role <strong>in</strong> development and ma<strong>in</strong>tenance of the nervous system.<br />

B-raf from human and mouse (Uniprot entries: BRAF_HUMAN and<br />

BRAF_MOUSE) are exceptions to Rule 2. This exception reflects<br />

both the ubiquity of b-Raf and the fact that it represents the nature<br />

of the signall<strong>in</strong>g pathways it is <strong>in</strong>volved <strong>in</strong>.<br />

5 CONCLUSIONS<br />

This paper proposed a data m<strong>in</strong><strong>in</strong>g approach to generate comprehensible<br />

rules that predict whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong><br />

<strong>activity</strong>, based on Prosite patterns occurr<strong>in</strong>g (or not) <strong>in</strong> the prote<strong>in</strong>,<br />

as well as on a couple of simple prote<strong>in</strong> properties computed directly<br />

from the prote<strong>in</strong>’s primary sequence (namely the sequence length<br />

and the molecular weight of the prote<strong>in</strong>).<br />

The discovered rules were evaluated with respect to both their predictive<br />

accuracy and their degree of surpris<strong>in</strong>gness (unexpectedness)<br />

to the user. The discovered rules were very successful with respect to<br />

predictive accuracy. The ma<strong>in</strong> contribution of this paper, however, is<br />

the analysis of the rules with respect to their surpris<strong>in</strong>gess. Although<br />

this is a very important issue <strong>in</strong> data m<strong>in</strong><strong>in</strong>g (as discussed earlier),<br />

and particularly crucial <strong>in</strong> the context of scientific discovery, this<br />

issue is largely ignored <strong>in</strong> virtually all the literature about prediction<br />

of prote<strong>in</strong> function from sequence.<br />

ii24<br />

From a biological perspective, the discovered rules overall reveal<br />

<strong>in</strong>terest<strong>in</strong>g features of this approach to m<strong>in</strong><strong>in</strong>g functional data from<br />

Uniprot/SwissProt. A number of expected rules accurately predict<br />

some aspects of <strong>post</strong>-<strong>synaptic</strong> function. Other rules (unexpectedly)<br />

can exclude <strong>post</strong>-<strong>synaptic</strong> function with astonish<strong>in</strong>g accuracy. Still<br />

other rules <strong>in</strong>dicate the limitation of this approach. The lack of<br />

voltage-gated ion channel Prosite patterns (related to type 3 prote<strong>in</strong>s<br />

<strong>in</strong> the notation of Fig. 1) reflects limitations <strong>in</strong> Prosite:<br />

future approaches to this problem will need to consider this. In<br />

the future we also plan to generate a more diverse set of predictor<br />

attributes, captur<strong>in</strong>g <strong>in</strong>formation about other relevant properties of<br />

prote<strong>in</strong> sequences.<br />

A direction for future research would be to estimate the ‘<strong>in</strong>terest<strong>in</strong>gness’<br />

of the discovered rules by us<strong>in</strong>g some data-driven rule<br />

<strong>in</strong>terest<strong>in</strong>gness measures proposed <strong>in</strong> the literature. Then we would<br />

be able to automatically rank the discovered rules accord<strong>in</strong>g to<br />

those <strong>in</strong>terest<strong>in</strong>gness measures, and present the rules to the user<br />

<strong>in</strong> decreas<strong>in</strong>g order of estimated <strong>in</strong>terest<strong>in</strong>gness. We could also<br />

measure the correlation between the value of those data-driven <strong>in</strong>terest<strong>in</strong>gness<br />

measures and the subjective, real <strong>in</strong>terest of the rules to<br />

a biologist. This would allow us to evaluate how effective those<br />

data-driven <strong>in</strong>terest<strong>in</strong>gness measures are <strong>in</strong> the sense of be<strong>in</strong>g good<br />

estimators of the real human <strong>in</strong>terest <strong>in</strong> the rules. It would also<br />

be <strong>in</strong>terest<strong>in</strong>g to analyse the rules discovered by other data m<strong>in</strong><strong>in</strong>g<br />

algorithms.<br />

ACKNOWLEDGEMENTS<br />

G.L.P. is f<strong>in</strong>ancially supported by CAPES (a Brazilian researchsupport<br />

agency), process number 1650-02-5.<br />

Conflict of Interest: none declared.<br />

REFERENCES<br />

Clare,A. and K<strong>in</strong>g,R.D. (2002) Mach<strong>in</strong>e learn<strong>in</strong>g of functional class from phenotype<br />

data. Bio<strong>in</strong>formatics, 18, 160–166.<br />

Devos,D. and Valencia,A. (2000) Practical limits of functional prediction. Prote<strong>in</strong>s, 41,<br />

98–107.<br />

Fayyad,U.M., Piatetsky-Shapiro,G. and Smyth,P. (1996) From data m<strong>in</strong><strong>in</strong>g to<br />

knowledge discovery: an overview. In: Fayyad,U.M., Piatetsky-Shapiro,G. and<br />

Smyth,P. (eds), Advances <strong>in</strong> Knowledge Discovery and Data M<strong>in</strong><strong>in</strong>g. AAAI/MIT,<br />

pp. 1–34.<br />

Gerlt,J.A. and Babbitt,P.C. (2000) Can sequence determ<strong>in</strong>e function? Genome Biol.,<br />

1(5), reviews 0005.1–reviews 0005.10.<br />

Han,J. and Kamber,M. (2001) Data M<strong>in</strong><strong>in</strong>g: Concepts and Techniques. Morgan<br />

Kaufmann.<br />

Hand,D.J. (1997) Construction and Assessment of Classification Rules. Wiley.<br />

Hulo,N. et al. (2004) Recent improvements to the PROSITE database. Nucleic Acids<br />

Res., 32, D134–D137.<br />

Husi,H. et al. (2000) Proteomic analysis of NMDA receptor–adhesion prote<strong>in</strong> signal<strong>in</strong>g<br />

complexes. Nat. Neurosci., 3, 661–669.<br />

Karwath,A. and K<strong>in</strong>g,R.D. (2002) Homology <strong>in</strong>duction: the use of mach<strong>in</strong>e learn<strong>in</strong>g to<br />

improve sequence similarity searches. BMC Bio<strong>in</strong>formatics, 3, 11.<br />

K<strong>in</strong>g,R.D. et al. (2001) The utility of different representations of prote<strong>in</strong> sequence for<br />

predict<strong>in</strong>g functional class. Bio<strong>in</strong>formatics, 17, 445–454.<br />

Mirk<strong>in</strong>,B. and Ritter,O. (2000) A feature-based approach to discrim<strong>in</strong>ation and prediction<br />

of prote<strong>in</strong> fold<strong>in</strong>g groups. In: Suhai,S. (ed.), Genomics and Proteomics:<br />

Functional and Computational Aspects. Kluwer/Plenum, pp. 157–177.<br />

Nagl,S. (2003) Function prediction from prote<strong>in</strong> sequence. In Orengo,C.A., Jones,D.T.<br />

and Thornton,J.M. (eds), Bio<strong>in</strong>formatics: Genes, Prote<strong>in</strong>s & Computers, Bios, pp.<br />

65–79.<br />

Qu<strong>in</strong>lan,J.R. (1993) C4.5: Mach<strong>in</strong>e Learn<strong>in</strong>g Programs. Morgan Kaufmann.<br />

Schug,J. et al. (2002) <strong>Predict<strong>in</strong>g</strong> gene ontology functions from ProDom and CDD prote<strong>in</strong><br />

doma<strong>in</strong>s. Genome Res., 12, 648–655.


Sebban,M. et al. (2002) A data m<strong>in</strong><strong>in</strong>g approach to spacer oligonucleotide typ<strong>in</strong>g of<br />

Mycobacterium tuberculosis. Bio<strong>in</strong>formatics, 18, 235–243.<br />

Syed,U. and Yona,G. (2003) Us<strong>in</strong>g a mixture of probabilistic decision trees for direct<br />

prediction of prote<strong>in</strong> function. In Proceed<strong>in</strong>gs of the 2003 Conference on Research <strong>in</strong><br />

Computational Molecular Biology (RECOMB-2003), Berl<strong>in</strong>, Germany, pp. 289–300.<br />

Tsumoto,S. (2000) Cl<strong>in</strong>ical knowledge discovery <strong>in</strong> hospital <strong>in</strong>formation systems: two<br />

case studies. In Proceed<strong>in</strong>gs of the Fourth European Conference on Pr<strong>in</strong>ciples<br />

<strong>Predict<strong>in</strong>g</strong> <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> <strong>in</strong> prote<strong>in</strong>s with data m<strong>in</strong><strong>in</strong>g<br />

and Practice of Knowledge Discovery <strong>in</strong> Databases (PKDD-2000), LNAI 1910,<br />

652–656. Spr<strong>in</strong>ger-Verlag.<br />

Uniprot: The Universal Prote<strong>in</strong> Resource. (2004)<br />

Walikonis,I.R.S. et al. (2000) Identification of prote<strong>in</strong>s <strong>in</strong> the <strong>post</strong><strong>synaptic</strong> density<br />

fraction by mass spectrometry. J. Neurosci., 20, 4069–4080.<br />

Witten,H. and Frank,E. (2000) Data M<strong>in</strong><strong>in</strong>g: Practical Mach<strong>in</strong>e Learn<strong>in</strong>g Tools and<br />

Techniques. Morgan Kaufmann.<br />

ii25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!