Predicting post-synaptic activity in proteins - ResearchGate
Predicting post-synaptic activity in proteins - ResearchGate
Predicting post-synaptic activity in proteins - ResearchGate
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
BIOINFORMATICS<br />
Databases<br />
Vol. 21 Suppl. 2 2005, pages ii19–ii25<br />
doi:10.1093/bio<strong>in</strong>formatics/bti1102<br />
<strong>Predict<strong>in</strong>g</strong> <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> <strong>in</strong> prote<strong>in</strong>s with data m<strong>in</strong><strong>in</strong>g<br />
Gisele L. Pappa 1 , Anthony J. Ba<strong>in</strong>es 2 and Alex A. Freitas 1,∗<br />
1 Comput<strong>in</strong>g Laboratory, University of Kent, Canterbury CT2 7NF, UK and 2 Department of Biosciences,<br />
University of Kent, Canterbury CT2 7NJ, UK<br />
ABSTRACT<br />
Summary: The bio<strong>in</strong>formatics problem be<strong>in</strong>g addressed <strong>in</strong> this paper<br />
is to predict whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. This<br />
problem is of great <strong>in</strong>tr<strong>in</strong>sic <strong>in</strong>terest because prote<strong>in</strong>s with <strong>post</strong><strong>synaptic</strong><br />
activities are connected with function<strong>in</strong>g of the nervous<br />
system. Indeed, many prote<strong>in</strong>s hav<strong>in</strong>g <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> have<br />
been functionally characterized by biochemical, immunological and<br />
proteomic exercises. They represent a wide variety of prote<strong>in</strong>s with<br />
functions <strong>in</strong> extracellular signal reception and propagation through<br />
<strong>in</strong>tracellular apparatuses, cell adhesion molecules and scaffold<strong>in</strong>g prote<strong>in</strong>s<br />
that l<strong>in</strong>k them <strong>in</strong> a web. The challenge is to automatically discover<br />
features of the primary sequences of prote<strong>in</strong>s that typically occur <strong>in</strong><br />
prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> but rarely (or never) occur <strong>in</strong> prote<strong>in</strong>s<br />
without <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>, and vice-versa. In this context, we<br />
used data m<strong>in</strong><strong>in</strong>g to automatically discover classification rules that predict<br />
whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. The discovered<br />
rules were analysed with respect to their predictive accuracy (generalization<br />
ability) and with respect to their <strong>in</strong>terest<strong>in</strong>gness to biologists<br />
(<strong>in</strong> the sense of represent<strong>in</strong>g novel, unexpected knowledge).<br />
Contact: A.A.Freitas@kent.ac.uk<br />
1 INTRODUCTION<br />
One of the great challenges of our era is to predict the functions of<br />
prote<strong>in</strong>s based on their primary sequence. This is a very difficult<br />
problem, s<strong>in</strong>ce the relationship between prote<strong>in</strong> sequence and function<br />
is very complex (Gerlt and Babbitt, 2000; Devos and Valencia,<br />
2000; Nagl, 2003). Indeed, although there is a vast amount of data<br />
stored <strong>in</strong> prote<strong>in</strong> databases, there is still a large gap between the huge<br />
amount of data about prote<strong>in</strong> sequences and the knowledge necessary<br />
for understand<strong>in</strong>g the process of prote<strong>in</strong> fold<strong>in</strong>g and associated<br />
prote<strong>in</strong> functions.<br />
Intuitively, however, prote<strong>in</strong> databases conta<strong>in</strong> important ‘hidden<br />
relationships’ (knowledge) between prote<strong>in</strong> sequence and prote<strong>in</strong><br />
function. There is a clear and urgent motivation for discover<strong>in</strong>g this<br />
hidden knowledge from prote<strong>in</strong> databases for a number of reasons,<br />
such as a better understand<strong>in</strong>g of diseases, design<strong>in</strong>g more effective<br />
medical drugs, etc. This creates both a need and an opportunity<br />
to apply data m<strong>in</strong><strong>in</strong>g techniques to the problem of automatically<br />
discover<strong>in</strong>g knowledge from prote<strong>in</strong> databases.<br />
Data m<strong>in</strong><strong>in</strong>g is a multi-discipl<strong>in</strong>ary field, which consists of us<strong>in</strong>g<br />
methods of several research areas (arguably, ma<strong>in</strong>ly mach<strong>in</strong>e learn<strong>in</strong>g<br />
and statistical pattern recognition) to extract <strong>in</strong>terest<strong>in</strong>g knowledge<br />
from real-world datasets (Witten and Frank, 2000; Fayyad et al.,<br />
1996).<br />
∗ To whom correspondence should be addressed.<br />
This paper proposes a data m<strong>in</strong><strong>in</strong>g approach to the problem of<br />
predict<strong>in</strong>g whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>, based<br />
on features of the prote<strong>in</strong>’s primary sequence. The proposed approach<br />
will be described later, <strong>in</strong> Sections 3 and 4. In this Introduction<br />
we only emphasize a major difference between the proposed data<br />
m<strong>in</strong><strong>in</strong>g approach and a more traditional bio<strong>in</strong>formatics approach for<br />
predict<strong>in</strong>g prote<strong>in</strong> function, as follows.<br />
In general, the approach most used to predict the function of a new<br />
prote<strong>in</strong>—for which we know only its sequence—consists of perform<strong>in</strong>g<br />
a similarity search <strong>in</strong> a prote<strong>in</strong> database. In essence, the program<br />
f<strong>in</strong>ds the most similar prote<strong>in</strong>(s) to the new prote<strong>in</strong>, and if that similarity<br />
is higher than a threshold, the function of the most similar<br />
prote<strong>in</strong> is transferred to the new prote<strong>in</strong>. Although this approach is<br />
very useful <strong>in</strong> many cases, it also has some limitations, as follows.<br />
First, it is well-known that two prote<strong>in</strong>s might have very similar<br />
sequences and perform different functions, or have very different<br />
sequences and perform the same or similar function (Syed and Yona,<br />
2003; Gerlt and Babbitt, 2000). Second, the prote<strong>in</strong>s be<strong>in</strong>g compared<br />
may be similar <strong>in</strong> regions of the sequence that are not determ<strong>in</strong>ants<br />
of function (Schug et al., 2002). Third, the prediction of function is<br />
based only on sequence similarity, ignor<strong>in</strong>g many relevant biochemical<br />
properties of prote<strong>in</strong>s (Karwath and K<strong>in</strong>g, 2002; Syed and Yona,<br />
2003). Fourth, it does not produce a model for predict<strong>in</strong>g function,<br />
and so it does not give <strong>in</strong>sights <strong>in</strong>to the relationship between the<br />
sequence, biochemical properties and function of prote<strong>in</strong>s.<br />
Another approach consists of <strong>in</strong>duc<strong>in</strong>g, from prote<strong>in</strong> data, a<br />
model describ<strong>in</strong>g (<strong>in</strong> a very summarized form) the data, so that<br />
new prote<strong>in</strong>s can be classified by the model. This is the data<br />
m<strong>in</strong><strong>in</strong>g approach followed <strong>in</strong> this project. We emphasize that<br />
this model-<strong>in</strong>duction approach aims at complement<strong>in</strong>g—rather than<br />
replac<strong>in</strong>g—the conventional similarity-based approach. In any case,<br />
the model-<strong>in</strong>duction approach followed <strong>in</strong> this project has two<br />
important advantages (K<strong>in</strong>g et al., 2001). First, it can predict the<br />
function of a new prote<strong>in</strong> even <strong>in</strong> the absence of sequence similarity<br />
between that prote<strong>in</strong> and other prote<strong>in</strong>s with known function. Second,<br />
if the discovered model is expressed <strong>in</strong> a comprehensible form (which<br />
is the case <strong>in</strong> this research, where knowledge is expressed by <strong>in</strong>tuitively<br />
comprehensible IF-THEN rules), it can be used by biologists,<br />
to give new <strong>in</strong>sights and possibly suggest new biological experiments.<br />
Indeed, <strong>in</strong> this research the rules discovered by a data m<strong>in</strong><strong>in</strong>g<br />
algorithm were not only automatically evaluated with respect to their<br />
predictive accuracy—as is usual <strong>in</strong> data m<strong>in</strong><strong>in</strong>g—but also manually<br />
<strong>in</strong>terpreted <strong>in</strong> the context of relevant biochemical knowledge, <strong>in</strong> order<br />
to determ<strong>in</strong>e how <strong>in</strong>terest<strong>in</strong>g they were with respect to provid<strong>in</strong>g<br />
novel <strong>in</strong>sights—unknown <strong>in</strong> the current literature—about the relationship<br />
between some prote<strong>in</strong> sequence patterns and <strong>post</strong>-<strong>synaptic</strong><br />
<strong>activity</strong>. This two-criteria evaluation re<strong>in</strong>forces the <strong>in</strong>ter-discipl<strong>in</strong>ary<br />
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org ii19
G.L.Pappa et al.<br />
Fig. 1. Ma<strong>in</strong> elements <strong>in</strong>volved <strong>in</strong> pre-<strong>synaptic</strong> and <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>.<br />
A synapse is the po<strong>in</strong>t where two nerve cells communicate with each other by<br />
transmission of a chemical known as a neurotransmitter. The ma<strong>in</strong> elements<br />
found <strong>in</strong> synapses are shown <strong>in</strong> Figure 1. The cells are held together by<br />
cell adhesion molecules (1). In the cell where the signal is com<strong>in</strong>g from<br />
(the pre-<strong>synaptic</strong> cell) neurotransmitters are stored <strong>in</strong> bags called <strong>synaptic</strong><br />
vesicles. When signals are to be transmitted from the pre-<strong>synaptic</strong> cell to the<br />
<strong>post</strong>-<strong>synaptic</strong> cell, <strong>synaptic</strong> vesicles fuse with the pre-<strong>synaptic</strong> membrane and<br />
release their contents <strong>in</strong>to the <strong>synaptic</strong> cleft between the cells. The transmitters<br />
then diffuse with<strong>in</strong> the cleft, and some of them meet a <strong>post</strong>-<strong>synaptic</strong> receptor<br />
(2), which recognises them as a signal. This activates the receptor, which<br />
then transmits the signal on to other signall<strong>in</strong>g components such as voltagegated<br />
ion channels (3), prote<strong>in</strong> k<strong>in</strong>ases (4) and phosphatases (5). To ensure<br />
that the signal is term<strong>in</strong>ated and to clear up residual neurotransmitters after<br />
the signal has term<strong>in</strong>ated, transporters (6) remove neurotransmitters from the<br />
cleft. With<strong>in</strong> the <strong>post</strong>-<strong>synaptic</strong> cell, the signall<strong>in</strong>g apparatus is organised by<br />
various scaffold<strong>in</strong>g prote<strong>in</strong>s (7).<br />
nature of this project, which is, of course, a desirable feature <strong>in</strong> a<br />
bio<strong>in</strong>formatics project.<br />
The rema<strong>in</strong>der of this paper is organised as follows: Section 2<br />
describes the target biological problem; Section 3 describes the<br />
preparation of the dataset for m<strong>in</strong><strong>in</strong>g purposes, by us<strong>in</strong>g prote<strong>in</strong><br />
data available <strong>in</strong> UniProt/SwissProt and Prosite; Section 4 discusses<br />
the proposed data m<strong>in</strong><strong>in</strong>g approach and the correspond<strong>in</strong>g analysis<br />
of results; f<strong>in</strong>ally, Section 5 reports the conclusions and future<br />
research directions.<br />
2 THE TARGET BIOLOGICAL PROBLEM<br />
In essence, <strong>post</strong>-<strong>synaptic</strong> sites represent po<strong>in</strong>ts where one nerve cell<br />
receives signals from another. As <strong>in</strong>dicated <strong>in</strong> Figure 1, multiple<br />
types of prote<strong>in</strong>s are expected to be found at these sites for reception<br />
and propagation of signals, and for jo<strong>in</strong><strong>in</strong>g the two nerve cells to each<br />
other. Note that Figure 1 is a very m<strong>in</strong>imal summary of the types of<br />
prote<strong>in</strong>s found <strong>in</strong> <strong>post</strong><strong>synaptic</strong> sites.<br />
The bio<strong>in</strong>formatics problem be<strong>in</strong>g addressed <strong>in</strong> this paper is to predict<br />
whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. This problem<br />
is of great <strong>in</strong>tr<strong>in</strong>sic <strong>in</strong>terest because prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong> activities<br />
are connected with function<strong>in</strong>g of the nervous system. Indeed,<br />
many prote<strong>in</strong>s hav<strong>in</strong>g <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> have been functionally<br />
characterized by biochemical, immunological and proteomic exercises<br />
[see e.g. Husi et al. (2000) and Walikonis et al. (2000)], and are<br />
ii20<br />
now extensively catalogued and annotated <strong>in</strong> the Uniprot/SwissProt<br />
database. They represent a wide variety of prote<strong>in</strong>s with functions <strong>in</strong><br />
extracellular signal reception and propagation through <strong>in</strong>tracellular<br />
apparatuses, cell adhesion molecules and scaffold<strong>in</strong>g prote<strong>in</strong>s that<br />
l<strong>in</strong>k them <strong>in</strong> a web.<br />
The challenge is to automatically discover features of prote<strong>in</strong>s’<br />
primary sequences that typically occur <strong>in</strong> prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong><br />
<strong>activity</strong> but rarely (or never) occur <strong>in</strong> prote<strong>in</strong>s without <strong>post</strong>-<strong>synaptic</strong><br />
<strong>activity</strong>, and vice-versa. These discovered features constitute the<br />
essence of the knowledge discovered by a data m<strong>in</strong><strong>in</strong>g algorithm.<br />
If the algorithm is successful <strong>in</strong> discover<strong>in</strong>g knowledge with a high<br />
predictive power, that knowledge can be used to accurately discrim<strong>in</strong>ate<br />
between the two classes of prote<strong>in</strong>s. In addition, and<br />
most important, the fact that <strong>in</strong> this project discovered knowledge<br />
will be expressed <strong>in</strong> a comprehensible form—as mentioned <strong>in</strong> the<br />
Introduction—represents potentially valuable knowledge by itself,<br />
because such knowledge can potentially give new <strong>in</strong>sights to biologists<br />
about which sequence features are predictive of <strong>post</strong>-<strong>synaptic</strong><br />
<strong>activity</strong>.<br />
3 METHODS<br />
The goal of this project is to predict—with a data m<strong>in</strong><strong>in</strong>g algorithm—whether<br />
or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. The algorithm is used to discover<br />
<strong>in</strong>terest<strong>in</strong>g relationships between sequence features that are often present <strong>in</strong><br />
<strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s but usually absent <strong>in</strong> prote<strong>in</strong>s without <strong>post</strong>-<strong>synaptic</strong><br />
prote<strong>in</strong>s, and vice-versa. Therefore, <strong>in</strong> order to discover this k<strong>in</strong>d of knowledge,<br />
we need not only a set of <strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s but also a control set<br />
of prote<strong>in</strong>s which do not have <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>.<br />
In data m<strong>in</strong><strong>in</strong>g term<strong>in</strong>ology, the set of prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong><br />
is called the set of positive examples, whereas the control set of prote<strong>in</strong>s<br />
(without <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>) is called the set of negative examples. The<br />
problem of f<strong>in</strong>d<strong>in</strong>g sequence features that discrim<strong>in</strong>ate between these two<br />
k<strong>in</strong>ds of prote<strong>in</strong>s is then cast as a classification problem, where the goal is<br />
to predict the value of a class attribute for each example (prote<strong>in</strong>) based on<br />
a set of predictor attributes for that example. The classes are whether or not<br />
a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>, and the predictor attributes are ma<strong>in</strong>ly<br />
Prosite patterns, as described below.<br />
More precisely, the dataset m<strong>in</strong>ed <strong>in</strong> this project was constructed <strong>in</strong><br />
two phases. In the first phase we carefully selected relevant prote<strong>in</strong>s<br />
from the UniProt database. UniProt (Universal Prote<strong>in</strong> Resource) was<br />
created by the union of Swiss-Prot, TrEMBL and PIR databases, and<br />
is a repository for prote<strong>in</strong> sequence and functional data (Uniprot, 2004,<br />
http://www.ebi.uniprot.org/<strong>in</strong>dex.shtml). UniProt was chosen as the source<br />
of the data to be m<strong>in</strong>ed because it is the world standard non-redundant,<br />
comprehensive prote<strong>in</strong> sequence database. UniProt is divided <strong>in</strong> three<br />
database layers: UniParc (UniProt Archive), UniProt Knowledgebase (Uni-<br />
Prot/SwissProt) and UniProt NREF (UniRef). In this research we have used<br />
UniProt/SwissProt, which is the richest annotated layer. One of the great<br />
advantages of this layer is the comprehensive annotation of SwissProt. This<br />
is curated to ensure m<strong>in</strong>imal redundancy, accurate and comprehensive annotation<br />
of function, expression, sequence features (e.g. doma<strong>in</strong> structures),<br />
literature references, l<strong>in</strong>ks to other databases, etc.<br />
The second phase of data preparation used the l<strong>in</strong>ks from Unit-<br />
Prot/SwissProt to the Prosite database, <strong>in</strong> order to retrieve <strong>in</strong>formation from<br />
Prosite that was used to create the predictor attributes. These two phases are<br />
described <strong>in</strong> detail next.<br />
3.1 Phase 1: select<strong>in</strong>g positive and negative examples<br />
Table 1 shows the queries submitted to UniProt/SwissProt <strong>in</strong> order to select the<br />
set of positive examples (prote<strong>in</strong>s with <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>) and the set of<br />
negative examples (prote<strong>in</strong>s without <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>). For each query,<br />
the table shows the specification of the query and the number of examples
Table 1. Queries submitted to UniProt/SwissProt to create the dataset<br />
Query No. Query specification # Examples<br />
Selection of positive examples<br />
1 Post-<strong>synaptic</strong> !tox<strong>in</strong> 356<br />
Total number of positive examples 356<br />
Query No. Query/Keywords # Examples<br />
Selection of negative examples<br />
2 Heart !(result of query 1) 3106<br />
3 Cardiac !(results of queries 1, 2) 331<br />
4 Liver !(results of queries 1, 2, 3) 2794<br />
5 Hepatic !(results of queries 1, 2, 3, 4) 256<br />
6 Kidney !(results of query 1, 2, 3, 4, 5) 988<br />
Total number of negative examples 7475<br />
(prote<strong>in</strong>s) returned by that query. The specification of each query consists of<br />
keywords and the logical operator ‘NOT’ (!).<br />
Note that the queries did not specify any specific species, i.e. prote<strong>in</strong>s from<br />
all species were considered. This was done <strong>in</strong> order to maximize the number<br />
of examples <strong>in</strong> the data to be m<strong>in</strong>ed. Positive examples were selected not<br />
only by the presence of the keyword ‘<strong>post</strong>-<strong>synaptic</strong>’ but also by the absence<br />
of the keyword ‘tox<strong>in</strong>’. The reason for this latter criterion is that several<br />
entries <strong>in</strong> UniProt/SwissProt refer to the tox<strong>in</strong> α-latrotox<strong>in</strong>. This prote<strong>in</strong> acts<br />
<strong>post</strong>-<strong>synaptic</strong>ally, but it is not, of course, a <strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>.<br />
The selection of the negative examples was more difficult, s<strong>in</strong>ce of course<br />
the UniProt/SwissProt entries do not have an explicit ‘not <strong>post</strong>-<strong>synaptic</strong>’<br />
keyword. The trivial ‘solution’ of retriev<strong>in</strong>g all entries that do not explicitly<br />
have the keyword ‘<strong>post</strong>-<strong>synaptic</strong>’ is not satisfactory, for two reasons.<br />
First, this would produce a very large number of negative examples, which<br />
would be much larger than the number of positive examples. This would create<br />
a dataset with an extremely unbalanced class distribution, and it would be<br />
very difficult for the classification algorithm to discover rules that correctly<br />
classify the positive (<strong>post</strong>-<strong>synaptic</strong>) class. Second, and most important, many<br />
of the negative examples would be ‘trivial’, lead<strong>in</strong>g to un<strong>in</strong>terest<strong>in</strong>g, trivial<br />
rules for the discrim<strong>in</strong>ation between positive and negative examples. For<br />
<strong>in</strong>stance, there is no need to discover rules discrim<strong>in</strong>at<strong>in</strong>g plant prote<strong>in</strong>s from<br />
<strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s, s<strong>in</strong>ce it is obvious that plants do not have a nervous<br />
system. Hence, <strong>in</strong>clud<strong>in</strong>g plant prote<strong>in</strong>s <strong>in</strong> the set of negative examples would<br />
only contribute to the discovery of un<strong>in</strong>terest<strong>in</strong>g, trivial classification rules.<br />
The goal is to select a set of negative examples where, although the prote<strong>in</strong>s<br />
do not have <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>, they have some characteristics that<br />
could be confused with some of the characteristics of <strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s<br />
(the positive examples), mak<strong>in</strong>g it difficult to discrim<strong>in</strong>ate between these two<br />
k<strong>in</strong>ds of prote<strong>in</strong>s. Intuitively, the higher the similarity between positive and<br />
negative examples, the harder the classification problem, and so the stronger<br />
the motivation to use a data m<strong>in</strong><strong>in</strong>g algorithm to discover <strong>in</strong>terest<strong>in</strong>g classification<br />
rules represent<strong>in</strong>g novel knowledge to a biologist. (Recall that the<br />
ultimate goal is to automatically discover <strong>in</strong>terest<strong>in</strong>g, novel rules that provide<br />
new <strong>in</strong>sight to biologists about which sequence features are most correlated<br />
with the presence or absence of <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>.)<br />
Note that, although the positive and negative examples must have enough<br />
similarity to lead to the discovery of <strong>in</strong>terest<strong>in</strong>g, non-trivial rules, they also<br />
must be different enough to allow the discovery of reliable classification rules<br />
for each class. Hence, the challenge is to f<strong>in</strong>d a good trade-off between these<br />
two goals.<br />
In this spirit, the negative examples were selected by us<strong>in</strong>g the keywords<br />
‘heart’, ‘cardiac’, ‘liver’, ‘hepatic’ and ‘kidney’. Consider, for <strong>in</strong>stance, the<br />
query 4 <strong>in</strong> Table 1: ‘Liver !(results of queries 1, 2, 3)’. This query means<br />
<strong>Predict<strong>in</strong>g</strong> <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> <strong>in</strong> prote<strong>in</strong>s with data m<strong>in</strong><strong>in</strong>g<br />
that all UniProt/ SwissProt entries that had the keyword ‘liver’ and were not<br />
already <strong>in</strong>cluded <strong>in</strong> the results of the previous queries (1, 2, 3) were selected<br />
as negative examples. The latter selection criterion was necessary to avoid the<br />
possibility that two or more copies of the same data entry—i.e. a data entry<br />
hav<strong>in</strong>g two or more of the above-mentioned keywords—were duplicated <strong>in</strong><br />
the set of negative examples.<br />
The previously-mentioned five keywords were chosen as the basis for<br />
obta<strong>in</strong><strong>in</strong>g negative examples for two reasons. First, it is known that, <strong>in</strong> general,<br />
prote<strong>in</strong>s found <strong>in</strong> these sites do not present <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong>. Second,<br />
prote<strong>in</strong>s found <strong>in</strong> these sites often have some characteristics similar to <strong>post</strong><strong>synaptic</strong><br />
prote<strong>in</strong>s. Indeed, <strong>in</strong> the context of this project, many of the same types<br />
of prote<strong>in</strong>s represented at <strong>post</strong>-<strong>synaptic</strong> sites (k<strong>in</strong>ases/phosphatases/channels,<br />
etc.) are present <strong>in</strong> abundance <strong>in</strong> heart, liver and kidney tissues.<br />
It should be noted that the queries listed <strong>in</strong> Table 1 searched for the correspond<strong>in</strong>g<br />
keywords <strong>in</strong> all the fields of the UniProt/ SwissProt entries. This<br />
means that the selection criteria are not perfect, s<strong>in</strong>ce those keywords could<br />
be present <strong>in</strong> fields where the presence of the keyword does not mean that the<br />
prote<strong>in</strong> has the correspond<strong>in</strong>g function/characteristic. This potentially <strong>in</strong>troduces<br />
some ‘noise’ <strong>in</strong> the dataset be<strong>in</strong>g m<strong>in</strong>ed. However, this was necessary,<br />
because queries search<strong>in</strong>g for keywords only <strong>in</strong> the field ‘KEYWORD’ of<br />
UniProt/ SwissProt returned too few prote<strong>in</strong>s. In any case, the amount of<br />
noise <strong>in</strong>troduced by the imperfect selection criteria seems to be relatively<br />
small, s<strong>in</strong>ce the data m<strong>in</strong><strong>in</strong>g algorithm was able to discovery quite accurate<br />
classification rules, as will be shown later.<br />
3.2 Phase 2: generat<strong>in</strong>g the predictor attributes<br />
Once the set of examples to be m<strong>in</strong>ed has been selected from Uniprot/Swissprot,<br />
the next step was to generate a set of predictor attributes<br />
represent<strong>in</strong>g relevant properties of the sequences those prote<strong>in</strong>s. The predictor<br />
attributes must have a good predictive power and facilitate easy <strong>in</strong>terpretation<br />
by biologists. In this project we have focused ma<strong>in</strong>ly on generat<strong>in</strong>g attributes<br />
based on Prosite patterns associated with the prote<strong>in</strong>s—a type of attribute<br />
satisfy<strong>in</strong>g both the previously-mentioned properties. The Prosite database<br />
stores significant patterns and profiles that help to identify the family of a<br />
new prote<strong>in</strong> (Hulo et al., 2004). We decided to use attributes based only on<br />
Prosite patterns, and not Prosite profiles, for two reasons. First, the match<strong>in</strong>g<br />
between a Prosite pattern and a prote<strong>in</strong> can be exactly computed, produc<strong>in</strong>g<br />
a simple b<strong>in</strong>ary attribute—i.e. the pattern either occurs or does not occur <strong>in</strong><br />
a given prote<strong>in</strong>. This reduces the size of the search space for the data m<strong>in</strong><strong>in</strong>g<br />
algorithm and simplifies the <strong>in</strong>terpretation of the rule by biologists. By<br />
contrast, the match<strong>in</strong>g between a Prosite profile and a prote<strong>in</strong> is an approximate<br />
match<strong>in</strong>g, and the data m<strong>in</strong><strong>in</strong>g algorithm would have to search <strong>in</strong> a<br />
correspond<strong>in</strong>gly much larger search space. Second, the use of both patterns<br />
and profiles would lead to a very large number of attributes, which would<br />
aga<strong>in</strong> expand the size of the search space, and so would significantly <strong>in</strong>crease<br />
the risk of overfitt<strong>in</strong>g the <strong>in</strong>duced model to the data. It should be noted that,<br />
even consider<strong>in</strong>g only the b<strong>in</strong>ary attributes derived from Prosite patterns, this<br />
led to 443 attributes (as expla<strong>in</strong>ed next), which corresponds to a huge search<br />
space of size 2443 .<br />
For each prote<strong>in</strong> selected <strong>in</strong> phase 1 (regardless of the prote<strong>in</strong> be<strong>in</strong>g a<br />
positive or a negative example), we retrieved all Prosite entry id’s that occurred<br />
<strong>in</strong> the field database cross-references (DR) of UniProt/SwissProt. For each<br />
Prosite entry id, we ‘followed the l<strong>in</strong>k’ from UniProt/SwissProt to Prosite, <strong>in</strong><br />
order to access <strong>in</strong>formation about that Prosite entry. Once this was done for<br />
all prote<strong>in</strong>s selected <strong>in</strong> phase 1, we had a large set of Prosite entries. We then<br />
selected, for use as predictor attributes, the entries that:<br />
(i) were marked as a ‘pattern’ <strong>in</strong> the ID l<strong>in</strong>e of that entry <strong>in</strong> the Prosite<br />
database;<br />
(ii) were not commented (<strong>in</strong> the CC l<strong>in</strong>e of the Prosite database) as<br />
very general patterns (/SKIP_FLAG = TRUE)—it was necessary to<br />
exclude those patterns because they appear <strong>in</strong> almost all prote<strong>in</strong>s, and<br />
so are not useful to discrim<strong>in</strong>ate between the two classes of prote<strong>in</strong>s;<br />
(iii) occurred <strong>in</strong> at least two prote<strong>in</strong>s of the dataset be<strong>in</strong>g m<strong>in</strong>ed—this<br />
was necessary to remove extremely specific patterns, occurr<strong>in</strong>g <strong>in</strong><br />
ii21
G.L.Pappa et al.<br />
just one prote<strong>in</strong> of the dataset, which do not have any generalisation<br />
power.<br />
F<strong>in</strong>ally, each of the selected Prosite patterns was encoded as one b<strong>in</strong>ary<br />
attribute of the dataset be<strong>in</strong>g m<strong>in</strong>ed, tak<strong>in</strong>g on the value ‘yes’ or<br />
‘no’ for each prote<strong>in</strong>—<strong>in</strong>dicat<strong>in</strong>g whether or not the pattern occurs <strong>in</strong> that<br />
prote<strong>in</strong>, respectively.<br />
Note that, a few prote<strong>in</strong>s <strong>in</strong> the dataset to be m<strong>in</strong>ed did not have any Prosite<br />
pattern, i.e. they had the value ‘no’ for all predictor attributes based on Prosite<br />
patterns. These prote<strong>in</strong>s were removed from the data to be m<strong>in</strong>ed.<br />
In addition to the previously-described attributes based on Prosite patterns,<br />
we added to the dataset two simple predictor attributes derived directly from<br />
the prote<strong>in</strong>s’ sequences, namely the sequence length and the molecular weight<br />
of the prote<strong>in</strong> (both attributes are available from the correspond<strong>in</strong>g fields <strong>in</strong><br />
UniProt/SwissProt entries). Other k<strong>in</strong>ds of attributes will be considered <strong>in</strong><br />
future research, but for now it is <strong>in</strong>terest<strong>in</strong>g to note that even the current set of<br />
predictor attributes is enough to discover quite accurate classification rules,<br />
as will be shown later.<br />
After all this data preparation process, we ended up with a dataset composed<br />
by 4303 examples (260 belong<strong>in</strong>g to the positive class and 4043<br />
belong<strong>in</strong>g to the negative class) and 445 predictor attributes. These 445<br />
attributes <strong>in</strong>clude 443 Prosite patterns, the sequence length and the molecular<br />
weight of each prote<strong>in</strong>.<br />
4 RESULTS<br />
In order to discover knowledge from the dataset described <strong>in</strong> the<br />
previous section, we have used the well-known C4.5Rules rule <strong>in</strong>duction<br />
algorithm (Qu<strong>in</strong>lan, 1993). This algorithm was chosen as the<br />
data m<strong>in</strong><strong>in</strong>g algorithm <strong>in</strong> our experiments ma<strong>in</strong>ly because it produces<br />
comprehensible knowledge, represented by a set of high-level,<br />
easily-<strong>in</strong>terpretable classification rules of the form: IF (conditions)<br />
THEN (class). This k<strong>in</strong>d of rule has the <strong>in</strong>tuitive mean<strong>in</strong>g that, if an<br />
example (prote<strong>in</strong>) satisfies the conditions <strong>in</strong> the rule antecedent, the<br />
example is assigned to the class predicted by the rule consequent.<br />
It should be noted that the comprehensibility of discovered knowledge<br />
is a very important issue <strong>in</strong> bio<strong>in</strong>formatics [see e.g. Mirk<strong>in</strong><br />
and Ritter (2000), Clare and K<strong>in</strong>g (2002) and Sebban et al.<br />
(2002)], because the discovered knowledge should be <strong>in</strong>terpreted<br />
and validated by biologists, rather than be<strong>in</strong>g bl<strong>in</strong>dly trusted as a<br />
‘black box’.<br />
We used the default parameters of C4.5Rules. The classification<br />
rules discovered by C4.5Rules were evaluated accord<strong>in</strong>g to two criteria,<br />
namely predictive accuracy and <strong>in</strong>terest<strong>in</strong>gness to biologists, as<br />
follows. Predictive accuracy was estimated by a well-known 10-fold<br />
cross-validation procedure (Witten and Frank, 2000), as usual <strong>in</strong> data<br />
m<strong>in</strong><strong>in</strong>g. In essence, the dataset was divided <strong>in</strong>to 10 partitions, with<br />
approximately the same number of examples (prote<strong>in</strong>s) <strong>in</strong> each partition.<br />
In the i-th iteration, i = 1, 2, ..., 10, the i-th partition was used<br />
as the test set and the other 9 partitions were temporarily merged and<br />
used as the tra<strong>in</strong><strong>in</strong>g set. In each iteration C4.5Rules discovered a rule<br />
set from the tra<strong>in</strong><strong>in</strong>g set and used that rule set to classify examples <strong>in</strong><br />
the test set (unseen dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g), <strong>in</strong> order to evaluate the generalisation<br />
ability of discovered knowledge. The classification accuracy<br />
rate of the discovered rules can then be computed as the average<br />
accuracy rate over the 10 test sets, and this is the measure of predictive<br />
accuracy most popular <strong>in</strong> the literature. In our experiments, this<br />
produced an accuracy rate of 97.85%.<br />
It should be noted, however, that <strong>in</strong> the context of this project this<br />
traditional measure of accurate rate is not a very effective one. The<br />
reason is that the class distribution is very unbalanced: only 6.4% of<br />
the examples have the positive class. Hence, as a basel<strong>in</strong>e solution for<br />
ii22<br />
this classification problem, the ‘majority classifier’—which predicts<br />
the majority (negative) class for all examples—would trivially obta<strong>in</strong><br />
an accuracy rate of 93.9%, without provid<strong>in</strong>g any <strong>in</strong>sight about the<br />
relationship between the predictor attributes and the classes.<br />
Therefore, we use a more ‘demand<strong>in</strong>g’ measure of predictive<br />
accuracy, for which a high value can be obta<strong>in</strong>ed only by accurately<br />
classify<strong>in</strong>g examples of both classes. The measure <strong>in</strong> question is the<br />
product: true positive rate (TPR) × true negative rate (TNR) (Hand,<br />
1997). These terms (which are sometimes referred to as Sensitivity<br />
and Specificity, respectively) are def<strong>in</strong>ed as follows.<br />
TPR = TP/(TP + FN) TNR = TN/(TN + FP),<br />
where<br />
TP = number of true positives—i.e. the number of examples that<br />
were predicted as positive class by the discovered rule set,<br />
and <strong>in</strong>deed have the positive class;<br />
FN = number of false negatives—i.e. the number of examples that<br />
were predicted as negative class, but actually have the positive<br />
class;<br />
TN = number of true negatives—i.e. the number of examples that<br />
were predicted as negative class, and <strong>in</strong>deed have the negative<br />
class;<br />
FP = number of false positives—i.e. the number of examples that<br />
were predicted as positive class, but actually have the negative<br />
class.<br />
In our experiments the average values (over the 10 iterations of the<br />
cross-validation procedure) of the TPR and TNR were 0.85 and 0.98<br />
respectively, result<strong>in</strong>g <strong>in</strong> the f<strong>in</strong>al measure of predictive accuracy as<br />
TPR × TNR = 0.84 (with a standard deviation of 0.09). Note that,<br />
the basel<strong>in</strong>e majority classifier obta<strong>in</strong>s TPR × TNR = 0 × 93.9 = 0,<br />
i.e. it is very strongly penalized (as it should be) for never predict<strong>in</strong>g<br />
the positive class.<br />
It should also be noted that, although the vast majority of the data<br />
m<strong>in</strong><strong>in</strong>g literature focuses on measur<strong>in</strong>g only the predictive accuracy<br />
of the discovered rules, the ultimate goal of data m<strong>in</strong><strong>in</strong>g is to discover<br />
knowledge that is comprehensible and <strong>in</strong>terest<strong>in</strong>g (novel, unexpected)<br />
to the user (Fayyad et al., 1996; Han and Kamber, 2001). We<br />
emphasize that a very accurate rule will not be useful to the user if<br />
it represents a previously known pattern. Consider, for <strong>in</strong>stance, the<br />
follow<strong>in</strong>g hypothetical example. In a hospital’s medical database a<br />
data m<strong>in</strong><strong>in</strong>g algorithm could discover the rule: IF (patient is pregnant)<br />
THEN (patient’s gender is female). This rule is extremely accurate,<br />
but it is also completely useless, s<strong>in</strong>ce it represents an obvious pattern.<br />
As a real-world example of the difficult of discover<strong>in</strong>g novel,<br />
unexpected rules, (Tsumoto, 2000) reports that, <strong>in</strong> experiments with<br />
two medical datasets,
Table 2. Rules discovered by C4.5Rules<br />
Id Classification rule<br />
32 IF (NEUROTR_ION_CHANNEL = yes) THEN (class = yes)<br />
19 IF (CADHERIN_1 = yes) AND (920 < seq_length 699)<br />
THEN (class = yes)<br />
10 IF (G_PROTEIN_RECEP_F1_1 = yes)<br />
AND (11287 < mol_weigth 318)<br />
THEN (class= yes)<br />
21 IF (G_PROTEIN_RECEP_F3_1 = yes)<br />
AND (mol_weight
G.L.Pappa et al.<br />
31.4%. The low accuracy of these two rules comes from the fact that<br />
ser/thr phosphatases and G-prote<strong>in</strong> coupled receptors are expressed<br />
<strong>in</strong> every human cell, and not just <strong>post</strong>-<strong>synaptic</strong>ally.<br />
The unexpected rules are much more complicated, but they<br />
are very surpris<strong>in</strong>gly accurate. Therefore, <strong>in</strong> general they represent<br />
<strong>in</strong>terest<strong>in</strong>g knowledge to biologists who are experts <strong>in</strong><br />
<strong>post</strong>-<strong>synaptic</strong> prote<strong>in</strong>s.<br />
For example, Rule 7 states that if 8 specific Prosite signatures are<br />
absent, then the prote<strong>in</strong> is not <strong>post</strong>-<strong>synaptic</strong> with 99.8% accuracy.<br />
Similarly, Rule 2 states the same th<strong>in</strong>g with 12 Prosite signatures.<br />
These rules could not have been predicted a priori with just biological<br />
knowledge. What Rules 2 and 7 do is to take a number of<br />
Prosite signatures that appear <strong>in</strong> <strong>in</strong>dividual ‘expected’ rules and comb<strong>in</strong>e<br />
them <strong>in</strong> a way that says that, when none of those signatures is<br />
present, then the prote<strong>in</strong>s are not <strong>post</strong>-<strong>synaptic</strong>. This has real utility<br />
<strong>in</strong> classify<strong>in</strong>g novel prote<strong>in</strong>s, exclud<strong>in</strong>g them from the <strong>post</strong>-<strong>synaptic</strong><br />
class.<br />
In order to better understand those rules, we also retrieved, from<br />
the dataset be<strong>in</strong>g m<strong>in</strong>ed, the prote<strong>in</strong>s which are exceptions to each<br />
of those rules—i.e. prote<strong>in</strong>s that satisfy the conditions <strong>in</strong> the rule<br />
antecedent but have a class different from the one predicted by the<br />
rule consequent. These exceptions are quite reveal<strong>in</strong>g.<br />
An exception to Rule 7 is the SPOCK prote<strong>in</strong> (Uniprot:<br />
TIC1_MOUSE). This is an extracellular matrix (ECM) prote<strong>in</strong> that<br />
is associated with the <strong>post</strong>-<strong>synaptic</strong> area of pyramidal neurons. The<br />
signatures <strong>in</strong> Rule 7 are membrane-associated or cytoplasmic; thus<br />
they do not cover this ECM prote<strong>in</strong>. The <strong>synaptic</strong> cleft is rather poor<br />
<strong>in</strong> ECM prote<strong>in</strong>s, so most other ECM prote<strong>in</strong>s (collagen etc.) are<br />
accurately <strong>in</strong>cluded <strong>in</strong> this rule.<br />
Rule 2 has some <strong>in</strong>terest<strong>in</strong>g exceptions, and aga<strong>in</strong> po<strong>in</strong>ts to limitations<br />
<strong>in</strong> the method used to generate the predictor attributes. The<br />
prote<strong>in</strong> b-Raf is a prote<strong>in</strong> k<strong>in</strong>ase that is ubiquitous <strong>in</strong> animal tissues,<br />
and has a role <strong>in</strong> mitogenic signall<strong>in</strong>g. It is also found <strong>in</strong> <strong>synaptic</strong><br />
structures. In neurons, it is thought to be part of the system that<br />
responds to growth factors such as NGF. In this sense it is not classically<br />
part of the system that responds to neurotransmitters, but rather<br />
has a role <strong>in</strong> development and ma<strong>in</strong>tenance of the nervous system.<br />
B-raf from human and mouse (Uniprot entries: BRAF_HUMAN and<br />
BRAF_MOUSE) are exceptions to Rule 2. This exception reflects<br />
both the ubiquity of b-Raf and the fact that it represents the nature<br />
of the signall<strong>in</strong>g pathways it is <strong>in</strong>volved <strong>in</strong>.<br />
5 CONCLUSIONS<br />
This paper proposed a data m<strong>in</strong><strong>in</strong>g approach to generate comprehensible<br />
rules that predict whether or not a prote<strong>in</strong> has <strong>post</strong>-<strong>synaptic</strong><br />
<strong>activity</strong>, based on Prosite patterns occurr<strong>in</strong>g (or not) <strong>in</strong> the prote<strong>in</strong>,<br />
as well as on a couple of simple prote<strong>in</strong> properties computed directly<br />
from the prote<strong>in</strong>’s primary sequence (namely the sequence length<br />
and the molecular weight of the prote<strong>in</strong>).<br />
The discovered rules were evaluated with respect to both their predictive<br />
accuracy and their degree of surpris<strong>in</strong>gness (unexpectedness)<br />
to the user. The discovered rules were very successful with respect to<br />
predictive accuracy. The ma<strong>in</strong> contribution of this paper, however, is<br />
the analysis of the rules with respect to their surpris<strong>in</strong>gess. Although<br />
this is a very important issue <strong>in</strong> data m<strong>in</strong><strong>in</strong>g (as discussed earlier),<br />
and particularly crucial <strong>in</strong> the context of scientific discovery, this<br />
issue is largely ignored <strong>in</strong> virtually all the literature about prediction<br />
of prote<strong>in</strong> function from sequence.<br />
ii24<br />
From a biological perspective, the discovered rules overall reveal<br />
<strong>in</strong>terest<strong>in</strong>g features of this approach to m<strong>in</strong><strong>in</strong>g functional data from<br />
Uniprot/SwissProt. A number of expected rules accurately predict<br />
some aspects of <strong>post</strong>-<strong>synaptic</strong> function. Other rules (unexpectedly)<br />
can exclude <strong>post</strong>-<strong>synaptic</strong> function with astonish<strong>in</strong>g accuracy. Still<br />
other rules <strong>in</strong>dicate the limitation of this approach. The lack of<br />
voltage-gated ion channel Prosite patterns (related to type 3 prote<strong>in</strong>s<br />
<strong>in</strong> the notation of Fig. 1) reflects limitations <strong>in</strong> Prosite:<br />
future approaches to this problem will need to consider this. In<br />
the future we also plan to generate a more diverse set of predictor<br />
attributes, captur<strong>in</strong>g <strong>in</strong>formation about other relevant properties of<br />
prote<strong>in</strong> sequences.<br />
A direction for future research would be to estimate the ‘<strong>in</strong>terest<strong>in</strong>gness’<br />
of the discovered rules by us<strong>in</strong>g some data-driven rule<br />
<strong>in</strong>terest<strong>in</strong>gness measures proposed <strong>in</strong> the literature. Then we would<br />
be able to automatically rank the discovered rules accord<strong>in</strong>g to<br />
those <strong>in</strong>terest<strong>in</strong>gness measures, and present the rules to the user<br />
<strong>in</strong> decreas<strong>in</strong>g order of estimated <strong>in</strong>terest<strong>in</strong>gness. We could also<br />
measure the correlation between the value of those data-driven <strong>in</strong>terest<strong>in</strong>gness<br />
measures and the subjective, real <strong>in</strong>terest of the rules to<br />
a biologist. This would allow us to evaluate how effective those<br />
data-driven <strong>in</strong>terest<strong>in</strong>gness measures are <strong>in</strong> the sense of be<strong>in</strong>g good<br />
estimators of the real human <strong>in</strong>terest <strong>in</strong> the rules. It would also<br />
be <strong>in</strong>terest<strong>in</strong>g to analyse the rules discovered by other data m<strong>in</strong><strong>in</strong>g<br />
algorithms.<br />
ACKNOWLEDGEMENTS<br />
G.L.P. is f<strong>in</strong>ancially supported by CAPES (a Brazilian researchsupport<br />
agency), process number 1650-02-5.<br />
Conflict of Interest: none declared.<br />
REFERENCES<br />
Clare,A. and K<strong>in</strong>g,R.D. (2002) Mach<strong>in</strong>e learn<strong>in</strong>g of functional class from phenotype<br />
data. Bio<strong>in</strong>formatics, 18, 160–166.<br />
Devos,D. and Valencia,A. (2000) Practical limits of functional prediction. Prote<strong>in</strong>s, 41,<br />
98–107.<br />
Fayyad,U.M., Piatetsky-Shapiro,G. and Smyth,P. (1996) From data m<strong>in</strong><strong>in</strong>g to<br />
knowledge discovery: an overview. In: Fayyad,U.M., Piatetsky-Shapiro,G. and<br />
Smyth,P. (eds), Advances <strong>in</strong> Knowledge Discovery and Data M<strong>in</strong><strong>in</strong>g. AAAI/MIT,<br />
pp. 1–34.<br />
Gerlt,J.A. and Babbitt,P.C. (2000) Can sequence determ<strong>in</strong>e function? Genome Biol.,<br />
1(5), reviews 0005.1–reviews 0005.10.<br />
Han,J. and Kamber,M. (2001) Data M<strong>in</strong><strong>in</strong>g: Concepts and Techniques. Morgan<br />
Kaufmann.<br />
Hand,D.J. (1997) Construction and Assessment of Classification Rules. Wiley.<br />
Hulo,N. et al. (2004) Recent improvements to the PROSITE database. Nucleic Acids<br />
Res., 32, D134–D137.<br />
Husi,H. et al. (2000) Proteomic analysis of NMDA receptor–adhesion prote<strong>in</strong> signal<strong>in</strong>g<br />
complexes. Nat. Neurosci., 3, 661–669.<br />
Karwath,A. and K<strong>in</strong>g,R.D. (2002) Homology <strong>in</strong>duction: the use of mach<strong>in</strong>e learn<strong>in</strong>g to<br />
improve sequence similarity searches. BMC Bio<strong>in</strong>formatics, 3, 11.<br />
K<strong>in</strong>g,R.D. et al. (2001) The utility of different representations of prote<strong>in</strong> sequence for<br />
predict<strong>in</strong>g functional class. Bio<strong>in</strong>formatics, 17, 445–454.<br />
Mirk<strong>in</strong>,B. and Ritter,O. (2000) A feature-based approach to discrim<strong>in</strong>ation and prediction<br />
of prote<strong>in</strong> fold<strong>in</strong>g groups. In: Suhai,S. (ed.), Genomics and Proteomics:<br />
Functional and Computational Aspects. Kluwer/Plenum, pp. 157–177.<br />
Nagl,S. (2003) Function prediction from prote<strong>in</strong> sequence. In Orengo,C.A., Jones,D.T.<br />
and Thornton,J.M. (eds), Bio<strong>in</strong>formatics: Genes, Prote<strong>in</strong>s & Computers, Bios, pp.<br />
65–79.<br />
Qu<strong>in</strong>lan,J.R. (1993) C4.5: Mach<strong>in</strong>e Learn<strong>in</strong>g Programs. Morgan Kaufmann.<br />
Schug,J. et al. (2002) <strong>Predict<strong>in</strong>g</strong> gene ontology functions from ProDom and CDD prote<strong>in</strong><br />
doma<strong>in</strong>s. Genome Res., 12, 648–655.
Sebban,M. et al. (2002) A data m<strong>in</strong><strong>in</strong>g approach to spacer oligonucleotide typ<strong>in</strong>g of<br />
Mycobacterium tuberculosis. Bio<strong>in</strong>formatics, 18, 235–243.<br />
Syed,U. and Yona,G. (2003) Us<strong>in</strong>g a mixture of probabilistic decision trees for direct<br />
prediction of prote<strong>in</strong> function. In Proceed<strong>in</strong>gs of the 2003 Conference on Research <strong>in</strong><br />
Computational Molecular Biology (RECOMB-2003), Berl<strong>in</strong>, Germany, pp. 289–300.<br />
Tsumoto,S. (2000) Cl<strong>in</strong>ical knowledge discovery <strong>in</strong> hospital <strong>in</strong>formation systems: two<br />
case studies. In Proceed<strong>in</strong>gs of the Fourth European Conference on Pr<strong>in</strong>ciples<br />
<strong>Predict<strong>in</strong>g</strong> <strong>post</strong>-<strong>synaptic</strong> <strong>activity</strong> <strong>in</strong> prote<strong>in</strong>s with data m<strong>in</strong><strong>in</strong>g<br />
and Practice of Knowledge Discovery <strong>in</strong> Databases (PKDD-2000), LNAI 1910,<br />
652–656. Spr<strong>in</strong>ger-Verlag.<br />
Uniprot: The Universal Prote<strong>in</strong> Resource. (2004)<br />
Walikonis,I.R.S. et al. (2000) Identification of prote<strong>in</strong>s <strong>in</strong> the <strong>post</strong><strong>synaptic</strong> density<br />
fraction by mass spectrometry. J. Neurosci., 20, 4069–4080.<br />
Witten,H. and Frank,E. (2000) Data M<strong>in</strong><strong>in</strong>g: Practical Mach<strong>in</strong>e Learn<strong>in</strong>g Tools and<br />
Techniques. Morgan Kaufmann.<br />
ii25