Unsupervised Recursive Sequence Processing - Institute of ...
Unsupervised Recursive Sequence Processing - Institute of ...
Unsupervised Recursive Sequence Processing - Institute of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Unsupervised</strong> <strong>Recursive</strong> <strong>Sequence</strong> <strong>Processing</strong><br />
Marc Strickert, Barbara Hammer<br />
Research group LNM, Department <strong>of</strong> Mathematics/Computer Science,<br />
University <strong>of</strong> Osnabrück, Germany<br />
Sebastian Blohm<br />
<strong>Institute</strong> for Cognitive Science,<br />
University <strong>of</strong> Osnabrück, Germany<br />
Abstract<br />
The self organizing map (SOM) is a valuable tool for data visualization and data mining for<br />
potentially high dimensional data <strong>of</strong> an a priori fixed dimensionality. We investigate SOMs<br />
for sequences and propose the SOM-S architecture for sequential data. <strong>Sequence</strong>s <strong>of</strong> potentially<br />
infinite length are recursively processed by integrating the currently presented item<br />
and the recent map activation, as proposed in [11]. We combine that approach with the<br />
hyperbolic neighborhood <strong>of</strong> Ritter [29], in order to account for the representation <strong>of</strong> possibly<br />
exponentially increasing sequence diversification over time. Discrete and real-valued<br />
sequences can be processed efficiently with this method, as we will show in experiments.<br />
Temporal dependencies can be reliably extracted from a trained SOM. U-Matrix methods,<br />
adapted to sequence processing SOMs, allow the detection <strong>of</strong> clusters also for real-valued<br />
sequence elements.<br />
Key words: Self-organizing map, sequence processing, recurrent models, hyperbolic<br />
SOM, U-Matrix, Markov models<br />
1 Introduction<br />
<strong>Unsupervised</strong> clustering by means <strong>of</strong> the self organizing map (SOM) was first proposed<br />
by Kohonen [21]. The SOM makes the exploration <strong>of</strong> high dimensional data<br />
possible and it allows the exploration <strong>of</strong> the topological data structure. By SOM<br />
training, the data space is mapped to a typically two dimensional Euclidean grid<br />
Email address: {marc,hammer}@informatik.uni-osnabrueck.de<br />
(Marc Strickert, Barbara Hammer).<br />
Preprint submitted to Elsevier Science 23 January 2004
<strong>of</strong> neurons, preferably in a topology preserving manner. Prominent applications <strong>of</strong><br />
the SOM are WEBSOM for the retrieval <strong>of</strong> text documents and PicSOM for the<br />
recovery and ordering <strong>of</strong> pictures [18,25]. Various alternatives and extensions to<br />
the standard SOM exist, such as statistical models, growing networks, alternative<br />
lattice structures, or adaptive metrics [3,4,19,27,28,30,33].<br />
If temporal or spatial data are dealt with – like time series, language data, or DNA<br />
strings – sequences <strong>of</strong> potentially unrestricted length constitute a natural domain for<br />
data analysis and classification. Unfortunately, the temporal scope is unknown in<br />
most cases, and therefore fixed vector dimensions, as used for standard SOM, cannot<br />
be applied. Several extensions <strong>of</strong> SOM to sequences have been proposed; for<br />
instance, time-window techniques or the data representation by statistical features<br />
make a processing with standard methods possible [21,28]. Due to data selection or<br />
preprocessing, information might get lost; for this reason, a data-driven adaptation<br />
<strong>of</strong> the metric or the grid is strongly advisable [29,33,36]. The first widely used application<br />
<strong>of</strong> SOM in sequence processing employed the temporal trajectory <strong>of</strong> the<br />
best matching units <strong>of</strong> a standard SOM in order to visualize speech signals and the<br />
variations <strong>of</strong> which [20]. This approach, however, does not operate on sequences<br />
as they are; rather, SOM is used for reducing the dimensionality <strong>of</strong> single sequence<br />
entries and acts as a preprocessing mechanism this way. Proposed alternatives substitute<br />
the standard Euclidean metric by similarity operators on sequences by incorporating<br />
autoregressive processes or time warping strategies [16,26,34]. These<br />
methods are very powerful, but a major problem is their computational costs.<br />
A fundamental way for sequence processing is a recursive approach. Supervised<br />
recurrent networks constitute a well-established generalization <strong>of</strong> standard feedforward<br />
networks to time series; many successful applications for different sequence<br />
classification and regression tasks are known [12,24]. Recurrent unsupervised models<br />
have also been proposed: the temporal Kohonen map (TKM) and the recurrent<br />
SOM (RSOM) use the biologically plausible dynamics <strong>of</strong> leaky integrators [8,39],<br />
as they occur in organisms, and explain phenomena such as direction selectivity<br />
in the visual cortex [9]. Furthermore, the models have been applied with moderate<br />
success to learning tasks [22]. Better results have been achieved by integrating these<br />
models into more complex systems [7,17]. Recent more powerful approaches are<br />
the recursive SOM (RecSOM) and the SOM for structured data (SOMSD) [10,41].<br />
These are based on a richer and explicit representation <strong>of</strong> the temporal context: they<br />
use the activation pr<strong>of</strong>ile <strong>of</strong> the entire map or the index <strong>of</strong> the most recent winner.<br />
As a result, their representation ability is superior to RSOM and TKM.<br />
A proposal to put existing unsupervised recursive models into a taxonomy can be<br />
found in [1,2]. The latter article identifies the entity ’time context’ used by the models<br />
as one <strong>of</strong> the main branches <strong>of</strong> the given taxonomy [2]. Although more general,<br />
the models are still quite diverse, and the recent developments <strong>of</strong> [10,11,35] are<br />
not included in the taxonomy. An earlier, simple, and elegant general description <strong>of</strong><br />
recurrent models with an explicit notion <strong>of</strong> context has been introduced in [13,14].<br />
2
This framework directly generalizes the dynamics <strong>of</strong> supervised recurrent networks<br />
to unsupervised models and it contains TKM, RecSOM, and SOMSD as special<br />
cases. As pointed out in [15], the precise approaches differ with respect to the<br />
notion <strong>of</strong> context and therefore they yield different accuracies and computational<br />
complexities, but their basic dynamic is the same. TKM is restricted due to the<br />
locality <strong>of</strong> its context representation, whereas RecSOM and SOMSD also include<br />
global information. In that regard, SOMSD can be interpreted as a modification<br />
<strong>of</strong> RecSOM, based on a compression <strong>of</strong> the RecSOM context model, and being<br />
computationally less demanding. Alternative efficient compression schemes such<br />
as Merging SOM (MSOM) have recently been developed [37].<br />
Here, we will focus on the compact and flexible representation <strong>of</strong> time context<br />
by linking the current winner neuron to the most recently presented sequence element:<br />
a neuron’s temporal context is given by an explicit back-reference to the best<br />
matching unit <strong>of</strong> the past time step, representing the previously processed input as<br />
the location <strong>of</strong> the last winning neuron in the map, as proposed in [10]. In comparison<br />
to RecSOM, this yields a greatly reduced computation time: the context <strong>of</strong><br />
SOMSD is a low-dimensional (usually two-dimensional) vector compared to a N-<br />
dimensional vector <strong>of</strong> RecSOM, N being the number <strong>of</strong> neurons (usually, N is at<br />
least 100). In addition, the explicit reference to the past winning unit allows elegant<br />
ways for extracting temporal dependencies. We will show how Markov models can<br />
be easily obtained from a trained map. This is not only possible for discrete input<br />
symbols, but also for real-valued sequence entries, by applying an adaptation<br />
<strong>of</strong> standard U-Matrix methods [38] to recursive SOMs. We will demonstrate the<br />
faithful representation <strong>of</strong> several time series and Markov processes within SOMSD<br />
in this article. However, SOMSD heavily relies on an adequate grid topology, because<br />
the distance <strong>of</strong> context representations is measured within the grid structure.<br />
It can be expected that low dimensional regular lattices do not capture typical<br />
characteristics <strong>of</strong> the space <strong>of</strong> time series. For this reason, we extend the SOMSD<br />
approach to more general topologies, that is to possibly non-Euclidean triangular<br />
grid structures. In particular, we combine a hyperbolic grid and the last-winner-ingrid<br />
temporal back reference. Hyperbolic grid structures have been proposed and<br />
successfully applied to document organization and retrieval [29,30]. Unlike rectangular<br />
lattices with inherent power law neighborhood growth, HSOM implements<br />
an exponential neighborhood growth. For discrete and real-valued time series we<br />
will evaluate the combination <strong>of</strong> hyperbolic lattices with the recurrent dynamics,<br />
putting the focus on neuron specialization, activations, and weight distributions.<br />
First, we present some recursive self-organizing map models introduced in the literature,<br />
which use different notions <strong>of</strong> context. Then, we explain the SOM for structured<br />
data (SOMSD) adapted to sequences in detail, and we extend the model to<br />
arbitrary triangular grid structures. After that, we propose an algorithm to extract<br />
Markov models from a trained map and we show how this algorithm can be combined<br />
with U-Matrix methods. Finally, we demonstrate in experiments the sequence<br />
representation capabilities, using several discrete and real-valued benchmark series.<br />
3
2 <strong>Unsupervised</strong> processing <strong>of</strong> sequences<br />
Let input sequences be denoted by s = (s 1 , . . . , s t ) with entries s i in an alphabet<br />
Σ which is embedded in a real-vector space R n . The element s 1 denotes the most<br />
recent entry <strong>of</strong> the sequence and t is the sequence length.The set <strong>of</strong> sequences <strong>of</strong><br />
arbitrary length over symbols Σ is Σ ∗ , and Σ l is the space <strong>of</strong> sequences <strong>of</strong> length l.<br />
Popular recursive sequence processing models are the temporal Kohonen map,<br />
recurrent SOM, recursive SOM, and SOM for structured data [8,11,39,41]. The<br />
SOMSD has originally been proposed for the more general case <strong>of</strong> tree structure<br />
processing. Here, only sequences, i.e. trees with a node fan-out <strong>of</strong> 1 are considered.<br />
As for standard SOM, a recursive neural map is given by a set <strong>of</strong> neurons n 1 , . . . ,<br />
n N . The neurons are arranged on a grid, <strong>of</strong>ten a two-dimensional regular lattice.<br />
All neurons are equipped with weights w i ∈ R n .<br />
Two important ingredients have to be defined to specify self-organizing network<br />
models: the data metric and the network update. The metric addressed the question,<br />
how an appropriate distance can be defined to measure the similarity <strong>of</strong> possibly<br />
sequential input signals to map units. For this purpose, the sequence entries<br />
are compared with the weight parameters stored at the neuron. The set <strong>of</strong> input signals<br />
for which a given neuron i is closest, is called the receptive field <strong>of</strong> neuron i,<br />
and neuron i is the winner and representative for all these signals within its receptive<br />
field. In the following, we will recall the distance computation for the standard<br />
SOM and also review several ways found in the literature to compute the distance<br />
<strong>of</strong> a neuron from a sequential input. Apart from the metric, the update procedure or<br />
learning rule for neurons to adapt to the input is essential. Commonly, Hebbian or<br />
competitive 1 learning takes place, referring to the following scheme: the parameters<br />
<strong>of</strong> the winner and its neighborhood within a given lattice structure are adapted<br />
such that their response to the current signal is increased. Thereby, neighborhood<br />
cooperation ensures a topologically faithful mapping.<br />
Standard SOM relies on a simple winner-takes-all scheme and does not account<br />
for the temporal structure <strong>of</strong> inputs. For a stimulus s i ∈ R n the neuron n j responds,<br />
for which the squared distance<br />
d SOM (s i , w j ) = ‖s i − w j ‖ 2 ,<br />
s i ∈ R n<br />
is minimum, where ‖ · ‖ is the standard Euclidean metric. Training starts with<br />
randomly initialized weights w i and adapts the parameters iteratively as follows:<br />
denote by n 0 the index <strong>of</strong> the winning neuron for the input signal s i . Assume a<br />
function nhd(n j , n k ) which indicates the degree <strong>of</strong> neighborhood <strong>of</strong> neuron j and<br />
k within the chosen lattice structure is fixed. Adaptation <strong>of</strong> all weights w j takes<br />
1 We will use these two terms interchangeably in the following.<br />
4
place by the update rule<br />
△w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j )<br />
whereby ɛ ∈ (0, 1) is the learning rate. The function h σ describes the amount <strong>of</strong><br />
neuron adaptation in the neighborhood <strong>of</strong> the winner: <strong>of</strong>ten the Gaussian bell function<br />
h σ (x) = exp(−x 2 /σ 2 ) is chosen, <strong>of</strong> which the shape is narrowed during training<br />
by decreasing σ to ensure the neuron specialization. The function nhd(n j , n k )<br />
which measures the degree <strong>of</strong> neighborhood <strong>of</strong> the neurons n i and n j within the<br />
lattice might be induced by the simple Euclidean distance between the neuron coordinates<br />
in a rectangular grid or by the shortest distance in a graph connecting the<br />
two neurons.<br />
<strong>Recursive</strong> models substitute the one-shot distance computation for a single entry<br />
s i by a recursive formula over all entries <strong>of</strong> a given sequence s. For all models,<br />
sequences are presented recursively, and the current sequence entry s i is processed<br />
in the context which is set by its predecessors s i+1 , s i+2 , . . .. 2 The models differ<br />
with respect to the representation <strong>of</strong> the context and in the way that the context<br />
influences further computation.<br />
The Temporal Kohonen Map (TKM) computes the distance <strong>of</strong> s = (s 1 , . . . , s t )<br />
from neuron n j labeled with w j ∈ R n by the leaky integration<br />
t∑<br />
d TKM (s, n j ) = η(1 − η) i−1 ‖s i − w j ‖ 2<br />
i=1<br />
where η ∈ (0, 1) is a memory parameter [8]. A neuron becomes winner if the<br />
current entry s 1 is close to its weight w j as in standard SOM, and, in addition,<br />
the remaining sum (1 − η)‖s 2 − w j ‖ + (1 − η) 2 ‖s 3 − w j ‖ + . . . is also small.<br />
This additional term integrates the distances <strong>of</strong> the neuron’s weight from previous<br />
sequence entries weighted by an exponentially decreasing decay factor (1 − η) i−1 .<br />
The context resulting from previous sequence entries is pointing towards neurons<br />
<strong>of</strong> which the weights have been close to previous entries. Thus, the winner is a<br />
neuron whose weight is close to the average presented signal for the recent time<br />
steps.<br />
The training for the TKM takes place by Hebbian learning in the same way as for<br />
the standard SOM, making well-matching neurons more similar to the current input<br />
than bad-matching neurons. At the beginning, weights w j are initialized randomly<br />
and then iteratively adapted when data is presented. For adaptation assume that a<br />
2 We use reverse indexing <strong>of</strong> the sequence entries, s 1 denoting the most recent entry,<br />
s 2 , s 3 , . . . its predecessors.<br />
5
sequence s is given, with s i denoting the current entry and n j0 denoting the best<br />
matching neuron for this time step. Then the weight correction term is<br />
△w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j )<br />
As discussed in [23], the learning rule <strong>of</strong> TKM is unstable and leads to only suboptimal<br />
results. More advanced, the Recurrent SOM (RSOM) leaky integration first<br />
sums up the weighted directions and afterwards computes the distance [39]<br />
t∑<br />
2<br />
d RSOM (s, n j ) =<br />
η(1 − η) i−1 (s<br />
∥<br />
i − w j )<br />
.<br />
∥<br />
i=1<br />
It represents the context in a larger space than TKM since the vectors <strong>of</strong> directions<br />
are stored instead <strong>of</strong> the scalar Euclidean distance. More importantly, the training<br />
rule is changed. RSOM derives its learning rule directly from the objective to minimize<br />
the distortion error on sequences and thus adapts the weights towards the<br />
vector <strong>of</strong> integrated directions:<br />
△w j = ɛ · h σ (nhd(n j0 , n j )) · y j (i)<br />
whereby<br />
y j (i) =<br />
t∑<br />
η(1 − η) i−1 (s i − w j ) .<br />
i=1<br />
Again, the already processed part <strong>of</strong> the sequence produces a context notion, and<br />
the neuron becomes the winner for the current entry <strong>of</strong> which the weight is most<br />
similar to the average entry for the past time steps. The training rule <strong>of</strong> RSOM takes<br />
this fact into account by adapting the weights towards this averaged activation.<br />
We will not refer to this learning rule in the following. Instead, the way in which<br />
sequences are represented within these two models, and the ways to improve the<br />
representational capabilities <strong>of</strong> such maps will be <strong>of</strong> interest.<br />
Assuming vanishing neighborhood influences σ for both cases TKM and RSOM,<br />
one can analytically compute the internal representation <strong>of</strong> sequences for these two<br />
models, i.e. weights with response optimum to a given sequence s = (s 1 , . . . , s t ):<br />
the weight w is optimum for which<br />
t∑<br />
t∑<br />
w = (1 − η) i−1 s i / (1 − η) i−1<br />
i=1<br />
i=1<br />
holds [40]. This explains the encoding scheme <strong>of</strong> the winner-takes-all dynamics<br />
<strong>of</strong> TKM and RSOM. <strong>Sequence</strong>s are encoded in the weight space by providing a<br />
6
ecursive partitioning very much like the one generating fractal Cantor sets. As an<br />
example for explaining this encoding scheme, assume that binary sequences {0, 1} l<br />
are dealt with. For η = 0.5, the representation <strong>of</strong> sequences <strong>of</strong> fixed length l corresponds<br />
to an encoding in a Cantor set: the interval [0, 0.5) represents sequences<br />
with most recent entry s 1 = 0, interval [0.5, 1) contains only codes <strong>of</strong> sequences<br />
with most recent entry 1. <strong>Recursive</strong> decomposition <strong>of</strong> the intervals allows to recover<br />
further entries <strong>of</strong> the sequence: [0, 0.25) stands for the beginning 00. . . <strong>of</strong> a<br />
sequence, [0.25, 0.5) stands for 01, [0.5, 0.75) for 10, and [0.75, 1) represents 11.<br />
By further subdivision, [0, 0.125) stands for the beginning 000. . ., [0.125, 0.25) for<br />
001, and so on. Similar encodings can be found for alternative choices <strong>of</strong> η. <strong>Sequence</strong>s<br />
over discrete sets Σ = {0, . . . , d} ⊂ R can be uniquely encoded using<br />
this fractal partitioning if η < 1/d. For larger η, the subsets start to overlap, i.e.<br />
codes are no longer sorted according to their last symbols, and a code might stand<br />
for two or more different sequences. A very small η ≪ 1/d, in turn, results in an<br />
only sparsely used space; for example the interval (d · η, 1] does not contain a valid<br />
code. Note that the explicit computation <strong>of</strong> this encoding stresses the superiority<br />
<strong>of</strong> the RSOM learning rule compared to TKM update, as pointed out in [40]: the<br />
fractal code is a fixed point for the dynamics <strong>of</strong> RSOM training, whereas TKM<br />
converges towards the borders <strong>of</strong> the intervals, preventing the optimum fractal encoding<br />
scheme from developing on its own.<br />
Fractal encoding is reasonable, but limited: it is obviously restricted to discrete<br />
sequence entries, and real values or noise might destroy the encoded information.<br />
Fractal codes do not differentiate between sequences <strong>of</strong> different length; e.g. the<br />
code 0 gives optimum response to 0,00, 000, and so forth. <strong>Sequence</strong>s with this<br />
kind <strong>of</strong> encoding cannot be distinguished. In addition, the number <strong>of</strong> neurons does<br />
not take influence on the expressiveness <strong>of</strong> the context space. The range in which<br />
sequences are encoded is the same as the weight space. Thus, both the size <strong>of</strong> the<br />
weight space and the computation accuracy are limiting the number <strong>of</strong> different<br />
contexts, independently <strong>of</strong> the number <strong>of</strong> neurons <strong>of</strong> the network.<br />
Based on these considerations, richer and in particular explicit representations <strong>of</strong><br />
context have been proposed. The models that we introduce in the following extend<br />
the parameter space <strong>of</strong> each neuron j by an additional vector c j , which is<br />
used to explicitly store the sequential context within which a sequence entry is expected.<br />
Depending on the model, the context c j is contained in a representation<br />
space with different dimensionality. However, in all cases this space is independent<br />
<strong>of</strong> the weight space and extends the expressiveness <strong>of</strong> the models in comparison<br />
to TKM and RSOM. For each model, we will define the basic ingredients: what is<br />
the space <strong>of</strong> context representations? How is the distance between a sequence entry<br />
and neuron j computed, taking into account its temporal context c j ? How are the<br />
weights and contexts adapted?<br />
The <strong>Recursive</strong> SOM (RecSOM) [41] equips each neuron n j with a weight w j ∈<br />
R n that represents the given sequence entry, as usual. In addition, a vector c j ∈<br />
7
R N is provided, N denoting the number <strong>of</strong> neurons, which explicitly represents<br />
the contextual map activation <strong>of</strong> all neurons in the previous time step. Thus, the<br />
temporal context is represented in this model in an N-dimensional vector space, N<br />
denoting the number <strong>of</strong> neurons. One can think <strong>of</strong> the context as an explicit storage<br />
<strong>of</strong> the activity pr<strong>of</strong>ile <strong>of</strong> the whole map in the previous time step. More precisely,<br />
distance is recursively computed by<br />
d RecSOM ((s 1 , . . . , s t ), n j ) = η 1 ‖s 1 − w j ‖ 2 + η 2 ‖C RecSOM (s 2 , . . . , s t ) − c j ‖ 2<br />
where η 1 , η 2 > 0.<br />
C RecSOM (s) = (exp(−d RecSOM (s, n 1 )), . . . , exp(−d RecSOM (s, n N )))<br />
constitutes the context. Note that this vector is almost the vector <strong>of</strong> distances <strong>of</strong> all<br />
neurons computed in the previous time step. These are exponentially transformed<br />
to avoid an explosion <strong>of</strong> the values. As before, the above distance can be decomposed<br />
into two parts: the winner computation similar to standard SOM, and, as in<br />
the case <strong>of</strong> RSOM and TKM, a term which assesses the context match. For Rec-<br />
SOM the context match is a comparison <strong>of</strong> the current context when processing<br />
the sequence, i.e. the vector <strong>of</strong> distances <strong>of</strong> the previous time step, and the expected<br />
context c j which is stored at neuron j. That is to say, RecSOM explicitly stores context<br />
vectors for each neuron and compares these context vectors to their expected<br />
contexts during the recursive computation. Since the entire map activation is taken<br />
into account, sequences <strong>of</strong> any given fixed length can be stored, if enough neurons<br />
are provided. Thus, the representation space for context is no longer restricted by<br />
the weight space and its capacity now scales with the number <strong>of</strong> neurons.<br />
For RecSOM, training is done in Hebbian style for both weights and contexts. Denote<br />
by n j0 the winner for sequence entry i, then the weight changes are<br />
△w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j )<br />
and the context adaptation is<br />
△c j = ɛ ′ · h σ (nhd(n j0 , n j )) · (C RecSOM (s i+1 , . . . , s t ) − c j )<br />
The latter update rule makes sure that the context vectors <strong>of</strong> the winner neuron<br />
and its neighborhood become more similar to the current context vector C RecSOM ,<br />
which is computed when the sequence is processed. The learning rates are ɛ, ɛ ′ ∈<br />
(0, 1). As demonstrated in [41], this richer representation <strong>of</strong> context allows a better<br />
quantization <strong>of</strong> time series data. In [41], various quantitative measures to evaluate<br />
trained recursive maps are proposed, such as the temporal quantization error and<br />
the specialization <strong>of</strong> neurons. RecSOM turns out to be clearly superior to TKM and<br />
RSOM with respect to these measures in the experiments provided in [41].<br />
8
However, the dimensionality <strong>of</strong> the context for RecSOM equals the number <strong>of</strong> neurons<br />
N, making this approach computationally quite costly. The training <strong>of</strong> very<br />
huge maps with several thousands <strong>of</strong> neurons is no longer feasible for RecSOM.<br />
Another drawback is given by the exponential activity transfer function in the term<br />
<strong>of</strong> C RecSOM ∈ R N : specialized neurons are characterized by the fact that they have<br />
only one or a few well-matching predecessors contributing values <strong>of</strong> about 1 to<br />
C RecSOM ; however, for a large number N <strong>of</strong> neurons, the noise influence on C RecSOM<br />
from other neurons destroys the valid context information, because even poorly<br />
matching neurons – contributing values <strong>of</strong> slightly above 0 – are summed up in the<br />
distance computation.<br />
SOM for structured data (SOMSD) as proposed in [10,11] is an efficient and still<br />
powerful alternative. SOMSD represents temporal context by the corresponding<br />
winner index in the previous time step. Assume that a regular l-dimensional lattice<br />
<strong>of</strong> neurons is given. Each neuron n j is equipped with a weight w j ∈ R n and a<br />
value c j ∈ R l which represents a compressed version <strong>of</strong> the context, the location<br />
<strong>of</strong> the previous winner within the map [10]. The space in which context vectors<br />
are represented is the vector space R l for this model. The distance <strong>of</strong> sequence<br />
s = (s 1 , . . . , s t ) from neuron n j is recursively computed by<br />
d SOMSD ((s 1 , . . . , s t ), n j ) = η 1 ‖s 1 − w j ‖ 2 + η 2 ‖C SOMSD (s 2 , . . . , s n ) − c j ‖ 2<br />
where C SOMSD (s) equals the location <strong>of</strong> neuron n j with smallest d SOMSD (s, n j ) in the<br />
grid topology. Note that the context C SOMSD is an element in a low-dimensional vector<br />
space, usually only R 2 . The distance between contexts is given by the Euclidean<br />
metric within this vector space. The learning dynamic <strong>of</strong> SOMSD is very similar<br />
to the dynamic <strong>of</strong> RecSOM: the current distance is defined as a mixture <strong>of</strong> two<br />
terms, the match <strong>of</strong> the neuron’s weight and the current sequence entry, and the<br />
match <strong>of</strong> the neuron’s context weight and the context currently computed in the<br />
model. Thereby, the current context is represented by the location <strong>of</strong> the winning<br />
neuron <strong>of</strong> the map in the previous time step. This dynamic imposes a temporal bias<br />
towards those neurons which context vector matches the winner location <strong>of</strong> the previous<br />
time step. It relies on the fact that a lattice structure <strong>of</strong> neurons is defined and<br />
a distance measure <strong>of</strong> locations within the map is defined.<br />
Due to the compressed context information, this approach is very efficient in comparison<br />
to RecSOM and also very large maps can be trained. In addition, noise<br />
is suppressed in this compact representation. However, more complex context information<br />
is used than for TKM and RSOM, namely the location <strong>of</strong> the previous<br />
winner in the map. As for RecSOM, Hebbian learning takes place for SOMSD, because<br />
weight vectors and contexts are adapted in a well-known correction manner,<br />
here by the formulas<br />
△w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j )<br />
9
and<br />
△c j = ɛ ′ · h σ (nhd(n j0 , n j )) · (C SOMSD (s i+1 , . . . , s t ) − c j )<br />
with learning rates ɛ, ɛ ′ ∈ (0, 1). n j0 denotes the winner for sequence entry i.<br />
As demonstrated in [11], a generalization <strong>of</strong> this approach to tree structures can<br />
reliably model structured objects and their respective topological ordering.<br />
We would like to point out that, although these approaches seem different, they<br />
constitute instances <strong>of</strong> the same recursive computation scheme. As proved in [14],<br />
the underlying recursive update dynamics comply with<br />
d((s 1 , . . . , s t ), n j ) = η 1 ‖s 1 − w j ‖ 2 + η 2 ‖C(s 2 , . . . , s n ) − c j ‖ 2<br />
in all the cases. Their specific similarity measures for weights and contexts are denoted<br />
by the generic ‖ · ‖ expression. The approaches differ with respect to the<br />
concrete choice <strong>of</strong> the context C: TKM and RSOM refer to only the neuron itself<br />
and are therefore restricted to local fractal codes within the weight space; RecSOM<br />
uses the whole map activation, which is powerful but also expensive and subject<br />
to random neuron activations; SOMSD relies on compressed information, the location<br />
<strong>of</strong> the winner. Note that also standard supervised recurrent networks can be<br />
put into the generic dynamic framework by choosing the context as the output <strong>of</strong><br />
the sigmoidal transfer function [14]. In addition, alternative compression schemes,<br />
such as a representation <strong>of</strong> the context by the winner content, are possible [37].<br />
To summarize this section, essentially four different models have been proposed<br />
for processing temporal information. The models are characterized by the way in<br />
which context is taken into account within the map. The models are:<br />
Standard SOM: no context representation; standard distance computation; standard<br />
competitive learning.<br />
TKM and RSOM: no explicit context representation; the distance computation<br />
recursively refers to the distance <strong>of</strong> the previous time step; competitive learning<br />
for the weight whereby (for RSOM) the averaged signal is used.<br />
RecSOM: explicit context representation as N-dimensional activity pr<strong>of</strong>ile <strong>of</strong> the<br />
previous time step; the distance computation is given as mixture <strong>of</strong> the current<br />
match and the match <strong>of</strong> the context stored at the neuron and the (recursively computed)<br />
current context given by the processed time series; competitive learning<br />
adapts the weight and context vectors.<br />
SOMSD: explicit context representation as low-dimensional vector, the location<br />
<strong>of</strong> the previously winning neuron in the map; the distance is computed recursively<br />
the same way as for RecSOM, whereby a distance measure for locations<br />
in the map has to be provided; so far, the model is only available for standard<br />
rectangular Euclidean lattices; competitive learning adapts the weight and context<br />
vectors, whereby the context vectors are embedded in the Euclidean space.<br />
10
In the following, we focus on the context representation by the winner index, as<br />
proposed in SOMSD. This scheme <strong>of</strong>fers a compact and efficient context representation.<br />
However, it relies heavily on the neighborhood structure <strong>of</strong> the neurons,<br />
and faithful topological ordering is essential for appropriate processing. Since for<br />
sequential data, like for words in Σ ∗ , the number <strong>of</strong> possible strings is an exponential<br />
function <strong>of</strong> their length, an Euclidean target grid with inherent power law<br />
neighborhood growth is not suited for a topology preserving representation. The<br />
reason for this is that the storage <strong>of</strong> temporal data is related to the representation<br />
<strong>of</strong> trajectories on the neural grid. String processing means beginning at a node that<br />
represents the start symbol; then, how many nodes n s can in the ideal case uniquely<br />
be reached in a fixed number s <strong>of</strong> steps? In grids with 6 neurons per neighbor the<br />
triangular tessellation <strong>of</strong> the Euclidean plane leads to a hexagonal superstructure,<br />
inducing the surprising answer <strong>of</strong> n s = 6 for any choice <strong>of</strong> s > 0. Providing 7<br />
neurons per neighbor yields the exponential branching n s = 7 · 2 (s−1) <strong>of</strong> paths.<br />
In this respect, it is interesting to note that RecSOM can also be combined with<br />
alternative lattice structures; in [41] a comparison is presented <strong>of</strong> RecSOM with a<br />
standard rectangular topology and a data optimum topology provided by neural gas<br />
(NG) [27,28]. The latter clearly leads to superior results. Unfortunately, it is not<br />
possible to combine the optimum topology <strong>of</strong> NG with SOMSD: for NG, no grid<br />
with straightforward neuron indexing exists. Therefore, context cannot be defined<br />
easily by referring back to the previous winner, because no similarity measure is<br />
available for indices <strong>of</strong> neurons within a grid topology.<br />
Here, we extend SOMSD to grid structures with triangular grid connectivity in<br />
order to obtain a larger flexibility for the lattice design. Apart from the standard<br />
Euclidean plane, the sphere and the hyperbolic plane are alternative popular twodimensional<br />
manifolds. They differ from the Euclidean plane with respect to their<br />
curvature: the Euclidean plane is flat, whereas the hyperbolic space has negative<br />
curvature, and the sphere is curved positively. By computing the Euler characteristics<br />
<strong>of</strong> all compact connected surfaces, it can be shown that only seven have nonnegative<br />
curvature, implying that all but seven are locally isometric to the hyperbolic<br />
plane, which makes the study <strong>of</strong> hyperbolic spaces particularly interesting. 3<br />
The curvature has consequences on regular tessellations <strong>of</strong> the referred manifolds as<br />
pointed out in [30]: the number <strong>of</strong> neighbors <strong>of</strong> a grid point in a regular tessellation<br />
<strong>of</strong> the Euclidean plane follows a power law, whereas the hyperbolic plane allows<br />
an exponential increase <strong>of</strong> the number <strong>of</strong> neighbors. The sphere yields compact<br />
lattices with vanishing neighborhoods, whereby a regular tessellation for which all<br />
vertices have the same number <strong>of</strong> neighbors is impossible (with the uninteresting<br />
exception <strong>of</strong> an approximation by one <strong>of</strong> the 5 Platonic solids). Since all these<br />
surfaces constitute two-dimensional manifolds, they can be approximated locally<br />
within a cell <strong>of</strong> the tessellation by a subset <strong>of</strong> the standard Euclidean plane without<br />
3 For an excellent tool box and introduction to hyperbolic geometry see e.g.<br />
http://www.geom.uiuc.edu/docs/forum/hype/hype.html<br />
11
too much contortion. A global isometric embedding, however, is not possible in<br />
general. Interestingly, for all such tessellations a data similarity measure is defined<br />
and possibly non-isometric visualization in the 2D plane can be achieved. While 6<br />
neighbors per neuron lead to standard Euclidean triangular meshes, for a grid with<br />
7 neighbors or more, the graph becomes part <strong>of</strong> the 2-dimensional hyperbolic plane.<br />
As already mentioned, exponential neighborhood growth is possible and hence an<br />
adequate data representation can be expected for the visualization <strong>of</strong> domains with<br />
a high connectivity <strong>of</strong> the involved objects. SOM with hyperbolic neighborhood<br />
(HSOM) has already proved well-suited for text representation as demonstrated for<br />
a non-recursive model in [29].<br />
3 SOM for sequences (SOM-S)<br />
In the following, we introduce the adaptation <strong>of</strong> SOMSD for sequences and the<br />
general triangular grid structure, SOM for sequences (SOM-S). Standard SOMs<br />
operate on a rectangular neuron grid embedded in a real-valued vector space. More<br />
flexibility for the topological setup can be obtained by describing the grid in terms<br />
<strong>of</strong> a graph: neural connections are realized by assigning each neuron a set <strong>of</strong> direct<br />
neighbors. The distance <strong>of</strong> two neurons is given by the length <strong>of</strong> a shortest path<br />
within the lattice <strong>of</strong> neurons. Each edge is assigned the unit length 1. The number <strong>of</strong><br />
neighbors might vary (also within a single map). Less than 6 neighbors per neuron<br />
lead to a subsiding neighborhood, resulting in graphs with small numbers <strong>of</strong> nodes.<br />
Choosing more than 6 neighbors per neuron yields, as argued above, an exponential<br />
increase <strong>of</strong> the neighborhood size, which is convenient for representing sequences<br />
with potentially exponential context diversification.<br />
Unlike standard SOM or HSOM, we do not assume that a distance preserving embedding<br />
<strong>of</strong> the lattice into the two dimensional plane or another globally parameterized<br />
two-dimensional manifold with global metric structure, such as the hyperbolic<br />
plane, exists. Rather, we assume that the distance <strong>of</strong> neurons within the grid<br />
is computed directly on the neighborhood graph, which might be obtained by any<br />
non-overlapping triangulation <strong>of</strong> the topological two-dimensional plane. 4 For our<br />
experiments, we have implemented a grid generator for a circular triangle meshing<br />
around a center neuron, which requires the desired number <strong>of</strong> neurons and the<br />
neighborhood degree n as parameters. Neurons at the lattice edge possess less than<br />
n neighbors, and if the chosen total number <strong>of</strong> neurons does not lead to filling up<br />
the outer neuron circle, neurons there are connected to others in a maximum symmetric<br />
way. Figure 1 shows a small map with 7 neighbors for the inner neurons,<br />
and a total <strong>of</strong> 29 neurons perfectly filling up the outer edge. For ≥ 7 neighbors, the<br />
exponential neighborhood increase can be observed, for which an embedding into<br />
4 Since the lattice is fixed during training, these values have to be computed only once.<br />
12
,<br />
<br />
<br />
!<br />
!<br />
<br />
,<br />
<br />
<br />
Fig. 1. Hyperbolic self organizing map with context. Neuron n refers to the context given<br />
by the winner location in the map, indicated by the triangle <strong>of</strong> neurons N 1 , N 2 , and N 3 ,<br />
and the precise coordinates ß 12 ,ß 13 . If the previous winner has been D 2 , adaptation <strong>of</strong> the<br />
context along the dotted line takes place.<br />
the Euclidean plane is not possible without contortions; however, local projections<br />
in terms <strong>of</strong> a fish eye magnification focus can be obtained (cf. [29]).<br />
SOMSD adapts the location <strong>of</strong> the expected previous winner during training. For<br />
this purpose, we have to embed the triangular mesh structure into a continuous<br />
space. We achieve this by computing lattice distances beforehand, and then we approximate<br />
the distance <strong>of</strong> points within a triangle shaped map patch by the standard<br />
Euclidean distance. Thus, positions in the lattice are represented by three neuron<br />
indices which represent the selected triangle <strong>of</strong> adjacent neurons, and two real numbers<br />
which represent the position within the triangle. The recursive nature <strong>of</strong> the<br />
shown map is illustrated exemplarily in figure 1 for neuron n. This neuron n is<br />
equipped with a weight w ∈ R n and a context c that is given by a location within<br />
the triangle <strong>of</strong> neurons N 1 , N 2 , and N 3 expressing corner affinities by means <strong>of</strong><br />
the linear combination parameters ß 12 and ß 13 . The distance <strong>of</strong> a sequence s from<br />
neuron n is recursively computed by<br />
d SOM-S ((s 1 , . . . , s t ), n) = η ‖s 1 − w‖ 2 + (1 − η) g(C SOM-S (s 2 , . . . , s n ), c).<br />
C SOM-S (s) is the index <strong>of</strong> the neuron n j in the grid with smallest distance d SOM-S (s, n j ).<br />
g measures the grid distance <strong>of</strong> the triangular position c j = (N 1 , N 2 , N 3 , ß 12 , ß 13 )<br />
to the winner as the shortest possible path in the mesh structure. Grid distances<br />
between neighboring neurons possess unit length, and the metric structure within<br />
the triangle N 1 , N 2 , N 3 is approximated by the Euclidean metric. The range <strong>of</strong> g<br />
is normalized by scaling with the inverse maximum grid distance. This mixture <strong>of</strong><br />
hyperbolic grid distance and Euclidean distance is valid, because the hyperbolic<br />
space can locally be approximated by Euclidean space, which is applied for computational<br />
convenience to both distance calculation and update.<br />
13
Training is carried out by presenting a pattern s = (s 1 , . . . , s t ), determining the<br />
winner n j0 , and updating the weight and the context. Adaptation affects all neurons<br />
on the breadth first search graph around the winning neuron according to their<br />
grid distances in a Hebbian style. Hence, for the sequence entry s i , weight w j is<br />
updated by △w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j ). The learning rate ɛ is typically<br />
exponentially decreased during training; as above, h σ (nhd(n j0 , n j )) describes the<br />
influence <strong>of</strong> the winner n j0 to the current neuron n j as a decreasing function <strong>of</strong><br />
grid distance. The context update is analogous: the current context, expressed in<br />
terms <strong>of</strong> neuron triangle corners and coordinates, is moved towards the previous<br />
winner along a shortest path. This adaptation yields positions on the grid only.<br />
Intermediate positions can be achieved by interpolation: if two neurons N i and N j<br />
exist in the triangle with the same distance, the midway is taken for the flat grids<br />
obtained by our grid generator. This explains why the update path, depicted as the<br />
dotted line in figure 1, for the current context towards D 2 is via D 1 . Since the grid<br />
distances are stored in a static matrix, a fast calculation <strong>of</strong> shortest path lengths is<br />
possible. The parameter η in the recursive distance calculations controls the balance<br />
between pattern and context influence; since initially nothing is known about the<br />
temporal structure, this parameter starts at 1, thus indicating the absence <strong>of</strong> context,<br />
and resulting in standard SOM. During training it is decreased to an application<br />
dependent value that mediates the balance between the externally presented pattern<br />
and the internally gained model about historic contexts.<br />
Thus, we can combine the flexibility <strong>of</strong> general triangular and possibly hyperbolic<br />
lattice structures with the efficient context representation as proposed in [11].<br />
4 Evaluation measures <strong>of</strong> SOM<br />
Popular methods to evaluate the standard SOM are the visual inspection, the identification<br />
<strong>of</strong> meaningful clusters, the quantization error, and measures for topological<br />
ordering <strong>of</strong> the map. For recursive self organizing maps, an additional dimension<br />
arises: the temporal dynamic stored in the context representations <strong>of</strong> the map.<br />
4.1 Temporal quantization error<br />
Using ideas <strong>of</strong> Voegtlin [41] we introduce a method to assess the implicit representation<br />
<strong>of</strong> temporal dependencies in the map, and to evaluate to which amount<br />
faithful representation <strong>of</strong> the temporal data takes place. The general quantization<br />
error refers to the distortion <strong>of</strong> each map unit with respect to its receptive field,<br />
which measures the extent <strong>of</strong> data space coverage by the units. If temporal data are<br />
considered, the distortion needs to be assessed back in time. For a formal definition,<br />
assume that a time series (s 1 , s 2 , . . . , s t , . . .) is presented to the network, again<br />
14
with reverse indexing notation, i.e. s 1 is the most recent entry <strong>of</strong> the time series. Let<br />
win i denote all time steps for which neuron i becomes the winner in the considered<br />
recursive map model. The mean activation <strong>of</strong> neuron i for time step t in the past is<br />
the value<br />
A i (t) =<br />
∑<br />
s j+t /|win i |.<br />
j∈win i<br />
Assume that neuron i becomes winner for a sequence entry s j . It can then be expected<br />
that s j is like the standard SOM close to the average A i (0), because the map<br />
is trained with Hebbian learning. Temporal specification takes place if, in addition,<br />
s j+t is close to the average A i (t) for t > 0. The temporal quantization error <strong>of</strong><br />
neuron i at time step t back in the past is defined by<br />
⎛<br />
⎞<br />
E i (t) = ⎝ ∑<br />
‖s j+t − A i (t)‖ 2 ⎠<br />
j∈win i<br />
1/2<br />
.<br />
This measures the extent up to which the values observed t time steps back in the<br />
past coincide with a winning neuron. Temporal specialization <strong>of</strong> neuron i takes<br />
place if E i (t) is small for t > 0. Since no temporal context is learned for the<br />
standard SOM, the temporal quantization will be large for t > 0, just reflecting<br />
specifics <strong>of</strong> the underlying time series such as smoothness or periodicity. For recursive<br />
models, this quantity allows to assess the amount <strong>of</strong> temporal specification.<br />
The temporal quantization error <strong>of</strong> the entire map for t time steps back into the past<br />
is defined as the average<br />
N∑<br />
E(t) = E i (t)/N<br />
i=1<br />
This method allows to evaluate whether the temporal dynamic in the recent past is<br />
faithfully represented.<br />
4.2 Temporal models<br />
After the training <strong>of</strong> a recursive map, it can be used to obtain an explicit, possibly<br />
approximative description <strong>of</strong> the underlying global temporal dynamics. This <strong>of</strong>fers<br />
another possibility to evaluate the dynamics <strong>of</strong> SOM because we can compare the<br />
extracted temporal model to the original one, if available, or a temporal model<br />
extracted directly from the data. In addition, a compressed description <strong>of</strong> the global<br />
dynamics extracted from a trained SOM is interesting for data mining tasks. In<br />
particular, it can be tested whether clustering properties <strong>of</strong> SOM, referred to by<br />
U-matrix methods, transfer to the temporal domain.<br />
15
Markov models constitute simple, though powerful techniques for sequence processing<br />
and analysis [6,32]. Assume that Σ = {a 1 , . . . , a d } is a finite alphabet. The<br />
prediction <strong>of</strong> the next symbol refers to the task to anticipate the probability <strong>of</strong> a i<br />
having observed a sequence s = (s 1 , . . . , s t ) ∈ Σ ∗ before. This is just the conditional<br />
probability P (a i |s). For finite Markov models, a finite memory length l is<br />
sufficient to determine this probability, i.e. the probability<br />
P (a i |(s 1 , . . . , s l , . . . , s t )) = P (a i |(s 1 , . . . , s l )) , (t ≥ l)<br />
depends only on the past l symbols instead <strong>of</strong> the whole context (s 1 , . . . , s t ). Markov<br />
models can be estimated from given data if the order l is fixed. It holds that<br />
P (a i |(s 1 , . . . , s l )) = P ((a i, s 1 , . . . , s l ))<br />
∑<br />
j P ((a j , s 1 , . . . , s l ))<br />
(1)<br />
which means that the next symbol probability can be estimated from the frequencies<br />
<strong>of</strong> (l + 1)-grams.<br />
We are interested in the question whether a trained SOM-S can capture the essential<br />
probabilities for predicting the next symbol, generated by simple Markov<br />
models. For this purpose, we train maps on Markov models and afterwards extract<br />
the transition probabilities entirely from the obtained maps. This extraction can be<br />
done because <strong>of</strong> the specific form <strong>of</strong> context for SOM-S. Given a finite alphabet<br />
Σ = {a 1 , . . . , a d } for training, most neurons specialize during training and become<br />
winner for at least one or some stimuli. Winner neurons represent the input sequence<br />
entries w by their trained weight vectors. Usually, the weight w i <strong>of</strong> neuron<br />
n i is very close to a symbol a j <strong>of</strong> Σ and can thus be identified with the symbol.<br />
In addition, the neurons represent their context by an explicit reference to the location<br />
<strong>of</strong> the winner in the previous time step. The context vectors stored in the<br />
neurons define an intermediate winning position in the map encoded by the parameters<br />
(N 1 , N 2 , N 3 , ß 12 , ß 13 ) for the closest three neurons and the exact position<br />
within the triangle. We take this into account for extracting sequences corresponding<br />
to the averaged weights <strong>of</strong> all three potential winners <strong>of</strong> the previous time step.<br />
For the averaging, the contribution <strong>of</strong> each neuron to the interpolated position is<br />
considered. Repeating this back-referencing procedure recursively for each winner<br />
weighted by its influence, yields an exponentially spreading number <strong>of</strong> potentially<br />
infinite time series for each <strong>of</strong> neuron. This way, we obtain a probability distribution<br />
over time series that is representative for the history <strong>of</strong> each map neuron. 5<br />
5 Interestingly, one can formally prove that every finite length Markov model can be approximated<br />
by some map in this way in principle, i.e. for every Markov model <strong>of</strong> length l<br />
a map exists such that the above extraction procedure yields the original model up to small<br />
deviations. Assume a fixed length l and a rational P (a i |(s 1 , . . . , s l )) and denote by q the<br />
smallest common denominator <strong>of</strong> the transition probabilities. Consider a map in which for<br />
16
The number <strong>of</strong> specialized neurons for each time series is correlated to the probability<br />
<strong>of</strong> these stimuli in the original data source. Therefore, we can simply take the<br />
mean <strong>of</strong> the probabilities for all neurons and obtain a global distribution over all<br />
histories which are represented in the map. Since standard SOM has a magnification<br />
factor different from 1, the number <strong>of</strong> neurons, which represent a symbol a i , deviates<br />
from the probability for a i in the given data [31]. This leads to a slightly biased<br />
estimation <strong>of</strong> the sequence probabilities represented by the map. Nevertheless, we<br />
will use the above extraction procedure as a sufficiently close approximation to the<br />
true underlying distribution. This compromise is taken, because the magnification<br />
factor for recurrent SOMs is not known and techniques from [31] for its computation<br />
cannot be transferred to recurrent models. Our experiments confirm that the<br />
global trend is still correct. We have extracted for every finite memory length l the<br />
probability distribution for words in Σ l+1 as they are represented in the map and<br />
determined the transition probabilities <strong>of</strong> equation 1.<br />
The method as described above is a valuable tool to evaluate the representation<br />
capacity <strong>of</strong> SOM for temporal structures. Obviously, fixed order Markov models<br />
can be better extracted directly from the given data, avoiding problems such as the<br />
magnification factor <strong>of</strong> SOM. Hence, this method just serves as an alternative for<br />
the evaluation <strong>of</strong> temporal self-organizing maps and their capability <strong>of</strong> representing<br />
temporal dynamics. The situation is different if real-valued elements are processed,<br />
like in the case <strong>of</strong> obtaining symbolic structure from noisy sequences. Then, a reasonable<br />
quantization <strong>of</strong> the sequence entries must be found before a Markov model<br />
can be extracted from the data. The standard SOM together with U-matrix methods<br />
provides a valuable tool to find meaningful clusters in a given set <strong>of</strong> continuous<br />
data. It is an interesting question whether this property transfers to the temporal<br />
domain, i.e. whether meaningful clusters <strong>of</strong> real-valued sequence entries can also<br />
be extracted from a trained recursive model. SOM-S allows to combine both reliable<br />
quantization <strong>of</strong> the sequence entries and the extraction mechanism for Markov<br />
models to take into account the temporal structure <strong>of</strong> the data.<br />
For the extraction we extend U-Matrix methods to recursive models as follows [38]:<br />
the standard U-Matrix assigns to each neuron the averaged distance <strong>of</strong> its weight<br />
vector compared to its direct lattice neighbors:<br />
U(n i ) =<br />
∑<br />
nhd(n i ,n j )=1<br />
‖w i − w j ‖<br />
each symbol a i a cluster <strong>of</strong> neurons with weights w j = a i exist. These main clusters are<br />
divided into subclusters enumerated by s = (s 1 , . . . , s l ) ∈ Σ l with q · P (a i |s) neurons for<br />
each possible s. The context <strong>of</strong> each <strong>of</strong> such neuron refers to another neuron within a cluster<br />
belonging to s 1 and to a subcluster belonging to (s 2 , . . . , s l , s l+1 ) for some arbitrary s l+1 .<br />
Note that the clusters can thereby be chosen contiguous on the map respecting the topological<br />
ordering <strong>of</strong> the neurons. The extraction mechanism leads to the original Markov model<br />
(with rational probabilities) based on this map.<br />
17
In a trained map, neurons spread in regions <strong>of</strong> the data space where a high sample<br />
density can be observed, resulting in large U-values at borders between clusters.<br />
Consequently, the U-Matrix forms a 3D landscape on the lattice <strong>of</strong> neurons with<br />
valleys corresponding to meaningful clusters and hills at the cluster borders. The<br />
U-Matrix <strong>of</strong> weight vectors can be constructed also for SOM-S. Based on this matrix,<br />
the sequence entries can be clustered into meaningful categories, based on<br />
which the extraction <strong>of</strong> Markov models as described above is possible. Note that<br />
the U-Matrix is built by using the weights assigned to the neurons only, while the<br />
context information <strong>of</strong> SOM-S is yet ignored. 6 However, since context information<br />
is used for training, clusters emerge which are meaningful with respect to the<br />
temporal structure, and this way they contribute implicitly to the topological ordering<br />
<strong>of</strong> the map and to the U-Matrix. Partially overlapping, noisy, and ambiguous<br />
input elements are separated during the training, because the different temporal<br />
contexts contain enough information to activate and produce characteristic clusters<br />
on the map. Thus, the temporal structure captured by the training allows a reliable<br />
reconstruction <strong>of</strong> the input sequences, which could not have been achieved by the<br />
standard SOM architecture.<br />
5 Experiments<br />
5.1 Mackey-Glass time series<br />
The first task is to learn the dynamic <strong>of</strong> the real-valued chaotic Mackey-Glass time<br />
series dx = bx(τ) + ax(τ−d) using a = 0.2, b = −0.1, d = 17. This is the same<br />
dτ 1+x(τ−d) 10<br />
setup as given in [41] making a comparison <strong>of</strong> the results possible. 7 Three types<br />
<strong>of</strong> maps with 100 neurons have been trained: a 6-neighbor map without context<br />
giving standard SOM, a map with 6 neighbors and with context (SOM-S), and<br />
a 7-neighbor map providing a hyperbolic grid with context utilization (H-SOM-<br />
S). Each run has been computed with 1.5 · 10 5 presentations starting at random<br />
positions within the Mackey-Glass series using a sample period <strong>of</strong> ∆t = 3; the<br />
neuron weights have been initialized white within [0.6, 1.4]. The context has been<br />
considered by decreasing the parameter from η = 1 to η = 0.97. The learning rate<br />
is exponentially decreased from 0.1 to 0.005 for weight and context update. Initial<br />
neighborhood cooperativity is 10 which is annealed to 1 during training.<br />
Figure 2 shows the temporal quantization error for the above setups: the temporal<br />
quantization error is expressed by the average standard deviation <strong>of</strong> the given sequence<br />
and the mean unit receptive field for 29 time steps into the past. Similar<br />
6 Preliminary experiments indicate that the context also orders topologically and yields<br />
meaningful clusters. The number <strong>of</strong> neurons in context clusters is thereby small compared<br />
to the number <strong>of</strong> neurons and statistically significant results could not be obtained.<br />
7 We would like to thank T.Voegtlin for providing data for comparison.<br />
18
to Voegtlin’s results, we observe large cyclic oscillations driven by the periodicity<br />
<strong>of</strong> the training series for standard SOM. Since SOM does not take contextual information<br />
into account, this quantization result can be seen as an upper bound for<br />
temporal models, at least for the indices > 0 reaching into the past (trivially, SOM<br />
is a very good quantizer <strong>of</strong> scalar elements without history); the oscillating shape<br />
<strong>of</strong> the curve is explained by the continuity <strong>of</strong> the series and its quasi-periodic dynamic,<br />
and extrema exist rather by the nature <strong>of</strong> the series than by special model<br />
properties. Obviously, the very restricted context <strong>of</strong> RSOM does not yield a long<br />
term improvement <strong>of</strong> the temporal quantization error. However, the displayed error<br />
periodicity is anti-cyclic compared to the original series. Interestingly, the data<br />
optimum topology <strong>of</strong> neural gas (NG), which also does not take contextual information<br />
into account, allows a reduction <strong>of</strong> the overall quantization error; however,<br />
the main characteristics, such as the periodicity, remain the same as for standard<br />
SOM. RecSOM leads to a much better quantization error than RSOM and also NG.<br />
Thereby, the error is minimum for the immediate past (left side <strong>of</strong> the diagram),<br />
and increases for going back in time, which is reasonable because <strong>of</strong> the weighting<br />
<strong>of</strong> context influence by (1 − η). The increase <strong>of</strong> the quantization error is smooth<br />
and the final values after 29 time steps is better than the default given by standard<br />
SOM. In addition, almost no periodicity can be observed for RecSOM. SOM-S<br />
and H-SOM-S further improve the results: only some periodicity can be observed,<br />
and the overall quantization error increases smoothly for the past values. Note that<br />
these models are superior to RecSOM in this task while requiring less computational<br />
power. H-SOM-S allows a slightly better representation <strong>of</strong> the immediate<br />
past compared to SOM-S due to the hyperbolic topology <strong>of</strong> the lattice structure<br />
that matches better the characteristics <strong>of</strong> the input data.<br />
0.2<br />
Quantization Error<br />
0.15<br />
0.1<br />
0.05<br />
* SOM<br />
* RSOM<br />
NG<br />
* RecSOM<br />
SOM-S<br />
H-SOM-S<br />
0<br />
0 5 10 15 20 25 30<br />
Index <strong>of</strong> past inputs (index 0: present)<br />
Fig. 2. Temporal quantization errors <strong>of</strong> different model setups for the Mackey-Glass series.<br />
Results indicated by ∗ are taken from [41].<br />
19
5.2 Binary automata<br />
The second experiment is also inspired by Voegtlin. A discrete 0/1-sequence generated<br />
by a binary automaton with P (0|1) = 0.4 and P (1|0) = 0.3 shall be learned.<br />
For discrete data, the specialization <strong>of</strong> a neuron can be defined as the longest sequence<br />
that still leads to unambiguous winner selection. A high percentage <strong>of</strong> specialized<br />
neurons indicates that temporal context has been learned by the map. In<br />
addition, one can compare the distribution <strong>of</strong> specializations with the original distribution<br />
<strong>of</strong> strings as generated by the underlying probability. Figure 3 shows the<br />
specialization <strong>of</strong> a trained H-SOM-S. Training has been carried out with 3·10 6 presentations,<br />
increasing the context influence (1 − η) exponentially from 0 to 0.06.<br />
The remaining parameters have been chosen as in the first experiment. Finally, the<br />
receptive field has been computed by providing an additional number <strong>of</strong> 10 6 test<br />
iterations. Putting more emphasis on the context results in a smaller number <strong>of</strong> active<br />
neurons representing rather long strings that cover only a small part <strong>of</strong> the total<br />
input space. If a Euclidean lattice is used instead <strong>of</strong> a hyperbolic neighborhood,<br />
the resulting quantizers differ only slightly, which indicates that the representation<br />
<strong>of</strong> binary symbols and their contexts in the 2-dimensional output space representations<br />
does barely benefit from exponential branching. In the depicted run, 64 <strong>of</strong><br />
the neurons express a clear pr<strong>of</strong>ile, whereas the other neurons are located at sparse<br />
locations <strong>of</strong> the input data topology, between cluster boundaries, and thus do not<br />
win for the presented stimuli. The distribution corresponds nicely to the 100 most<br />
characteristic sequences <strong>of</strong> the probabilistic automaton as indicated by the graph.<br />
Unlike RecSOM (presented in [41]), also neurons at interior nodes <strong>of</strong> the tree are<br />
expressed for H-SOM-S. These nodes refer to transient states, which are represented<br />
by corresponding winners in the network. RecSOM, in contrast to SOM-S,<br />
does not rely on the winner index only, but it uses a more complex representation:<br />
since the transient states are spared, longer sequences can be expressed by<br />
RecSOM. In addition to the examination <strong>of</strong> neuron specialization, the whole map<br />
11<br />
10<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
100 most likely sequences<br />
H-SOM-S, 100 neurons<br />
64 specialized neurons<br />
Fig. 3. Receptive fields <strong>of</strong> a H-SOM-S compared to the most probable sub-sequences <strong>of</strong> the<br />
binary automaton. Left hand branches denote 0, right is 1.<br />
20
*<br />
6<br />
2<br />
6<br />
5<br />
:<br />
8<br />
:<br />
2<br />
8<br />
5<br />
-<br />
Type P (0) P (1) P (0|0) P (1|0) P (0|1) P (1|1)<br />
Automaton 1 4/7 ≈ 0.571 3/7 ≈ 0.429 0.7 0.3 0.4 0.6<br />
Map (98/100) 0.571 0.429 0.732 0.268 0.366 0.634<br />
Automaton 2 2/7 ≈ 0.286 5/7 ≈ 0.714 0.8 0.2 0.08 0.92<br />
Map (138/141) 0.297 0.703 0.75 0.25 0.12 0.88<br />
Automaton 3 0.5 0.5 0.5 0.5 0.5 0.5<br />
Map (138/141) 0.507 0.493 0.508 0.492 0.529 0.471<br />
Table 1<br />
Results for binary automata extraction with different transition probabilities. The extracted<br />
probabilities clearly follow the original ones.<br />
representation can be characterized by comparing the input symbol transition statistics<br />
with the learned context-neuron relations. While the current symbol is coded<br />
by the winning neuron’s weight, the previous symbol is represented by the average<br />
<strong>of</strong> weights <strong>of</strong> the winner’s context triangle neurons. The obtained two values – the<br />
neuron’s state and the average state <strong>of</strong> the neuron’s context – are clearly expressed<br />
in the trained map: only few neurons contain values in an indeterminate interval<br />
[ 1, 2 ], but most neurons specialize on very close to 0 or 1. Results for the reconstruction<br />
<strong>of</strong> three automata can be found in table 1. For the reconstruction we have<br />
3 3<br />
used the algorithm described in section 4.2 with memory length 1. The left column<br />
indicates the number <strong>of</strong> expressed neurons and the total number <strong>of</strong> neurons in the<br />
map. Note that the automata can be well reobtained from the trained maps. Again,<br />
the temporal dependencies are clearly captured by the maps.<br />
5.3 Reber grammar<br />
In a third experiment we have used more structured symbolic sequences as generated<br />
by the Reber grammar illustrated in figure 4. The 7 symbols have been coded<br />
in a 6-dimensional Euclidean space by points that denote the same as a tetrahedron<br />
does with its four corners in three dimensions: all points have the same distance<br />
Fig. 4. State graph <strong>of</strong> the Reber grammar.<br />
21
from each other. For training and testing we have taken the concatenation <strong>of</strong> randomly<br />
generated words, such preparing sequences <strong>of</strong> 3 · 10 6 and 10 6 input vectors,<br />
respectively. The map has got a map radius <strong>of</strong> 5 and contains 617 neurons on an<br />
hyperbolic grid. For the initialization and the training, the same parameters as in the<br />
previous experiment were used, except for an initially larger neighborhood range <strong>of</strong><br />
14, corresponding to the larger map. Context influence was taken into account by<br />
decreasing η from 1 to 0.8 during training. A number <strong>of</strong> 338 neurons developed a<br />
specialization for Reber strings with an average length <strong>of</strong> 7.23 characters. Figure 5<br />
shows that the neuron specializations produce strict clusters on the circular grid,<br />
ordered in a topological way by the last character. In agreement with the grammar,<br />
the letter T takes the largest sector on the map. The underlying hyperbolic lattice<br />
gives rise to sectors, because they clearly minimize the boundary between the 7<br />
classes. The symbol separation is further emphasized by the existence <strong>of</strong> idle neurons<br />
between the boundaries, which can be seen analogously to large values in a<br />
U-Matrix. Since neuron specialization takes place from the most common states<br />
–which are the 7 root symbols– to the increasingly special cases, the central nodes<br />
have fallen idle after they have served as signposts during training; finally the most<br />
specialized nodes with their associated strings are found at the lattice edge on the<br />
outer ring. Much in contrast to the such ordered hyperbolic target lattice, the result<br />
for the Euclidean grid in figure 7 shows a neuron arrangement in the form <strong>of</strong><br />
polymorphic coherent patches.<br />
Similar to the binary automata learning tasks, we have analyzed the map representation<br />
by the reconstruction <strong>of</strong> the trained data by backtracking all possible context<br />
sequences <strong>of</strong> each neuron up to length 3. Only 118 <strong>of</strong> all 343 combinatorially possible<br />
trigrams are realized. In a ranked table the most likely 33 strings cover all<br />
attainable Reber trigrams. In the log-probability plot 6 there is a leap between entry<br />
number 33 (TSS, valid) and 34 (XSX, invalid), emphasizing the presence <strong>of</strong> the Reber<br />
characteristic. The correlation <strong>of</strong> the probabilities <strong>of</strong> Reber trigrams and their<br />
relative frequencies found in the map is 0.75. An explicit comparison <strong>of</strong> the probabilities<br />
<strong>of</strong> valid Reber strings can be found in figure 8. The values deviate from the<br />
true probabilities, in particular for cycles <strong>of</strong> the Reber graph, such as consecutive<br />
letters T and S, or the VPX-circle. This effect is due to the magnification factor<br />
different from 1 for SOM, which further magnifies when sequences are processed<br />
in the proposed recursive manner.<br />
5.4 Finite memory models<br />
In a final series <strong>of</strong> experiments, we examine a SOM-S trained on Markov models<br />
with noisy input sequence entries. We investigate the possibility to extract temporal<br />
dependencies on real-valued sequences from a trained map. The Markov model<br />
possesses a memory length <strong>of</strong> 2 as depicted in figure 9. The basic symbols are denoted<br />
by a, b, and c. These are embedded in two dimensions, disrupted by noise, as<br />
22
TVVEBTSSX<br />
SEBTSSX<br />
VVEBTXX<br />
EBTSSSX<br />
EBTSSX<br />
SSSSSX<br />
XSEBTSX<br />
SSSSX<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
EBPVVEBPTT<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
TVPXT<br />
.<br />
TVPSEB<br />
.<br />
.<br />
.<br />
.<br />
. .<br />
.<br />
.<br />
..<br />
..<br />
.<br />
.<br />
..<br />
XXTVPS<br />
XTVPS<br />
TTTVPS<br />
TTVPS<br />
TVPSEBPVPS<br />
VVEBPVPS<br />
XSEBTXX<br />
SEBTXX<br />
TVPSEBTXX<br />
VPSEBTSX<br />
TTVVEBTSX<br />
TVVEBTXX<br />
TVVEBTSXX<br />
EBTSXX<br />
XSEBTSXX<br />
SSSSXX<br />
SSSXX<br />
EBTSSXX<br />
TVVEBTSX<br />
EBTSX<br />
TVVEBTX<br />
TVVEBTX<br />
TTVVEBTX<br />
EBPVVEBTX<br />
EBPVVEBTX<br />
TVPSEBTX<br />
SEBTX<br />
XSEBTX<br />
..<br />
XTVPX<br />
SXXTVPX<br />
.<br />
XTVPX<br />
.<br />
.<br />
..<br />
..<br />
VVEBPVPX<br />
TVPSEBPVPX<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
XTVPXTTVPS<br />
EBPVPS<br />
VVEBTXS<br />
TVVEBTXS<br />
TVVEBTXS<br />
XSEBTS<br />
VPSEBTS<br />
EBPVVEBTS<br />
TTTVVEBTS<br />
TTVVEBTS<br />
TVVEBTS<br />
TVVEBTS<br />
TTVPX<br />
VPSEBPVPX<br />
EBPVVEBPVPX<br />
XTVPXTTVPX<br />
SEBTSSXX<br />
TTVPX<br />
TTTVPX<br />
VVEBTXSEBTX<br />
.<br />
.<br />
.<br />
.<br />
TTTVPX<br />
SEBPVPX<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
SXXTTT<br />
SSXXTTT<br />
EBTXXTTT<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
TVVEBTSSS<br />
.<br />
SEBTSSS<br />
TVVEBTSS<br />
EBTSS<br />
XSEBTSS<br />
EBTXSEBTS<br />
TVPSEBTS<br />
TTVVEBTS<br />
XTVVEBTS<br />
EBPVPXTVPS<br />
TVVEBPTTT<br />
SSSSSXS<br />
EBTSXS<br />
EBTXS<br />
SEBTXS<br />
TVPSEBPTTT<br />
SEBPTTT<br />
XSEBTSXXTTT<br />
XSEBPTTT<br />
TVVEBTXXTTT<br />
VPSEBTSS<br />
.<br />
.<br />
.<br />
TTTVPS<br />
.<br />
.<br />
VVEBPTTT<br />
.<br />
EBPTTTT<br />
EBPVVEBPTTT<br />
EBPVVEBPTTTT<br />
.<br />
.<br />
SXXTTTT<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
TVVEBTSSSS<br />
.<br />
SSSSS<br />
TTTTT<br />
.<br />
.<br />
.<br />
TTTTT<br />
.<br />
.<br />
EBTSSSS<br />
.<br />
.<br />
.<br />
.<br />
TTTTT<br />
SEBTSSSS<br />
EBTSSS<br />
TVVEBTSXS<br />
TVVEBTXXTTTT<br />
TTTTTT<br />
VPXTTT<br />
EBPVVEBPTTT<br />
TVPXTTT<br />
SEBTXSEB<br />
XTVPXTTT<br />
XXTTTT<br />
.<br />
.<br />
.<br />
.<br />
.<br />
SSSSSS<br />
.<br />
.<br />
.<br />
TVVEBTXSEB<br />
TVVEBTXSEB<br />
VVEBTXSEB<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
. .<br />
.<br />
TTTTT<br />
.<br />
.<br />
SEBPVPXTT<br />
.<br />
.<br />
VPXTT<br />
SXSEB<br />
.<br />
.<br />
.<br />
.<br />
.<br />
SXXTVPXTT<br />
SSXSEB<br />
EBTXSEB<br />
SSSSSXSEB<br />
TTVPXTT<br />
TVPXTT<br />
TVPXTT<br />
.<br />
.<br />
.<br />
.<br />
SEBTXXTT<br />
TVPSEBPVPSEB<br />
.<br />
.<br />
.<br />
.<br />
EBTXXTT<br />
VVEBPVPSEB<br />
TTTVPSEB<br />
EBPVPSEB<br />
TVVEBTSXSEB<br />
XTVPXTT<br />
SSXXTT<br />
. .<br />
. .<br />
.<br />
.<br />
SXXTT<br />
TVVEBTXXTT<br />
TTVPSEB<br />
TTTVPSEB<br />
XSEBTSXXTT<br />
XSEBPTT<br />
SEBPTT<br />
EBPVVEBPTT<br />
TVPSEBPTT<br />
XTVPSEB<br />
.<br />
TVPSEB<br />
VVEBPTT<br />
TVVEBPTT<br />
.<br />
EBPVPXTVPSEB<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
. . .<br />
SXXTVPXT<br />
XTVPXT<br />
TVPXT<br />
EBPVVEB<br />
TTTVPXT<br />
TTVPXT<br />
.<br />
.<br />
EBPVVEB<br />
SEBPVPXT<br />
.<br />
.<br />
.<br />
VVEBPVPXT<br />
TVVEBPVVEB<br />
TVPSEBPVVEB<br />
.<br />
.<br />
.<br />
.<br />
TTTVVEB<br />
.<br />
.<br />
VPSEBPVPXT<br />
.<br />
.<br />
TTVPXT<br />
.<br />
.<br />
.<br />
.<br />
.<br />
EBPVVEBP<br />
.<br />
.<br />
.<br />
. .<br />
SEBTXXT<br />
TTVVEB<br />
.<br />
EBPVVEBPT<br />
XSEBTSXXT<br />
VVEBTXXT<br />
TVPSEBTXXT<br />
XTTVVEB<br />
TTVVEB<br />
.<br />
.<br />
SSXXT<br />
TVVEB<br />
XTVVEB<br />
.<br />
SSSSXXT<br />
TVVEB<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
EBPVVEBP<br />
.<br />
.<br />
.<br />
.<br />
TVVEBTXXTVVEB<br />
TVVEBP<br />
EBPVVEBPVVE<br />
EBPVVEBPT<br />
TVVEBTXXT<br />
EBTSXXT<br />
XTVVEBT<br />
XSEBPT<br />
SEBPT<br />
.<br />
. .<br />
.<br />
.<br />
.<br />
TTVVEBT<br />
TVPSEBPT<br />
.<br />
TVVEBPT<br />
VVEBPT<br />
TVVEBP<br />
.<br />
.<br />
TTTVP<br />
.<br />
EBPVVEBT<br />
.<br />
TVVEBT<br />
TVVEBT<br />
XTTVVEBP<br />
.<br />
.<br />
.<br />
.<br />
TTVVEBT<br />
TTVVEBP<br />
EBPVVEBP<br />
.<br />
.<br />
TTTVVEBT<br />
TVPSEBP<br />
.<br />
SEBPVVE<br />
XTVPSEBP<br />
EBPVPXTVPSEBP<br />
TVVEBTSXSEBP<br />
EBPVVEBT<br />
TTVPSEBP<br />
.<br />
.<br />
EBPVVEBPVV<br />
.<br />
EBPVPXTVPSEBT<br />
TVPSEBT<br />
.<br />
TVVEBTSXSEBT<br />
VPSEBP<br />
EBPVPSEBP<br />
XTVPSEBT<br />
.<br />
.<br />
. .<br />
EBPVVEBPV<br />
.<br />
.<br />
VPSEBT<br />
SSXSEBP<br />
SXSEBP<br />
XTVPXTTVP<br />
EBTXSEBP<br />
.<br />
EBTXSEBP<br />
.<br />
.<br />
.<br />
.<br />
TTTVVE<br />
.<br />
VPSEBPVP<br />
SEBPVP<br />
VVEBPVP<br />
VPSEBT<br />
TVVEBTXXTVP<br />
TVVEBTSXSE<br />
TTTVPSE<br />
VVEBTXSEBT<br />
EBTXSEBT<br />
.<br />
.<br />
SEBTXSEBT<br />
SXSEBT<br />
.<br />
. .<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
TVVEBTXXTVVE<br />
SSXSEBT<br />
TVPSEBPVP<br />
XTVVE<br />
EBPTVVE<br />
TTTVP<br />
EBPVVEBPVP<br />
SXXTVP<br />
TTVVE<br />
.<br />
.<br />
.<br />
.<br />
.<br />
XTTVVE<br />
EBPTVP<br />
EBPVPXTVP<br />
TVPSEBPV<br />
SEBPV<br />
.<br />
.<br />
.<br />
EBPVPSE<br />
TTVPSE<br />
TTVVE<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
EBPTVV<br />
.<br />
.<br />
TVVEBPVVE<br />
TVPSEBPVVE<br />
.<br />
XXTVV<br />
TVVEBTXXTVV<br />
XTVPXTTV<br />
XTVPXTV<br />
TVVEBTXXTV<br />
EBPVVEBPTV<br />
EBPVVEBPTV<br />
EBTXSE<br />
EBPVVE<br />
.<br />
XTTVV<br />
.<br />
.<br />
TVPSE<br />
TTTVV<br />
.<br />
TVPSEBPVV<br />
.<br />
.<br />
XTVPSE<br />
EBPVPXTVPSE<br />
SEBPVV<br />
SEBPVV<br />
VVEBPVV<br />
.<br />
.<br />
TTTVPSE<br />
..<br />
TVVEBPV<br />
VVEBPV<br />
.<br />
VPSEBPV<br />
SXSEBPV<br />
.<br />
..<br />
EBTXSEBPV<br />
.<br />
.<br />
.<br />
.<br />
TVPSEBPVPSE<br />
VVEBPVPSE<br />
TVPSE<br />
TVVEBPVV<br />
TTTTV<br />
VPXTTV<br />
EBPVVEBPTTV<br />
.<br />
XSEBTSXXTV<br />
.<br />
VVEBPTV<br />
TVVEBPTV<br />
SEBPTV<br />
TVPSEBPTV<br />
. .<br />
SSSSSXSE<br />
SEBPVPXTV<br />
EBPVPXTV<br />
VPXTV<br />
VVEBTXXTV<br />
SSXSE<br />
XXTTV<br />
XXTTV<br />
EBPTTV<br />
TVPXTV<br />
TVPXTV<br />
SEBTXXTV<br />
SSXXTV<br />
SXXTV<br />
TVVEBTXSE<br />
TVVEBTXSE<br />
VVEBTXSE<br />
SEBTXSE<br />
Fig. 5. Arrangement <strong>of</strong> Reber words on a hyperbolic lattice structure. The words are arranged<br />
according to their most recent symbols (shown on the right <strong>of</strong> the sequences). The<br />
hyperbolic lattice yields a sector partitioning.<br />
-1<br />
log-Probability<br />
-2<br />
-3<br />
-4<br />
-5<br />
-6<br />
0 10 20 30 40 50 60 70 80 90 100 110<br />
Index <strong>of</strong> 3-letter word.<br />
Fig. 6. Likelihood <strong>of</strong> extracted trigrams. The most probable combinations are given by valid<br />
trigrams, and a gap <strong>of</strong> the likelihood can be observed for the first invalid combination.<br />
23
TTVPS<br />
*<br />
TTVVEBTS<br />
EBTS<br />
VPS<br />
*<br />
TVPS<br />
*<br />
EBTXS<br />
TVPS<br />
EBTSSXXTVPS<br />
VVEBTS<br />
SEBTS<br />
*<br />
*<br />
EBTSXS<br />
EBTSSXS<br />
*<br />
*<br />
SEBTX<br />
*<br />
SSXS<br />
EBTSSS<br />
VVEBTSS<br />
EBTXS<br />
EBTSS<br />
*<br />
*<br />
*<br />
TVVEBTX<br />
SSSS<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
TTTTVVEBP<br />
EBP<br />
EBTX<br />
VPXTTVVEBTX<br />
TTVVEBP<br />
TVVEBP<br />
*<br />
*<br />
EBPVVEBP<br />
EBP<br />
VPSE<br />
VVEBP<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
EBTXSE<br />
EBTSSXXTVPSE<br />
*<br />
*<br />
*<br />
TVP<br />
* EBTSSXXTVP<br />
EBTXX *<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
TVVE<br />
EBTSSXXTVPSEBP<br />
TTTTVP<br />
EBTXSEBP<br />
SSXX<br />
TVPSE<br />
EBTSXSE<br />
EBTSSXSE<br />
EBTXSE<br />
SSXSE<br />
EBPTVP<br />
TVP<br />
TVP<br />
VPSEBP<br />
TTVP<br />
TVPXTVP<br />
XXTVVE<br />
TVVE<br />
SXSEBP<br />
EBPTTVP<br />
*<br />
*<br />
*<br />
*<br />
XTVVE<br />
*<br />
*<br />
EBPVP<br />
SEBPVP<br />
TTVP<br />
*<br />
TVVE<br />
VPXTTVVE<br />
*<br />
*<br />
*<br />
*<br />
TTVVE<br />
TTTTVVE<br />
TSSXXTVPSEBPVP<br />
VPXTTVP<br />
TVV<br />
EBPVP<br />
*<br />
*<br />
EBPVV<br />
XTTVVE<br />
EBPVVE<br />
*<br />
EBPVV<br />
SEBPVV<br />
EBPVPXTVV<br />
*<br />
EBPVVE<br />
TSSXXTVPSEBPVV<br />
XXTVV<br />
TTVV<br />
*<br />
VVEBPVV<br />
*<br />
TTTTVV<br />
*<br />
*<br />
VPXTTVV<br />
*<br />
TTVVEBPV<br />
*<br />
EBPV<br />
BTSSXXTVPSEBPV<br />
XTTVV<br />
VVEBPV<br />
TTVV<br />
EBPV<br />
EBTXXTTVV<br />
TTTTV<br />
TTTV<br />
SEBPV<br />
*<br />
*<br />
EBTXSEBTS<br />
*<br />
VPSEBTX<br />
EBTSSXX<br />
VVEBTXX<br />
VPXTTVVEBTXX<br />
*<br />
TVV<br />
XTVV<br />
VPXTTV<br />
TTV<br />
*<br />
EBTX<br />
VVEBTX<br />
*<br />
SEBTXX<br />
*<br />
*<br />
SSXXTVV<br />
XTTV<br />
*<br />
*<br />
VVEB<br />
TTVVEB<br />
EBPVVEB<br />
TTTTVVEB<br />
TVVEB<br />
XTVVEB<br />
*<br />
*<br />
*<br />
VPXTTVVEB<br />
XXTVVEB<br />
*<br />
*<br />
VPX<br />
EBPVPX<br />
TVVEB<br />
*<br />
*<br />
EBTXSEB<br />
EBTXSEB<br />
VPXTTVPX<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
EBPTTV<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
*<br />
TVPSEB<br />
EBTSXSEB<br />
EBTSXX<br />
EBTSSXXTVPX<br />
TVPX<br />
TVPX<br />
SSXSEB<br />
TVPX<br />
EBPTVPX<br />
EBTXX<br />
SSX<br />
EBTSSX<br />
VVEBTSX<br />
EBTSX<br />
VVEBPT TVVEBPT EBPT<br />
VPXT<br />
TVVEBPTV<br />
TTVVEBT<br />
SSXXTV<br />
XTV<br />
TVPXTV<br />
*<br />
TTVVEBPT EBPT<br />
VVEBT<br />
VVEBPTV<br />
EBPTV<br />
*<br />
VPSEBT<br />
EBTXXTT<br />
EBTXXTTV<br />
XTT<br />
TVPXTT<br />
TTV<br />
TTT<br />
EBTSSXXT<br />
EBPVPXT<br />
VPXTTVVEBT<br />
EBTXSEBT<br />
EBTXXTV<br />
EBPVPXTV<br />
EBTSSXXTV<br />
VPXTT<br />
TTT<br />
TVPXTTTT<br />
TVPXTTT<br />
XTTT<br />
TTTT<br />
TTTTT<br />
EBTSSXXTVPSEB<br />
XXTT<br />
TVPXTT<br />
*<br />
VPSEB<br />
*<br />
SEBPT<br />
* *<br />
*<br />
TVVEBT<br />
TVVEBT<br />
TVPSEBT<br />
SEBPTT<br />
EBPTT<br />
EBTSSXXTT<br />
*<br />
EBTXXT<br />
*<br />
SSXXT<br />
*<br />
VPXTTVPXT<br />
TVPXT<br />
TVPXT<br />
EBTXSEBT<br />
EBTSSXXTVPSEBT<br />
*<br />
EBTSXSEBT<br />
VVEBPTT<br />
EBTXXT<br />
*<br />
TVPXT<br />
EBPTVPXT<br />
*<br />
*<br />
SSXSEBT<br />
Fig. 7. Arrangement <strong>of</strong> Reber words on a Euclidean lattice structure. The words are arranged<br />
according to their most recent symbols (shown on the right <strong>of</strong> the sequences).<br />
Patches emerge according to the most recent symbol. Within the patches, an ordering according<br />
to the preceding symbols can be observed.<br />
C<br />
Fig. 8. Frequency reconstruction <strong>of</strong> trigrams from the Reber grammar.<br />
24
follows: a stands for (0, 0) + µ, b for (1, 0) + µ, and c for (0, 1) + µ, µ being independent<br />
Gaussian noise with standard deviation σ g , which is a variable to be tested<br />
in the experiments. The symbols are denoted right to left, i.e. ab indicates that the<br />
currently emitted symbol is a, after having observed symbol b in the previous step.<br />
Thus, b and c are always succeeded by a, whereas a is succeeded with probability<br />
x by b, and (1 − x) by c assumed the past symbol was b, and vice versa, if the<br />
last symbol was c. The transition probability x is varied between the experiments.<br />
We train a SOM-S with regular rectangular two-dimensional lattice structure and<br />
100 neurons for a generated Markov series. The context parameter was decreased<br />
from η = 0.97 to η = 0.93, the neighborhood radius was decreased from σ = 5<br />
to σ = 0.5, the learning rate was annealed from 0.02 to 0.005. A number <strong>of</strong> 1000<br />
patterns are presented in 15000 cycles. U-Matrix clustering has been calculated<br />
with such a level <strong>of</strong> the landscape that half the neurons are contained in valleys.<br />
The neurons in the same valleys are assigned to belong to the same cluster, and the<br />
number <strong>of</strong> different clusters is determined. Afterwards, all the remaining neurons<br />
are assigned to their closest cluster.<br />
First, we choose a noise level <strong>of</strong> σ g = 0.1 such that almost no overlap can be<br />
observed, and we investigate this setup with different x between 0 and 0.8. In all<br />
the results, three distinct clusters, corresponding to the three symbols, are found<br />
with the U-Matrix method. The extraction <strong>of</strong> the order 2 Markov models indicates<br />
that the global transition probabilities are correctly represented in the maps.Table 2<br />
shows the corresponding extracted probabilities. Thereby, the exact probabilities<br />
cannot be recovered because <strong>of</strong> a magnification factor <strong>of</strong> SOM different from 1.<br />
However, the global trend is clearly found and the extracted probabilities are in<br />
good agreement with the priorly chosen values.<br />
In a second experiment, the transition probability is fixed to x = 0.4, but the noise<br />
level is modified, choosing σ g between 0.1 and 0.5. All the training parameters are<br />
chosen as in the previous experiment. Note that a noise level σ g = 0.3 already yields<br />
much overlap <strong>of</strong> the classes, as depicted in figure 10. Nevertheless, three clusters<br />
can be detected in all <strong>of</strong> the cases and the transition probabilities can be recovered,<br />
except for a noise level <strong>of</strong> 0.5 for which the training scenario degenerates to an<br />
almost deterministic case, making a the most dominant state. Table 3 summarizes<br />
the extracted probabilities.<br />
ab<br />
1−x 1−x<br />
ac<br />
x<br />
1<br />
1<br />
x<br />
ba<br />
ca<br />
Fig. 9. Markov automaton with 3 basic states and a finite order <strong>of</strong> 2 used to train the map.<br />
25
Fig. 10. Symbols a, b, c which are embedded in R 2 as a = (0, 0) + µ, b = (1, 0) + µ, and<br />
c = (0, 1) + µ, subject to noise µ with different variances: noise level are 0.1, 0.3, and 0.4.<br />
The latter two noise levels show considerable overlap <strong>of</strong> the classes which represent the<br />
symbol.<br />
x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8<br />
P (a|ab) 0 0.01 0 0.01 0 0.04 0 0.04 0.01<br />
P (b|ab) 0 0.08 0.3 0.31 0.38 0.55 0.68 0.66 0.78<br />
P (c|ab) 1 0.91 0.7 0.68 0.62 0.41 0.32 0.3 0.21<br />
P (a|ac) 0 0 0 0 0 0.01 0.01 0 0.01<br />
P (b|ac) 1 0.81 0.8 0.66 0.52 0.55 0.32 0.31 0.24<br />
P (c|ac) 0 0.19 0.2 0.34 0.48 0.44 0.67 0.69 0.75<br />
Table 2<br />
Transition probabilities extracted from the trained map. The noise level was fixed to 0.1<br />
and different generating transition probabilities x were used.<br />
noise 0.1 0.2 0.3 0.4 0.5 true<br />
P (a|ab) 0.01 0 0 0.1 0.98 0<br />
P (b|ab) 0.42 0.49 0.4 0.24 0.02 0.4<br />
P (c|ab) 0.57 0.51 0.6 0.66 0.02 0.6<br />
P (a|ac) 0.01 0 0 0.09 0 0<br />
P (b|ac) 0.59 0.6 0.44 0.39 0 0.6<br />
P (c|ac) 0.4 0.4 0.56 0.52 0 0.4<br />
Table 3<br />
Probabilities extracted from the trained map with fixed input transition probabilities and<br />
different noise levels. For a noise level <strong>of</strong> 0.5, the extraction mechanism breaks down and<br />
the symbol a becomes most dominant. For smaller noise levels, extraction <strong>of</strong> the symbols<br />
can still be done also for overlapping clusters because <strong>of</strong> temporal differentiation <strong>of</strong> the<br />
clusters in recursive models.<br />
26
6 Conclusions<br />
We have presented a self organizing map with a neural back-reference to the previously<br />
active sites and with a flexible topological structure <strong>of</strong> the neuron grid. For<br />
context representation, the compact and powerful SOMSD model as proposed in<br />
[11] has been used. Unlike TKM and RSOM, much more flexibility and expressiveness<br />
is obtained, because the context is represented in the space spanned by<br />
the neurons, and not only in the domain <strong>of</strong> the weight space. Compared to Rec-<br />
SOM, which is based on very extensive contexts, the SOMSD model is much more<br />
efficient. However, SOMSD requires an appropriate topological representation <strong>of</strong><br />
the symbols, measuring distances <strong>of</strong> contexts in the grid space. We have therefore<br />
extended the map configuration to more general triangular lattices, thus, making<br />
also hyperbolic models possible as introduced in [30]. Our SOM-S approach has<br />
been evaluated on several data series including discrete and real-valued entries.<br />
Two experimental setups have been taken from [41] to allow a direct comparison<br />
with different models. As pointed out, the compact model introduced here improves<br />
the capacity <strong>of</strong> simple leaky integrator networks like TKM and RSOM and shows<br />
results competitive to the more complex RecSOM.<br />
Since the context <strong>of</strong> SOM-S directly refers to the previous winner, temporal contexts<br />
can be extracted from a trained map. An extraction scheme to obtain Markov<br />
models <strong>of</strong> fixed order has been presented and its reliability has been confirmed in<br />
three experiments. As demonstrated, this mechanism can be applied to real-valued<br />
sequences, expanding U-Matrix methods to the recursive case.<br />
So far, the topological structure <strong>of</strong> context formation has not been taken into account<br />
during the extraction. Context clusters, in addition to weight clusters, provide<br />
more information, which might be used for the determination <strong>of</strong> appropriate orders<br />
<strong>of</strong> the models, or for the extraction <strong>of</strong> more complex settings like hidden Markov<br />
models. We currently investigate experiments aiming at these issues. However, preliminary<br />
results indicate that Hebbian training, as introduced in this article, allows<br />
the reliable extraction <strong>of</strong> finite memory models only. More sophisticated training<br />
algorithms should be developed for more complex temporal dependencies.<br />
Interestingly, the proposed context model can be interpreted as the development<br />
<strong>of</strong> long range synaptic connections, leading to more specialized map regions. Statistical<br />
counterparts to unsupervised sequence processing, like the Generative Topographic<br />
Mapping Through Time (GTMTT) [5], incorporate similar ideas by describing<br />
temporal data dependencies by hidden Markov latent space models. Such a<br />
context effects the prior distribution on the space <strong>of</strong> neurons. Due to computational<br />
restrictions, the transition probabilities <strong>of</strong> GTMTT are usually limited to only local<br />
connections. Thus, long range connections like in the presented context model<br />
do not emerge, rather visualizations similar (though more powerful) to TKM and<br />
RSOM arise. It could be interesting to develop more efficient statistical counterparts,<br />
which also allow the emergence <strong>of</strong> interpretable long range connections such<br />
as those <strong>of</strong> the deterministic SOM-S.<br />
27
References<br />
[1] G. Barreto and A. Araújo. Time in self-organizing maps: An overview <strong>of</strong> models. Int.<br />
Journ. <strong>of</strong> Computer Research, 10(2):139–179, 2001.<br />
[2] G. de A. Barreto, A. F. R. Araujo, and S. C. Kremer. A taxonomy for spatiotemporal<br />
connectionist networks revisited: the unsupervised case. Neural Computation,15(6):<br />
1255 - 1320, 2003.<br />
[3] H.-U. Bauer and T. Villmann. Growing a hypercubical output space in a selforganizing<br />
feature map. IEEE Transactions on Neural Networks, 8(2):218–226, 1997.<br />
[4] C. M. Bishop, M. Svensén, and C. K. I. Williams. GTM: the generative topographic<br />
mapping. Neural Computation 10(1):215-235, 1998.<br />
[5] C. M. Bishop, G. E. Hinton, and C. K. I. Williams. GTM through time. Proceedings<br />
IEE Fifth International Conference on Artificial Neural Networks, Cambridge, U.K.,<br />
pages 111-116, 1997.<br />
[6] Bühlmann, P., and Wyner, A.J. (1999). Variable length Markov chains. Annals <strong>of</strong><br />
Statistics, 27:480-513.<br />
[7] O. A. Carpinteiro. A hierarchical self-organizing map for sequence recognition.<br />
Neural <strong>Processing</strong> Letters, 9(3):209-220, 1999.<br />
[8] G. Chappell and J. Taylor. The temporal Kohonen map. Neural Networks, 6:441–445,<br />
1993.<br />
[9] I. Farkas and R. Miikkulainen. Modeling the self-organization <strong>of</strong> directional<br />
selectivity in the primary visual cortex. Proceedings <strong>of</strong> ICANN’99, Edinburgh,<br />
Scotland, pp. 251-256, 1999.<br />
[10] M. Hagenbuchner, A. C. Tsoi, and A. Sperduti. A supervised self-organising map for<br />
structured data. In N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advances in<br />
Self-Organising Maps, 21–28. Springer, 2001.<br />
[11] M. Hagenbuchner, A. Sperduti, and A.C. Tsoi. A Self-Organizing Map for Adaptive<br />
<strong>Processing</strong> <strong>of</strong> Structured Data. IEEE Transactions on Neural Networks, 14(3):491–<br />
505, 2003.<br />
[12] B. Hammer. On the learnability <strong>of</strong> recursive data. Mathematics <strong>of</strong> Control Signals and<br />
Systems, 12:62–79, 1999.<br />
[13] B. Hammer, A. Micheli, and A. Sperduti. A general framework for unsupervised<br />
processing <strong>of</strong> structured data. In M. Verleysen, editor, European Symposium on<br />
Artificial Neural Networks’2002, 389–394. D Facto, 2002.<br />
[14] B. Hammer, A. Micheli, M. Strickert, A. Sperduti. A general framework for<br />
unsupervised processing <strong>of</strong> structured data. To appear in: Neurocomputing.<br />
[15] B. Hammer, A. Micheli, A. Sperduti. A general framework for self-organizing<br />
structure processing neural networks. Technical report TR-03-04 <strong>of</strong> the Università<br />
di Pisa, 2003.<br />
[16] J. Joutsensalo and A. Miettinen. Self-organizing operator map for nonlinear dimension<br />
reduction. Proceedings ICNN’95, 1:111-114, IEEE, 1995.<br />
[17] J. Kangas. On the analysis <strong>of</strong> pattern sequences by self-organizing maps. PhD thesis,<br />
Helsinki University <strong>of</strong> Technology, Espoo, Finland, 1994.<br />
28
[18] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM – self-organizing maps<br />
<strong>of</strong> document collections. Neurocomputing, 21(1):101-117, 1998.<br />
[19] S. Kaski and J. Sinkkonen. A topography-preserving latent variable model with<br />
learning metrics. In: N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advances<br />
in Self-Organizing Maps, pages 224–229, Springer, 2001.<br />
[20] T. Kohonen. The ‘neural’ phonetic typewriter. Computer, 21(3):11-22, 1988.<br />
[21] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 2001.<br />
[22] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Recurrent SOM with local linear<br />
models in time series prediction. In M.Verleysen, editor, 6th European Symposium on<br />
Artificial Neural Networks,pages 167–172, De facto, 1998.<br />
[23] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Time series prediction using<br />
recurrent SOM with local linear models. International Journal <strong>of</strong> Knowledge-based<br />
Intelligent Engineering Systens 2(1):60-68, 1998.<br />
[24] S. C. Kremer. Spatio-temporal connectionist networks: A taxonomy and review.<br />
Neural Computation, 13(2):249–306, 2001.<br />
[25] J. Laaksonen, M. Koskela, S. Laakso, and E. Oja. PicSOM – content-based image<br />
retrieval with self-organizing maps. Pattern Recognition Letters, 21(13-14):1199-<br />
1207, 2000.<br />
[26] J. Lampinen and E. Oja. Self-organizing maps for spatial and temporal AR models.<br />
M. Pietikäinen and J. Röning (eds.), Proceedings 6 SCIA, 120-127, Helsinki, Finland,<br />
1989.<br />
[27] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,<br />
7(3):507–522, 1994.<br />
[28] T. Martinetz, S.G. Berkovich, and K.J. Schulten. ‘Neural-gas’ networks for vector<br />
quantization and its application to time-series prediction. IEEE Transactions on<br />
Neural Networks, 4(4):558–569, 1993.<br />
[29] J. Ontrup and H. Ritter. Text categorization and semantic browsing with selforganizing<br />
maps on non-euclidean spaces. In L. D. Raedt and A. Siebes, editors,<br />
Proceedings <strong>of</strong> PKDD-01, 338–349. Springer, 2001.<br />
[30] H. Ritter. Self-organizing maps on non-Euclidian spaces. In: E. Oja and S. Kaski,<br />
editors, Kohonen Maps, pages 97–110. Elsevier, 1999.<br />
[31] H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-Organizing<br />
Maps: An Introduction, Addison-Wesley, 1992.<br />
[32] Ron, D., Singer, Y., and Tishby, N. (1996). The power <strong>of</strong> amnesia. Machine Learning,<br />
25:117-150.<br />
[33] J. Sinkkonen and S. Kaski. Clustering based on conditional distributions in an<br />
auxiliary space. Neural Computation, 14:217–239, 2002.<br />
[34] P. Sommervuo. Self-organizing maps for signal and symbokl sequences, PhD thesis,<br />
Helsinki University <strong>of</strong> Technology, 2000.<br />
[35] A. Sperduti. Neural networks for adaptive processing <strong>of</strong> structured data. In Proc.<br />
ICANN 2001, 5–12. Springer, 2001.<br />
[36] M. Strickert, T. Bojer, and B. Hammer. Generalized relevance LVQ for time series. In<br />
Proc. ICANN’2001, 677–638. Springer, 2001.<br />
29
[37] M. Strickert and B. Hammer. Neural Gas for <strong>Sequence</strong>s. In Proc. WSOM’03, 53-57,<br />
2003.<br />
[38] A. Ultsch and C. Vetter. Selforganizing Feature Maps versus Statistical Clustering:<br />
A Benchmark. Research Report No. 9, Dep. <strong>of</strong> Mathematics, University <strong>of</strong> Marburg<br />
1994.<br />
[39] M. Varsta, J. del R. Milan, and J. Heikkonen. A recurrent self-organizing map for<br />
temporal sequence processing. In Proc. ICANN’97, 421–426. Springer, 1997.<br />
[40] M. Varsta, J. Heikkonen, and J. Lampinen. Analytical comparison <strong>of</strong> the temporal<br />
Kohonen map and the recurrent self organizing map. M. Verleysen (ed.),<br />
ESANN’2000, pages 273-280, De Facto, 2000.<br />
[41] T. Voegtlin. <strong>Recursive</strong> self-organizing maps. Neur.Netw., 15(8-9):979–991, 2002.<br />
30