31.01.2014 Views

Unsupervised Recursive Sequence Processing - Institute of ...

Unsupervised Recursive Sequence Processing - Institute of ...

Unsupervised Recursive Sequence Processing - Institute of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Unsupervised</strong> <strong>Recursive</strong> <strong>Sequence</strong> <strong>Processing</strong><br />

Marc Strickert, Barbara Hammer<br />

Research group LNM, Department <strong>of</strong> Mathematics/Computer Science,<br />

University <strong>of</strong> Osnabrück, Germany<br />

Sebastian Blohm<br />

<strong>Institute</strong> for Cognitive Science,<br />

University <strong>of</strong> Osnabrück, Germany<br />

Abstract<br />

The self organizing map (SOM) is a valuable tool for data visualization and data mining for<br />

potentially high dimensional data <strong>of</strong> an a priori fixed dimensionality. We investigate SOMs<br />

for sequences and propose the SOM-S architecture for sequential data. <strong>Sequence</strong>s <strong>of</strong> potentially<br />

infinite length are recursively processed by integrating the currently presented item<br />

and the recent map activation, as proposed in [11]. We combine that approach with the<br />

hyperbolic neighborhood <strong>of</strong> Ritter [29], in order to account for the representation <strong>of</strong> possibly<br />

exponentially increasing sequence diversification over time. Discrete and real-valued<br />

sequences can be processed efficiently with this method, as we will show in experiments.<br />

Temporal dependencies can be reliably extracted from a trained SOM. U-Matrix methods,<br />

adapted to sequence processing SOMs, allow the detection <strong>of</strong> clusters also for real-valued<br />

sequence elements.<br />

Key words: Self-organizing map, sequence processing, recurrent models, hyperbolic<br />

SOM, U-Matrix, Markov models<br />

1 Introduction<br />

<strong>Unsupervised</strong> clustering by means <strong>of</strong> the self organizing map (SOM) was first proposed<br />

by Kohonen [21]. The SOM makes the exploration <strong>of</strong> high dimensional data<br />

possible and it allows the exploration <strong>of</strong> the topological data structure. By SOM<br />

training, the data space is mapped to a typically two dimensional Euclidean grid<br />

Email address: {marc,hammer}@informatik.uni-osnabrueck.de<br />

(Marc Strickert, Barbara Hammer).<br />

Preprint submitted to Elsevier Science 23 January 2004


<strong>of</strong> neurons, preferably in a topology preserving manner. Prominent applications <strong>of</strong><br />

the SOM are WEBSOM for the retrieval <strong>of</strong> text documents and PicSOM for the<br />

recovery and ordering <strong>of</strong> pictures [18,25]. Various alternatives and extensions to<br />

the standard SOM exist, such as statistical models, growing networks, alternative<br />

lattice structures, or adaptive metrics [3,4,19,27,28,30,33].<br />

If temporal or spatial data are dealt with – like time series, language data, or DNA<br />

strings – sequences <strong>of</strong> potentially unrestricted length constitute a natural domain for<br />

data analysis and classification. Unfortunately, the temporal scope is unknown in<br />

most cases, and therefore fixed vector dimensions, as used for standard SOM, cannot<br />

be applied. Several extensions <strong>of</strong> SOM to sequences have been proposed; for<br />

instance, time-window techniques or the data representation by statistical features<br />

make a processing with standard methods possible [21,28]. Due to data selection or<br />

preprocessing, information might get lost; for this reason, a data-driven adaptation<br />

<strong>of</strong> the metric or the grid is strongly advisable [29,33,36]. The first widely used application<br />

<strong>of</strong> SOM in sequence processing employed the temporal trajectory <strong>of</strong> the<br />

best matching units <strong>of</strong> a standard SOM in order to visualize speech signals and the<br />

variations <strong>of</strong> which [20]. This approach, however, does not operate on sequences<br />

as they are; rather, SOM is used for reducing the dimensionality <strong>of</strong> single sequence<br />

entries and acts as a preprocessing mechanism this way. Proposed alternatives substitute<br />

the standard Euclidean metric by similarity operators on sequences by incorporating<br />

autoregressive processes or time warping strategies [16,26,34]. These<br />

methods are very powerful, but a major problem is their computational costs.<br />

A fundamental way for sequence processing is a recursive approach. Supervised<br />

recurrent networks constitute a well-established generalization <strong>of</strong> standard feedforward<br />

networks to time series; many successful applications for different sequence<br />

classification and regression tasks are known [12,24]. Recurrent unsupervised models<br />

have also been proposed: the temporal Kohonen map (TKM) and the recurrent<br />

SOM (RSOM) use the biologically plausible dynamics <strong>of</strong> leaky integrators [8,39],<br />

as they occur in organisms, and explain phenomena such as direction selectivity<br />

in the visual cortex [9]. Furthermore, the models have been applied with moderate<br />

success to learning tasks [22]. Better results have been achieved by integrating these<br />

models into more complex systems [7,17]. Recent more powerful approaches are<br />

the recursive SOM (RecSOM) and the SOM for structured data (SOMSD) [10,41].<br />

These are based on a richer and explicit representation <strong>of</strong> the temporal context: they<br />

use the activation pr<strong>of</strong>ile <strong>of</strong> the entire map or the index <strong>of</strong> the most recent winner.<br />

As a result, their representation ability is superior to RSOM and TKM.<br />

A proposal to put existing unsupervised recursive models into a taxonomy can be<br />

found in [1,2]. The latter article identifies the entity ’time context’ used by the models<br />

as one <strong>of</strong> the main branches <strong>of</strong> the given taxonomy [2]. Although more general,<br />

the models are still quite diverse, and the recent developments <strong>of</strong> [10,11,35] are<br />

not included in the taxonomy. An earlier, simple, and elegant general description <strong>of</strong><br />

recurrent models with an explicit notion <strong>of</strong> context has been introduced in [13,14].<br />

2


This framework directly generalizes the dynamics <strong>of</strong> supervised recurrent networks<br />

to unsupervised models and it contains TKM, RecSOM, and SOMSD as special<br />

cases. As pointed out in [15], the precise approaches differ with respect to the<br />

notion <strong>of</strong> context and therefore they yield different accuracies and computational<br />

complexities, but their basic dynamic is the same. TKM is restricted due to the<br />

locality <strong>of</strong> its context representation, whereas RecSOM and SOMSD also include<br />

global information. In that regard, SOMSD can be interpreted as a modification<br />

<strong>of</strong> RecSOM, based on a compression <strong>of</strong> the RecSOM context model, and being<br />

computationally less demanding. Alternative efficient compression schemes such<br />

as Merging SOM (MSOM) have recently been developed [37].<br />

Here, we will focus on the compact and flexible representation <strong>of</strong> time context<br />

by linking the current winner neuron to the most recently presented sequence element:<br />

a neuron’s temporal context is given by an explicit back-reference to the best<br />

matching unit <strong>of</strong> the past time step, representing the previously processed input as<br />

the location <strong>of</strong> the last winning neuron in the map, as proposed in [10]. In comparison<br />

to RecSOM, this yields a greatly reduced computation time: the context <strong>of</strong><br />

SOMSD is a low-dimensional (usually two-dimensional) vector compared to a N-<br />

dimensional vector <strong>of</strong> RecSOM, N being the number <strong>of</strong> neurons (usually, N is at<br />

least 100). In addition, the explicit reference to the past winning unit allows elegant<br />

ways for extracting temporal dependencies. We will show how Markov models can<br />

be easily obtained from a trained map. This is not only possible for discrete input<br />

symbols, but also for real-valued sequence entries, by applying an adaptation<br />

<strong>of</strong> standard U-Matrix methods [38] to recursive SOMs. We will demonstrate the<br />

faithful representation <strong>of</strong> several time series and Markov processes within SOMSD<br />

in this article. However, SOMSD heavily relies on an adequate grid topology, because<br />

the distance <strong>of</strong> context representations is measured within the grid structure.<br />

It can be expected that low dimensional regular lattices do not capture typical<br />

characteristics <strong>of</strong> the space <strong>of</strong> time series. For this reason, we extend the SOMSD<br />

approach to more general topologies, that is to possibly non-Euclidean triangular<br />

grid structures. In particular, we combine a hyperbolic grid and the last-winner-ingrid<br />

temporal back reference. Hyperbolic grid structures have been proposed and<br />

successfully applied to document organization and retrieval [29,30]. Unlike rectangular<br />

lattices with inherent power law neighborhood growth, HSOM implements<br />

an exponential neighborhood growth. For discrete and real-valued time series we<br />

will evaluate the combination <strong>of</strong> hyperbolic lattices with the recurrent dynamics,<br />

putting the focus on neuron specialization, activations, and weight distributions.<br />

First, we present some recursive self-organizing map models introduced in the literature,<br />

which use different notions <strong>of</strong> context. Then, we explain the SOM for structured<br />

data (SOMSD) adapted to sequences in detail, and we extend the model to<br />

arbitrary triangular grid structures. After that, we propose an algorithm to extract<br />

Markov models from a trained map and we show how this algorithm can be combined<br />

with U-Matrix methods. Finally, we demonstrate in experiments the sequence<br />

representation capabilities, using several discrete and real-valued benchmark series.<br />

3


2 <strong>Unsupervised</strong> processing <strong>of</strong> sequences<br />

Let input sequences be denoted by s = (s 1 , . . . , s t ) with entries s i in an alphabet<br />

Σ which is embedded in a real-vector space R n . The element s 1 denotes the most<br />

recent entry <strong>of</strong> the sequence and t is the sequence length.The set <strong>of</strong> sequences <strong>of</strong><br />

arbitrary length over symbols Σ is Σ ∗ , and Σ l is the space <strong>of</strong> sequences <strong>of</strong> length l.<br />

Popular recursive sequence processing models are the temporal Kohonen map,<br />

recurrent SOM, recursive SOM, and SOM for structured data [8,11,39,41]. The<br />

SOMSD has originally been proposed for the more general case <strong>of</strong> tree structure<br />

processing. Here, only sequences, i.e. trees with a node fan-out <strong>of</strong> 1 are considered.<br />

As for standard SOM, a recursive neural map is given by a set <strong>of</strong> neurons n 1 , . . . ,<br />

n N . The neurons are arranged on a grid, <strong>of</strong>ten a two-dimensional regular lattice.<br />

All neurons are equipped with weights w i ∈ R n .<br />

Two important ingredients have to be defined to specify self-organizing network<br />

models: the data metric and the network update. The metric addressed the question,<br />

how an appropriate distance can be defined to measure the similarity <strong>of</strong> possibly<br />

sequential input signals to map units. For this purpose, the sequence entries<br />

are compared with the weight parameters stored at the neuron. The set <strong>of</strong> input signals<br />

for which a given neuron i is closest, is called the receptive field <strong>of</strong> neuron i,<br />

and neuron i is the winner and representative for all these signals within its receptive<br />

field. In the following, we will recall the distance computation for the standard<br />

SOM and also review several ways found in the literature to compute the distance<br />

<strong>of</strong> a neuron from a sequential input. Apart from the metric, the update procedure or<br />

learning rule for neurons to adapt to the input is essential. Commonly, Hebbian or<br />

competitive 1 learning takes place, referring to the following scheme: the parameters<br />

<strong>of</strong> the winner and its neighborhood within a given lattice structure are adapted<br />

such that their response to the current signal is increased. Thereby, neighborhood<br />

cooperation ensures a topologically faithful mapping.<br />

Standard SOM relies on a simple winner-takes-all scheme and does not account<br />

for the temporal structure <strong>of</strong> inputs. For a stimulus s i ∈ R n the neuron n j responds,<br />

for which the squared distance<br />

d SOM (s i , w j ) = ‖s i − w j ‖ 2 ,<br />

s i ∈ R n<br />

is minimum, where ‖ · ‖ is the standard Euclidean metric. Training starts with<br />

randomly initialized weights w i and adapts the parameters iteratively as follows:<br />

denote by n 0 the index <strong>of</strong> the winning neuron for the input signal s i . Assume a<br />

function nhd(n j , n k ) which indicates the degree <strong>of</strong> neighborhood <strong>of</strong> neuron j and<br />

k within the chosen lattice structure is fixed. Adaptation <strong>of</strong> all weights w j takes<br />

1 We will use these two terms interchangeably in the following.<br />

4


place by the update rule<br />

△w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j )<br />

whereby ɛ ∈ (0, 1) is the learning rate. The function h σ describes the amount <strong>of</strong><br />

neuron adaptation in the neighborhood <strong>of</strong> the winner: <strong>of</strong>ten the Gaussian bell function<br />

h σ (x) = exp(−x 2 /σ 2 ) is chosen, <strong>of</strong> which the shape is narrowed during training<br />

by decreasing σ to ensure the neuron specialization. The function nhd(n j , n k )<br />

which measures the degree <strong>of</strong> neighborhood <strong>of</strong> the neurons n i and n j within the<br />

lattice might be induced by the simple Euclidean distance between the neuron coordinates<br />

in a rectangular grid or by the shortest distance in a graph connecting the<br />

two neurons.<br />

<strong>Recursive</strong> models substitute the one-shot distance computation for a single entry<br />

s i by a recursive formula over all entries <strong>of</strong> a given sequence s. For all models,<br />

sequences are presented recursively, and the current sequence entry s i is processed<br />

in the context which is set by its predecessors s i+1 , s i+2 , . . .. 2 The models differ<br />

with respect to the representation <strong>of</strong> the context and in the way that the context<br />

influences further computation.<br />

The Temporal Kohonen Map (TKM) computes the distance <strong>of</strong> s = (s 1 , . . . , s t )<br />

from neuron n j labeled with w j ∈ R n by the leaky integration<br />

t∑<br />

d TKM (s, n j ) = η(1 − η) i−1 ‖s i − w j ‖ 2<br />

i=1<br />

where η ∈ (0, 1) is a memory parameter [8]. A neuron becomes winner if the<br />

current entry s 1 is close to its weight w j as in standard SOM, and, in addition,<br />

the remaining sum (1 − η)‖s 2 − w j ‖ + (1 − η) 2 ‖s 3 − w j ‖ + . . . is also small.<br />

This additional term integrates the distances <strong>of</strong> the neuron’s weight from previous<br />

sequence entries weighted by an exponentially decreasing decay factor (1 − η) i−1 .<br />

The context resulting from previous sequence entries is pointing towards neurons<br />

<strong>of</strong> which the weights have been close to previous entries. Thus, the winner is a<br />

neuron whose weight is close to the average presented signal for the recent time<br />

steps.<br />

The training for the TKM takes place by Hebbian learning in the same way as for<br />

the standard SOM, making well-matching neurons more similar to the current input<br />

than bad-matching neurons. At the beginning, weights w j are initialized randomly<br />

and then iteratively adapted when data is presented. For adaptation assume that a<br />

2 We use reverse indexing <strong>of</strong> the sequence entries, s 1 denoting the most recent entry,<br />

s 2 , s 3 , . . . its predecessors.<br />

5


sequence s is given, with s i denoting the current entry and n j0 denoting the best<br />

matching neuron for this time step. Then the weight correction term is<br />

△w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j )<br />

As discussed in [23], the learning rule <strong>of</strong> TKM is unstable and leads to only suboptimal<br />

results. More advanced, the Recurrent SOM (RSOM) leaky integration first<br />

sums up the weighted directions and afterwards computes the distance [39]<br />

t∑<br />

2<br />

d RSOM (s, n j ) =<br />

η(1 − η) i−1 (s<br />

∥<br />

i − w j )<br />

.<br />

∥<br />

i=1<br />

It represents the context in a larger space than TKM since the vectors <strong>of</strong> directions<br />

are stored instead <strong>of</strong> the scalar Euclidean distance. More importantly, the training<br />

rule is changed. RSOM derives its learning rule directly from the objective to minimize<br />

the distortion error on sequences and thus adapts the weights towards the<br />

vector <strong>of</strong> integrated directions:<br />

△w j = ɛ · h σ (nhd(n j0 , n j )) · y j (i)<br />

whereby<br />

y j (i) =<br />

t∑<br />

η(1 − η) i−1 (s i − w j ) .<br />

i=1<br />

Again, the already processed part <strong>of</strong> the sequence produces a context notion, and<br />

the neuron becomes the winner for the current entry <strong>of</strong> which the weight is most<br />

similar to the average entry for the past time steps. The training rule <strong>of</strong> RSOM takes<br />

this fact into account by adapting the weights towards this averaged activation.<br />

We will not refer to this learning rule in the following. Instead, the way in which<br />

sequences are represented within these two models, and the ways to improve the<br />

representational capabilities <strong>of</strong> such maps will be <strong>of</strong> interest.<br />

Assuming vanishing neighborhood influences σ for both cases TKM and RSOM,<br />

one can analytically compute the internal representation <strong>of</strong> sequences for these two<br />

models, i.e. weights with response optimum to a given sequence s = (s 1 , . . . , s t ):<br />

the weight w is optimum for which<br />

t∑<br />

t∑<br />

w = (1 − η) i−1 s i / (1 − η) i−1<br />

i=1<br />

i=1<br />

holds [40]. This explains the encoding scheme <strong>of</strong> the winner-takes-all dynamics<br />

<strong>of</strong> TKM and RSOM. <strong>Sequence</strong>s are encoded in the weight space by providing a<br />

6


ecursive partitioning very much like the one generating fractal Cantor sets. As an<br />

example for explaining this encoding scheme, assume that binary sequences {0, 1} l<br />

are dealt with. For η = 0.5, the representation <strong>of</strong> sequences <strong>of</strong> fixed length l corresponds<br />

to an encoding in a Cantor set: the interval [0, 0.5) represents sequences<br />

with most recent entry s 1 = 0, interval [0.5, 1) contains only codes <strong>of</strong> sequences<br />

with most recent entry 1. <strong>Recursive</strong> decomposition <strong>of</strong> the intervals allows to recover<br />

further entries <strong>of</strong> the sequence: [0, 0.25) stands for the beginning 00. . . <strong>of</strong> a<br />

sequence, [0.25, 0.5) stands for 01, [0.5, 0.75) for 10, and [0.75, 1) represents 11.<br />

By further subdivision, [0, 0.125) stands for the beginning 000. . ., [0.125, 0.25) for<br />

001, and so on. Similar encodings can be found for alternative choices <strong>of</strong> η. <strong>Sequence</strong>s<br />

over discrete sets Σ = {0, . . . , d} ⊂ R can be uniquely encoded using<br />

this fractal partitioning if η < 1/d. For larger η, the subsets start to overlap, i.e.<br />

codes are no longer sorted according to their last symbols, and a code might stand<br />

for two or more different sequences. A very small η ≪ 1/d, in turn, results in an<br />

only sparsely used space; for example the interval (d · η, 1] does not contain a valid<br />

code. Note that the explicit computation <strong>of</strong> this encoding stresses the superiority<br />

<strong>of</strong> the RSOM learning rule compared to TKM update, as pointed out in [40]: the<br />

fractal code is a fixed point for the dynamics <strong>of</strong> RSOM training, whereas TKM<br />

converges towards the borders <strong>of</strong> the intervals, preventing the optimum fractal encoding<br />

scheme from developing on its own.<br />

Fractal encoding is reasonable, but limited: it is obviously restricted to discrete<br />

sequence entries, and real values or noise might destroy the encoded information.<br />

Fractal codes do not differentiate between sequences <strong>of</strong> different length; e.g. the<br />

code 0 gives optimum response to 0,00, 000, and so forth. <strong>Sequence</strong>s with this<br />

kind <strong>of</strong> encoding cannot be distinguished. In addition, the number <strong>of</strong> neurons does<br />

not take influence on the expressiveness <strong>of</strong> the context space. The range in which<br />

sequences are encoded is the same as the weight space. Thus, both the size <strong>of</strong> the<br />

weight space and the computation accuracy are limiting the number <strong>of</strong> different<br />

contexts, independently <strong>of</strong> the number <strong>of</strong> neurons <strong>of</strong> the network.<br />

Based on these considerations, richer and in particular explicit representations <strong>of</strong><br />

context have been proposed. The models that we introduce in the following extend<br />

the parameter space <strong>of</strong> each neuron j by an additional vector c j , which is<br />

used to explicitly store the sequential context within which a sequence entry is expected.<br />

Depending on the model, the context c j is contained in a representation<br />

space with different dimensionality. However, in all cases this space is independent<br />

<strong>of</strong> the weight space and extends the expressiveness <strong>of</strong> the models in comparison<br />

to TKM and RSOM. For each model, we will define the basic ingredients: what is<br />

the space <strong>of</strong> context representations? How is the distance between a sequence entry<br />

and neuron j computed, taking into account its temporal context c j ? How are the<br />

weights and contexts adapted?<br />

The <strong>Recursive</strong> SOM (RecSOM) [41] equips each neuron n j with a weight w j ∈<br />

R n that represents the given sequence entry, as usual. In addition, a vector c j ∈<br />

7


R N is provided, N denoting the number <strong>of</strong> neurons, which explicitly represents<br />

the contextual map activation <strong>of</strong> all neurons in the previous time step. Thus, the<br />

temporal context is represented in this model in an N-dimensional vector space, N<br />

denoting the number <strong>of</strong> neurons. One can think <strong>of</strong> the context as an explicit storage<br />

<strong>of</strong> the activity pr<strong>of</strong>ile <strong>of</strong> the whole map in the previous time step. More precisely,<br />

distance is recursively computed by<br />

d RecSOM ((s 1 , . . . , s t ), n j ) = η 1 ‖s 1 − w j ‖ 2 + η 2 ‖C RecSOM (s 2 , . . . , s t ) − c j ‖ 2<br />

where η 1 , η 2 > 0.<br />

C RecSOM (s) = (exp(−d RecSOM (s, n 1 )), . . . , exp(−d RecSOM (s, n N )))<br />

constitutes the context. Note that this vector is almost the vector <strong>of</strong> distances <strong>of</strong> all<br />

neurons computed in the previous time step. These are exponentially transformed<br />

to avoid an explosion <strong>of</strong> the values. As before, the above distance can be decomposed<br />

into two parts: the winner computation similar to standard SOM, and, as in<br />

the case <strong>of</strong> RSOM and TKM, a term which assesses the context match. For Rec-<br />

SOM the context match is a comparison <strong>of</strong> the current context when processing<br />

the sequence, i.e. the vector <strong>of</strong> distances <strong>of</strong> the previous time step, and the expected<br />

context c j which is stored at neuron j. That is to say, RecSOM explicitly stores context<br />

vectors for each neuron and compares these context vectors to their expected<br />

contexts during the recursive computation. Since the entire map activation is taken<br />

into account, sequences <strong>of</strong> any given fixed length can be stored, if enough neurons<br />

are provided. Thus, the representation space for context is no longer restricted by<br />

the weight space and its capacity now scales with the number <strong>of</strong> neurons.<br />

For RecSOM, training is done in Hebbian style for both weights and contexts. Denote<br />

by n j0 the winner for sequence entry i, then the weight changes are<br />

△w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j )<br />

and the context adaptation is<br />

△c j = ɛ ′ · h σ (nhd(n j0 , n j )) · (C RecSOM (s i+1 , . . . , s t ) − c j )<br />

The latter update rule makes sure that the context vectors <strong>of</strong> the winner neuron<br />

and its neighborhood become more similar to the current context vector C RecSOM ,<br />

which is computed when the sequence is processed. The learning rates are ɛ, ɛ ′ ∈<br />

(0, 1). As demonstrated in [41], this richer representation <strong>of</strong> context allows a better<br />

quantization <strong>of</strong> time series data. In [41], various quantitative measures to evaluate<br />

trained recursive maps are proposed, such as the temporal quantization error and<br />

the specialization <strong>of</strong> neurons. RecSOM turns out to be clearly superior to TKM and<br />

RSOM with respect to these measures in the experiments provided in [41].<br />

8


However, the dimensionality <strong>of</strong> the context for RecSOM equals the number <strong>of</strong> neurons<br />

N, making this approach computationally quite costly. The training <strong>of</strong> very<br />

huge maps with several thousands <strong>of</strong> neurons is no longer feasible for RecSOM.<br />

Another drawback is given by the exponential activity transfer function in the term<br />

<strong>of</strong> C RecSOM ∈ R N : specialized neurons are characterized by the fact that they have<br />

only one or a few well-matching predecessors contributing values <strong>of</strong> about 1 to<br />

C RecSOM ; however, for a large number N <strong>of</strong> neurons, the noise influence on C RecSOM<br />

from other neurons destroys the valid context information, because even poorly<br />

matching neurons – contributing values <strong>of</strong> slightly above 0 – are summed up in the<br />

distance computation.<br />

SOM for structured data (SOMSD) as proposed in [10,11] is an efficient and still<br />

powerful alternative. SOMSD represents temporal context by the corresponding<br />

winner index in the previous time step. Assume that a regular l-dimensional lattice<br />

<strong>of</strong> neurons is given. Each neuron n j is equipped with a weight w j ∈ R n and a<br />

value c j ∈ R l which represents a compressed version <strong>of</strong> the context, the location<br />

<strong>of</strong> the previous winner within the map [10]. The space in which context vectors<br />

are represented is the vector space R l for this model. The distance <strong>of</strong> sequence<br />

s = (s 1 , . . . , s t ) from neuron n j is recursively computed by<br />

d SOMSD ((s 1 , . . . , s t ), n j ) = η 1 ‖s 1 − w j ‖ 2 + η 2 ‖C SOMSD (s 2 , . . . , s n ) − c j ‖ 2<br />

where C SOMSD (s) equals the location <strong>of</strong> neuron n j with smallest d SOMSD (s, n j ) in the<br />

grid topology. Note that the context C SOMSD is an element in a low-dimensional vector<br />

space, usually only R 2 . The distance between contexts is given by the Euclidean<br />

metric within this vector space. The learning dynamic <strong>of</strong> SOMSD is very similar<br />

to the dynamic <strong>of</strong> RecSOM: the current distance is defined as a mixture <strong>of</strong> two<br />

terms, the match <strong>of</strong> the neuron’s weight and the current sequence entry, and the<br />

match <strong>of</strong> the neuron’s context weight and the context currently computed in the<br />

model. Thereby, the current context is represented by the location <strong>of</strong> the winning<br />

neuron <strong>of</strong> the map in the previous time step. This dynamic imposes a temporal bias<br />

towards those neurons which context vector matches the winner location <strong>of</strong> the previous<br />

time step. It relies on the fact that a lattice structure <strong>of</strong> neurons is defined and<br />

a distance measure <strong>of</strong> locations within the map is defined.<br />

Due to the compressed context information, this approach is very efficient in comparison<br />

to RecSOM and also very large maps can be trained. In addition, noise<br />

is suppressed in this compact representation. However, more complex context information<br />

is used than for TKM and RSOM, namely the location <strong>of</strong> the previous<br />

winner in the map. As for RecSOM, Hebbian learning takes place for SOMSD, because<br />

weight vectors and contexts are adapted in a well-known correction manner,<br />

here by the formulas<br />

△w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j )<br />

9


and<br />

△c j = ɛ ′ · h σ (nhd(n j0 , n j )) · (C SOMSD (s i+1 , . . . , s t ) − c j )<br />

with learning rates ɛ, ɛ ′ ∈ (0, 1). n j0 denotes the winner for sequence entry i.<br />

As demonstrated in [11], a generalization <strong>of</strong> this approach to tree structures can<br />

reliably model structured objects and their respective topological ordering.<br />

We would like to point out that, although these approaches seem different, they<br />

constitute instances <strong>of</strong> the same recursive computation scheme. As proved in [14],<br />

the underlying recursive update dynamics comply with<br />

d((s 1 , . . . , s t ), n j ) = η 1 ‖s 1 − w j ‖ 2 + η 2 ‖C(s 2 , . . . , s n ) − c j ‖ 2<br />

in all the cases. Their specific similarity measures for weights and contexts are denoted<br />

by the generic ‖ · ‖ expression. The approaches differ with respect to the<br />

concrete choice <strong>of</strong> the context C: TKM and RSOM refer to only the neuron itself<br />

and are therefore restricted to local fractal codes within the weight space; RecSOM<br />

uses the whole map activation, which is powerful but also expensive and subject<br />

to random neuron activations; SOMSD relies on compressed information, the location<br />

<strong>of</strong> the winner. Note that also standard supervised recurrent networks can be<br />

put into the generic dynamic framework by choosing the context as the output <strong>of</strong><br />

the sigmoidal transfer function [14]. In addition, alternative compression schemes,<br />

such as a representation <strong>of</strong> the context by the winner content, are possible [37].<br />

To summarize this section, essentially four different models have been proposed<br />

for processing temporal information. The models are characterized by the way in<br />

which context is taken into account within the map. The models are:<br />

Standard SOM: no context representation; standard distance computation; standard<br />

competitive learning.<br />

TKM and RSOM: no explicit context representation; the distance computation<br />

recursively refers to the distance <strong>of</strong> the previous time step; competitive learning<br />

for the weight whereby (for RSOM) the averaged signal is used.<br />

RecSOM: explicit context representation as N-dimensional activity pr<strong>of</strong>ile <strong>of</strong> the<br />

previous time step; the distance computation is given as mixture <strong>of</strong> the current<br />

match and the match <strong>of</strong> the context stored at the neuron and the (recursively computed)<br />

current context given by the processed time series; competitive learning<br />

adapts the weight and context vectors.<br />

SOMSD: explicit context representation as low-dimensional vector, the location<br />

<strong>of</strong> the previously winning neuron in the map; the distance is computed recursively<br />

the same way as for RecSOM, whereby a distance measure for locations<br />

in the map has to be provided; so far, the model is only available for standard<br />

rectangular Euclidean lattices; competitive learning adapts the weight and context<br />

vectors, whereby the context vectors are embedded in the Euclidean space.<br />

10


In the following, we focus on the context representation by the winner index, as<br />

proposed in SOMSD. This scheme <strong>of</strong>fers a compact and efficient context representation.<br />

However, it relies heavily on the neighborhood structure <strong>of</strong> the neurons,<br />

and faithful topological ordering is essential for appropriate processing. Since for<br />

sequential data, like for words in Σ ∗ , the number <strong>of</strong> possible strings is an exponential<br />

function <strong>of</strong> their length, an Euclidean target grid with inherent power law<br />

neighborhood growth is not suited for a topology preserving representation. The<br />

reason for this is that the storage <strong>of</strong> temporal data is related to the representation<br />

<strong>of</strong> trajectories on the neural grid. String processing means beginning at a node that<br />

represents the start symbol; then, how many nodes n s can in the ideal case uniquely<br />

be reached in a fixed number s <strong>of</strong> steps? In grids with 6 neurons per neighbor the<br />

triangular tessellation <strong>of</strong> the Euclidean plane leads to a hexagonal superstructure,<br />

inducing the surprising answer <strong>of</strong> n s = 6 for any choice <strong>of</strong> s > 0. Providing 7<br />

neurons per neighbor yields the exponential branching n s = 7 · 2 (s−1) <strong>of</strong> paths.<br />

In this respect, it is interesting to note that RecSOM can also be combined with<br />

alternative lattice structures; in [41] a comparison is presented <strong>of</strong> RecSOM with a<br />

standard rectangular topology and a data optimum topology provided by neural gas<br />

(NG) [27,28]. The latter clearly leads to superior results. Unfortunately, it is not<br />

possible to combine the optimum topology <strong>of</strong> NG with SOMSD: for NG, no grid<br />

with straightforward neuron indexing exists. Therefore, context cannot be defined<br />

easily by referring back to the previous winner, because no similarity measure is<br />

available for indices <strong>of</strong> neurons within a grid topology.<br />

Here, we extend SOMSD to grid structures with triangular grid connectivity in<br />

order to obtain a larger flexibility for the lattice design. Apart from the standard<br />

Euclidean plane, the sphere and the hyperbolic plane are alternative popular twodimensional<br />

manifolds. They differ from the Euclidean plane with respect to their<br />

curvature: the Euclidean plane is flat, whereas the hyperbolic space has negative<br />

curvature, and the sphere is curved positively. By computing the Euler characteristics<br />

<strong>of</strong> all compact connected surfaces, it can be shown that only seven have nonnegative<br />

curvature, implying that all but seven are locally isometric to the hyperbolic<br />

plane, which makes the study <strong>of</strong> hyperbolic spaces particularly interesting. 3<br />

The curvature has consequences on regular tessellations <strong>of</strong> the referred manifolds as<br />

pointed out in [30]: the number <strong>of</strong> neighbors <strong>of</strong> a grid point in a regular tessellation<br />

<strong>of</strong> the Euclidean plane follows a power law, whereas the hyperbolic plane allows<br />

an exponential increase <strong>of</strong> the number <strong>of</strong> neighbors. The sphere yields compact<br />

lattices with vanishing neighborhoods, whereby a regular tessellation for which all<br />

vertices have the same number <strong>of</strong> neighbors is impossible (with the uninteresting<br />

exception <strong>of</strong> an approximation by one <strong>of</strong> the 5 Platonic solids). Since all these<br />

surfaces constitute two-dimensional manifolds, they can be approximated locally<br />

within a cell <strong>of</strong> the tessellation by a subset <strong>of</strong> the standard Euclidean plane without<br />

3 For an excellent tool box and introduction to hyperbolic geometry see e.g.<br />

http://www.geom.uiuc.edu/docs/forum/hype/hype.html<br />

11


too much contortion. A global isometric embedding, however, is not possible in<br />

general. Interestingly, for all such tessellations a data similarity measure is defined<br />

and possibly non-isometric visualization in the 2D plane can be achieved. While 6<br />

neighbors per neuron lead to standard Euclidean triangular meshes, for a grid with<br />

7 neighbors or more, the graph becomes part <strong>of</strong> the 2-dimensional hyperbolic plane.<br />

As already mentioned, exponential neighborhood growth is possible and hence an<br />

adequate data representation can be expected for the visualization <strong>of</strong> domains with<br />

a high connectivity <strong>of</strong> the involved objects. SOM with hyperbolic neighborhood<br />

(HSOM) has already proved well-suited for text representation as demonstrated for<br />

a non-recursive model in [29].<br />

3 SOM for sequences (SOM-S)<br />

In the following, we introduce the adaptation <strong>of</strong> SOMSD for sequences and the<br />

general triangular grid structure, SOM for sequences (SOM-S). Standard SOMs<br />

operate on a rectangular neuron grid embedded in a real-valued vector space. More<br />

flexibility for the topological setup can be obtained by describing the grid in terms<br />

<strong>of</strong> a graph: neural connections are realized by assigning each neuron a set <strong>of</strong> direct<br />

neighbors. The distance <strong>of</strong> two neurons is given by the length <strong>of</strong> a shortest path<br />

within the lattice <strong>of</strong> neurons. Each edge is assigned the unit length 1. The number <strong>of</strong><br />

neighbors might vary (also within a single map). Less than 6 neighbors per neuron<br />

lead to a subsiding neighborhood, resulting in graphs with small numbers <strong>of</strong> nodes.<br />

Choosing more than 6 neighbors per neuron yields, as argued above, an exponential<br />

increase <strong>of</strong> the neighborhood size, which is convenient for representing sequences<br />

with potentially exponential context diversification.<br />

Unlike standard SOM or HSOM, we do not assume that a distance preserving embedding<br />

<strong>of</strong> the lattice into the two dimensional plane or another globally parameterized<br />

two-dimensional manifold with global metric structure, such as the hyperbolic<br />

plane, exists. Rather, we assume that the distance <strong>of</strong> neurons within the grid<br />

is computed directly on the neighborhood graph, which might be obtained by any<br />

non-overlapping triangulation <strong>of</strong> the topological two-dimensional plane. 4 For our<br />

experiments, we have implemented a grid generator for a circular triangle meshing<br />

around a center neuron, which requires the desired number <strong>of</strong> neurons and the<br />

neighborhood degree n as parameters. Neurons at the lattice edge possess less than<br />

n neighbors, and if the chosen total number <strong>of</strong> neurons does not lead to filling up<br />

the outer neuron circle, neurons there are connected to others in a maximum symmetric<br />

way. Figure 1 shows a small map with 7 neighbors for the inner neurons,<br />

and a total <strong>of</strong> 29 neurons perfectly filling up the outer edge. For ≥ 7 neighbors, the<br />

exponential neighborhood increase can be observed, for which an embedding into<br />

4 Since the lattice is fixed during training, these values have to be computed only once.<br />

12


,<br />

<br />

<br />

!<br />

!<br />

<br />

,<br />

<br />

<br />

Fig. 1. Hyperbolic self organizing map with context. Neuron n refers to the context given<br />

by the winner location in the map, indicated by the triangle <strong>of</strong> neurons N 1 , N 2 , and N 3 ,<br />

and the precise coordinates ß 12 ,ß 13 . If the previous winner has been D 2 , adaptation <strong>of</strong> the<br />

context along the dotted line takes place.<br />

the Euclidean plane is not possible without contortions; however, local projections<br />

in terms <strong>of</strong> a fish eye magnification focus can be obtained (cf. [29]).<br />

SOMSD adapts the location <strong>of</strong> the expected previous winner during training. For<br />

this purpose, we have to embed the triangular mesh structure into a continuous<br />

space. We achieve this by computing lattice distances beforehand, and then we approximate<br />

the distance <strong>of</strong> points within a triangle shaped map patch by the standard<br />

Euclidean distance. Thus, positions in the lattice are represented by three neuron<br />

indices which represent the selected triangle <strong>of</strong> adjacent neurons, and two real numbers<br />

which represent the position within the triangle. The recursive nature <strong>of</strong> the<br />

shown map is illustrated exemplarily in figure 1 for neuron n. This neuron n is<br />

equipped with a weight w ∈ R n and a context c that is given by a location within<br />

the triangle <strong>of</strong> neurons N 1 , N 2 , and N 3 expressing corner affinities by means <strong>of</strong><br />

the linear combination parameters ß 12 and ß 13 . The distance <strong>of</strong> a sequence s from<br />

neuron n is recursively computed by<br />

d SOM-S ((s 1 , . . . , s t ), n) = η ‖s 1 − w‖ 2 + (1 − η) g(C SOM-S (s 2 , . . . , s n ), c).<br />

C SOM-S (s) is the index <strong>of</strong> the neuron n j in the grid with smallest distance d SOM-S (s, n j ).<br />

g measures the grid distance <strong>of</strong> the triangular position c j = (N 1 , N 2 , N 3 , ß 12 , ß 13 )<br />

to the winner as the shortest possible path in the mesh structure. Grid distances<br />

between neighboring neurons possess unit length, and the metric structure within<br />

the triangle N 1 , N 2 , N 3 is approximated by the Euclidean metric. The range <strong>of</strong> g<br />

is normalized by scaling with the inverse maximum grid distance. This mixture <strong>of</strong><br />

hyperbolic grid distance and Euclidean distance is valid, because the hyperbolic<br />

space can locally be approximated by Euclidean space, which is applied for computational<br />

convenience to both distance calculation and update.<br />

13


Training is carried out by presenting a pattern s = (s 1 , . . . , s t ), determining the<br />

winner n j0 , and updating the weight and the context. Adaptation affects all neurons<br />

on the breadth first search graph around the winning neuron according to their<br />

grid distances in a Hebbian style. Hence, for the sequence entry s i , weight w j is<br />

updated by △w j = ɛ · h σ (nhd(n j0 , n j )) · (s i − w j ). The learning rate ɛ is typically<br />

exponentially decreased during training; as above, h σ (nhd(n j0 , n j )) describes the<br />

influence <strong>of</strong> the winner n j0 to the current neuron n j as a decreasing function <strong>of</strong><br />

grid distance. The context update is analogous: the current context, expressed in<br />

terms <strong>of</strong> neuron triangle corners and coordinates, is moved towards the previous<br />

winner along a shortest path. This adaptation yields positions on the grid only.<br />

Intermediate positions can be achieved by interpolation: if two neurons N i and N j<br />

exist in the triangle with the same distance, the midway is taken for the flat grids<br />

obtained by our grid generator. This explains why the update path, depicted as the<br />

dotted line in figure 1, for the current context towards D 2 is via D 1 . Since the grid<br />

distances are stored in a static matrix, a fast calculation <strong>of</strong> shortest path lengths is<br />

possible. The parameter η in the recursive distance calculations controls the balance<br />

between pattern and context influence; since initially nothing is known about the<br />

temporal structure, this parameter starts at 1, thus indicating the absence <strong>of</strong> context,<br />

and resulting in standard SOM. During training it is decreased to an application<br />

dependent value that mediates the balance between the externally presented pattern<br />

and the internally gained model about historic contexts.<br />

Thus, we can combine the flexibility <strong>of</strong> general triangular and possibly hyperbolic<br />

lattice structures with the efficient context representation as proposed in [11].<br />

4 Evaluation measures <strong>of</strong> SOM<br />

Popular methods to evaluate the standard SOM are the visual inspection, the identification<br />

<strong>of</strong> meaningful clusters, the quantization error, and measures for topological<br />

ordering <strong>of</strong> the map. For recursive self organizing maps, an additional dimension<br />

arises: the temporal dynamic stored in the context representations <strong>of</strong> the map.<br />

4.1 Temporal quantization error<br />

Using ideas <strong>of</strong> Voegtlin [41] we introduce a method to assess the implicit representation<br />

<strong>of</strong> temporal dependencies in the map, and to evaluate to which amount<br />

faithful representation <strong>of</strong> the temporal data takes place. The general quantization<br />

error refers to the distortion <strong>of</strong> each map unit with respect to its receptive field,<br />

which measures the extent <strong>of</strong> data space coverage by the units. If temporal data are<br />

considered, the distortion needs to be assessed back in time. For a formal definition,<br />

assume that a time series (s 1 , s 2 , . . . , s t , . . .) is presented to the network, again<br />

14


with reverse indexing notation, i.e. s 1 is the most recent entry <strong>of</strong> the time series. Let<br />

win i denote all time steps for which neuron i becomes the winner in the considered<br />

recursive map model. The mean activation <strong>of</strong> neuron i for time step t in the past is<br />

the value<br />

A i (t) =<br />

∑<br />

s j+t /|win i |.<br />

j∈win i<br />

Assume that neuron i becomes winner for a sequence entry s j . It can then be expected<br />

that s j is like the standard SOM close to the average A i (0), because the map<br />

is trained with Hebbian learning. Temporal specification takes place if, in addition,<br />

s j+t is close to the average A i (t) for t > 0. The temporal quantization error <strong>of</strong><br />

neuron i at time step t back in the past is defined by<br />

⎛<br />

⎞<br />

E i (t) = ⎝ ∑<br />

‖s j+t − A i (t)‖ 2 ⎠<br />

j∈win i<br />

1/2<br />

.<br />

This measures the extent up to which the values observed t time steps back in the<br />

past coincide with a winning neuron. Temporal specialization <strong>of</strong> neuron i takes<br />

place if E i (t) is small for t > 0. Since no temporal context is learned for the<br />

standard SOM, the temporal quantization will be large for t > 0, just reflecting<br />

specifics <strong>of</strong> the underlying time series such as smoothness or periodicity. For recursive<br />

models, this quantity allows to assess the amount <strong>of</strong> temporal specification.<br />

The temporal quantization error <strong>of</strong> the entire map for t time steps back into the past<br />

is defined as the average<br />

N∑<br />

E(t) = E i (t)/N<br />

i=1<br />

This method allows to evaluate whether the temporal dynamic in the recent past is<br />

faithfully represented.<br />

4.2 Temporal models<br />

After the training <strong>of</strong> a recursive map, it can be used to obtain an explicit, possibly<br />

approximative description <strong>of</strong> the underlying global temporal dynamics. This <strong>of</strong>fers<br />

another possibility to evaluate the dynamics <strong>of</strong> SOM because we can compare the<br />

extracted temporal model to the original one, if available, or a temporal model<br />

extracted directly from the data. In addition, a compressed description <strong>of</strong> the global<br />

dynamics extracted from a trained SOM is interesting for data mining tasks. In<br />

particular, it can be tested whether clustering properties <strong>of</strong> SOM, referred to by<br />

U-matrix methods, transfer to the temporal domain.<br />

15


Markov models constitute simple, though powerful techniques for sequence processing<br />

and analysis [6,32]. Assume that Σ = {a 1 , . . . , a d } is a finite alphabet. The<br />

prediction <strong>of</strong> the next symbol refers to the task to anticipate the probability <strong>of</strong> a i<br />

having observed a sequence s = (s 1 , . . . , s t ) ∈ Σ ∗ before. This is just the conditional<br />

probability P (a i |s). For finite Markov models, a finite memory length l is<br />

sufficient to determine this probability, i.e. the probability<br />

P (a i |(s 1 , . . . , s l , . . . , s t )) = P (a i |(s 1 , . . . , s l )) , (t ≥ l)<br />

depends only on the past l symbols instead <strong>of</strong> the whole context (s 1 , . . . , s t ). Markov<br />

models can be estimated from given data if the order l is fixed. It holds that<br />

P (a i |(s 1 , . . . , s l )) = P ((a i, s 1 , . . . , s l ))<br />

∑<br />

j P ((a j , s 1 , . . . , s l ))<br />

(1)<br />

which means that the next symbol probability can be estimated from the frequencies<br />

<strong>of</strong> (l + 1)-grams.<br />

We are interested in the question whether a trained SOM-S can capture the essential<br />

probabilities for predicting the next symbol, generated by simple Markov<br />

models. For this purpose, we train maps on Markov models and afterwards extract<br />

the transition probabilities entirely from the obtained maps. This extraction can be<br />

done because <strong>of</strong> the specific form <strong>of</strong> context for SOM-S. Given a finite alphabet<br />

Σ = {a 1 , . . . , a d } for training, most neurons specialize during training and become<br />

winner for at least one or some stimuli. Winner neurons represent the input sequence<br />

entries w by their trained weight vectors. Usually, the weight w i <strong>of</strong> neuron<br />

n i is very close to a symbol a j <strong>of</strong> Σ and can thus be identified with the symbol.<br />

In addition, the neurons represent their context by an explicit reference to the location<br />

<strong>of</strong> the winner in the previous time step. The context vectors stored in the<br />

neurons define an intermediate winning position in the map encoded by the parameters<br />

(N 1 , N 2 , N 3 , ß 12 , ß 13 ) for the closest three neurons and the exact position<br />

within the triangle. We take this into account for extracting sequences corresponding<br />

to the averaged weights <strong>of</strong> all three potential winners <strong>of</strong> the previous time step.<br />

For the averaging, the contribution <strong>of</strong> each neuron to the interpolated position is<br />

considered. Repeating this back-referencing procedure recursively for each winner<br />

weighted by its influence, yields an exponentially spreading number <strong>of</strong> potentially<br />

infinite time series for each <strong>of</strong> neuron. This way, we obtain a probability distribution<br />

over time series that is representative for the history <strong>of</strong> each map neuron. 5<br />

5 Interestingly, one can formally prove that every finite length Markov model can be approximated<br />

by some map in this way in principle, i.e. for every Markov model <strong>of</strong> length l<br />

a map exists such that the above extraction procedure yields the original model up to small<br />

deviations. Assume a fixed length l and a rational P (a i |(s 1 , . . . , s l )) and denote by q the<br />

smallest common denominator <strong>of</strong> the transition probabilities. Consider a map in which for<br />

16


The number <strong>of</strong> specialized neurons for each time series is correlated to the probability<br />

<strong>of</strong> these stimuli in the original data source. Therefore, we can simply take the<br />

mean <strong>of</strong> the probabilities for all neurons and obtain a global distribution over all<br />

histories which are represented in the map. Since standard SOM has a magnification<br />

factor different from 1, the number <strong>of</strong> neurons, which represent a symbol a i , deviates<br />

from the probability for a i in the given data [31]. This leads to a slightly biased<br />

estimation <strong>of</strong> the sequence probabilities represented by the map. Nevertheless, we<br />

will use the above extraction procedure as a sufficiently close approximation to the<br />

true underlying distribution. This compromise is taken, because the magnification<br />

factor for recurrent SOMs is not known and techniques from [31] for its computation<br />

cannot be transferred to recurrent models. Our experiments confirm that the<br />

global trend is still correct. We have extracted for every finite memory length l the<br />

probability distribution for words in Σ l+1 as they are represented in the map and<br />

determined the transition probabilities <strong>of</strong> equation 1.<br />

The method as described above is a valuable tool to evaluate the representation<br />

capacity <strong>of</strong> SOM for temporal structures. Obviously, fixed order Markov models<br />

can be better extracted directly from the given data, avoiding problems such as the<br />

magnification factor <strong>of</strong> SOM. Hence, this method just serves as an alternative for<br />

the evaluation <strong>of</strong> temporal self-organizing maps and their capability <strong>of</strong> representing<br />

temporal dynamics. The situation is different if real-valued elements are processed,<br />

like in the case <strong>of</strong> obtaining symbolic structure from noisy sequences. Then, a reasonable<br />

quantization <strong>of</strong> the sequence entries must be found before a Markov model<br />

can be extracted from the data. The standard SOM together with U-matrix methods<br />

provides a valuable tool to find meaningful clusters in a given set <strong>of</strong> continuous<br />

data. It is an interesting question whether this property transfers to the temporal<br />

domain, i.e. whether meaningful clusters <strong>of</strong> real-valued sequence entries can also<br />

be extracted from a trained recursive model. SOM-S allows to combine both reliable<br />

quantization <strong>of</strong> the sequence entries and the extraction mechanism for Markov<br />

models to take into account the temporal structure <strong>of</strong> the data.<br />

For the extraction we extend U-Matrix methods to recursive models as follows [38]:<br />

the standard U-Matrix assigns to each neuron the averaged distance <strong>of</strong> its weight<br />

vector compared to its direct lattice neighbors:<br />

U(n i ) =<br />

∑<br />

nhd(n i ,n j )=1<br />

‖w i − w j ‖<br />

each symbol a i a cluster <strong>of</strong> neurons with weights w j = a i exist. These main clusters are<br />

divided into subclusters enumerated by s = (s 1 , . . . , s l ) ∈ Σ l with q · P (a i |s) neurons for<br />

each possible s. The context <strong>of</strong> each <strong>of</strong> such neuron refers to another neuron within a cluster<br />

belonging to s 1 and to a subcluster belonging to (s 2 , . . . , s l , s l+1 ) for some arbitrary s l+1 .<br />

Note that the clusters can thereby be chosen contiguous on the map respecting the topological<br />

ordering <strong>of</strong> the neurons. The extraction mechanism leads to the original Markov model<br />

(with rational probabilities) based on this map.<br />

17


In a trained map, neurons spread in regions <strong>of</strong> the data space where a high sample<br />

density can be observed, resulting in large U-values at borders between clusters.<br />

Consequently, the U-Matrix forms a 3D landscape on the lattice <strong>of</strong> neurons with<br />

valleys corresponding to meaningful clusters and hills at the cluster borders. The<br />

U-Matrix <strong>of</strong> weight vectors can be constructed also for SOM-S. Based on this matrix,<br />

the sequence entries can be clustered into meaningful categories, based on<br />

which the extraction <strong>of</strong> Markov models as described above is possible. Note that<br />

the U-Matrix is built by using the weights assigned to the neurons only, while the<br />

context information <strong>of</strong> SOM-S is yet ignored. 6 However, since context information<br />

is used for training, clusters emerge which are meaningful with respect to the<br />

temporal structure, and this way they contribute implicitly to the topological ordering<br />

<strong>of</strong> the map and to the U-Matrix. Partially overlapping, noisy, and ambiguous<br />

input elements are separated during the training, because the different temporal<br />

contexts contain enough information to activate and produce characteristic clusters<br />

on the map. Thus, the temporal structure captured by the training allows a reliable<br />

reconstruction <strong>of</strong> the input sequences, which could not have been achieved by the<br />

standard SOM architecture.<br />

5 Experiments<br />

5.1 Mackey-Glass time series<br />

The first task is to learn the dynamic <strong>of</strong> the real-valued chaotic Mackey-Glass time<br />

series dx = bx(τ) + ax(τ−d) using a = 0.2, b = −0.1, d = 17. This is the same<br />

dτ 1+x(τ−d) 10<br />

setup as given in [41] making a comparison <strong>of</strong> the results possible. 7 Three types<br />

<strong>of</strong> maps with 100 neurons have been trained: a 6-neighbor map without context<br />

giving standard SOM, a map with 6 neighbors and with context (SOM-S), and<br />

a 7-neighbor map providing a hyperbolic grid with context utilization (H-SOM-<br />

S). Each run has been computed with 1.5 · 10 5 presentations starting at random<br />

positions within the Mackey-Glass series using a sample period <strong>of</strong> ∆t = 3; the<br />

neuron weights have been initialized white within [0.6, 1.4]. The context has been<br />

considered by decreasing the parameter from η = 1 to η = 0.97. The learning rate<br />

is exponentially decreased from 0.1 to 0.005 for weight and context update. Initial<br />

neighborhood cooperativity is 10 which is annealed to 1 during training.<br />

Figure 2 shows the temporal quantization error for the above setups: the temporal<br />

quantization error is expressed by the average standard deviation <strong>of</strong> the given sequence<br />

and the mean unit receptive field for 29 time steps into the past. Similar<br />

6 Preliminary experiments indicate that the context also orders topologically and yields<br />

meaningful clusters. The number <strong>of</strong> neurons in context clusters is thereby small compared<br />

to the number <strong>of</strong> neurons and statistically significant results could not be obtained.<br />

7 We would like to thank T.Voegtlin for providing data for comparison.<br />

18


to Voegtlin’s results, we observe large cyclic oscillations driven by the periodicity<br />

<strong>of</strong> the training series for standard SOM. Since SOM does not take contextual information<br />

into account, this quantization result can be seen as an upper bound for<br />

temporal models, at least for the indices > 0 reaching into the past (trivially, SOM<br />

is a very good quantizer <strong>of</strong> scalar elements without history); the oscillating shape<br />

<strong>of</strong> the curve is explained by the continuity <strong>of</strong> the series and its quasi-periodic dynamic,<br />

and extrema exist rather by the nature <strong>of</strong> the series than by special model<br />

properties. Obviously, the very restricted context <strong>of</strong> RSOM does not yield a long<br />

term improvement <strong>of</strong> the temporal quantization error. However, the displayed error<br />

periodicity is anti-cyclic compared to the original series. Interestingly, the data<br />

optimum topology <strong>of</strong> neural gas (NG), which also does not take contextual information<br />

into account, allows a reduction <strong>of</strong> the overall quantization error; however,<br />

the main characteristics, such as the periodicity, remain the same as for standard<br />

SOM. RecSOM leads to a much better quantization error than RSOM and also NG.<br />

Thereby, the error is minimum for the immediate past (left side <strong>of</strong> the diagram),<br />

and increases for going back in time, which is reasonable because <strong>of</strong> the weighting<br />

<strong>of</strong> context influence by (1 − η). The increase <strong>of</strong> the quantization error is smooth<br />

and the final values after 29 time steps is better than the default given by standard<br />

SOM. In addition, almost no periodicity can be observed for RecSOM. SOM-S<br />

and H-SOM-S further improve the results: only some periodicity can be observed,<br />

and the overall quantization error increases smoothly for the past values. Note that<br />

these models are superior to RecSOM in this task while requiring less computational<br />

power. H-SOM-S allows a slightly better representation <strong>of</strong> the immediate<br />

past compared to SOM-S due to the hyperbolic topology <strong>of</strong> the lattice structure<br />

that matches better the characteristics <strong>of</strong> the input data.<br />

0.2<br />

Quantization Error<br />

0.15<br />

0.1<br />

0.05<br />

* SOM<br />

* RSOM<br />

NG<br />

* RecSOM<br />

SOM-S<br />

H-SOM-S<br />

0<br />

0 5 10 15 20 25 30<br />

Index <strong>of</strong> past inputs (index 0: present)<br />

Fig. 2. Temporal quantization errors <strong>of</strong> different model setups for the Mackey-Glass series.<br />

Results indicated by ∗ are taken from [41].<br />

19


5.2 Binary automata<br />

The second experiment is also inspired by Voegtlin. A discrete 0/1-sequence generated<br />

by a binary automaton with P (0|1) = 0.4 and P (1|0) = 0.3 shall be learned.<br />

For discrete data, the specialization <strong>of</strong> a neuron can be defined as the longest sequence<br />

that still leads to unambiguous winner selection. A high percentage <strong>of</strong> specialized<br />

neurons indicates that temporal context has been learned by the map. In<br />

addition, one can compare the distribution <strong>of</strong> specializations with the original distribution<br />

<strong>of</strong> strings as generated by the underlying probability. Figure 3 shows the<br />

specialization <strong>of</strong> a trained H-SOM-S. Training has been carried out with 3·10 6 presentations,<br />

increasing the context influence (1 − η) exponentially from 0 to 0.06.<br />

The remaining parameters have been chosen as in the first experiment. Finally, the<br />

receptive field has been computed by providing an additional number <strong>of</strong> 10 6 test<br />

iterations. Putting more emphasis on the context results in a smaller number <strong>of</strong> active<br />

neurons representing rather long strings that cover only a small part <strong>of</strong> the total<br />

input space. If a Euclidean lattice is used instead <strong>of</strong> a hyperbolic neighborhood,<br />

the resulting quantizers differ only slightly, which indicates that the representation<br />

<strong>of</strong> binary symbols and their contexts in the 2-dimensional output space representations<br />

does barely benefit from exponential branching. In the depicted run, 64 <strong>of</strong><br />

the neurons express a clear pr<strong>of</strong>ile, whereas the other neurons are located at sparse<br />

locations <strong>of</strong> the input data topology, between cluster boundaries, and thus do not<br />

win for the presented stimuli. The distribution corresponds nicely to the 100 most<br />

characteristic sequences <strong>of</strong> the probabilistic automaton as indicated by the graph.<br />

Unlike RecSOM (presented in [41]), also neurons at interior nodes <strong>of</strong> the tree are<br />

expressed for H-SOM-S. These nodes refer to transient states, which are represented<br />

by corresponding winners in the network. RecSOM, in contrast to SOM-S,<br />

does not rely on the winner index only, but it uses a more complex representation:<br />

since the transient states are spared, longer sequences can be expressed by<br />

RecSOM. In addition to the examination <strong>of</strong> neuron specialization, the whole map<br />

11<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

100 most likely sequences<br />

H-SOM-S, 100 neurons<br />

64 specialized neurons<br />

Fig. 3. Receptive fields <strong>of</strong> a H-SOM-S compared to the most probable sub-sequences <strong>of</strong> the<br />

binary automaton. Left hand branches denote 0, right is 1.<br />

20


*<br />

6<br />

2<br />

6<br />

5<br />

:<br />

8<br />

:<br />

2<br />

8<br />

5<br />

-<br />

Type P (0) P (1) P (0|0) P (1|0) P (0|1) P (1|1)<br />

Automaton 1 4/7 ≈ 0.571 3/7 ≈ 0.429 0.7 0.3 0.4 0.6<br />

Map (98/100) 0.571 0.429 0.732 0.268 0.366 0.634<br />

Automaton 2 2/7 ≈ 0.286 5/7 ≈ 0.714 0.8 0.2 0.08 0.92<br />

Map (138/141) 0.297 0.703 0.75 0.25 0.12 0.88<br />

Automaton 3 0.5 0.5 0.5 0.5 0.5 0.5<br />

Map (138/141) 0.507 0.493 0.508 0.492 0.529 0.471<br />

Table 1<br />

Results for binary automata extraction with different transition probabilities. The extracted<br />

probabilities clearly follow the original ones.<br />

representation can be characterized by comparing the input symbol transition statistics<br />

with the learned context-neuron relations. While the current symbol is coded<br />

by the winning neuron’s weight, the previous symbol is represented by the average<br />

<strong>of</strong> weights <strong>of</strong> the winner’s context triangle neurons. The obtained two values – the<br />

neuron’s state and the average state <strong>of</strong> the neuron’s context – are clearly expressed<br />

in the trained map: only few neurons contain values in an indeterminate interval<br />

[ 1, 2 ], but most neurons specialize on very close to 0 or 1. Results for the reconstruction<br />

<strong>of</strong> three automata can be found in table 1. For the reconstruction we have<br />

3 3<br />

used the algorithm described in section 4.2 with memory length 1. The left column<br />

indicates the number <strong>of</strong> expressed neurons and the total number <strong>of</strong> neurons in the<br />

map. Note that the automata can be well reobtained from the trained maps. Again,<br />

the temporal dependencies are clearly captured by the maps.<br />

5.3 Reber grammar<br />

In a third experiment we have used more structured symbolic sequences as generated<br />

by the Reber grammar illustrated in figure 4. The 7 symbols have been coded<br />

in a 6-dimensional Euclidean space by points that denote the same as a tetrahedron<br />

does with its four corners in three dimensions: all points have the same distance<br />

Fig. 4. State graph <strong>of</strong> the Reber grammar.<br />

21


from each other. For training and testing we have taken the concatenation <strong>of</strong> randomly<br />

generated words, such preparing sequences <strong>of</strong> 3 · 10 6 and 10 6 input vectors,<br />

respectively. The map has got a map radius <strong>of</strong> 5 and contains 617 neurons on an<br />

hyperbolic grid. For the initialization and the training, the same parameters as in the<br />

previous experiment were used, except for an initially larger neighborhood range <strong>of</strong><br />

14, corresponding to the larger map. Context influence was taken into account by<br />

decreasing η from 1 to 0.8 during training. A number <strong>of</strong> 338 neurons developed a<br />

specialization for Reber strings with an average length <strong>of</strong> 7.23 characters. Figure 5<br />

shows that the neuron specializations produce strict clusters on the circular grid,<br />

ordered in a topological way by the last character. In agreement with the grammar,<br />

the letter T takes the largest sector on the map. The underlying hyperbolic lattice<br />

gives rise to sectors, because they clearly minimize the boundary between the 7<br />

classes. The symbol separation is further emphasized by the existence <strong>of</strong> idle neurons<br />

between the boundaries, which can be seen analogously to large values in a<br />

U-Matrix. Since neuron specialization takes place from the most common states<br />

–which are the 7 root symbols– to the increasingly special cases, the central nodes<br />

have fallen idle after they have served as signposts during training; finally the most<br />

specialized nodes with their associated strings are found at the lattice edge on the<br />

outer ring. Much in contrast to the such ordered hyperbolic target lattice, the result<br />

for the Euclidean grid in figure 7 shows a neuron arrangement in the form <strong>of</strong><br />

polymorphic coherent patches.<br />

Similar to the binary automata learning tasks, we have analyzed the map representation<br />

by the reconstruction <strong>of</strong> the trained data by backtracking all possible context<br />

sequences <strong>of</strong> each neuron up to length 3. Only 118 <strong>of</strong> all 343 combinatorially possible<br />

trigrams are realized. In a ranked table the most likely 33 strings cover all<br />

attainable Reber trigrams. In the log-probability plot 6 there is a leap between entry<br />

number 33 (TSS, valid) and 34 (XSX, invalid), emphasizing the presence <strong>of</strong> the Reber<br />

characteristic. The correlation <strong>of</strong> the probabilities <strong>of</strong> Reber trigrams and their<br />

relative frequencies found in the map is 0.75. An explicit comparison <strong>of</strong> the probabilities<br />

<strong>of</strong> valid Reber strings can be found in figure 8. The values deviate from the<br />

true probabilities, in particular for cycles <strong>of</strong> the Reber graph, such as consecutive<br />

letters T and S, or the VPX-circle. This effect is due to the magnification factor<br />

different from 1 for SOM, which further magnifies when sequences are processed<br />

in the proposed recursive manner.<br />

5.4 Finite memory models<br />

In a final series <strong>of</strong> experiments, we examine a SOM-S trained on Markov models<br />

with noisy input sequence entries. We investigate the possibility to extract temporal<br />

dependencies on real-valued sequences from a trained map. The Markov model<br />

possesses a memory length <strong>of</strong> 2 as depicted in figure 9. The basic symbols are denoted<br />

by a, b, and c. These are embedded in two dimensions, disrupted by noise, as<br />

22


TVVEBTSSX<br />

SEBTSSX<br />

VVEBTXX<br />

EBTSSSX<br />

EBTSSX<br />

SSSSSX<br />

XSEBTSX<br />

SSSSX<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

EBPVVEBPTT<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

TVPXT<br />

.<br />

TVPSEB<br />

.<br />

.<br />

.<br />

.<br />

. .<br />

.<br />

.<br />

..<br />

..<br />

.<br />

.<br />

..<br />

XXTVPS<br />

XTVPS<br />

TTTVPS<br />

TTVPS<br />

TVPSEBPVPS<br />

VVEBPVPS<br />

XSEBTXX<br />

SEBTXX<br />

TVPSEBTXX<br />

VPSEBTSX<br />

TTVVEBTSX<br />

TVVEBTXX<br />

TVVEBTSXX<br />

EBTSXX<br />

XSEBTSXX<br />

SSSSXX<br />

SSSXX<br />

EBTSSXX<br />

TVVEBTSX<br />

EBTSX<br />

TVVEBTX<br />

TVVEBTX<br />

TTVVEBTX<br />

EBPVVEBTX<br />

EBPVVEBTX<br />

TVPSEBTX<br />

SEBTX<br />

XSEBTX<br />

..<br />

XTVPX<br />

SXXTVPX<br />

.<br />

XTVPX<br />

.<br />

.<br />

..<br />

..<br />

VVEBPVPX<br />

TVPSEBPVPX<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

XTVPXTTVPS<br />

EBPVPS<br />

VVEBTXS<br />

TVVEBTXS<br />

TVVEBTXS<br />

XSEBTS<br />

VPSEBTS<br />

EBPVVEBTS<br />

TTTVVEBTS<br />

TTVVEBTS<br />

TVVEBTS<br />

TVVEBTS<br />

TTVPX<br />

VPSEBPVPX<br />

EBPVVEBPVPX<br />

XTVPXTTVPX<br />

SEBTSSXX<br />

TTVPX<br />

TTTVPX<br />

VVEBTXSEBTX<br />

.<br />

.<br />

.<br />

.<br />

TTTVPX<br />

SEBPVPX<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

SXXTTT<br />

SSXXTTT<br />

EBTXXTTT<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

TVVEBTSSS<br />

.<br />

SEBTSSS<br />

TVVEBTSS<br />

EBTSS<br />

XSEBTSS<br />

EBTXSEBTS<br />

TVPSEBTS<br />

TTVVEBTS<br />

XTVVEBTS<br />

EBPVPXTVPS<br />

TVVEBPTTT<br />

SSSSSXS<br />

EBTSXS<br />

EBTXS<br />

SEBTXS<br />

TVPSEBPTTT<br />

SEBPTTT<br />

XSEBTSXXTTT<br />

XSEBPTTT<br />

TVVEBTXXTTT<br />

VPSEBTSS<br />

.<br />

.<br />

.<br />

TTTVPS<br />

.<br />

.<br />

VVEBPTTT<br />

.<br />

EBPTTTT<br />

EBPVVEBPTTT<br />

EBPVVEBPTTTT<br />

.<br />

.<br />

SXXTTTT<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

TVVEBTSSSS<br />

.<br />

SSSSS<br />

TTTTT<br />

.<br />

.<br />

.<br />

TTTTT<br />

.<br />

.<br />

EBTSSSS<br />

.<br />

.<br />

.<br />

.<br />

TTTTT<br />

SEBTSSSS<br />

EBTSSS<br />

TVVEBTSXS<br />

TVVEBTXXTTTT<br />

TTTTTT<br />

VPXTTT<br />

EBPVVEBPTTT<br />

TVPXTTT<br />

SEBTXSEB<br />

XTVPXTTT<br />

XXTTTT<br />

.<br />

.<br />

.<br />

.<br />

.<br />

SSSSSS<br />

.<br />

.<br />

.<br />

TVVEBTXSEB<br />

TVVEBTXSEB<br />

VVEBTXSEB<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

. .<br />

.<br />

TTTTT<br />

.<br />

.<br />

SEBPVPXTT<br />

.<br />

.<br />

VPXTT<br />

SXSEB<br />

.<br />

.<br />

.<br />

.<br />

.<br />

SXXTVPXTT<br />

SSXSEB<br />

EBTXSEB<br />

SSSSSXSEB<br />

TTVPXTT<br />

TVPXTT<br />

TVPXTT<br />

.<br />

.<br />

.<br />

.<br />

SEBTXXTT<br />

TVPSEBPVPSEB<br />

.<br />

.<br />

.<br />

.<br />

EBTXXTT<br />

VVEBPVPSEB<br />

TTTVPSEB<br />

EBPVPSEB<br />

TVVEBTSXSEB<br />

XTVPXTT<br />

SSXXTT<br />

. .<br />

. .<br />

.<br />

.<br />

SXXTT<br />

TVVEBTXXTT<br />

TTVPSEB<br />

TTTVPSEB<br />

XSEBTSXXTT<br />

XSEBPTT<br />

SEBPTT<br />

EBPVVEBPTT<br />

TVPSEBPTT<br />

XTVPSEB<br />

.<br />

TVPSEB<br />

VVEBPTT<br />

TVVEBPTT<br />

.<br />

EBPVPXTVPSEB<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

. . .<br />

SXXTVPXT<br />

XTVPXT<br />

TVPXT<br />

EBPVVEB<br />

TTTVPXT<br />

TTVPXT<br />

.<br />

.<br />

EBPVVEB<br />

SEBPVPXT<br />

.<br />

.<br />

.<br />

VVEBPVPXT<br />

TVVEBPVVEB<br />

TVPSEBPVVEB<br />

.<br />

.<br />

.<br />

.<br />

TTTVVEB<br />

.<br />

.<br />

VPSEBPVPXT<br />

.<br />

.<br />

TTVPXT<br />

.<br />

.<br />

.<br />

.<br />

.<br />

EBPVVEBP<br />

.<br />

.<br />

.<br />

. .<br />

SEBTXXT<br />

TTVVEB<br />

.<br />

EBPVVEBPT<br />

XSEBTSXXT<br />

VVEBTXXT<br />

TVPSEBTXXT<br />

XTTVVEB<br />

TTVVEB<br />

.<br />

.<br />

SSXXT<br />

TVVEB<br />

XTVVEB<br />

.<br />

SSSSXXT<br />

TVVEB<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

EBPVVEBP<br />

.<br />

.<br />

.<br />

.<br />

TVVEBTXXTVVEB<br />

TVVEBP<br />

EBPVVEBPVVE<br />

EBPVVEBPT<br />

TVVEBTXXT<br />

EBTSXXT<br />

XTVVEBT<br />

XSEBPT<br />

SEBPT<br />

.<br />

. .<br />

.<br />

.<br />

.<br />

TTVVEBT<br />

TVPSEBPT<br />

.<br />

TVVEBPT<br />

VVEBPT<br />

TVVEBP<br />

.<br />

.<br />

TTTVP<br />

.<br />

EBPVVEBT<br />

.<br />

TVVEBT<br />

TVVEBT<br />

XTTVVEBP<br />

.<br />

.<br />

.<br />

.<br />

TTVVEBT<br />

TTVVEBP<br />

EBPVVEBP<br />

.<br />

.<br />

TTTVVEBT<br />

TVPSEBP<br />

.<br />

SEBPVVE<br />

XTVPSEBP<br />

EBPVPXTVPSEBP<br />

TVVEBTSXSEBP<br />

EBPVVEBT<br />

TTVPSEBP<br />

.<br />

.<br />

EBPVVEBPVV<br />

.<br />

EBPVPXTVPSEBT<br />

TVPSEBT<br />

.<br />

TVVEBTSXSEBT<br />

VPSEBP<br />

EBPVPSEBP<br />

XTVPSEBT<br />

.<br />

.<br />

. .<br />

EBPVVEBPV<br />

.<br />

.<br />

VPSEBT<br />

SSXSEBP<br />

SXSEBP<br />

XTVPXTTVP<br />

EBTXSEBP<br />

.<br />

EBTXSEBP<br />

.<br />

.<br />

.<br />

.<br />

TTTVVE<br />

.<br />

VPSEBPVP<br />

SEBPVP<br />

VVEBPVP<br />

VPSEBT<br />

TVVEBTXXTVP<br />

TVVEBTSXSE<br />

TTTVPSE<br />

VVEBTXSEBT<br />

EBTXSEBT<br />

.<br />

.<br />

SEBTXSEBT<br />

SXSEBT<br />

.<br />

. .<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

TVVEBTXXTVVE<br />

SSXSEBT<br />

TVPSEBPVP<br />

XTVVE<br />

EBPTVVE<br />

TTTVP<br />

EBPVVEBPVP<br />

SXXTVP<br />

TTVVE<br />

.<br />

.<br />

.<br />

.<br />

.<br />

XTTVVE<br />

EBPTVP<br />

EBPVPXTVP<br />

TVPSEBPV<br />

SEBPV<br />

.<br />

.<br />

.<br />

EBPVPSE<br />

TTVPSE<br />

TTVVE<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

EBPTVV<br />

.<br />

.<br />

TVVEBPVVE<br />

TVPSEBPVVE<br />

.<br />

XXTVV<br />

TVVEBTXXTVV<br />

XTVPXTTV<br />

XTVPXTV<br />

TVVEBTXXTV<br />

EBPVVEBPTV<br />

EBPVVEBPTV<br />

EBTXSE<br />

EBPVVE<br />

.<br />

XTTVV<br />

.<br />

.<br />

TVPSE<br />

TTTVV<br />

.<br />

TVPSEBPVV<br />

.<br />

.<br />

XTVPSE<br />

EBPVPXTVPSE<br />

SEBPVV<br />

SEBPVV<br />

VVEBPVV<br />

.<br />

.<br />

TTTVPSE<br />

..<br />

TVVEBPV<br />

VVEBPV<br />

.<br />

VPSEBPV<br />

SXSEBPV<br />

.<br />

..<br />

EBTXSEBPV<br />

.<br />

.<br />

.<br />

.<br />

TVPSEBPVPSE<br />

VVEBPVPSE<br />

TVPSE<br />

TVVEBPVV<br />

TTTTV<br />

VPXTTV<br />

EBPVVEBPTTV<br />

.<br />

XSEBTSXXTV<br />

.<br />

VVEBPTV<br />

TVVEBPTV<br />

SEBPTV<br />

TVPSEBPTV<br />

. .<br />

SSSSSXSE<br />

SEBPVPXTV<br />

EBPVPXTV<br />

VPXTV<br />

VVEBTXXTV<br />

SSXSE<br />

XXTTV<br />

XXTTV<br />

EBPTTV<br />

TVPXTV<br />

TVPXTV<br />

SEBTXXTV<br />

SSXXTV<br />

SXXTV<br />

TVVEBTXSE<br />

TVVEBTXSE<br />

VVEBTXSE<br />

SEBTXSE<br />

Fig. 5. Arrangement <strong>of</strong> Reber words on a hyperbolic lattice structure. The words are arranged<br />

according to their most recent symbols (shown on the right <strong>of</strong> the sequences). The<br />

hyperbolic lattice yields a sector partitioning.<br />

-1<br />

log-Probability<br />

-2<br />

-3<br />

-4<br />

-5<br />

-6<br />

0 10 20 30 40 50 60 70 80 90 100 110<br />

Index <strong>of</strong> 3-letter word.<br />

Fig. 6. Likelihood <strong>of</strong> extracted trigrams. The most probable combinations are given by valid<br />

trigrams, and a gap <strong>of</strong> the likelihood can be observed for the first invalid combination.<br />

23


TTVPS<br />

*<br />

TTVVEBTS<br />

EBTS<br />

VPS<br />

*<br />

TVPS<br />

*<br />

EBTXS<br />

TVPS<br />

EBTSSXXTVPS<br />

VVEBTS<br />

SEBTS<br />

*<br />

*<br />

EBTSXS<br />

EBTSSXS<br />

*<br />

*<br />

SEBTX<br />

*<br />

SSXS<br />

EBTSSS<br />

VVEBTSS<br />

EBTXS<br />

EBTSS<br />

*<br />

*<br />

*<br />

TVVEBTX<br />

SSSS<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

TTTTVVEBP<br />

EBP<br />

EBTX<br />

VPXTTVVEBTX<br />

TTVVEBP<br />

TVVEBP<br />

*<br />

*<br />

EBPVVEBP<br />

EBP<br />

VPSE<br />

VVEBP<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

EBTXSE<br />

EBTSSXXTVPSE<br />

*<br />

*<br />

*<br />

TVP<br />

* EBTSSXXTVP<br />

EBTXX *<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

TVVE<br />

EBTSSXXTVPSEBP<br />

TTTTVP<br />

EBTXSEBP<br />

SSXX<br />

TVPSE<br />

EBTSXSE<br />

EBTSSXSE<br />

EBTXSE<br />

SSXSE<br />

EBPTVP<br />

TVP<br />

TVP<br />

VPSEBP<br />

TTVP<br />

TVPXTVP<br />

XXTVVE<br />

TVVE<br />

SXSEBP<br />

EBPTTVP<br />

*<br />

*<br />

*<br />

*<br />

XTVVE<br />

*<br />

*<br />

EBPVP<br />

SEBPVP<br />

TTVP<br />

*<br />

TVVE<br />

VPXTTVVE<br />

*<br />

*<br />

*<br />

*<br />

TTVVE<br />

TTTTVVE<br />

TSSXXTVPSEBPVP<br />

VPXTTVP<br />

TVV<br />

EBPVP<br />

*<br />

*<br />

EBPVV<br />

XTTVVE<br />

EBPVVE<br />

*<br />

EBPVV<br />

SEBPVV<br />

EBPVPXTVV<br />

*<br />

EBPVVE<br />

TSSXXTVPSEBPVV<br />

XXTVV<br />

TTVV<br />

*<br />

VVEBPVV<br />

*<br />

TTTTVV<br />

*<br />

*<br />

VPXTTVV<br />

*<br />

TTVVEBPV<br />

*<br />

EBPV<br />

BTSSXXTVPSEBPV<br />

XTTVV<br />

VVEBPV<br />

TTVV<br />

EBPV<br />

EBTXXTTVV<br />

TTTTV<br />

TTTV<br />

SEBPV<br />

*<br />

*<br />

EBTXSEBTS<br />

*<br />

VPSEBTX<br />

EBTSSXX<br />

VVEBTXX<br />

VPXTTVVEBTXX<br />

*<br />

TVV<br />

XTVV<br />

VPXTTV<br />

TTV<br />

*<br />

EBTX<br />

VVEBTX<br />

*<br />

SEBTXX<br />

*<br />

*<br />

SSXXTVV<br />

XTTV<br />

*<br />

*<br />

VVEB<br />

TTVVEB<br />

EBPVVEB<br />

TTTTVVEB<br />

TVVEB<br />

XTVVEB<br />

*<br />

*<br />

*<br />

VPXTTVVEB<br />

XXTVVEB<br />

*<br />

*<br />

VPX<br />

EBPVPX<br />

TVVEB<br />

*<br />

*<br />

EBTXSEB<br />

EBTXSEB<br />

VPXTTVPX<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

EBPTTV<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

*<br />

TVPSEB<br />

EBTSXSEB<br />

EBTSXX<br />

EBTSSXXTVPX<br />

TVPX<br />

TVPX<br />

SSXSEB<br />

TVPX<br />

EBPTVPX<br />

EBTXX<br />

SSX<br />

EBTSSX<br />

VVEBTSX<br />

EBTSX<br />

VVEBPT TVVEBPT EBPT<br />

VPXT<br />

TVVEBPTV<br />

TTVVEBT<br />

SSXXTV<br />

XTV<br />

TVPXTV<br />

*<br />

TTVVEBPT EBPT<br />

VVEBT<br />

VVEBPTV<br />

EBPTV<br />

*<br />

VPSEBT<br />

EBTXXTT<br />

EBTXXTTV<br />

XTT<br />

TVPXTT<br />

TTV<br />

TTT<br />

EBTSSXXT<br />

EBPVPXT<br />

VPXTTVVEBT<br />

EBTXSEBT<br />

EBTXXTV<br />

EBPVPXTV<br />

EBTSSXXTV<br />

VPXTT<br />

TTT<br />

TVPXTTTT<br />

TVPXTTT<br />

XTTT<br />

TTTT<br />

TTTTT<br />

EBTSSXXTVPSEB<br />

XXTT<br />

TVPXTT<br />

*<br />

VPSEB<br />

*<br />

SEBPT<br />

* *<br />

*<br />

TVVEBT<br />

TVVEBT<br />

TVPSEBT<br />

SEBPTT<br />

EBPTT<br />

EBTSSXXTT<br />

*<br />

EBTXXT<br />

*<br />

SSXXT<br />

*<br />

VPXTTVPXT<br />

TVPXT<br />

TVPXT<br />

EBTXSEBT<br />

EBTSSXXTVPSEBT<br />

*<br />

EBTSXSEBT<br />

VVEBPTT<br />

EBTXXT<br />

*<br />

TVPXT<br />

EBPTVPXT<br />

*<br />

*<br />

SSXSEBT<br />

Fig. 7. Arrangement <strong>of</strong> Reber words on a Euclidean lattice structure. The words are arranged<br />

according to their most recent symbols (shown on the right <strong>of</strong> the sequences).<br />

Patches emerge according to the most recent symbol. Within the patches, an ordering according<br />

to the preceding symbols can be observed.<br />

C<br />

Fig. 8. Frequency reconstruction <strong>of</strong> trigrams from the Reber grammar.<br />

24


follows: a stands for (0, 0) + µ, b for (1, 0) + µ, and c for (0, 1) + µ, µ being independent<br />

Gaussian noise with standard deviation σ g , which is a variable to be tested<br />

in the experiments. The symbols are denoted right to left, i.e. ab indicates that the<br />

currently emitted symbol is a, after having observed symbol b in the previous step.<br />

Thus, b and c are always succeeded by a, whereas a is succeeded with probability<br />

x by b, and (1 − x) by c assumed the past symbol was b, and vice versa, if the<br />

last symbol was c. The transition probability x is varied between the experiments.<br />

We train a SOM-S with regular rectangular two-dimensional lattice structure and<br />

100 neurons for a generated Markov series. The context parameter was decreased<br />

from η = 0.97 to η = 0.93, the neighborhood radius was decreased from σ = 5<br />

to σ = 0.5, the learning rate was annealed from 0.02 to 0.005. A number <strong>of</strong> 1000<br />

patterns are presented in 15000 cycles. U-Matrix clustering has been calculated<br />

with such a level <strong>of</strong> the landscape that half the neurons are contained in valleys.<br />

The neurons in the same valleys are assigned to belong to the same cluster, and the<br />

number <strong>of</strong> different clusters is determined. Afterwards, all the remaining neurons<br />

are assigned to their closest cluster.<br />

First, we choose a noise level <strong>of</strong> σ g = 0.1 such that almost no overlap can be<br />

observed, and we investigate this setup with different x between 0 and 0.8. In all<br />

the results, three distinct clusters, corresponding to the three symbols, are found<br />

with the U-Matrix method. The extraction <strong>of</strong> the order 2 Markov models indicates<br />

that the global transition probabilities are correctly represented in the maps.Table 2<br />

shows the corresponding extracted probabilities. Thereby, the exact probabilities<br />

cannot be recovered because <strong>of</strong> a magnification factor <strong>of</strong> SOM different from 1.<br />

However, the global trend is clearly found and the extracted probabilities are in<br />

good agreement with the priorly chosen values.<br />

In a second experiment, the transition probability is fixed to x = 0.4, but the noise<br />

level is modified, choosing σ g between 0.1 and 0.5. All the training parameters are<br />

chosen as in the previous experiment. Note that a noise level σ g = 0.3 already yields<br />

much overlap <strong>of</strong> the classes, as depicted in figure 10. Nevertheless, three clusters<br />

can be detected in all <strong>of</strong> the cases and the transition probabilities can be recovered,<br />

except for a noise level <strong>of</strong> 0.5 for which the training scenario degenerates to an<br />

almost deterministic case, making a the most dominant state. Table 3 summarizes<br />

the extracted probabilities.<br />

ab<br />

1−x 1−x<br />

ac<br />

x<br />

1<br />

1<br />

x<br />

ba<br />

ca<br />

Fig. 9. Markov automaton with 3 basic states and a finite order <strong>of</strong> 2 used to train the map.<br />

25


Fig. 10. Symbols a, b, c which are embedded in R 2 as a = (0, 0) + µ, b = (1, 0) + µ, and<br />

c = (0, 1) + µ, subject to noise µ with different variances: noise level are 0.1, 0.3, and 0.4.<br />

The latter two noise levels show considerable overlap <strong>of</strong> the classes which represent the<br />

symbol.<br />

x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8<br />

P (a|ab) 0 0.01 0 0.01 0 0.04 0 0.04 0.01<br />

P (b|ab) 0 0.08 0.3 0.31 0.38 0.55 0.68 0.66 0.78<br />

P (c|ab) 1 0.91 0.7 0.68 0.62 0.41 0.32 0.3 0.21<br />

P (a|ac) 0 0 0 0 0 0.01 0.01 0 0.01<br />

P (b|ac) 1 0.81 0.8 0.66 0.52 0.55 0.32 0.31 0.24<br />

P (c|ac) 0 0.19 0.2 0.34 0.48 0.44 0.67 0.69 0.75<br />

Table 2<br />

Transition probabilities extracted from the trained map. The noise level was fixed to 0.1<br />

and different generating transition probabilities x were used.<br />

noise 0.1 0.2 0.3 0.4 0.5 true<br />

P (a|ab) 0.01 0 0 0.1 0.98 0<br />

P (b|ab) 0.42 0.49 0.4 0.24 0.02 0.4<br />

P (c|ab) 0.57 0.51 0.6 0.66 0.02 0.6<br />

P (a|ac) 0.01 0 0 0.09 0 0<br />

P (b|ac) 0.59 0.6 0.44 0.39 0 0.6<br />

P (c|ac) 0.4 0.4 0.56 0.52 0 0.4<br />

Table 3<br />

Probabilities extracted from the trained map with fixed input transition probabilities and<br />

different noise levels. For a noise level <strong>of</strong> 0.5, the extraction mechanism breaks down and<br />

the symbol a becomes most dominant. For smaller noise levels, extraction <strong>of</strong> the symbols<br />

can still be done also for overlapping clusters because <strong>of</strong> temporal differentiation <strong>of</strong> the<br />

clusters in recursive models.<br />

26


6 Conclusions<br />

We have presented a self organizing map with a neural back-reference to the previously<br />

active sites and with a flexible topological structure <strong>of</strong> the neuron grid. For<br />

context representation, the compact and powerful SOMSD model as proposed in<br />

[11] has been used. Unlike TKM and RSOM, much more flexibility and expressiveness<br />

is obtained, because the context is represented in the space spanned by<br />

the neurons, and not only in the domain <strong>of</strong> the weight space. Compared to Rec-<br />

SOM, which is based on very extensive contexts, the SOMSD model is much more<br />

efficient. However, SOMSD requires an appropriate topological representation <strong>of</strong><br />

the symbols, measuring distances <strong>of</strong> contexts in the grid space. We have therefore<br />

extended the map configuration to more general triangular lattices, thus, making<br />

also hyperbolic models possible as introduced in [30]. Our SOM-S approach has<br />

been evaluated on several data series including discrete and real-valued entries.<br />

Two experimental setups have been taken from [41] to allow a direct comparison<br />

with different models. As pointed out, the compact model introduced here improves<br />

the capacity <strong>of</strong> simple leaky integrator networks like TKM and RSOM and shows<br />

results competitive to the more complex RecSOM.<br />

Since the context <strong>of</strong> SOM-S directly refers to the previous winner, temporal contexts<br />

can be extracted from a trained map. An extraction scheme to obtain Markov<br />

models <strong>of</strong> fixed order has been presented and its reliability has been confirmed in<br />

three experiments. As demonstrated, this mechanism can be applied to real-valued<br />

sequences, expanding U-Matrix methods to the recursive case.<br />

So far, the topological structure <strong>of</strong> context formation has not been taken into account<br />

during the extraction. Context clusters, in addition to weight clusters, provide<br />

more information, which might be used for the determination <strong>of</strong> appropriate orders<br />

<strong>of</strong> the models, or for the extraction <strong>of</strong> more complex settings like hidden Markov<br />

models. We currently investigate experiments aiming at these issues. However, preliminary<br />

results indicate that Hebbian training, as introduced in this article, allows<br />

the reliable extraction <strong>of</strong> finite memory models only. More sophisticated training<br />

algorithms should be developed for more complex temporal dependencies.<br />

Interestingly, the proposed context model can be interpreted as the development<br />

<strong>of</strong> long range synaptic connections, leading to more specialized map regions. Statistical<br />

counterparts to unsupervised sequence processing, like the Generative Topographic<br />

Mapping Through Time (GTMTT) [5], incorporate similar ideas by describing<br />

temporal data dependencies by hidden Markov latent space models. Such a<br />

context effects the prior distribution on the space <strong>of</strong> neurons. Due to computational<br />

restrictions, the transition probabilities <strong>of</strong> GTMTT are usually limited to only local<br />

connections. Thus, long range connections like in the presented context model<br />

do not emerge, rather visualizations similar (though more powerful) to TKM and<br />

RSOM arise. It could be interesting to develop more efficient statistical counterparts,<br />

which also allow the emergence <strong>of</strong> interpretable long range connections such<br />

as those <strong>of</strong> the deterministic SOM-S.<br />

27


References<br />

[1] G. Barreto and A. Araújo. Time in self-organizing maps: An overview <strong>of</strong> models. Int.<br />

Journ. <strong>of</strong> Computer Research, 10(2):139–179, 2001.<br />

[2] G. de A. Barreto, A. F. R. Araujo, and S. C. Kremer. A taxonomy for spatiotemporal<br />

connectionist networks revisited: the unsupervised case. Neural Computation,15(6):<br />

1255 - 1320, 2003.<br />

[3] H.-U. Bauer and T. Villmann. Growing a hypercubical output space in a selforganizing<br />

feature map. IEEE Transactions on Neural Networks, 8(2):218–226, 1997.<br />

[4] C. M. Bishop, M. Svensén, and C. K. I. Williams. GTM: the generative topographic<br />

mapping. Neural Computation 10(1):215-235, 1998.<br />

[5] C. M. Bishop, G. E. Hinton, and C. K. I. Williams. GTM through time. Proceedings<br />

IEE Fifth International Conference on Artificial Neural Networks, Cambridge, U.K.,<br />

pages 111-116, 1997.<br />

[6] Bühlmann, P., and Wyner, A.J. (1999). Variable length Markov chains. Annals <strong>of</strong><br />

Statistics, 27:480-513.<br />

[7] O. A. Carpinteiro. A hierarchical self-organizing map for sequence recognition.<br />

Neural <strong>Processing</strong> Letters, 9(3):209-220, 1999.<br />

[8] G. Chappell and J. Taylor. The temporal Kohonen map. Neural Networks, 6:441–445,<br />

1993.<br />

[9] I. Farkas and R. Miikkulainen. Modeling the self-organization <strong>of</strong> directional<br />

selectivity in the primary visual cortex. Proceedings <strong>of</strong> ICANN’99, Edinburgh,<br />

Scotland, pp. 251-256, 1999.<br />

[10] M. Hagenbuchner, A. C. Tsoi, and A. Sperduti. A supervised self-organising map for<br />

structured data. In N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advances in<br />

Self-Organising Maps, 21–28. Springer, 2001.<br />

[11] M. Hagenbuchner, A. Sperduti, and A.C. Tsoi. A Self-Organizing Map for Adaptive<br />

<strong>Processing</strong> <strong>of</strong> Structured Data. IEEE Transactions on Neural Networks, 14(3):491–<br />

505, 2003.<br />

[12] B. Hammer. On the learnability <strong>of</strong> recursive data. Mathematics <strong>of</strong> Control Signals and<br />

Systems, 12:62–79, 1999.<br />

[13] B. Hammer, A. Micheli, and A. Sperduti. A general framework for unsupervised<br />

processing <strong>of</strong> structured data. In M. Verleysen, editor, European Symposium on<br />

Artificial Neural Networks’2002, 389–394. D Facto, 2002.<br />

[14] B. Hammer, A. Micheli, M. Strickert, A. Sperduti. A general framework for<br />

unsupervised processing <strong>of</strong> structured data. To appear in: Neurocomputing.<br />

[15] B. Hammer, A. Micheli, A. Sperduti. A general framework for self-organizing<br />

structure processing neural networks. Technical report TR-03-04 <strong>of</strong> the Università<br />

di Pisa, 2003.<br />

[16] J. Joutsensalo and A. Miettinen. Self-organizing operator map for nonlinear dimension<br />

reduction. Proceedings ICNN’95, 1:111-114, IEEE, 1995.<br />

[17] J. Kangas. On the analysis <strong>of</strong> pattern sequences by self-organizing maps. PhD thesis,<br />

Helsinki University <strong>of</strong> Technology, Espoo, Finland, 1994.<br />

28


[18] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM – self-organizing maps<br />

<strong>of</strong> document collections. Neurocomputing, 21(1):101-117, 1998.<br />

[19] S. Kaski and J. Sinkkonen. A topography-preserving latent variable model with<br />

learning metrics. In: N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advances<br />

in Self-Organizing Maps, pages 224–229, Springer, 2001.<br />

[20] T. Kohonen. The ‘neural’ phonetic typewriter. Computer, 21(3):11-22, 1988.<br />

[21] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 2001.<br />

[22] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Recurrent SOM with local linear<br />

models in time series prediction. In M.Verleysen, editor, 6th European Symposium on<br />

Artificial Neural Networks,pages 167–172, De facto, 1998.<br />

[23] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Time series prediction using<br />

recurrent SOM with local linear models. International Journal <strong>of</strong> Knowledge-based<br />

Intelligent Engineering Systens 2(1):60-68, 1998.<br />

[24] S. C. Kremer. Spatio-temporal connectionist networks: A taxonomy and review.<br />

Neural Computation, 13(2):249–306, 2001.<br />

[25] J. Laaksonen, M. Koskela, S. Laakso, and E. Oja. PicSOM – content-based image<br />

retrieval with self-organizing maps. Pattern Recognition Letters, 21(13-14):1199-<br />

1207, 2000.<br />

[26] J. Lampinen and E. Oja. Self-organizing maps for spatial and temporal AR models.<br />

M. Pietikäinen and J. Röning (eds.), Proceedings 6 SCIA, 120-127, Helsinki, Finland,<br />

1989.<br />

[27] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,<br />

7(3):507–522, 1994.<br />

[28] T. Martinetz, S.G. Berkovich, and K.J. Schulten. ‘Neural-gas’ networks for vector<br />

quantization and its application to time-series prediction. IEEE Transactions on<br />

Neural Networks, 4(4):558–569, 1993.<br />

[29] J. Ontrup and H. Ritter. Text categorization and semantic browsing with selforganizing<br />

maps on non-euclidean spaces. In L. D. Raedt and A. Siebes, editors,<br />

Proceedings <strong>of</strong> PKDD-01, 338–349. Springer, 2001.<br />

[30] H. Ritter. Self-organizing maps on non-Euclidian spaces. In: E. Oja and S. Kaski,<br />

editors, Kohonen Maps, pages 97–110. Elsevier, 1999.<br />

[31] H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-Organizing<br />

Maps: An Introduction, Addison-Wesley, 1992.<br />

[32] Ron, D., Singer, Y., and Tishby, N. (1996). The power <strong>of</strong> amnesia. Machine Learning,<br />

25:117-150.<br />

[33] J. Sinkkonen and S. Kaski. Clustering based on conditional distributions in an<br />

auxiliary space. Neural Computation, 14:217–239, 2002.<br />

[34] P. Sommervuo. Self-organizing maps for signal and symbokl sequences, PhD thesis,<br />

Helsinki University <strong>of</strong> Technology, 2000.<br />

[35] A. Sperduti. Neural networks for adaptive processing <strong>of</strong> structured data. In Proc.<br />

ICANN 2001, 5–12. Springer, 2001.<br />

[36] M. Strickert, T. Bojer, and B. Hammer. Generalized relevance LVQ for time series. In<br />

Proc. ICANN’2001, 677–638. Springer, 2001.<br />

29


[37] M. Strickert and B. Hammer. Neural Gas for <strong>Sequence</strong>s. In Proc. WSOM’03, 53-57,<br />

2003.<br />

[38] A. Ultsch and C. Vetter. Selforganizing Feature Maps versus Statistical Clustering:<br />

A Benchmark. Research Report No. 9, Dep. <strong>of</strong> Mathematics, University <strong>of</strong> Marburg<br />

1994.<br />

[39] M. Varsta, J. del R. Milan, and J. Heikkonen. A recurrent self-organizing map for<br />

temporal sequence processing. In Proc. ICANN’97, 421–426. Springer, 1997.<br />

[40] M. Varsta, J. Heikkonen, and J. Lampinen. Analytical comparison <strong>of</strong> the temporal<br />

Kohonen map and the recurrent self organizing map. M. Verleysen (ed.),<br />

ESANN’2000, pages 273-280, De Facto, 2000.<br />

[41] T. Voegtlin. <strong>Recursive</strong> self-organizing maps. Neur.Netw., 15(8-9):979–991, 2002.<br />

30

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!