26.03.2013 Views

MIT Encyclopedia of the Cognitive Sciences - Cryptome

MIT Encyclopedia of the Cognitive Sciences - Cryptome

MIT Encyclopedia of the Cognitive Sciences - Cryptome

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 2. A generic recurrent network. The presence <strong>of</strong> cycles in <strong>the</strong><br />

graph distinguishes <strong>the</strong>se networks from <strong>the</strong> class <strong>of</strong> feedforward<br />

networks. Note that <strong>the</strong>re is no requirement <strong>of</strong> reciprocity as in <strong>the</strong><br />

Hopfield or Boltzmann networks.<br />

constraints between nodes. The resulting equations <strong>of</strong> constraint,<br />

known as “mean field equations,” generally turn out<br />

to once again have <strong>the</strong> form <strong>of</strong> <strong>the</strong> perceptron update rule,<br />

although <strong>the</strong> sharp decision <strong>of</strong> Eq. (2) is replaced with a<br />

smoo<strong>the</strong>r nonlinear function (Peterson and Anderson 1987).<br />

Mean field methods can also be utilized with decreasing<br />

T. If T is decreased to zero, this idea, referred to as “deterministic<br />

annealing,” can be applied to <strong>the</strong> Hopfield network.<br />

In fact deterministic annealing has become <strong>the</strong> method <strong>of</strong><br />

choice for <strong>the</strong> update <strong>of</strong> Hopfield networks, replacing <strong>the</strong><br />

simple dynamics <strong>of</strong> Eq. (2).<br />

General recurrent networks are usually specified by<br />

drawing a directed graph (see figure 2). In such graphs,<br />

arbitrary connectivity patterns are allowed; that is, <strong>the</strong>re is<br />

no requirement that nodes are connected reciprocally. We<br />

associate a real-valued weight J ij with <strong>the</strong> link from node j<br />

to node i, letting J ij equal zero if <strong>the</strong>re is no link.<br />

At time t, <strong>the</strong> ith node in <strong>the</strong> network has an activation<br />

value S i [t], which can ei<strong>the</strong>r be a discrete value or a continuous<br />

value. Generally <strong>the</strong> focus is on discrete-time systems<br />

(see also AUTOMATA), in which t is a discrete index,<br />

although continuous-time systems are also studied (see also<br />

CONTROL THEORY). The update rule defining <strong>the</strong> dynamics<br />

<strong>of</strong> <strong>the</strong> network is typically <strong>of</strong> <strong>the</strong> following form:<br />

⎛ ⎞<br />

S<br />

i<br />

[ t + 1]<br />

= f⎜∑J ij<br />

S<br />

j<br />

[] t ⎟,<br />

⎝ ⎠<br />

j<br />

where <strong>the</strong> function f is generally taken to be a smooth nonlinear<br />

function.<br />

General recurrent networks can show complex patterns<br />

<strong>of</strong> dynamic behavior (including limit cycles and chaotic patterns),<br />

and it is difficult to place conditions on <strong>the</strong> weights<br />

J ij that guarantee particular kinds <strong>of</strong> desired behavior. Thus,<br />

researchers interested in time-varying behavior <strong>of</strong> recurrent<br />

networks have generally utilized learning algorithms as a<br />

method <strong>of</strong> “programming” <strong>the</strong> network by providing examples<br />

<strong>of</strong> desired behavior (Giles, Kuhn, and Williams1994).<br />

A general purpose learning algorithm for recurrent networks,<br />

known as backpropagation-in-time, can be obtained<br />

(3)<br />

Recurrent Networks 711<br />

Figure 3. An “unrolled” recurrent network. The nodes S i in <strong>the</strong><br />

recurrent network in Figure 2 are copied in each <strong>of</strong> T + 1 time<br />

slices. These copies represent <strong>the</strong> time-varying activations S i [t].<br />

The weights in each slice are time-invariant; <strong>the</strong>y are copies <strong>of</strong> <strong>the</strong><br />

corresponding weights in <strong>the</strong> original network. Thus, for example,<br />

<strong>the</strong> weights between <strong>the</strong> topmost nodes in each slice are all equal to<br />

J ii , <strong>the</strong> value <strong>of</strong> <strong>the</strong> weight from unit S i to itself in <strong>the</strong> original<br />

network.<br />

by a construction that “unrolls” <strong>the</strong> recurrent network<br />

(Rumelhart et al. 1986). The unrolled network has T + 1<br />

layers <strong>of</strong> N nodes each, obtained by copying <strong>the</strong> N <strong>of</strong> <strong>the</strong><br />

recurrent network at every time step from t = 0 to t = T (see<br />

figure 3). The connections in <strong>the</strong> unrolled network are feedforward<br />

connections that are copies <strong>of</strong> <strong>the</strong> recurrent connections<br />

in <strong>the</strong> original network. The result is an unrolled<br />

network that is a standard feedforward network. Applying<br />

<strong>the</strong> standard algorithm for feedforward networks, in particular<br />

backpropagation, yields <strong>the</strong> backpropagation-in-time<br />

algorithm.<br />

Backpropagation-in-time and similar algorithms have a<br />

general difficulty in training networks to hold information<br />

over lengthy time intervals (Bengio, Simard, and Frasconi<br />

1994). Essentially, gradient-based methods utilize <strong>the</strong> derivative<br />

<strong>of</strong> <strong>the</strong> state transition function <strong>of</strong> <strong>the</strong> dynamic system,<br />

and for systems that are able to hold information over<br />

lengthy intervals this derivative tends rapidly to zero. Many<br />

new ideas in recurrent network research, including <strong>the</strong> use<br />

<strong>of</strong> embedded memories and particular forms <strong>of</strong> prior knowledge,<br />

have arisen as researchers have tried to combat this<br />

problem (Frasconi et al. 1995; Omlin and Giles 1996).<br />

Finally, <strong>the</strong>re has been substantial work on <strong>the</strong> use <strong>of</strong><br />

recurrent networks to represent finite automata and <strong>the</strong><br />

problem <strong>of</strong> learning regular languages (Giles et al. 1992).<br />

—Michael I. Jordan<br />

References<br />

Bengio, Y., P. Simard, and P. Frasconi. (1994). Learning long-term<br />

dependencies with gradient descent is difficult. IEEE Transactions<br />

on Neural Networks 5: 157–166.<br />

Frasconi, P., M. Gori, M. Maggini, and G. Soda. (1995). Unified<br />

integration <strong>of</strong> explicit rules and learning by example in recurrent<br />

networks. IEEE Transactions on Knowledge and Data<br />

Engineering 7: 340–346.<br />

Geman, S., and D. Geman. (1984). Stochastic relaxation, Gibbs<br />

distributions, and <strong>the</strong> Bayesian restoration <strong>of</strong> images. IEEE<br />

Transactions on Pattern Analysis and Machine Intelligence 6:<br />

721–741.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!