12.07.2015 Views

Neural Networks - Algorithms, Applications,and ... - Csbdu.in

Neural Networks - Algorithms, Applications,and ... - Csbdu.in

Neural Networks - Algorithms, Applications,and ... - Csbdu.in

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

COMPUTATION AND NEURAL SYSTEMS SERIESSERIES EDITORChristof KochCalifornia Institute of TechnologyEDITORIAL ADVISORY BOARD MEMBERSDana AndersonUniversity of Colorado, BoulderMichael ArbibUniversity of Southern CaliforniaDana BallardUniversity of RochesterJames BowerCalifornia Institute of TechnologyGerard DreyfusEcole Superieure de Physique el deChimie Industrie/les de la Ville de ParisRolf EckmillerUniversity of DiisseldorfKunihiko FukushimaOsaka UniversityWalter HeiligenbergScripps Institute of Oceanography,La JollaShaul Hochste<strong>in</strong>Hebrew University, JerusalemAlan LapedesLos Alamos National LaboratoryCarver MeadCalifornia Institute of Technology-Guy OrbanCatholic University of LeuvenHaim Sompol<strong>in</strong>skyHebrew University, JerusalemJohn Wyatt, Jr.Massachusetts Institute of TechnologyThe series editor, Dr. Christof Koch, is Assistant Professor of Computation <strong>and</strong> <strong>Neural</strong>Systems at the California Institute of Technology. Dr. Koch works at both the biophysicallevel, <strong>in</strong>vestigat<strong>in</strong>g <strong>in</strong>formation process<strong>in</strong>g <strong>in</strong> s<strong>in</strong>gle neurons <strong>and</strong> <strong>in</strong> networks such asthe visual cortex, as well as study<strong>in</strong>g <strong>and</strong> implement<strong>in</strong>g simple resistive networks forcomput<strong>in</strong>g motion, stereo, <strong>and</strong> color <strong>in</strong> biological <strong>and</strong> artificial systems.


<strong>Neural</strong> <strong>Networks</strong><strong>Algorithms</strong>, <strong>Applications</strong>,<strong>and</strong> Programm<strong>in</strong>g TechniquesJames A. FreemanDavid M. SkapuraLoral Space Information Systems<strong>and</strong>Adjunct Faculty, School of Natural <strong>and</strong> Applied SciencesUniversity of Houston at Clear LakeTVAddison-Wesley Publish<strong>in</strong>g CompanyRead<strong>in</strong>g, Massachusetts • Menlo Park, California • New YorkDon Mills, Ontario • Wok<strong>in</strong>gham, Engl<strong>and</strong> • Amsterdam • BonnSydney • S<strong>in</strong>gapore • Tokyo • Madrid • San Juan • Milan • Paris


Library of Congress Catalog<strong>in</strong>g-<strong>in</strong>-Publication DataFreeman, James A.<strong>Neural</strong> networks : algorithms, applications, <strong>and</strong> programm<strong>in</strong>g techniques/ James A. Freeman <strong>and</strong> David M. Skapura.p. cm.Includes bibliographical references <strong>and</strong> <strong>in</strong>dex.ISBN 0-201-51376-51. <strong>Neural</strong> networks (Computer science) 2. <strong>Algorithms</strong>.I. Skapura, David M. II. Title.QA76.87.F74 1991006.3-dc20 90-23758CIPMany of the designations used by manufacturers <strong>and</strong> sellers to dist<strong>in</strong>guish their products are claimedas trademarks. Where those designations appear <strong>in</strong> this book, <strong>and</strong> Addison-Wesley was aware of atrademark claim, the designations have been pr<strong>in</strong>ted <strong>in</strong> <strong>in</strong>itial caps or all caps.The programs <strong>and</strong> applications presented <strong>in</strong> this book have been <strong>in</strong>cluded for their <strong>in</strong>structionalvalue. They have been tested with care, but are not guaranteed for any particular purpose. Thepublisher does not offer any warranties or representations, nor does it accept any liabilities withrespect to the programs or applications.Copyright ©1991 by Addison-Wesley Publish<strong>in</strong>g Company, Inc.All rights reserved. No part of this publication may be reproduced, stored <strong>in</strong> a retrieval system,or transmitted, <strong>in</strong> any form or by any means, electronic, mechanical, photocopy<strong>in</strong>g, record<strong>in</strong>g, orotherwise, without the prior written permission of the publisher. Pr<strong>in</strong>ted <strong>in</strong> the United States ofAmerica.123456789 10-MA-9594939291


CLEMSONUNIVERSITYRThe appearance of digital computers <strong>and</strong> the development of modern theoriesof learn<strong>in</strong>g <strong>and</strong> neural process<strong>in</strong>g both occurred at about the same time, dur<strong>in</strong>gthe late 1940s. S<strong>in</strong>ce that time, the digital computer has been used as a toolto model <strong>in</strong>dividual neurons as well as clusters of neurons, which are calledneural networks. A large body of neurophysiological research has accumulateds<strong>in</strong>ce then. For a good review of this research, see <strong>Neural</strong> <strong>and</strong> Bra<strong>in</strong> Model<strong>in</strong>gby Ronald J. MacGregor [21]. The study of artificial neural systems (ANS) oncomputers rema<strong>in</strong>s an active field of biomedical research.Our <strong>in</strong>terest <strong>in</strong> this text is not primarily neurological research. Rather, wewish to borrow concepts <strong>and</strong> ideas from the neuroscience field <strong>and</strong> to apply themto the solution of problems <strong>in</strong> other areas of science <strong>and</strong> eng<strong>in</strong>eer<strong>in</strong>g. The ANSmodels that are developed here may or may not have neurological relevance.Therefore, we have broadened the scope of the def<strong>in</strong>ition of ANS to <strong>in</strong>cludemodels that have been <strong>in</strong>spired by our current underst<strong>and</strong><strong>in</strong>g of the bra<strong>in</strong>, butthat do not necessarily conform strictly to that underst<strong>and</strong><strong>in</strong>g.The first examples of these new systems appeared <strong>in</strong> the late 1950s. Themost common historical reference is to the work done by Frank Rosenblatt ona device called the perceptron. There are other examples, however, such as thedevelopment of the Adal<strong>in</strong>e by Professor Bernard Widrow.Unfortunately, ANS technology has not always enjoyed the status <strong>in</strong> thefields of eng<strong>in</strong>eer<strong>in</strong>g or computer science that it has ga<strong>in</strong>ed <strong>in</strong> the neurosciencecommunity. Early pessimism concern<strong>in</strong>g the limited capability of the perceptroneffectively curtailed most research that might have paralleled the neurologicalresearch <strong>in</strong>to ANS. From 1969 until the early 1980s, the field languished. Theappearance, <strong>in</strong> 1969, of the book, Perceptrons, by Marv<strong>in</strong> M<strong>in</strong>sky <strong>and</strong> SeymourPapert [26], is often credited with caus<strong>in</strong>g the demise of this technology.Whether this causal connection actually holds cont<strong>in</strong>ues to be a subject for debate.Still, dur<strong>in</strong>g those years, isolated pockets of research cont<strong>in</strong>ued. Many ofthe network architectures discussed <strong>in</strong> this book were developed by researcherswho rema<strong>in</strong>ed active through the lean years. We owe the modern renaissance ofneural-net work technology to the successful efforts of those persistent workers.Today, we are witness<strong>in</strong>g substantial growth <strong>in</strong> fund<strong>in</strong>g for neural-networkresearch <strong>and</strong> development. Conferences dedicated to neural networks <strong>and</strong> a


viPrefacenew professional society have appeared, <strong>and</strong> many new educational programsat colleges <strong>and</strong> universities are beg<strong>in</strong>n<strong>in</strong>g to tra<strong>in</strong> students <strong>in</strong> neural-networktechnology.In 1986, another book appeared that has had a significant positive effecton the field. Parallel Distributed Process<strong>in</strong>g (PDF), Vols. I <strong>and</strong> II, by DavidRumelhart <strong>and</strong> James McClell<strong>and</strong> [23], <strong>and</strong> the accompany<strong>in</strong>g h<strong>and</strong>book [22]are the place most often recommended to beg<strong>in</strong> a study of neural networks.Although biased toward physiological <strong>and</strong> cognitive-psychology issues, it ishighly readable <strong>and</strong> conta<strong>in</strong>s a large amount of basic background material.POP is certa<strong>in</strong>ly not the only book <strong>in</strong> the field, although many others tend tobe compilations of <strong>in</strong>dividual papers from professional journals <strong>and</strong> conferences.That statement is not a criticism of these texts. Researchers <strong>in</strong> the field publish<strong>in</strong> a wide variety of journals, mak<strong>in</strong>g accessibility a problem. Collect<strong>in</strong>g a seriesof related papers <strong>in</strong> a s<strong>in</strong>gle volume can overcome that problem. Nevertheless,there is a cont<strong>in</strong>u<strong>in</strong>g need for books that survey the field <strong>and</strong> are more suitableto be used as textbooks. In this book, we attempt to address that need.The material from which this book was written was orig<strong>in</strong>ally developedfor a series of short courses <strong>and</strong> sem<strong>in</strong>ars for practic<strong>in</strong>g eng<strong>in</strong>eers. For manyof our students, the courses provided a first exposure to the technology. Somewere computer-science majors with specialties <strong>in</strong> artificial <strong>in</strong>telligence, but manycame from a variety of eng<strong>in</strong>eer<strong>in</strong>g backgrounds. Some were recent graduates;others held Ph.Ds. S<strong>in</strong>ce it was impossible to prepare separate courses tailored to<strong>in</strong>dividual backgrounds, we were faced with the challenge of design<strong>in</strong>g materialthat would meet the needs of the entire spectrum of our student population. Wereta<strong>in</strong> that ambition for the material presented <strong>in</strong> this book.This text conta<strong>in</strong>s a survey of neural-network architectures that we believerepresents a core of knowledge that all practitioners should have. We haveattempted, <strong>in</strong> this text, to supply readers with solid background <strong>in</strong>formation,rather than to present the latest research results; the latter task is left to theproceed<strong>in</strong>gs <strong>and</strong> compendia, as described later. Our choice of topics was basedon this philosophy.It is significant that we refer to the readers of this book as practitioners.We expect that most of the people who use this book will be us<strong>in</strong>g neuralnetworks to solve real problems. For that reason, we have <strong>in</strong>cluded material onthe application of neural networks to eng<strong>in</strong>eer<strong>in</strong>g problems. Moreover, we have<strong>in</strong>cluded sections that describe suitable methodologies for simulat<strong>in</strong>g neuralnetworkarchitectures on traditional digital comput<strong>in</strong>g systems. We have doneso because we believe that the bulk of ANS research <strong>and</strong> applications willbe developed on traditional computers, even though analog VLSI <strong>and</strong> opticalimplementations will play key roles <strong>in</strong> the future.The book is suitable both for self-study <strong>and</strong> as a classroom text. The levelis appropriate for an advanced undergraduate or beg<strong>in</strong>n<strong>in</strong>g graduate course <strong>in</strong>neural networks. The material should be accessible to students <strong>and</strong> professionals<strong>in</strong> a variety of technical discipl<strong>in</strong>es. The mathematical prerequisites are the


Prefaceviist<strong>and</strong>ard set of courses <strong>in</strong> calculus, differential equations, <strong>and</strong> advanced eng<strong>in</strong>eer<strong>in</strong>gmathematics normally taken dur<strong>in</strong>g the first 3 years <strong>in</strong> an eng<strong>in</strong>eer<strong>in</strong>gcurriculum. These prerequisites may make computer-science students uneasy,but the material can easily be tailored by an <strong>in</strong>structor to suit students' backgrounds.There are mathematical derivations <strong>and</strong> exercises <strong>in</strong> the text; however,our approach is to give an underst<strong>and</strong><strong>in</strong>g of how the networks operate, ratherthat to concentrate on pure theory.There is a sufficient amount of material <strong>in</strong> the text to support a two-semestercourse. Because each chapter is virtually self-conta<strong>in</strong>ed, there is considerableflexibility <strong>in</strong> the choice of topics that could be presented <strong>in</strong> a s<strong>in</strong>gle semester.Chapter 1 provides necessary background material for all the rema<strong>in</strong><strong>in</strong>g chapters;it should be the first chapter studied <strong>in</strong> any course. The first part of Chapter 6(Section 6.1) conta<strong>in</strong>s background material that is necessary for a completeunderst<strong>and</strong><strong>in</strong>g of Chapters 7 (Self-Organiz<strong>in</strong>g Maps) <strong>and</strong> 8 (Adaptive ResonanceTheory). Other than these two dependencies, you are free to move around atwill without be<strong>in</strong>g concerned about miss<strong>in</strong>g required background material.Chapter 3 (Backpropagation) naturally follows Chapter 2 (Adal<strong>in</strong>e <strong>and</strong>Madal<strong>in</strong>e) because of the relationship between the delta rule, derived <strong>in</strong> Chapter2, <strong>and</strong> the generalized delta rule, derived <strong>in</strong> Chapter 3. Nevertheless, these twochapters are sufficiently self-conta<strong>in</strong>ed that there is no need to treat them <strong>in</strong>order.To achieve full benefit from the material, you must do programm<strong>in</strong>g ofneural-net work simulation software <strong>and</strong> must carry out experiments tra<strong>in</strong><strong>in</strong>g thenetworks to solve problems. For this reason, you should have the ability toprogram <strong>in</strong> a high-level language, such as Ada or C. Prior familiarity with theconcepts of po<strong>in</strong>ters, arrays, l<strong>in</strong>ked lists, <strong>and</strong> dynamic memory management willbe of value. Furthermore, because our simulators emphasize efficiency <strong>in</strong> orderto reduce the amount of time needed to simulate large neural networks, youwill f<strong>in</strong>d it helpful to have a basic underst<strong>and</strong><strong>in</strong>g of computer architecture, datastructures, <strong>and</strong> assembly language concepts.In view of the availability of comercial hardware <strong>and</strong> software that comeswith a development environment for build<strong>in</strong>g <strong>and</strong> experiment<strong>in</strong>g with ANSmodels, our emphasis on the need to program from scratch requires explanation.Our experience has been that large-scale ANS applications require highlyoptimized software due to the extreme computational load that neural networksplace on comput<strong>in</strong>g systems. Specialized environments often place a significantoverhead on the system, result<strong>in</strong>g <strong>in</strong> decreased performance. Moreover, certa<strong>in</strong>issues—such as design flexibility, portability, <strong>and</strong> the ability to embed neuralnetworksoftware <strong>in</strong>to an application—become much less of a concern whenprogramm<strong>in</strong>g is done directly <strong>in</strong> a language such as C.Chapter 1, Introduction to ANS Technology, provides background materialthat is common to many of the discussions <strong>in</strong> follow<strong>in</strong>g chapters. The two majortopics <strong>in</strong> this chapter are a description of a general neural-network process<strong>in</strong>gmodel <strong>and</strong> an overview of simulation techniques. In the description of the


viiiPrefaceprocess<strong>in</strong>g model, we have adhered, as much as possible, to the notation <strong>in</strong>the PDF series. The simulation overview presents a general framework for thesimulations discussed <strong>in</strong> subsequent chapters.Follow<strong>in</strong>g this <strong>in</strong>troductory chapter is a series of chapters, each devoted toa specific network or class of networks. There are n<strong>in</strong>e such chapters:Chapter 2, Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eChapter 3, BackpropagationChapter 4, The BAM <strong>and</strong> the Hopfield MemoryChapter 5, Simulated Anneal<strong>in</strong>g: <strong>Networks</strong> discussed <strong>in</strong>clude the Boltzmanncompletion <strong>and</strong> <strong>in</strong>put-output networksChapter 6, The Counterpropagation NetworkChapter 7, Self-Organiz<strong>in</strong>g Maps: <strong>in</strong>cludes the Kohonen topology-preserv<strong>in</strong>gmap <strong>and</strong> the feature-map classifierChapter 8, Adaptive Resonance Theory: <strong>Networks</strong> discussed <strong>in</strong>clude bothART1 <strong>and</strong> ART2Chapter 9, Spatiotemporal Pattern Classification: discusses Hecht-Nielsen'sspatiotemporal networkChapter 10, The NeocognitronEach of these n<strong>in</strong>e chapters conta<strong>in</strong>s a general description of the networkarchitecture <strong>and</strong> a detailed discussion of the theory of operation of the network.Most chapters conta<strong>in</strong> examples of applications that use the particular network.Chapters 2 through 9 <strong>in</strong>clude detailed <strong>in</strong>structions on how to build softwaresimulations of the networks with<strong>in</strong> the general framework given <strong>in</strong> Chapter 1.Exercises based on the material are <strong>in</strong>terspersed throughout the text. A listof suggested programm<strong>in</strong>g exercises <strong>and</strong> projects appears at the end of eachchapter.We have chosen not to <strong>in</strong>clude the usual pseudocode for the neocognitronnetwork described <strong>in</strong> Chapter 10. We believe that the complexity of this networkmakes the neocognitron <strong>in</strong>appropriate as a programm<strong>in</strong>g exercise for students.To compile this survey, we had to borrow ideas from many different sources.We have attempted to give credit to the orig<strong>in</strong>al developers of these networks,but it was impossible to def<strong>in</strong>e a source for every idea <strong>in</strong> the text. To helpalleviate this deficiency, we have <strong>in</strong>cluded a list of suggested read<strong>in</strong>gs after eachchapter. We have not, however, attempted to provide anyth<strong>in</strong>g approach<strong>in</strong>g anexhaustive bibliography for each of the topics that we discuss.Each chapter bibliography conta<strong>in</strong>s a few references to key sources <strong>and</strong> supplementarymaterial <strong>in</strong> support of the chapter. Often, the sources we quote areolder references, rather than the newest research on a particular topic. Many ofthe later research results are easy to f<strong>in</strong>d: S<strong>in</strong>ce 1987, the majority of technicalpapers on ANS-related topics has congregated <strong>in</strong> a few journals <strong>and</strong> conference


Acknowledgmentsixproceed<strong>in</strong>gs. In particular, the journals <strong>Neural</strong> <strong>Networks</strong>, published by the International<strong>Neural</strong> Network Society (INNS), <strong>and</strong> <strong>Neural</strong> Computation, publishedby MIT Press, are two important periodicals. A newcomer at the time of thiswrit<strong>in</strong>g is the IEEE special-<strong>in</strong>terest group on neural networks, which has its ownperiodical.The primary conference <strong>in</strong> the United States is the International Jo<strong>in</strong>t Conferenceon <strong>Neural</strong> <strong>Networks</strong>, sponsored by the IEEE <strong>and</strong> INNS. This conferenceseries was <strong>in</strong>augurated <strong>in</strong> June of 1987, sponsored by the IEEE. The conferenceshave produced a number of large proceed<strong>in</strong>gs, which should be the primarysource for anyone <strong>in</strong>terested <strong>in</strong> the field. The proceed<strong>in</strong>gs of the annual conferenceon <strong>Neural</strong> Information Process<strong>in</strong>g Systems (NIPS), published by Morgan-Kaufmann, is another good source. There are other conferences as well, both <strong>in</strong>the United States <strong>and</strong> <strong>in</strong> Europe. As a comprehensive bibliography of the field,Casey Klimausauskas has compiled The 1989 Neuro-Comput<strong>in</strong>g Bibliography,published by MIT Press [17].F<strong>in</strong>ally, we believe this book will be successful if our readers ga<strong>in</strong>• A firm underst<strong>and</strong><strong>in</strong>g of the operation of the specific networks presented• The ability to program simulations of those networks successfully• The ability to apply neural networks to real eng<strong>in</strong>eer<strong>in</strong>g <strong>and</strong> scientific problems• A sufficient background to permit access to the professional literature• The enthusiasm that we feel for this relatively new technology <strong>and</strong> therespect we have for its ability to solve problems that have eluded otherapproachesACKNOWLEDGMENTSAs this page is be<strong>in</strong>g written, several associates are outside our offices, discuss<strong>in</strong>gthe New York Giants' w<strong>in</strong> over the Buffalo Bills <strong>in</strong> Super Bowl XXVlast night. Their comments describ<strong>in</strong>g the affair range from the typical superlatives,"The Giants' offensive l<strong>in</strong>e overwhelmed the Bills' defense," to denialsof any skill, tra<strong>in</strong><strong>in</strong>g, or teamwork attributable to the participants, "They werejust pla<strong>in</strong> lucky."By way of analogy, we have now arrived at our Super Bowl. The text iswritten, the artwork done, the manuscript reviewed, the edit<strong>in</strong>g completed, <strong>and</strong>the book is now ready for typesett<strong>in</strong>g. Undoubtedly, after the book is publishedmany will comment on the quality of the effort, although we hope no one willattribute the quality to "just pla<strong>in</strong> luck." We have survived the arduous processof publish<strong>in</strong>g a textbook, <strong>and</strong> like the teams that went to the Super Bowl, wehave succeeded because of the comb<strong>in</strong>ed efforts of many, many people. Spacedoes not allow us to mention each person by name, but we are deeply gratefu'to everyone that has been associated with this project.


xPrefaceThere are, however, several <strong>in</strong>dividuals that have gone well beyond thenormal call of duty, <strong>and</strong> we would now like to thank these people by name.First of all, Dr. John Engvall <strong>and</strong> Mr. John Frere of Loral Space InformationSystems were k<strong>in</strong>d enough to encourage us <strong>in</strong> the exploration of neuralnetworktechnology <strong>and</strong> <strong>in</strong> the development of this book. Mr. Gary Mclntire,Ms. Sheryl Knotts, <strong>and</strong> Mr. Matt Hanson all of the Loral Space InformationSystems Anificial Intelligence Laboratory proofread early versions of themanuscript <strong>and</strong> helped us to debug our algorithms. We would also like to thankour reviewers: Dr. Marijke Augusteijn, Department of Computer Science, Universityof Colorado; Dr. Daniel Kammen, Division of Biology, California Instituteof Technology; Dr. E. L. Perry, Loral Comm<strong>and</strong> <strong>and</strong> Control Systems;Dr. Gerald Tesauro, IBM Thomas J. Watson Research Center; <strong>and</strong> Dr. JohnVittal, GTE Laboratories, Inc. We found their many comments <strong>and</strong> suggestionsquite useful, <strong>and</strong> we believe that the end product is much better because of theirefforts.We received fund<strong>in</strong>g for several of the applications described <strong>in</strong> the textfrom sources outside our own company. In that regard, we would like to thankDr. Hosse<strong>in</strong> Nivi of the Ford Motor Company, <strong>and</strong> Dr. Jon Erickson, Mr. KenBaker, <strong>and</strong> Mr. Robert Savely of the NASA Johnson Space Center.We are also deeply grateful to our publishers, particularly Mr. Peter Gordon,Ms. Helen Goldste<strong>in</strong>, <strong>and</strong> Mr. Mark McFarl<strong>and</strong>, all of whom offered helpful<strong>in</strong>sights <strong>and</strong> suggestions <strong>and</strong> also took the risk of publish<strong>in</strong>g two unknownauthors. We also owe a great debt to our production staff, specifically, Ms.Loren Hilgenhurst Stevens, Ms. Mona Zeftel, <strong>and</strong> Ms. Mary Dyer, who guidedus through the maze of details associated with publish<strong>in</strong>g a book <strong>and</strong> to ourpatient copy editor, Ms. Lyn Dupre, who taught us much about the craft ofwrit<strong>in</strong>g.F<strong>in</strong>ally, to Peggy, Carolyn, Geoffrey, Deborah, <strong>and</strong> Danielle, our wives <strong>and</strong>children, who patiently accepted the fact that we could not be all th<strong>in</strong>gs to them<strong>and</strong> published authors, we offer our deepest <strong>and</strong> most heartfelt thanks.Houston, Texas J. A. F.D. M. S.


O N T E NChapter 1Introduction to ANS Technology 11.1 Elementary Neurophysiology 81.2 From Neurons to ANS 171.3 ANS Simulation 30Bibliography 41Chapter 2Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>e 452.1 Review of Signal Process<strong>in</strong>g 452.2 Adal<strong>in</strong>e <strong>and</strong> the Adaptive L<strong>in</strong>ear Comb<strong>in</strong>er 552.3 <strong>Applications</strong> of Adaptive Signal Process<strong>in</strong>g 682.4 The Madal<strong>in</strong>e 722.5 Simulat<strong>in</strong>g the Adal<strong>in</strong>e 79Bibliography 86Chapter 3Backpropagation 893.1 The Backpropagation Network 893.2 The Generalized Delta Rule 933.3 Practical Considerations 1033.4 BPN <strong>Applications</strong> 1063.5 The Backpropagation Simulator 114Bibliography 124Chapter 4The BAM <strong>and</strong> the Hopfield Memory 7274.1 Associative-Memory Def<strong>in</strong>itions 1284.2 The BAM 131xi


xiiContents4.3 The Hopfield Memory 1414.4 Simulat<strong>in</strong>g the BAM 156Bibliography 167Chapter 5Simulated Anneal<strong>in</strong>g 7695.1 Information Theory <strong>and</strong> Statistical Mechanics 1715.2 The Boltzmann Mach<strong>in</strong>e 1795.3 The Boltzmann Simulator 1895.4 Us<strong>in</strong>g the Boltzmann Simulator 207Bibliography 212Chapter 6The Counterpropagation Network 2736.7 CPN Build<strong>in</strong>g Blocks 2156.2 CPN Data Process<strong>in</strong>g 2356.3 An Image-Classification Example 2446.4 the CPN Simulator 247Bibliography 262Chapter 7Self-Organiz<strong>in</strong>g Maps 2637.7 SOM Data Process<strong>in</strong>g 2657.2 <strong>Applications</strong> of Self-Organiz<strong>in</strong>g Maps 2747.3 Simulat<strong>in</strong>g the SOM 279Bibliography 289Chapter 8Adaptive Resonance Theory 2978.1 ART Network Description 2938.2 ART1 2988.3 ART2 3168.4 The ART1 Simulator 3278.5 ART2 Simulation 336Bibliography 338Chapter 9Spatiotemporal Pattern Classification 3479.7 The Formal Avalanche 3429.2 Architectures of Spatiotemporal <strong>Networks</strong> (STNS) 345


Contentsxiii9.3 The Sequential Competitive Avalanche Field 3559.4 <strong>Applications</strong> of STNS 3639.5 STN Simulation 364Bibliography 371Chapter 10The Neocognitron 37310.1 Neocognitron Architecture 37610.2 Neocognitron Data Process<strong>in</strong>g 38110.3 Performance of the Neocognitron 38910.4 Addition of Lateral Inhibition <strong>and</strong> Feedback to theNeocognitron 390Bibliography 393


HIntroduction toANS TechnologyWhen the only tool you have is a hammer, every problem you encountertends to resemble a nail.—Source unknownWhy can't we build a computer that th<strong>in</strong>ks? Why can't we expect mach<strong>in</strong>esthat can perform 100 million float<strong>in</strong>g-po<strong>in</strong>t calculations per second to be ableto comprehend the mean<strong>in</strong>g of shapes <strong>in</strong> visual images, or even to dist<strong>in</strong>guishbetween different k<strong>in</strong>ds of similar objects? Why can't that same mach<strong>in</strong>e learnfrom experience, rather than repeat<strong>in</strong>g forever an explicit set of <strong>in</strong>structionsgenerated by a human programmer?These are only a few of the many questions fac<strong>in</strong>g computer designers,eng<strong>in</strong>eers, <strong>and</strong> programmers, all of whom are striv<strong>in</strong>g to create more "<strong>in</strong>telligent"computer systems. The <strong>in</strong>ability of the current generation of computersystems to <strong>in</strong>terpret the world at large does not, however, <strong>in</strong>dicate that these mach<strong>in</strong>esare completely <strong>in</strong>adequate. There are many tasks that are ideally suitedto solution by conventional computers: scientific <strong>and</strong> mathematical problemsolv<strong>in</strong>g; database creation, manipulation, <strong>and</strong> ma<strong>in</strong>tenance; electronic communication;word process<strong>in</strong>g, graphics, <strong>and</strong> desktop publication; even the simplecontrol functions that add <strong>in</strong>telligence to <strong>and</strong> simplify our household tools <strong>and</strong>appliances are h<strong>and</strong>led quite effectively by today's computers.In contrast, there are many applications that we would like to automate,but have not automated due to the complexities associated with programm<strong>in</strong>g acomputer to perform the tasks. To a large extent, the problems are not unsolvable;rather, they are difficult to solve us<strong>in</strong>g sequential computer systems. Thisdist<strong>in</strong>ction is important. If the only tool we have is a sequential computer, thenwe will naturally try to cast every problem <strong>in</strong> terms of sequential algorithms.Many problems are not suited to this approach, however, caus<strong>in</strong>g us to expend


2 Introduction to ANS Technologya great deal of effort on the development of sophisticated algorithms, perhapseven fail<strong>in</strong>g to f<strong>in</strong>d an acceptable solution.In the rema<strong>in</strong>der of this text, we will exam<strong>in</strong>e many parallel-process<strong>in</strong>garchitectures that provide us with new tools that can be used <strong>in</strong> a variety ofapplications. Perhaps, with these tools, we will be able to solve more easilycurrently difficult-to-solve, or unsolved, problems. Of course, our proverbialhammer will still be extremely useful, but with a full toolbox we should be ableto accomplish much more.As an example of the difficulties we encounter when we try to make asequential computer system perform an <strong>in</strong>herently parallel task, consider theproblem of visual pattern recognition. Complex patterns consist<strong>in</strong>g of numerouselements that, <strong>in</strong>dividually, reveal little of the total pattern, yet collectivelyrepresent easily recognizable (by humans) objects, are typical of the k<strong>in</strong>ds ofpatterns that have proven most difficult for computers to recognize. For example,exam<strong>in</strong>e the illustration presented <strong>in</strong> Figure 1.1. If we focus strictly on theblack splotches, the picture is devoid of mean<strong>in</strong>g. Yet, if we allow our perspectiveto encompass all the components, we can see the image of a commonlyrecognizable object <strong>in</strong> the picture. Furthermore, once we see the image, it isdifficult for us not to see it whenever we aga<strong>in</strong> see this picture.Now, let's consider the techniques we would apply were we to program aconventional computer to recognize the object <strong>in</strong> that picture. The first th<strong>in</strong>g ourprogram would attempt to do is to locate the primary area or areas of <strong>in</strong>terest<strong>in</strong> the picture. That is, we would try to segment or cluster the splotches <strong>in</strong>togroups, such that each group could be uniquely associated with one object. Wemight then attempt to f<strong>in</strong>d edges <strong>in</strong> the image by complet<strong>in</strong>g l<strong>in</strong>e segments. Wecould cont<strong>in</strong>ue by exam<strong>in</strong><strong>in</strong>g the result<strong>in</strong>g set of edges for consistency, try<strong>in</strong>g todeterm<strong>in</strong>e whether or not the edges found made sense <strong>in</strong> the context of the otherl<strong>in</strong>e segments. L<strong>in</strong>es that did not abide by some predef<strong>in</strong>ed rules describ<strong>in</strong>g theway l<strong>in</strong>es <strong>and</strong> edges appear <strong>in</strong> the real world would then be attributed to noise<strong>in</strong> the image <strong>and</strong> thus would be elim<strong>in</strong>ated. F<strong>in</strong>ally, we would attempt to isolateregions that <strong>in</strong>dicated common textures, thus fill<strong>in</strong>g <strong>in</strong> the holes <strong>and</strong> complet<strong>in</strong>gthe image.The illustration of Figure 1.1 is one of a dalmatian seen <strong>in</strong> profile, fac<strong>in</strong>g left,with head lowered to sniff at the ground. The image <strong>in</strong>dicates the complexityof the type of problem we have been discuss<strong>in</strong>g. S<strong>in</strong>ce the dog is illustrated asa series of black spots on a white background, how can we write a computerprogram to determ<strong>in</strong>e accurately which spots form the outl<strong>in</strong>e of the dog, whichspots can be attributed to the spots on his coat, <strong>and</strong> which spots are simplydistractions?An even better question is this: How is it that we can see the dog <strong>in</strong>.the image quickly, yet a computer cannot perform this discrim<strong>in</strong>ation? Thisquestion is especially poignant when we consider that the switch<strong>in</strong>g time ofthe components <strong>in</strong> modern electronic computers are more than seven orders ofmagnitude faster than the cells that comprise our neurobiological systems. This


Introduction to ANS TechnologyFigure 1.1The picture is an example of a complex pattern. Notice howthe image of the object <strong>in</strong> the foreground blends with thebackground clutter. Yet, there is enough <strong>in</strong>formation <strong>in</strong> thispicture to enable us to perceive the image of a commonlyrecognizable object. Source: Photo courtesy of Ron James.question is partially answered by the fact that the architecture of the humanbra<strong>in</strong> is significantly different from the architecture of a conventional computer.Whereas the response time of the <strong>in</strong>dividual neural cells is typically on the orderof a few tens of milliseconds, the massive parallelism <strong>and</strong> <strong>in</strong>terconnectivityobserved <strong>in</strong> the biological systems evidently account for the ability of the bra<strong>in</strong>to perform complex pattern recognition <strong>in</strong> a few hundred milliseconds.In many real-world applications, we want our computers to perform complexpattern recognition problems, such as the one just described. S<strong>in</strong>ce ourconventional computers are obviously not suited to this type of problem, wetherefore borrow features from the physiology of the bra<strong>in</strong> as the basis for ournew process<strong>in</strong>g models. Hence, the technology has come to be known as artificialneural systems (ANS) technology, or simply neural networks. Perhapsthe models we discuss here will enable us eventually to produce mach<strong>in</strong>es thatcan <strong>in</strong>terpret complex patterns such as the one <strong>in</strong> Figure 1.1.In the next section, we will discuss aspects of neurophysiology that contributeto the ANS models we will exam<strong>in</strong>e. Before we do that, let's firstconsider how an ANS might be used to formulate a computer solution to apattern-match<strong>in</strong>g problem similar to, but much simpler than, the problem of


4 Introduction to ANS Technologyrecogniz<strong>in</strong>g the dalmation <strong>in</strong> Figure 1.1. Specifically, the problem we will addressis recognition of h<strong>and</strong>-drawn alphanumeric characters. This example isparticularly <strong>in</strong>terest<strong>in</strong>g for two reasons:• Even though a character set can be def<strong>in</strong>ed rigorously, people tend to personalizethe manner <strong>in</strong> which they write the characters. This subtle variation<strong>in</strong> style is difficult to deal with when an algorithmic pattern-match<strong>in</strong>g approachis used, because it comb<strong>in</strong>atorially <strong>in</strong>creases the size of the legal<strong>in</strong>put space to be exam<strong>in</strong>ed.• As we will see <strong>in</strong> later chapters, the neural-network approach to solv<strong>in</strong>g theproblem not only can provide a feasible solution, but also can be used toga<strong>in</strong> <strong>in</strong>sight <strong>in</strong>to the nature of the problem.We beg<strong>in</strong> by def<strong>in</strong><strong>in</strong>g a neural-network structure as a collection of parallelprocessors connected together <strong>in</strong> the form of a directed graph, organized suchthat the network structure lends itself to the problem be<strong>in</strong>g considered. Referr<strong>in</strong>gto Figure 1.2 as a typical network diagram, we can schematically represent eachprocess<strong>in</strong>g element (or unit) <strong>in</strong> the network as a node, with connections betweenunits <strong>in</strong>dicated by the arcs. We shall <strong>in</strong>dicate the direction of <strong>in</strong>formationflow <strong>in</strong> the network through the use of the arrowheads on the connections.To simplify our example, we will restrict the number of characters theneural network must recognize to the 10 decimal digits, 0,1,..., 9, rather thanus<strong>in</strong>g the full ASCII character set. We adopt this constra<strong>in</strong>t only to clarify theexample; there is no reason why an ANS could not be used to recognize allcharacters, regardless of case or style.S<strong>in</strong>ce our objective is to have the neural network determ<strong>in</strong>e which of the10 digits a particular h<strong>and</strong>-drawn character is, we can create a network structurethat has 10 discrete output units (or processors), one for each character to beidentified. This strategy simplifies the character-discrim<strong>in</strong>ation function of thenetwork, as it allows us to use a network that conta<strong>in</strong>s b<strong>in</strong>ary units on the outputlayer (e.g., for any given <strong>in</strong>put pattern, our network should activate one <strong>and</strong>only one of the 10 output units, represent<strong>in</strong>g which of the 10 digits that we areattempt<strong>in</strong>g to recognize the <strong>in</strong>put most resembles). Furthermore, if we <strong>in</strong>sistthat the output units behave accord<strong>in</strong>g to a simple on-off strategy, the processof convert<strong>in</strong>g an <strong>in</strong>put signal to an output signal becomes a simple majorityfunction.Based on these considerations, we now know that our network should conta<strong>in</strong>10 b<strong>in</strong>ary units as its output structure. Similarly, we must determ<strong>in</strong>e howwe will model the character <strong>in</strong>put for the network. Keep<strong>in</strong>g <strong>in</strong> m<strong>in</strong>d that wehave already <strong>in</strong>dicated a preference for b<strong>in</strong>ary output units, we can aga<strong>in</strong> simplifyour task if we model the <strong>in</strong>put data as a vector conta<strong>in</strong><strong>in</strong>g b<strong>in</strong>ary elements,which will allow us to use a network with only one type of process<strong>in</strong>g unit. Tocreate this type of <strong>in</strong>put, we borrow an idea from the video world <strong>and</strong> pixelizethe character. We will arbitrarily size the pixel image as a 10 x 8 matrix, us<strong>in</strong>ga 1 to represent a pixel that is "on," <strong>and</strong> a 0 to represent a pixel that is "off."


Introduction to ANS TechnologyOutputsHiddensFigure 1.2InputsThis schematic represents the character-recognition problemdescribed <strong>in</strong> the text. In this example, application of an <strong>in</strong>putpattern on the bottom layer of processors can cause many of thesecond-layer, or hidden-layer, units to activate. The activity onthe hidden layer should then cause exactly one of the output-' layer units to activate—the one associated with the patternbe<strong>in</strong>g identified. You should also note the large number ofconnections needed for this relatively small network.Furthermore, we can dissect this matrix <strong>in</strong>to a set of row vectors, which can thenbe concatenated <strong>in</strong>to a s<strong>in</strong>gle row vector of dimension 80. Thus, we have nowdef<strong>in</strong>ed the dimension <strong>and</strong> characteristics of the <strong>in</strong>put pattern for our network.At this po<strong>in</strong>t, all that rema<strong>in</strong>s is to size the number of process<strong>in</strong>g units(called hidden units) that must be used <strong>in</strong>ternally, to connect them to the <strong>in</strong>put<strong>and</strong> output units already def<strong>in</strong>ed us<strong>in</strong>g weighted connections, <strong>and</strong> to tra<strong>in</strong> thenetwork with example data pairs.' This concept of learn<strong>in</strong>g by example is extremelyimportant. As we shall see, a significant advantage of an ANS approachto solv<strong>in</strong>g a problem is that we need not have a well-def<strong>in</strong>ed process for algorimmicallyconvert<strong>in</strong>g an <strong>in</strong>put to an output. Rather, all that we need for most1 Details of how this tra<strong>in</strong><strong>in</strong>g is accomplished will occupy much of the rema<strong>in</strong>der of the text.


6 Introduction to ANS Technologynetworks is a collection of representative examples of the desired translation.The ANS then adapts itself to reproduce the desired outputs when presentedwith the example <strong>in</strong>puts.In addition, as our example network illustrates, an ANS is robust <strong>in</strong> thesense that it will respond with an output even when presented with <strong>in</strong>puts that ithas never seen before, such as patterns conta<strong>in</strong><strong>in</strong>g noise. If the <strong>in</strong>put noise hasnot obliterated the image of the character, the network will produce a good guessus<strong>in</strong>g those portions of the image that were not obscured <strong>and</strong> the <strong>in</strong>formationthat it has stored about how the characters are supposed to look. The <strong>in</strong>herentability to deal with noisy or obscured patterns is a significant advantage ofan ANS approach over a traditional algorithmic solution. It also illustrates aneural-network maxim: The power of an ANS approach lies not necessarily<strong>in</strong> the elegance of the particular solution, but rather <strong>in</strong> the generality of thenetwork to f<strong>in</strong>d its own solution to particular problems, given only examples ofthe desired behavior.Once our network is tra<strong>in</strong>ed adequately, we can show it images of numeralswritten by people whose writ<strong>in</strong>g was not used to tra<strong>in</strong> the network. If the tra<strong>in</strong><strong>in</strong>ghas been adequate, the <strong>in</strong>formation propagat<strong>in</strong>g through the network will result<strong>in</strong> a s<strong>in</strong>gle element at the output hav<strong>in</strong>g a b<strong>in</strong>ary 1 value, <strong>and</strong> that unit will bethe one that corresponds to the numeral that was written. Figure 1.3 illustratescharacters that the tra<strong>in</strong>ed network can recognize, as well as several it cannot.In the previous discussion, we alluded to two different types of networkoperation: tra<strong>in</strong><strong>in</strong>g mode <strong>and</strong> production mode. The dist<strong>in</strong>ct nature of thesetwo modes of operation is another useful feature of ANS technology. If wenote that the process of tra<strong>in</strong><strong>in</strong>g the network is simply a means of encod<strong>in</strong>g<strong>in</strong>formation about the problem to be solved, <strong>and</strong> that the network spends mostof its productive time be<strong>in</strong>g exercised after the tra<strong>in</strong><strong>in</strong>g has completed, wewill have uncovered a means of allow<strong>in</strong>g automated systems to evolve withoutexplicit reprogramm<strong>in</strong>g.As an example of how we might benefit from this separation, consider asystem that utilizes a software simulation of a neural network as part of itsprogramm<strong>in</strong>g. In this case, the network would be modeled <strong>in</strong> the host computersystem as a set of data structures that represents the current state of the network.The process of tra<strong>in</strong><strong>in</strong>g the network is simply a matter of alter<strong>in</strong>g the connectionweights systematically to encode the desired <strong>in</strong>put-output relationships. If wecode the network simulator such that the data structures used by the network areallocated dynamically, <strong>and</strong> are <strong>in</strong>itialized by read<strong>in</strong>g of connection-weight datafrom a disk file, we can also create a network simulator with a similar structure<strong>in</strong> another, off-l<strong>in</strong>e computer system. When the on-l<strong>in</strong>e system must changeto satisfy new operational requirements, we can develop the new connectionweights off-l<strong>in</strong>e by tra<strong>in</strong><strong>in</strong>g the network simulator <strong>in</strong> the remote system. Later,we can update the operational system by simply chang<strong>in</strong>g the connection-weight<strong>in</strong>itialization file from the previous version to the new version produced by theoff-l<strong>in</strong>e system.


Introduction to ANS Technology(b)Figure 1.3H<strong>and</strong>written characters vary greatly, (a) These characters wererecognized by the network <strong>in</strong> Figure 1.2; (b) these characterswere not recognized.These examples h<strong>in</strong>t at the ability of neural networks to deal with complexpattern-recognition problems, but they are by no means <strong>in</strong>dicative of the limitsof the technology. In later chapters, we will describe networks that can be usedto diagnose problems from symptoms, networks that can adapt themselves tomodel a topological mapp<strong>in</strong>g accurately, <strong>and</strong> even networks that can learn torecognize <strong>and</strong> reproduce a temporal sequence of patterns. All these networksare based on the simple build<strong>in</strong>g blocks discussed previously, <strong>and</strong> derived fromthe topics we shall discuss <strong>in</strong> the next two sections.F<strong>in</strong>ally, the dist<strong>in</strong>ction made between the artificial <strong>and</strong> natural systems is<strong>in</strong>tentional. We cannot overemphasize the fact that the ANS models we willexam<strong>in</strong>e bear only a perfunctory resemblance to their biological counterparts.What is important about these models is that they all exhibit the useful behaviorsof learn<strong>in</strong>g, recogniz<strong>in</strong>g, <strong>and</strong> apply<strong>in</strong>g relationships between objects <strong>and</strong> patternsof objects <strong>in</strong> the real world. In this regard, they provide us with a whole newset of tools that we can use to solve "difficult" problems.


Introduction to ANS Technology1.1 ELEMENTARY NEUROPHYSIOLOGYFrom time to time throughout this text, we shall cite specific results from neurobiologythat perta<strong>in</strong> to a particular ANS architecture. There are also basicconcepts that have a more universal significance. In this regard, we look first at<strong>in</strong>dividual neurons, then at the synaptic junctions between neurons. We describethe McCulloch-Pitts model of neural computation, <strong>and</strong> exam<strong>in</strong>e its specific relationshipto our neural-network models. We f<strong>in</strong>ish the section with a look atHebb's theory of learn<strong>in</strong>g. Bear <strong>in</strong> m<strong>in</strong>d that the follow<strong>in</strong>g discussion is asimplified overview; the subject of neurophysiology is vastly more complicatedthan is the picture we pa<strong>in</strong>t here.1.1.1 S<strong>in</strong>gle-Neuron PhysiologyFigure 1.4 depicts the major components of a typical nerve cell <strong>in</strong> the centralnervous system. The membrane of a neuron separates the <strong>in</strong>tracellular plasmafrom the <strong>in</strong>terstitial fluid external to the cell. The membrane is permeable tocerta<strong>in</strong> ionic species, <strong>and</strong> acts to ma<strong>in</strong>ta<strong>in</strong> a potential difference between theMyel<strong>in</strong> sheathAxon hillockNucleusDendritesFigure 1.4 The major structures of a typical nerve cell <strong>in</strong>clude dendrites,the cell body, <strong>and</strong> a s<strong>in</strong>gle axon. The axon of many neurons issurrounded by a membrane called the myel<strong>in</strong> sheath. Nodesof Ranvier <strong>in</strong>terrupt the myel<strong>in</strong> sheath periodically along thelength of the axon. Synapses connect the axons of one neuronto various parts of other neurons.


1.1 Elementary NeurophysiologyCell membraneNa +External electrode Q| ~~Figure 1.5This figure illustrates the rest<strong>in</strong>g potential developed across thecell membrane of a neuron. The relative sizes of the labels forthe ionic species <strong>in</strong>dicate roughly the relative concentration ofeach species <strong>in</strong> the regions <strong>in</strong>ternal <strong>and</strong> external to the cell.<strong>in</strong>tracellular fluid <strong>and</strong> the extracellular fluid. It accomplishes this task primarilyby the action of a sodium-potassium pump. This mechanism transports sodiumions out of the cell <strong>and</strong> potassium ions <strong>in</strong>to the cell. Other ionic species presentare chloride ions <strong>and</strong> negative organic ions.All the ionic species can diffuse across the cell membrane, with the exceptionof the organic ions, which are too large. S<strong>in</strong>ce the organic ions cannotdiffuse out of the cell, their net negative charge makes chloride diffusion <strong>in</strong>to thecell unfavorable; thus, there will be a higher concentration of chloride ions outsideof the cell. The sodium-potassium pump forces a higher concentration ofpotassium <strong>in</strong>side the cell <strong>and</strong> a higher concentration of sodium outside the cell.The cell membrane is selectively more permeable to potassium ions thanto sodium ions. The chemical gradient of potassium tends to cause potassiumions to diffuse out of the cell, but the strong attraction of the negative organicions tends to keep the potassium <strong>in</strong>side. The result of these oppos<strong>in</strong>g forces isthat an equilibrium is reached where there are significantly more sodium <strong>and</strong>chloride ions outside the cell, <strong>and</strong> more potassium <strong>and</strong> organic ions <strong>in</strong>side thecell. Moreover, the result<strong>in</strong>g equilibrium leaves a potential difference across thecell membrane of about 70 to 100 millivolts (mV), with the <strong>in</strong>tracellular fluidbe<strong>in</strong>g more negative. This potential, called the rest<strong>in</strong>g potential of the cell, isdepicted schematically <strong>in</strong> Figure 1.5.Figure 1.6 illustrates a neuron with several <strong>in</strong>com<strong>in</strong>g connections, <strong>and</strong> thepotentials that occur at various locations. The figure shows the axon with acover<strong>in</strong>g called a myel<strong>in</strong> sheath. This <strong>in</strong>sulat<strong>in</strong>g layer is <strong>in</strong>terrupted at variouspo<strong>in</strong>ts by the nodes of Ranvier.Excitatory <strong>in</strong>puts to the cell reduce the potential difference across the cellmembrane. The result<strong>in</strong>g depolarization at the axon hillock alters the permeabilityof the cell membrane to sodium ions. As a result, there is a large <strong>in</strong>flux


10 Introduction to ANS TechnologyAction potential spikepropagates along axonExcitatory,depolariz<strong>in</strong>gpotentialInhibitory,polariz<strong>in</strong>gpotentialFigure 1.6Connections to the neuron from other neurons occur at variouslocations on the cell that are known as synapses. Nerveimpulses through these connect<strong>in</strong>g neurons can result <strong>in</strong> localchanges <strong>in</strong> the potential <strong>in</strong> the cell body of the receiv<strong>in</strong>gneuron. These potentials, called graded potentials or <strong>in</strong>putpotentials, can spread through the ma<strong>in</strong> body of the cell. Theycan be either excitatory (decreas<strong>in</strong>g the polarization of the cell)or <strong>in</strong>hibitory (<strong>in</strong>creas<strong>in</strong>g the polarization of the cell). The <strong>in</strong>putpotentials are summed at the axon hillock. If the amountof depolarization at the axon hillock is sufficient, an actionpotential is generated; it travels down the axon away from thema<strong>in</strong> cell body.of positive sodium ions <strong>in</strong>to the cell, contribut<strong>in</strong>g further to the depolarization.This self-generat<strong>in</strong>g effect results <strong>in</strong> the action potential.Nerve fibers themselves are poor conductors. The transmission of the actionpotential down the axon is a result of a sequence of depolarizations that occurat the nodes of Ranvier. As one node depolarizes, it triggers the depolarizationof the next node. The action potential travels down the fiber <strong>in</strong> a discont<strong>in</strong>uousfashion, from node to node. Once an action potential has passed a given po<strong>in</strong>t,


1.1 Elementary Neurophysiology 11Presynaptic membranePostsynaptic membraneNeurotransmitterreleaseSynaptic vesicleFigure 1.7Neurotransmitters are held <strong>in</strong> vesicles near the presynapticmembrane. These chemicals are released <strong>in</strong>to the synapticcleft <strong>and</strong> diffuse to the postsynaptic membrane, where theyare subsequently absorbed.that po<strong>in</strong>t is <strong>in</strong>capable of be<strong>in</strong>g reexcited for about 1 millisecond, while it isrestored to its rest<strong>in</strong>g potential. This refractory period limits the frequency ofnerve-pulse transmission to about 1000 per second.1.1.2 The Synaptic junctionLet's take a brief look at the activity that occurs at the connection betweentwo neurons called the synaptic junction or synapse. Communication betweenneurons occurs as a result of the release by the presynaptic cell of substancescalled neurotransmitters, <strong>and</strong> of the subsequent absorption of these substancesby the postsynaptic cell. Figure 1.7 shows this activity. When the actionpotential arrives as the presynaptic membrane, changes <strong>in</strong> the permeability ofthe membrane cause an <strong>in</strong>flux of calcium ions. These ions cause the vesiclesconta<strong>in</strong><strong>in</strong>g the neurotransmitters to fuse with the presynaptic membrane <strong>and</strong> torelease their neurotransmitters <strong>in</strong>to the synaptic cleft.


12 Introduction to ANS TechnologyThe neurotransmitters diffuse across the junction <strong>and</strong> jo<strong>in</strong> to the postsynapticmembrane at certa<strong>in</strong> receptor sites. The chemical action at the receptor sitesresults <strong>in</strong> changes <strong>in</strong> the permeability of the postsynaptic membrane to certa<strong>in</strong>ionic species. An <strong>in</strong>flux of positive species <strong>in</strong>to the cell will tend to depolarizethe rest<strong>in</strong>g potential; this effect is excitatory. If negative ions enter, ahyperpolarization effect occurs; this effect is <strong>in</strong>hibitory. Both effects are localeffects that spread a short distance <strong>in</strong>to the cell body <strong>and</strong> are summed at theaxon hillock. If the sum is greater than a certa<strong>in</strong> threshold, an action potentialis generated.1.1.3 <strong>Neural</strong> Circuits <strong>and</strong> ComputationFigure 1.8 illustrates several basic neural circuits that are found <strong>in</strong> the centralnervous system. Figures 1.8(a) <strong>and</strong> (b) illustrate the pr<strong>in</strong>ciples of divergence<strong>and</strong> convergence <strong>in</strong> neural circuitry. Each neuron sends impulses to many otherneurons (divergence), <strong>and</strong> receives impulses from many neurons (convergence).This simple idea appears to be the foundation for all activity <strong>in</strong> the centralnervous system, <strong>and</strong> forms the basis for most neural-network models that weshall discuss <strong>in</strong> later chapters.Notice the feedback paths <strong>in</strong> the circuits of Figure 1.8(b), (c), <strong>and</strong> (d). S<strong>in</strong>cesynaptic connections can be either excitatory or <strong>in</strong>hibitory, these circuits facilitatecontrol systems hav<strong>in</strong>g either positive or negative feedback. Of course, thesesimple circuits do not adequately portray the vast complexity of neuroanatomy.Now that we have an idea of how <strong>in</strong>dividual neurons operate <strong>and</strong> of howthey are put together, we can pose a fundamental question: How do theserelatively simple concepts comb<strong>in</strong>e to give the bra<strong>in</strong> its enormous abilities?The first significant attempt to answer this question was made <strong>in</strong> 1943, throughthe sem<strong>in</strong>al work by McCulloch <strong>and</strong> Pitts [24]. This work is important for manyreasons, not the least of which is that the <strong>in</strong>vestigators were the first people totreat the bra<strong>in</strong> as a computational organism.The McCulloch-Pitts theory is founded on five assumptions:1. The activity of a neuron is an all-or-none process.2. A certa<strong>in</strong> fixed number of synapses (> 1) must be excited with<strong>in</strong> a periodof latent addition for a neuron to be excited.3. The only significant delay with<strong>in</strong> the nervous system is synaptic delay.4. The activity of any <strong>in</strong>hibitory synapse absolutely prevents excitation of theneuron at that time.5. The structure of the <strong>in</strong>terconnection network does not change with time.Assumption 1 identifies the neurons as be<strong>in</strong>g b<strong>in</strong>ary: They are either onor off. We can therefore def<strong>in</strong>e a predicate, N t (t), which denotes the assertionthat the ith neuron fires at time t. The notation, -iATj(t), denotes the assertionthat the ith neuron did not fire at time t. Us<strong>in</strong>g this notation, we can describe


1.1 Elementary Neurophysiology 13Figure 1.8These schematics show examples of neural circuits <strong>in</strong> thecentral nervous system. The cell bodies (<strong>in</strong>clud<strong>in</strong>g thedendrites) are represented by the large circles. Small circlesappear at the ends of the axons. Illustrated <strong>in</strong> (a) <strong>and</strong> (b) arethe concepts of divergence <strong>and</strong> convergence. Shown <strong>in</strong> (b),(c), <strong>and</strong> (d) are examples of circuits with feedback paths.the action of certa<strong>in</strong> networks us<strong>in</strong>g propositional logic. Figure 1.9 shows fivesimple networks. We can write simple propositional expressions to describe thebehavior of the first four (the fifth one appears <strong>in</strong> Exercise 1.1). Figure 1.9(a)describes precession: neuron 2 fires after neuron 1. The expression is N 2 (t) =Ni(t — 1). Similarly, the expressions for parts (b) through (d) of this figure are• AT 3 (i) = N^t - 1) V N 2 (t - 1) (disjunction),• N 3 (t) = Ni(t - {)&N 2 (t - 1) (conjunction), <strong>and</strong>• N 3 (t) = Ni(t- l)&^N 2 (t - 1) (conjo<strong>in</strong>ed negation).One of the powerful proofs <strong>in</strong> this theory was that any network that does not havefeedback connections can be described <strong>in</strong> terms of comb<strong>in</strong>ations of these four


14 Introduction to ANS Technology(a)(e)Figure 1.9These draw<strong>in</strong>gs are examples of simple McCulloch-Pittsnetworks that can be def<strong>in</strong>ed <strong>in</strong> terms of the notationof prepositional logic. Large circles with labels representcell bodies. The small, filled circles represent excitatoryconnections; the small, open circles represent <strong>in</strong>hibitoryconnections. The networks illustrate (a) precession, (b)disjunction, (c) conjunction, <strong>and</strong> (d) conjo<strong>in</strong>ed negation.Shown <strong>in</strong> (e) is a comb<strong>in</strong>ation of networks (a)-(d).simple expressions, <strong>and</strong> vice versa. Figure 1.9(e) is an example of a networkmade from a comb<strong>in</strong>ation of the networks <strong>in</strong> parts (a) through (d).Although the McCulloch-Pitts theory has turned out not to be an accuratemodel of bra<strong>in</strong> activity, the importance of the work cannot be overstated. Thetheory helped to shape the th<strong>in</strong>k<strong>in</strong>g of many people who were <strong>in</strong>fluential <strong>in</strong>the development of modern computer science. As Anderson <strong>and</strong> Rosenfeldpo<strong>in</strong>t out, one critical idea was left unstated <strong>in</strong> the McCulloch-Pitts paper:Although neurons are simple devices, great computational power can be realized


1.1 Elementary Neurophysiology 15when these neurons are suitably connected <strong>and</strong> are embedded with<strong>in</strong> the nervoussystem [2].Exercise 1.1: Write the prepositional expression for JV 3 (i) <strong>and</strong> JV 4 (i), of Figure1.9(e).Exercise 1.2: Construct McCulloch-Pitts networks for the follow<strong>in</strong>g expressions:1. N)(t) = N 2 (t - 2)&^Ni(t - 3)2. N 4 (t) = [N 2 (t - l)&->JV,(t - 1)] V [JV 3 (i - 1)&-.JV,(< - 1)]V[N 2 (t- \)&N 3 (t- 1)]1.1.4 Hebbian Learn<strong>in</strong>gBiological neural systems are not born preprogrammed with all the knowledge<strong>and</strong> abilities that they will eventually have. A learn<strong>in</strong>g process that takes placeover a period of time somehow modifies the network to <strong>in</strong>corporate new <strong>in</strong>formation.In the previous section, we began to see how a relatively simple neuronmight result <strong>in</strong> a sophisticated computational device. In this section, we shallexplore a relatively simple learn<strong>in</strong>g theory that suggests an elegant answer tothis question: How do we learn?The basic theory comes from a 1949 book by Hebb, Organization of Behavior.The ma<strong>in</strong> idea was stated <strong>in</strong> the form of an assumption, which wereproduce here for historical <strong>in</strong>terest:When an axon of cell A is near enough to excite a cell B <strong>and</strong> repeatedly or persistentlytakes part <strong>in</strong> fir<strong>in</strong>g it, some growth process or metabolic change takes place <strong>in</strong> oneor both cells such that A's efficiency, as one of the cells fir<strong>in</strong>g B, is <strong>in</strong>creased. [10,p. 50]As with the McCulloch-Pitts model, this learn<strong>in</strong>g law does not tell thewhole story. Nevertheless, it appears <strong>in</strong> one form or another <strong>in</strong> many of theneural-network models that exist today.To illustrate the basic idea, we consider the example of classical condition<strong>in</strong>g,us<strong>in</strong>g the familiar experiment of Pavlov. Figure 1.10 shows three idealizedneurons that participate <strong>in</strong> the process.Suppose that the excitation of C, caused by the sight of food, is sufficientto excite B, caus<strong>in</strong>g salivation. Furthermore, suppose that, <strong>in</strong> the absence ofadditional stimulation, the excitation of A, result<strong>in</strong>g from hear<strong>in</strong>g a bell, is notsufficient to cause the fir<strong>in</strong>g of B.Let's allow C to cause B to fire by show<strong>in</strong>g food to the subject, <strong>and</strong> whileB is still fir<strong>in</strong>g, stimulate A by r<strong>in</strong>g<strong>in</strong>g a bell. Because B is still fir<strong>in</strong>g, A isnow participat<strong>in</strong>g <strong>in</strong> the excitation of B, even though by itself A would be<strong>in</strong>sufficient to cause B to fire. In this situation, Hebb's assumption dictates thatsome change occur between A <strong>and</strong> B, so that A's <strong>in</strong>fluence on B is <strong>in</strong>creased.


16 Introduction to ANS TechnologySalivationsignalSight <strong>in</strong>putFigure 1.10Two neurons, A <strong>and</strong> C, are stimulated by the sensory <strong>in</strong>putsof sound <strong>and</strong> sight, respectively. The third neuron, B,causes salivation. The two synaptic junctions are labeledSB A anc 'If the experiment is repeated often enough, A will eventually be able to causeB to fire even <strong>in</strong> the absence of the visual stimulation from C. Then, if the bellis rung, but no food is shown, salivation will still occur, because the excitationdue to A alone is now sufficient to cause B to fire.Because the connection between neurons is through the synapse, it is reasonableto guess that whatever changes occur dur<strong>in</strong>g learn<strong>in</strong>g take place there.Hebb theorized that the area of the synaptic junction <strong>in</strong>creased. More recenttheories assert that an <strong>in</strong>crease <strong>in</strong> the rate of neurotransmitter release by thepresynaptic cell is responsible. In any event, changes certa<strong>in</strong>ly occur at thesynapse. If either the pre- or postsynaptic cell were altered as a whole, otherresponses could be re<strong>in</strong>forced that are unrelated to the condition<strong>in</strong>g experiment.Thus we conclude our brief look at neurophysiology. Before mov<strong>in</strong>g on,however, we reiterate a caution <strong>and</strong> issue a challenge to you. On the one h<strong>and</strong>,although there are many analogies between the basic concepts of neurophysiology<strong>and</strong> the neural-network models described <strong>in</strong> this book, we caution you not toportray these systems as actually model<strong>in</strong>g the bra<strong>in</strong>. We prefer to say that thesenetworks have been <strong>in</strong>spired by our current underst<strong>and</strong><strong>in</strong>g of neurophysiology.On the other h<strong>and</strong>, it is often too easy for eng<strong>in</strong>eers, <strong>in</strong> their pursuit of solutionsto specific problems, to ignore completely the neurophysiological foundationsof the technology. We believe that this tendency is unfortunate. Therefore, wechallenge ANS practitioners to keep abreast of the developments <strong>in</strong> neurobiologyso as to be able to <strong>in</strong>corporate significant results <strong>in</strong>to their systems. Afterall, what better model is there than the one example of a neural network withexist<strong>in</strong>g capabilities that far surpass any of our artificial systems?


1 .2 From Neurons to ANS 1 7Exercise 1.3: The analysis of high-dimensional data sets is often a complextask. One way to simplify the task is to use the Karhunen-Loeve (KL) matrix,which is def<strong>in</strong>ed aswhere N is the number of vectors, <strong>and</strong> //' is the ith component of the /xth vector.The KL matrix extracts the pr<strong>in</strong>cipal components, or directions of maximum<strong>in</strong>formation (correlation) from a data set. Determ<strong>in</strong>e the relationship between theKL formulation <strong>and</strong> the popular version of the Hebb rule known as the Oja rule:atwhere O(t) is the output of a simple, l<strong>in</strong>ear process<strong>in</strong>g element; /;(£) are the<strong>in</strong>puts; <strong>and</strong> i(t) are the synaptic strengths. (This exercise was suggested byDr. Daniel Kammen, California Institute of Technology.)1.2 FROM NEURONS TO ANSIn this section, we make a transition from some of the ideas gleaned fromneurobiology to the idealized structures that form the basis of most ANS models.We first describe a general artificial neuron that <strong>in</strong>corporates most features weshall need for future discussions of specific models. Later <strong>in</strong> the section, wetake a brief look at a particular example of an ANS called the perceptron. Theperceptron was the result of an early attempt to simulate neural computation <strong>in</strong>order to perform complex tasks. We shall exam<strong>in</strong>e <strong>in</strong> particular what severallimitations of this approach are <strong>and</strong> how they might be overcome.1.2.1 The General Process<strong>in</strong>g ElementThe <strong>in</strong>dividual computational elements that make up most artificial neuralsystemmodels are rarely called artificial neurons; they are more often referredto as nodes, units, or process<strong>in</strong>g elements (PEs). All these terms are used<strong>in</strong>terchangeably throughout this book.Another po<strong>in</strong>t to bear <strong>in</strong> m<strong>in</strong>d is that it is not always appropriate to th<strong>in</strong>kof the process<strong>in</strong>g elements <strong>in</strong> a neural network as be<strong>in</strong>g <strong>in</strong> a one-to-one relationshipwith actual biological neurons. It is sometimes better to imag<strong>in</strong>e as<strong>in</strong>gle process<strong>in</strong>g element as representative of the collective activity of a groupof neurons. Not only will this <strong>in</strong>terpretation help us to avoid the trap of speak<strong>in</strong>gas though our systems were actual bra<strong>in</strong> models, but also it will make theproblem more tractable when we are attempt<strong>in</strong>g to model the behavior of somebiological structure.Figure 1.11 shows our general PE model. Each PE is numbered, the one <strong>in</strong>the figure be<strong>in</strong>g the zth. Hav<strong>in</strong>g cautioned you not to make too many biological


18 Introduction to ANS TechnologyOutputType<strong>in</strong>putsType n<strong>in</strong>putsType 2<strong>in</strong>putsFigure 1.11This structure represents a s<strong>in</strong>gle PE <strong>in</strong> a network. The <strong>in</strong>putconnections are modeled as arrows from other process<strong>in</strong>gelements. Each <strong>in</strong>put connection has associated with it aquantity, w. t j, called a weight. There is a s<strong>in</strong>gle output value,which can fan out to other units.analogies, we shall now ignore our own advice <strong>and</strong> make a few ourselves. Forexample, like a real neuron, the PE has many <strong>in</strong>puts, but has only a s<strong>in</strong>gleoutput, which can fan out to many other PEs <strong>in</strong> the network. The <strong>in</strong>put the zthreceives from the jth PE is <strong>in</strong>dicated as Xj (note that this value is also the outputof the jth node, just as the output generated by the ith node is labeled x^). Eachconnection to the ith PE has associated with it a quantity called a weight orconnection strength. The weight on the connection from the jth node to the ilhnode is denoted w t j. All these quantities have analogues <strong>in</strong> the st<strong>and</strong>ard neuronmodel: The output of the PE corresponds to the fir<strong>in</strong>g frequency of the neuron,<strong>and</strong> the weight corresponds to the strength of the synaptic connection betweenneurons. In our models, these quantities will be represented as real numbers.


1.2 From Neurons to ANS 19Notice that the <strong>in</strong>puts to the PE are segregated <strong>in</strong>to various types. Thissegregation acknowledges that a particular <strong>in</strong>put connection may have one ofseveral effects. An <strong>in</strong>put connection may be excitatory or <strong>in</strong>hibitory, for example.In our models, excitatory connections have positive weights, <strong>and</strong> <strong>in</strong>hibitoryconnections have negative weights. Other types are possible. The terms ga<strong>in</strong>,quench<strong>in</strong>g, <strong>and</strong> nonspecific arousal describe other, special-purpose connections;the characteristics of these other connections will be described later <strong>in</strong>the book. Excitatory <strong>and</strong> <strong>in</strong>hibitory connections are usually considered together,<strong>and</strong> constitute the most common forms of <strong>in</strong>put to a PE.Each PE determ<strong>in</strong>es a net-<strong>in</strong>put value based on all its <strong>in</strong>put connections.In the absence of special connections, we typically calculate the net <strong>in</strong>put bysumm<strong>in</strong>g the <strong>in</strong>put values, gated (multiplied) by their correspond<strong>in</strong>g weights.In other words, the net <strong>in</strong>put to the ith unit can be written asneti = "^XjWij (1.1)jwhere the <strong>in</strong>dex, j, runs over all connections to the PE. Note that excitation<strong>and</strong> <strong>in</strong>hibition are accounted for automatically by the sign of the weights. Thissum-of-products calculation plays an important role <strong>in</strong> the network simulationsthat we will be describ<strong>in</strong>g later. Because there is often a very large number of<strong>in</strong>terconnects <strong>in</strong> a network, the speed at which this calculation can be performedusually determ<strong>in</strong>es the performance of any given network simulation.Once the net <strong>in</strong>put is calculated, it is converted to an activation value, orsimply activation, for the PE. We can write this activation value asai(t) = F,{a,(t - l),net,-(t)) (1.2)to denote that the activation is an explicit function of the net <strong>in</strong>put. Noticethat the current activation may depend on the previous value of the activation,a(t - I). 2 We <strong>in</strong>clude this dependence <strong>in</strong> the def<strong>in</strong>ition for generality. In themajority of cases, the activation <strong>and</strong> net <strong>in</strong>put are identical, <strong>and</strong> the terms oftenare used <strong>in</strong>terchangeably. Sometimes, activation <strong>and</strong> net <strong>in</strong>put are not the same,<strong>and</strong> we must pay attention to the difference. For the most part, however, wewill be able to use activation to mean net <strong>in</strong>put, <strong>and</strong> vice versa.Once the activation of the PE is calculated, we can determ<strong>in</strong>e the outputvalue by apply<strong>in</strong>g an output function:x, = /i(ai) (1.3)S<strong>in</strong>ce, usually, a, = net,, this function is normally written asx t = /,(net,) (1.4)One reason for belabor<strong>in</strong>g the issue of activation versus net <strong>in</strong>put is thatthe term activation function is sometimes used to refer to the function, /,, thatecause of the emphasis on digital simulations <strong>in</strong> this text, we generally consider time to bemeasured <strong>in</strong> discrete steps. The notation t — 1 <strong>in</strong>dicates one timestep prior to time t.


20 Introduction to ANS Technologyconverts the net <strong>in</strong>put value, net;, to the node's output value, Xj. In this text, weshall consistently use the term output function for /,() of Eqs. (1.3) <strong>and</strong> (1.4).Be aware, however, that the literature is not always consistent <strong>in</strong> this respect.When we are describ<strong>in</strong>g the mathematical basis for network models, it willoften be useful to th<strong>in</strong>k of the network as a dynamical system—that is, asa system that evolves over time. To describe such a network, we shall writedifferential equations that describe the time rate of change of the outputs of thevarious PEs. For example, ±, — gi(x t , net,) represents a general differentialequation for the output of the ith PE, where the dot above the x refers todifferentiation with respect to time. S<strong>in</strong>ce netj depends on the outputs of manyother units, we actually have a system of coupled differential equations.As an example, let's look at the equation±i = -Xi + /j(neti)for the output of the itii process<strong>in</strong>g element. We apply some <strong>in</strong>put values to thePE so that net; > 0. If the <strong>in</strong>puts rema<strong>in</strong> for a sufficiently long time, the outputvalue will reach an equilibrium value, when x, = 0, given bywhich is identical to Eq. (1.4). We can often assume that <strong>in</strong>put values rema<strong>in</strong>until equilibrium has been achieved.Once the unit has a nonzero output value, removal of the <strong>in</strong>puts will causethe output to return to zero. If net; = 0, thenwhich means that x —> 0.It is also useful to view the collection of weight values as a dynamicalsystem. Recall the discussion <strong>in</strong> the previous section, where we asserted thatlearn<strong>in</strong>g is a result of the modification of the strength of synaptic junctions betweenneurons. In an ANS, learn<strong>in</strong>g usually is accomplished by modification ofthe weight values. We can write a system of differential equations for the weightvalues, Wij = G Z (WJJ, z;,Xj,...), where G, represents the learn<strong>in</strong>g law. Thelearn<strong>in</strong>g process consists of f<strong>in</strong>d<strong>in</strong>g weights that encode the knowledge that wewant the system to learn. For most realistic systems, it is not easy to determ<strong>in</strong>ea closed-form solution for this system of equations. Techniques exist, however,that result <strong>in</strong> an acceptable approximation to a solution. Prov<strong>in</strong>g the existenceof stable solutions to such systems of equations is an active area of research <strong>in</strong>neural networks today, <strong>and</strong> probably will cont<strong>in</strong>ue to be so for some time.1.2.2 Vector FormulationIn many of the network models that we shall discuss, it is useful to describecerta<strong>in</strong> quantities <strong>in</strong> terms of vectors. Th<strong>in</strong>k of a neural network composed ofseveral layers of identical process<strong>in</strong>g elements. If a particular layer conta<strong>in</strong>s n


1 .2 From Neurons to ANS 21units, the outputs of that layer can be thought of as an n-dimensional vector,X = (x\ , X2, • • • , x n Y, where the t superscript means transpose. In our notation,vectors written <strong>in</strong> boldface type, such as x, will be assumed to be column vectors.When they are written row form, the transpose symbol will be added to <strong>in</strong>dicatethat the vector is actually to be thought of as a column vector. Conversely, thenotation \ f <strong>in</strong>dicates a row vector.Suppose the n-dimensional output vector of the previous paragraph providesthe <strong>in</strong>put values to each unit <strong>in</strong> an m-dimensional layer (a layer with m units).Each unit on the m-dimensional layer will have n weights associated with theconnections from the previous layer. Thus, there are m n-dimensional weightvectors associated with this layer; there is one n-dimensional weight vectorfor each of the m units. The weight vector of the ith unit can be written asY/I = (wi\ , Wi2, • • • , W<strong>in</strong>f. A superscript can be added to the weight notation todist<strong>in</strong>guish between weights on different layers.The net <strong>in</strong>put to the ith unit can be written <strong>in</strong> terms of the <strong>in</strong>ner product,or dot product, of the <strong>in</strong>put vector <strong>and</strong> the weight vector. For vectors of equaldimensions, the <strong>in</strong>ner product is denned as the sum of the products of thecorrespond<strong>in</strong>g components of the two vectors. In the notation of the previoussection,where n is the number of connections to the ith unit. This equation can bewritten succ<strong>in</strong>ctly <strong>in</strong> vector notation asneti = x • w zorneti = x*w,Also note that, because of the rules of multiplication of vectors,We shall often speak of <strong>in</strong>put vectors <strong>and</strong> output vectors <strong>and</strong> weight vectors,but we tend to reserve the vector notation for cases where it is particularlyappropriate. Additional vector concepts will be <strong>in</strong>troduced later as needed. Inthe next section, we shall use the notation presented here to describe a neuralnetworkmodel that has an important place <strong>in</strong> history: the perceptron.T-2.3 The Perceptron: Part 1The device known as the perceptron was <strong>in</strong>vented by psychologist Frank Rosenblattm the late 1950s. It represented his attempt to "illustrate some of theundamental properties of <strong>in</strong>telligent systems <strong>in</strong> general, without becom<strong>in</strong>g too


22 Introduction to ANS TechnologySensory (S) areaAssociation (A) areaResponse (R) area—————° Inhibitory connection————• Excitatory connection————— Either <strong>in</strong>hibitory or excitatoryFigure 1.12A simple photoperceptron has a sensory area, an associationarea, <strong>and</strong> a response area. The connections shown betweenunits <strong>in</strong> the various areas are illustrative, <strong>and</strong> are not meantto be an exhaustive representation.deeply enmeshed <strong>in</strong> the special, <strong>and</strong> frequently unknown, conditions which holdfor particular biological organisms" [29, p. 387]. Rosenblatt believed that theconnectivity that develops <strong>in</strong> biological networks conta<strong>in</strong>s a large r<strong>and</strong>om element.Thus, he took exception to previous analyses, such as the McCulloch-Pittsmodel, where symbolic logic was employed to analyze rather idealized structures.Rather, Rosenblatt believed that the most appropriate analysis tool wasprobability theory. He developed a theory of statistical separability that he usedto characterize the gross properties of these somewhat r<strong>and</strong>omly <strong>in</strong>terconnectednetworks.The photoperceptron is a device that responds to optical patterns. We showan example <strong>in</strong> Figure 1.12. In this device, light imp<strong>in</strong>ges on the sensory (S)po<strong>in</strong>ts of the ret<strong>in</strong>a structure. Each S po<strong>in</strong>t responds <strong>in</strong> an all-or-noth<strong>in</strong>g mannerto the <strong>in</strong>com<strong>in</strong>g light. Impulses generated by the S po<strong>in</strong>ts are transmitted to theassociator (A) units <strong>in</strong> the association layer. Each A unit is connected to ar<strong>and</strong>om set of S po<strong>in</strong>ts, called the A unit's source set, <strong>and</strong> the connections maybe either excitatory or <strong>in</strong>hibitory. The connections have the possible values, +1,— 1, <strong>and</strong> 0. When a stimulus pattern appears on the ret<strong>in</strong>a, an A unit becomesactive if the sum of its <strong>in</strong>puts exceeds some threshold value. If active, the Aunit produces an output, which is sent to the next layer of units.In a similar manner, A units are connected to response (R) units <strong>in</strong> theresponse layer. The pattern of connectivity is aga<strong>in</strong> r<strong>and</strong>om between the layers,but there is the addition of <strong>in</strong>hibitory feedback connections from the response


24 Introduction to ANS Technologya re<strong>in</strong>forcement process whereby the output of A units was either <strong>in</strong>creased ordecreased depend<strong>in</strong>g on whether or not the A units contributed to the correctresponse of the perceptron for a given pattern. A pattern was applied to theret<strong>in</strong>a, <strong>and</strong> the stimulus was propagated through the layers until a response unitwas activated. If the correct response unit was active, the output of the contribut<strong>in</strong>gA units was <strong>in</strong>creased. If the <strong>in</strong>correct R unit was active, the outputof the contribut<strong>in</strong>g A units was decreased.Us<strong>in</strong>g such a scheme, Rosenblatt was able to show that the perceptroncould classify patterns successfully <strong>in</strong> what he termed a differentiated environment,where each class consisted of patterns that were <strong>in</strong> some sense similar toone another. The perceptron was also able to respond consistently to r<strong>and</strong>ompatterns, but its accuracy dim<strong>in</strong>ished as the number of patterns that it attemptedto learn <strong>in</strong>creased.Rosenblatt' s work resulted <strong>in</strong> the proof of an important result known asthe perceptron convergence theorem. The theorem is proved for a perceptronwith one R unit that is learn<strong>in</strong>g to differentiate patterns of two dist<strong>in</strong>ct classes.It states, <strong>in</strong> essence, that, if the classification can be learned by the perceptron,then the procedure we have described guarantees that it will be learned <strong>in</strong> af<strong>in</strong>ite number of tra<strong>in</strong><strong>in</strong>g cycles.Unfortunately, perceptrons caused a fair amount of controversy at the timethey were described. Unrealistic expectations <strong>and</strong> exaggerated claims no doubtplayed a part <strong>in</strong> this controversy. The end result was that the field of artificialneural networks was almost entirely ab<strong>and</strong>oned, except by a few die-hard researchers.We h<strong>in</strong>ted at one of the major problems with perceptrons when wesuggested that there were conditions attached to the successful operation of theperceptron. In the next section, we explore <strong>and</strong> evaluate these considerations.Exercise 1.4: Consider a perceptron with one R unit <strong>and</strong> N a association units,a/i, which is attempt<strong>in</strong>g to learn to differentiate i patterns, Sj, each of whichfalls <strong>in</strong>to one of two categories. For one category, the R unit gives an outputof +1; for the other, it gives an output of —1. Let W M be the output of thep,th A unit. Further, let p t be ±1, depend<strong>in</strong>g on the class of 5*;, <strong>and</strong> let e M ; be1 if a M is <strong>in</strong> the source set for 5;, <strong>and</strong> 0 otherwise. Show that the successfulclassification of patterns Si requires that the follow<strong>in</strong>g condition be satisfied:where 0 is the threshold value of the R unit.1.2.4 The Perceptron: Part 2In 1969, a book appeared that some people consider to have sounded the deathknell for neural networks. The book was aptly entitled Perceptrons: An Introductionto Computational Geometry <strong>and</strong> was written by Marv<strong>in</strong> M<strong>in</strong>sky <strong>and</strong>


1.2 From Neurons to ANS 25eThresholdconditionRet<strong>in</strong>aFigure 1.14The simple perceptron structure is similar <strong>in</strong> structure to thegeneral process<strong>in</strong>g element shown <strong>in</strong> Figure 1.11. Note theaddition of a threshold condition on the output. If the net<strong>in</strong>put is greater than the threshold value, the output of thedevice is +1; otherwise, the output is 0.Seymour Papert, both of MIT [26]. They presented an astute <strong>and</strong> detailed analysisof the perceptron <strong>in</strong> terms of its capabilities <strong>and</strong> limitations. Whether their<strong>in</strong>tention was to defuse popular support for neural-network research rema<strong>in</strong>s amatter for debate. Nevertheless, the analysis is as timely today as it was <strong>in</strong>1969, <strong>and</strong> many of the conclusions <strong>and</strong> concerns raised cont<strong>in</strong>ue to be valid.In particular, one of the po<strong>in</strong>ts made <strong>in</strong> the previous section—a po<strong>in</strong>t treated<strong>in</strong> detail <strong>in</strong> M<strong>in</strong>sky <strong>and</strong> Papert's book—is the idea that there are certa<strong>in</strong> restrictionson the class of problems for which the perceptron is suitable. Perceptronscan differentiate patterns only if the patterns are l<strong>in</strong>early separable. The mean<strong>in</strong>gof the term l<strong>in</strong>early separable should become clear shortly. Because manyclassification problems do not possess l<strong>in</strong>early separable classes, this conditionplaces a severe restriction on the applicability of the perceptron.M<strong>in</strong>sky <strong>and</strong> Papert departed from the probabilistic approach championedby Rosenblatt, <strong>and</strong> returned to the ideas of predicate calculus <strong>in</strong> their analysisof the perceptron. Their idealized perceptron appears <strong>in</strong> Figure 1.14.The set $ = { l P\,V2,---,Vn} is a set of predicates. In the predicates'simplest form, tp t = 1 if the zth po<strong>in</strong>t of the ret<strong>in</strong>a is on, <strong>and</strong> (pi — 0 otherwise.Each of the <strong>in</strong>put predicates is weighted by a number from the set\ a


26 Introduction to ANS TechnologyOutputOutput001101010110InputsFigure 1.15 This two-layer network has two nodes on the <strong>in</strong>put layer with<strong>in</strong>put values x\ <strong>and</strong> x 2 that can take on values of 0 or 1. Wewould like the network to be able to respond to the <strong>in</strong>putssuch that the output o is the XOR function of the <strong>in</strong>puts, as<strong>in</strong>dicated <strong>in</strong> the table.where 0 is the threshold value. This type of node is called a l<strong>in</strong>ear thresholdunit.The output-node activation is<strong>and</strong> the output value o isnet = w\x\ + W2X2o = /(net) =1 W\X\ + W2X2 > 00 W\X\ + W2X2 < 0The problem is to select values of the weights such that each pair of <strong>in</strong>putvalues results <strong>in</strong> the proper output value. This task cannot be done.Let's look at the equation6 = W\X\ + W2X2 (1.5)


I 1.2From Neurons to ANS 27Figure 1.16This figure shows the x\,x 2 plane with the four po<strong>in</strong>ts,(0,0), (1,0), (0, 1), <strong>and</strong> (1. 1), which make up the four <strong>in</strong>putvectors for the XOR problem. The l<strong>in</strong>e 9 — w\x\ + 1112x2divides the plane <strong>in</strong>to two regions but cannot successfullyisolate the set of po<strong>in</strong>ts (0,0) <strong>and</strong> (1, 1) from the po<strong>in</strong>ts (0.1)<strong>and</strong> (1,0).This equation is the equation of a l<strong>in</strong>e <strong>in</strong> the x\,xi plane. That plane is illustrated<strong>in</strong> Figure 1.16, along with the four po<strong>in</strong>ts that are the possible <strong>in</strong>puts to thenetwork. We can th<strong>in</strong>k of the problem as one of subdivid<strong>in</strong>g this space <strong>in</strong>to regions<strong>and</strong> then attach<strong>in</strong>g labels to the regions that correspond to the right answerfor po<strong>in</strong>ts <strong>in</strong> that region. We plot Eq. (1.5) for some values of 0, w\, <strong>and</strong> w 2 , as<strong>in</strong> Figure 1.16. The l<strong>in</strong>e can separate the plane <strong>in</strong>to at most two dist<strong>in</strong>ct regions.We can then classify po<strong>in</strong>ts <strong>in</strong> one region as belong<strong>in</strong>g to the class hav<strong>in</strong>g anoutput of 1, <strong>and</strong> those <strong>in</strong> the other region as belong<strong>in</strong>g to the class hav<strong>in</strong>g anoutput of 0; however, there is no way to arrange the position of the l<strong>in</strong>e so thatthe correct two po<strong>in</strong>ts for each class both lie <strong>in</strong> the same region. (Try it.) Thesimple l<strong>in</strong>ear threshold unit cannot correctly perform the XOR function.Exercise 1.5: A l<strong>in</strong>ear node is one whose output is equal to its activation. Showthat a network such as the one <strong>in</strong> Figure 1.15, but with a l<strong>in</strong>ear output node,also is <strong>in</strong>capable of solv<strong>in</strong>g the XOR problem.Before show<strong>in</strong>g a way to overcome this difficulty, we digress for a momentto <strong>in</strong>troduce the concept of hyperplanes. This idea shows up occasionally <strong>in</strong> theliterature <strong>and</strong> can be useful <strong>in</strong> the evaluation of the performance of certa<strong>in</strong> neuralnetworks. We have already used the concept to analyze the XOR problem.


28 Introduction to ANS TechnologyIn familiar three-dimensional space, a plane is an object of two dimensions.A s<strong>in</strong>gle plane can separate three-dimensional space <strong>in</strong>to two dist<strong>in</strong>ct regions;two planes can result <strong>in</strong> three or four dist<strong>in</strong>ct regions, depend<strong>in</strong>g on their relativeorientation, <strong>and</strong> so on. By extension, <strong>in</strong> an n-dimensional space, hyperplanesare objects of n — 1 dimensions. (An n-dimensional space is usually referred toas a hyperspace.) Suitable arrangement of hyperplanes allows an n-dimensionalspace to be partitioned <strong>in</strong>to various dist<strong>in</strong>ct regions.Many real problems <strong>in</strong>volve the separation of regions of po<strong>in</strong>ts <strong>in</strong> a hyperspace<strong>in</strong>to <strong>in</strong>dividual categories, or classes, which must be dist<strong>in</strong>guished fromother classes. One way to make these dist<strong>in</strong>ctions is to select hyperplanes thatseparate the hyperspace <strong>in</strong>to the proper regions. This task might appear difficultto perform <strong>in</strong> a high-dimensional space (higher than two, that is) — <strong>and</strong> it is.Fortunately, as we shall see later, certa<strong>in</strong> neural networks can learn the properpartition<strong>in</strong>g, so we don't have to figure it out <strong>in</strong> advance.In a general n-dimensional space, the equation of a hyperplane can bewritten asn,x, = Cwhere the a;s <strong>and</strong> C are constants, with at least one a,; / 0, <strong>and</strong> the x,s are thecoord<strong>in</strong>ates of the space.Exercise 1.6: What are the general equations for the hyperplanes <strong>in</strong> two- <strong>and</strong>three-dimensional spaces? What geometric figures do these equations describe?Let's return to the XOR problem to see how we might approach a solution.The graph <strong>in</strong> Figure 1.16 suggests that we could partition the space correctlyif we had three regions. One region would belong to one output class, <strong>and</strong> theother two would belong to the second output class. There is no reason whydisjo<strong>in</strong>t regions cannot belong to the same class. Figure 1.17 shows a networkof l<strong>in</strong>ear threshold units that performs the proper partition<strong>in</strong>g, along with thecorrespond<strong>in</strong>g hyperspace diagram. You should verify that the network does<strong>in</strong>deed give the correct results.The addition of the two hidden-layer, or middle-layer, units gave the networkthe needed flexibility to solve the problem. In fact, the existence of thishidden layer gives us the ability to construct networks that can solve complexproblems.This simple example is not <strong>in</strong>tended to imply that all criticisms of theperceptron could be answered by the addition of hidden layers <strong>in</strong> the structure.It is <strong>in</strong>tended to suggest that the technology cont<strong>in</strong>ues to evolve toward systemswith <strong>in</strong>creas<strong>in</strong>gly powerful computational abilities. Nevertheless, many concernsraised by M<strong>in</strong>sky <strong>and</strong> Papert should not be dismissed lightly. You can refer tothe epilog of the 1988 repr<strong>in</strong>t<strong>in</strong>g of their book for a synopsis [27]. We brieflydescribe a second concern here.


1.2 From Neurons to ANS 29(1.1)Output = 0Output = 0Figure 1.17 This network successfully solves the XOR problem. Thehidden layer provides for two l<strong>in</strong>es that can be used toseparate the plane <strong>in</strong>to three regions. The two regionsconta<strong>in</strong><strong>in</strong>g the po<strong>in</strong>ts (0,0) <strong>and</strong> (1,1) are associated with anetwork output of 0. The central region is associated with anetwork output of 1.The subject of that concern is scal<strong>in</strong>g. Many demonstrations of neuralnetworks rely on the solution of what M<strong>in</strong>sky <strong>and</strong> Papert call toy problems—that is, problems that are only shadows of real-world items. Mov<strong>in</strong>g from thesetoy problems to real-world problems is often thought to be only a matter oftime; we need only to wait until bigger, faster networks can be constructed.Several of the examples used <strong>in</strong> this text fall <strong>in</strong>to the category of toy problems.M<strong>in</strong>sky <strong>and</strong> Papert claim that many networks suffer undesirable effects whenscaled up to a large size. We raise this particular issue here, not because wenecessarily believe that scale-up problems will defeat us, but because we wish tocall attention to scal<strong>in</strong>g as an issue that still must be resolved. We suspect thatscal<strong>in</strong>g problems do exist, but that there is a solution—perhaps one suggestedby the architecture of the bra<strong>in</strong> itself.It seems plausible to us (<strong>and</strong> to M<strong>in</strong>sky <strong>and</strong> Papert) that the bra<strong>in</strong> is composedof many different parallel, distributed systems, perform<strong>in</strong>g well-def<strong>in</strong>edfunctions, but under the control of a serial-process<strong>in</strong>g system at one or more


30 Introduction to ANS Technologylevels. To address the issue of scal<strong>in</strong>g, we may need to learn how to comb<strong>in</strong>esmall networks <strong>and</strong> to place them under the control of other networks.Of course, a "small" network <strong>in</strong> the bra<strong>in</strong> challenges our current simulationcapabilities, so we do not know exactly what the limitations are. The technology,although over 30 years old at this writ<strong>in</strong>g, is still emerg<strong>in</strong>g <strong>and</strong> deserves closescrut<strong>in</strong>y. We should always be aware of both the strengths <strong>and</strong> the limitationsof our tools.1.3 ANS SIMULATIONWe will now consider several techniques for simulat<strong>in</strong>g ANS process<strong>in</strong>g modelsus<strong>in</strong>g conventional programm<strong>in</strong>g methodologies. After present<strong>in</strong>g the designguidel<strong>in</strong>es <strong>and</strong> goals that you should consider when implement<strong>in</strong>g your ownneural-net work simulators, we will discuss the data structures that will be usedthroughout the rema<strong>in</strong>der of this text as the basis for the network-simulationalgorithms presented as a part of each chapter.1.3.1 The Need for ANS SimulationMost of the ANS models that we will exam<strong>in</strong>e <strong>in</strong> subsequent chapters sharethe basic concepts of distributed <strong>and</strong> highly <strong>in</strong>terconnected PEs. Each networkmodel will build on these simple concepts, implement<strong>in</strong>g a unique learn<strong>in</strong>g law,an <strong>in</strong>terconnection scheme (e.g., fully <strong>in</strong>terconnected, sparsely <strong>in</strong>terconnected,unidirectional, <strong>and</strong> bidirectional), <strong>and</strong> a structure, to provide systems that aretailored to specific k<strong>in</strong>ds of problems.If we are to explore the possibilities of ANS technology, <strong>and</strong> to determ<strong>in</strong>ewhat its practical benefits <strong>and</strong> limitations are, we must develop a means oftest<strong>in</strong>g as many as possible of these different network models. Only then willwe be able to determ<strong>in</strong>e accurately whether or not an ANS can be used tosolve a particular problem. Unfortunately, we do not have access to a computersystem designed specifically to perform massively parallel process<strong>in</strong>g, such asis found <strong>in</strong> all the ANS models we will study. However, we do have accessto a tool that can be programmed rapidly to perform any type of algorithmicprocess, <strong>in</strong>clud<strong>in</strong>g simulation of a parallel-process<strong>in</strong>g system. This tool is thefamiliar sequential computer.Because we will study several different neural-network architectures, it isimportant for us to consider the aspects of code portability <strong>and</strong> reusability early<strong>in</strong> the implementation of our simulator. Let us therefore focus our attention onthe characteristics common to most of the ANS models, <strong>and</strong> implement thosecharacteristics as data structures that will allow our simulator to migrate to thewidest variety of network models possible. The process<strong>in</strong>g that is unique tothe different neural-network models can then be implemented to use the datastructures we will develop here. In this manner, we reduce to a m<strong>in</strong>imum theamount of reprogramm<strong>in</strong>g needed to implement other network models.


1.3 ANS Simulation 311.3.2 Design Guidel<strong>in</strong>es for SimulatorsAs we beg<strong>in</strong> simulat<strong>in</strong>g neural networks, one of the first observations we willusually make is that it is necessary to design the simulation software such thatthe network can be sized dynamically. Even when we use only one of thenetwork models described <strong>in</strong> this text, the ability to specify the number of PEsneeded "on the fly," <strong>and</strong> <strong>in</strong> what organization, is paramount. The justification forthis observation is based on the idea that it is not desirable to have to reprogram<strong>and</strong> recompile an ANS application simply because you want to change thenetwork size. S<strong>in</strong>ce dynamic memory-allocation tools exist <strong>in</strong> most of thecurrent generation of programm<strong>in</strong>g languages, we will use them to implementthe network data structures.The next observation you will probably make when design<strong>in</strong>g your ownsimulator is that, at run time, the computer's central process<strong>in</strong>g unit (CPU) willspend most of its time <strong>in</strong> the computation of the net,, the <strong>in</strong>put-activation termdescribed earlier. To underst<strong>and</strong> why this is so, consider how a uniprocessorcomputer will simulate a neural network. A program will have to be written toallow the CPU to time multiplex between units <strong>in</strong> the network; that is, each unit<strong>in</strong> the ANS model will share the CPU for some period. As the computer visitseach node, it will perform the <strong>in</strong>put computation <strong>and</strong> output translation functionbefore mov<strong>in</strong>g on to the next unit. As we have already seen, the computation thatproduces the net, value at each unit is normally a sum-of-products calculation—avery tirne-consum<strong>in</strong>g operation if there is a large number of <strong>in</strong>puts at each node.Compound<strong>in</strong>g the problem, the sum-of-products calculation is done us<strong>in</strong>gfloat<strong>in</strong>g-po<strong>in</strong>t numbers, s<strong>in</strong>ce the network simulation is essentially a digital representationof analog signals. Thus, the CPU will have to perform two float<strong>in</strong>gpo<strong>in</strong>toperations (a multiply <strong>and</strong> an add) for every <strong>in</strong>put to each unit <strong>in</strong> thenetwork. Given the large number of nodes <strong>in</strong> some networks, each with potentiallyhundreds or thous<strong>and</strong>s of <strong>in</strong>puts, it is easy to see that the computermust be capable of perform<strong>in</strong>g several million float<strong>in</strong>g-po<strong>in</strong>t operations per second(MFLOPS) to simulate an ANS of moderate size <strong>in</strong> a reasonable amountof time. Even assum<strong>in</strong>g the computer has the float<strong>in</strong>g-po<strong>in</strong>t hardware neededto improve the performance of the simulator, we, as programmers, must optimizethe computer's ability to perform this computation by design<strong>in</strong>g our datastructures appropriately.We now offer a f<strong>in</strong>al guidel<strong>in</strong>e for those readers who will attempt to implementmany different network models us<strong>in</strong>g the data structures <strong>and</strong> process<strong>in</strong>gconcepts presented here. To a large extent, our simulator design philosophy isbased on networks that have a uniform <strong>in</strong>terconnection strategy; that is, units<strong>in</strong> one layer have all been fully connected to units <strong>in</strong> another layer. However,many of the networks we present <strong>in</strong> this text will rely on different <strong>in</strong>terconnectionschemes. Units may be only sparsely <strong>in</strong>terconnected, or may have connectionsto units outside of the next sequential layer. We must take these notions <strong>in</strong>toaccount as we def<strong>in</strong>e our data structures, or we may well end up with a uniqueset of data structures for each network we implement.


32 Introduction to ANS TechnologyFigure 1.18 A two-layer network illustrat<strong>in</strong>g signal propagation isillustrated here. Each unit on the <strong>in</strong>put layer generates as<strong>in</strong>gle output signal that is propagated through the connect<strong>in</strong>gweights to each unit on the subsequent layer. Note that foreach second-layer unit, the connection weights to the <strong>in</strong>putlayer can be modeled as a sequential array (or list) of values.1.3.3 Array-Based ANS Data StructuresThe observation made earlier that data will be processed as a sum of products (oras the <strong>in</strong>ner product between two vectors) implies that the network data oughtto be arranged <strong>in</strong> groups of l<strong>in</strong>early sequential arrays, each conta<strong>in</strong><strong>in</strong>g homogeneousdata. The rationale beh<strong>in</strong>d this arrangement is that it is much faster to stepthrough an array of data sequentially than it is to have to look up the addressof every new value, as would be done if a l<strong>in</strong>ked-list approach were used. Thisgroup<strong>in</strong>g also is much more memory efficient than is a l<strong>in</strong>ked-list data structure,s<strong>in</strong>ce there is no need to store po<strong>in</strong>ters <strong>in</strong> the arrays. However, this efficiencyis bought at the expense of algorithm generality, as we shall show later.As an illustration of why arrays are more efficient than are l<strong>in</strong>ked records,consider the neural-network model shown <strong>in</strong> Figure 1.18. The <strong>in</strong>put valuepresent at the ith node <strong>in</strong> the upper layer is the sum of the modulated outputsreceived from every unit <strong>in</strong> the lower layer. To simulate this structure us<strong>in</strong>g dataorganized <strong>in</strong> arrays, we can model the connections <strong>and</strong> node outputs as values<strong>in</strong> two arrays, which we will call weights <strong>and</strong> outputs respectively. 3 Thedata <strong>in</strong> these arrays will be sequentially arranged so as to correspond one to onewith the item be<strong>in</strong>g modeled, as shown <strong>in</strong> Figure 1.19. Specifically, the outputfrom the first <strong>in</strong>put unit will be stored <strong>in</strong> the first location <strong>in</strong> the outputs array,the second <strong>in</strong> the second, <strong>and</strong> so on. Similarly, the weight associated with theconnection between the first <strong>in</strong>put unit <strong>and</strong> the unit of <strong>in</strong>terest, w-,\, will be3 Symbols that refer to variables, arrays, or code are identified <strong>in</strong> the text by the use of the typewritertypeface.


1.3 AIMS Simulation 33iweights,-w,-w,- 2w/ 3w / 1w ; 5outputs0 20405w ; 6^-— ^-_Figure 1.19An array data structure is illustrated for the computation of thenet, term. Here, we have organized the connection-weightvalues as a sequential array of values that map one to one tothe array conta<strong>in</strong><strong>in</strong>g unit output values.located as the first value <strong>in</strong> the zth weights array, weights [i]. Notice the<strong>in</strong>dex we now associate with the weights array. This <strong>in</strong>dex <strong>in</strong>dicates that therewill be many such arrays <strong>in</strong> the network, each conta<strong>in</strong><strong>in</strong>g a set of connectionweights. The <strong>in</strong>dex here <strong>in</strong>dicates that this array is one of these connectionarrays—specifically, the one associated with the <strong>in</strong>puts to the ith network unit.We will exp<strong>and</strong> on this notion later, as we extend the data structures to modela complete network.The process needed to compute the aggregate <strong>in</strong>put at the zth unit <strong>in</strong> theupper layer, net;, is as follows. We beg<strong>in</strong> by sett<strong>in</strong>g two po<strong>in</strong>ters to the firstlocation of the outputs <strong>and</strong> weights [i] arrays, <strong>and</strong> sett<strong>in</strong>g a local accumulatorto zero. We then perform the computation by multiply<strong>in</strong>g the valueslocated <strong>in</strong> memory at each of the two array po<strong>in</strong>ters, add<strong>in</strong>g the result<strong>in</strong>g productto the local accumulator, <strong>in</strong>crement<strong>in</strong>g both of the po<strong>in</strong>ters, <strong>and</strong> repeat<strong>in</strong>gthis sequence for all values <strong>in</strong> the arrays.In most modern computer systems, this sequence of operations will compile<strong>in</strong>to a two-<strong>in</strong>struction loop at the mach<strong>in</strong>e-code level (four <strong>in</strong>structions, if wecount the compare <strong>and</strong> branch <strong>in</strong>structions needed to implement the loop), becausethe process of <strong>in</strong>crement<strong>in</strong>g the array po<strong>in</strong>ters can be done automaticallyas a part of the <strong>in</strong>struction-address<strong>in</strong>g mode. Notice that, if either of the arraysconta<strong>in</strong>s a structure of data as its elements, rather than a s<strong>in</strong>gle value, the computationneeded to access the next element <strong>in</strong> the array is no longer an <strong>in</strong>crementpo<strong>in</strong>ter operation. Thus, the computer must execute additional <strong>in</strong>structions tocompute the location of the next array value, as opposed to simply <strong>in</strong>crement<strong>in</strong>ga register po<strong>in</strong>ter as part of the <strong>in</strong>struction address<strong>in</strong>g mode. For small networkapplications, the overhead associated with these extra <strong>in</strong>structions is trivial. Forapplications that use very large neural networks, however, the overhead time


34 Introduction to ANS Technologyneeded for each connection, repeated hundreds of thous<strong>and</strong>s (or perhaps millions)of times, can quickly overwhelm even a dedicated supercomputer. Wetherefore choose to emphasize efficiency <strong>in</strong> our simulator design; that is whywe <strong>in</strong>dicated earlier that the arrays ought to be constructed with homogeneousdata.This structure will do nicely for the general case of a fully <strong>in</strong>terconnectednetwork, but how can we adapt it to account for networks where the units arenot fully <strong>in</strong>terconnected? There are two strategies that can be employed to solvethis dilemma:• Implementation of a parallel <strong>in</strong>dex array to specify connectivity• Use of a universal "zero" value to act as a placeholder connectionIn the first case, an array with the same length as the weights [i] arrayis constructed <strong>and</strong> coexists with the weights [i] array. This array conta<strong>in</strong>san <strong>in</strong>teger <strong>in</strong>dex specify<strong>in</strong>g the offset <strong>in</strong>to the outputs array where the outputfrom the transmitt<strong>in</strong>g unit is located. Such a structure, along with the networkit describes, is illustrated <strong>in</strong> Figure 1.20. You should exam<strong>in</strong>e the diagramweights<strong>in</strong>dicesoutputsWi2IlI 2^^Ol021314x-03O40506Figure 1.20This sparse network is implemented us<strong>in</strong>g an <strong>in</strong>dex array.In this example, we calculate the <strong>in</strong>put value at unit i bymultiply<strong>in</strong>g each value <strong>in</strong> the weights array with the valuefound <strong>in</strong> the output array at the offset <strong>in</strong>dicated by the value<strong>in</strong> the <strong>in</strong>dices array.


1.3 ANS Simulation 35carefully to conv<strong>in</strong>ce yourself that the data structure does implement the networkstructure shown.In the second case, if we could specify to the network that a connectionhad a zero weight, the contribution to the total <strong>in</strong>put of the node that it feedswould be zero. Therefore, the only reason for the existence of this connectionwould be that it acts as a placeholder, allow<strong>in</strong>g the weights [i] array toma<strong>in</strong>ta<strong>in</strong> its one-to-one correspondence of location to connection. The costof this implementation is the amount of time consumed perform<strong>in</strong>g a uselessmultiply-accumulate operation, <strong>and</strong>, <strong>in</strong> very sparsely connected networks, thelarge amount of wasted memory space. In addition, as we write the code neededto implement the learn<strong>in</strong>g law associated with the different network models, ouralgorithms must take a universal zero value <strong>in</strong>to account <strong>and</strong> must not allow it toparticipate <strong>in</strong> the adaptation process; otherwise, the placeholder connection willbe changed as the network adapts <strong>and</strong> eventually become an active participant<strong>in</strong> the signal-propagation process.When is one approach preferable to the other? There is no absolute rulethat will cover the wide variety of computers that will be the target mach<strong>in</strong>es formany ANS applications. In our experience, though, the break-even po<strong>in</strong>t is whenthe network is miss<strong>in</strong>g one-half of its <strong>in</strong>terconnections. The desired approachtherefore depends largely on how completely <strong>in</strong>terconnected is the network thatis be<strong>in</strong>g simulated. Whereas the "placeholder" approach consumes less memory<strong>and</strong> CPU time when only a relatively few connections are miss<strong>in</strong>g, the <strong>in</strong>dexarray approach is much more efficient <strong>in</strong> very sparsely connected networks.1.3.4 L<strong>in</strong>ked-List ANS Data StructuresMany computer languages, such as Ada, LISP, <strong>and</strong> Modula-2, are designed toimplement dynamic memory structures primarily as lists of records conta<strong>in</strong><strong>in</strong>gmany different types of data. One type of data common to all records is thepo<strong>in</strong>ter type. Each record <strong>in</strong> the l<strong>in</strong>ked list will conta<strong>in</strong> a po<strong>in</strong>ter to the nextrecord <strong>in</strong> the cha<strong>in</strong>, thereby creat<strong>in</strong>g a threaded list of records. Each list isthen completely described as a set of records that each conta<strong>in</strong> po<strong>in</strong>ters toother similar records, or conta<strong>in</strong> null po<strong>in</strong>ters. L<strong>in</strong>ked lists offer a process<strong>in</strong>gadvantage <strong>in</strong> the algorithm generality they allow for neural-network simulationover the dynamic array structures described previously. Unfortunately, theyalso suffer from two disadvantages serious enough to limit their applicabilityto our simulator: excessive memory consumption <strong>and</strong> a significantly reducedprocess<strong>in</strong>g rate for signal propagation.To illustrate the disadvantages <strong>in</strong> the l<strong>in</strong>ked-list approach, we consider thetwo-layer network model <strong>and</strong> associated data structure depicted <strong>in</strong> Figure 1.21.In this example, each network unit is represented by an 7V ; record, <strong>and</strong> eachconnection is modeled as a Cij record. S<strong>in</strong>ce each connection is simultaneouslypart of two unique lists (the <strong>in</strong>put list to a unit on the upper layer <strong>and</strong> the outputhst from a unit on the lower layer), each connection record must conta<strong>in</strong> at leasttwo separate po<strong>in</strong>ters to ma<strong>in</strong>ta<strong>in</strong> the lists. Obviously, just the memory neededfcLEMSON UNIVERSITY LIBRARY


36 Introduction to ANS TechnologyN n-1N nUnit recordsConnection recordsUnit records°1*02*°3*°4*o5O6Figure 1.21A l<strong>in</strong>ked-list implementation of a two-layer network isshown here. In this model, each network unit accessesits connections by follow<strong>in</strong>g po<strong>in</strong>ters from one connectionrecord to the next. Here, the set of connection recordsmodel<strong>in</strong>g the <strong>in</strong>put connections to unit AT, are shown, withl<strong>in</strong>ks from all <strong>in</strong>put-layer units.to store those po<strong>in</strong>ters will consume twice as much memory space as is neededto store the connection weight values. This need for extra memory results <strong>in</strong>roughly a three-fold reduction <strong>in</strong> the size of the network simulation that can beimplemented when compared to the array-based model. 4 Similarly, the need tostore po<strong>in</strong>ters is not restricted to this particular structure; it is common to anyl<strong>in</strong>ked-list data structure.The l<strong>in</strong>ked-list approach is also less efficient than is the array model at runtime. The CPU must perform many more data fetches <strong>in</strong> the l<strong>in</strong>ked-list approach(to fetch po<strong>in</strong>ters), whereas <strong>in</strong> the array structure, the auto-post<strong>in</strong>crement address<strong>in</strong>gmode can be used to access the next connection implicitly. For verysparsely connected networks (or a very small network), this overhead is notsignificant. For a large network, however, the number of extra memory cyclesrequired due to the large number of connections <strong>in</strong> the network will quicklyoverwhelm the host computer system for most ANS simulations.On the bright side, the l<strong>in</strong>ked-list data structure is much more tolerant of"nonst<strong>and</strong>ard" network connectivity schemes; that is, once the code has beenwritten to enable the CPU to step through a st<strong>and</strong>ard list of <strong>in</strong>put connections,no code modification is required to step through a nonst<strong>and</strong>ard list. In this case,all the overhead is imposed on the software that constructs the orig<strong>in</strong>al datastructure for the network to be simulated. Once it is constructed, the CPU does4 This description is an obvious oversimplification, s<strong>in</strong>ce it does not consider potential differences<strong>in</strong> the amount of memory used by po<strong>in</strong>ters <strong>and</strong> float<strong>in</strong>g-po<strong>in</strong>t numbers, virtual-memory systems, orother techniques for extend<strong>in</strong>g physical memory.


1.3 ANS Simulation 37not know (or care) whether the connection list implements a fully <strong>in</strong>terconnectednetwork or a sparsely connected network. It simply follows the list to the end,then moves on to the next unit <strong>and</strong> repeats the process.1.3.5 Extension of ANS Data StructuresNow that we have def<strong>in</strong>ed two possible structures for perform<strong>in</strong>g the <strong>in</strong>putcomputations at each node <strong>in</strong> the network, we can extend these basic structuresto implement an entire network. S<strong>in</strong>ce the array structure tends to be moreefficient for comput<strong>in</strong>g <strong>in</strong>put values at run time on most computers, we willimplement the connection weights <strong>and</strong> node outputs as dynamically allocatedarrays. Similarly, any additional parameters required by the different networks<strong>and</strong> associated with <strong>in</strong>dividual connections will also be modeled as arrays thatcoexist with the connection-weights arrays.Now we must provide a higher-level structure to enable us to access thevarious <strong>in</strong>stances of these arrays <strong>in</strong> a logical <strong>and</strong> efficient manner. We can easilycreate an adequate model for our <strong>in</strong>tegrated network structure if we adopt a fewassumptions about how <strong>in</strong>formation is processed <strong>in</strong> a "st<strong>and</strong>ard" neural network:• Units <strong>in</strong> the network can always be coerced <strong>in</strong>to layers of units hav<strong>in</strong>gsimilar characteristics, even if there is only one unit <strong>in</strong> some layers.• All units <strong>in</strong> any layer must be processed completely before the CPU canbeg<strong>in</strong> simulat<strong>in</strong>g units <strong>in</strong> any other layer.• The number of layers that our network simulator will support is <strong>in</strong>def<strong>in</strong>ite,limited only by the amount of memory available.• The process<strong>in</strong>g done at each layer will usually <strong>in</strong>volve the <strong>in</strong>put connectionsto a node, <strong>and</strong> will only rarely <strong>in</strong>volve output connections from a node (seeChapter 3 for an exception to this assumption).Based on these assumptions, let us presume that the layer will be the networkstructure that b<strong>in</strong>ds the units together. Then, a layer will consist of arecord that conta<strong>in</strong>s po<strong>in</strong>ters to the various arrays that store the <strong>in</strong>formationabout the nodes on that layer. Such a layer model is presented <strong>in</strong> Figure 1.22.Notice that, whereas the layer record will locate the node output array directly,the connection arrays are accessed <strong>in</strong>directly through an <strong>in</strong>termediate array ofpo<strong>in</strong>ters. The reason for this <strong>in</strong>termediate structure is aga<strong>in</strong> related to our desireto optimize the data structures for efficient computation of the net 4 value foreach node. S<strong>in</strong>ce each node on the layer will produce exactly one output, theoutputs for all the nodes on any layer can be stored <strong>in</strong> a s<strong>in</strong>gle array. However,each node will also have many <strong>in</strong>put connections, each with weights unique tothat node. We must therefore construct our data structures to allow <strong>in</strong>put-weightarrays to be identified uniquely with specific nodes on the layer. The <strong>in</strong>termediateweight-po<strong>in</strong>ter array satisfies the need to associate <strong>in</strong>put weights withthe appropriate node (via the position of the po<strong>in</strong>ter <strong>in</strong> the <strong>in</strong>termediate array),


38 Introduction to ANS TechnologyoutputsweightsFigure 1.22The layer structure is used to model a collection of nodeswith similar function. In this example, the weight valuesof all <strong>in</strong>put connections to the first process<strong>in</strong>g unit (o,)are stored sequentially <strong>in</strong> the w\j array, connections to thesecond unit (o 2 ) <strong>in</strong> the w 2 j array, <strong>and</strong> so on, enabl<strong>in</strong>grapid sequential access to these values dur<strong>in</strong>g the <strong>in</strong>putcomputation operation.while allow<strong>in</strong>g the <strong>in</strong>put weights for each node to be modeled as sequentialarrays, thus ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the desired efficiency <strong>in</strong> the network data structures.F<strong>in</strong>ally, let us consider how we might model an entire network. S<strong>in</strong>cewe have decided that any network can be constructed from a set of layers, wewill model the network as a record that conta<strong>in</strong>s both global data, <strong>and</strong> po<strong>in</strong>tersto locate the first <strong>and</strong> last elements <strong>in</strong> a dynamically allocated array of layerrecords. This approach allows us to create a network of arbitrary depth whileprovid<strong>in</strong>g us with a means of immediate access to the two most commonlyaccessed layers <strong>in</strong> the network—the <strong>in</strong>put <strong>and</strong> output layers.Such a data structure, along with the network structure it represents, isdepicted <strong>in</strong> Figure 1.23. By model<strong>in</strong>g the data <strong>in</strong> this way, we will allow for


1.3 ANS Simulation 39layersoutputsweight_ptrsoutputs(a)weight_ptrsoutputs<strong>in</strong>putconnections(b)Figure 1.23The figure shows a neural network (a) as implemented by ourdata structures, <strong>and</strong> (b) as represented schematically.networks of arbitrary size <strong>and</strong> complexity, while optimiz<strong>in</strong>g the data structuresfor efficient run-time operation <strong>in</strong> the <strong>in</strong>ner loop, the computation of the net,term for each node <strong>in</strong> the network.1.3.6 A F<strong>in</strong>al Note on ANS SimulationBefore we move on to exam<strong>in</strong>e specific ANS models, we must mention thatthe earlier discussion of the ANS simulator data structures is meant to provideyou with only an <strong>in</strong>sight <strong>in</strong>to how to go about simulat<strong>in</strong>g neural networks onconventional computers. We have specifically avoided any detailed discussionof the data structures needed to implement a simulator (such as might be found<strong>in</strong> a conventional computer-science textbook). Likewise, we have avoided anyanalysis of how much more efficient one technique may be over another. Wehave taken this approach because we believe that it is more important to convey


40 Introduction to ANS TechnologySyntaxIntended Mean<strong>in</strong>gA typea po<strong>in</strong>ter to an object typeA type [ ]a po<strong>in</strong>ter to the first object <strong>in</strong> an arrayrecord, slotaccess a field "slot" <strong>in</strong> a recordpo<strong>in</strong>ter A .nameaccess a field "name" through a po<strong>in</strong>terlength ( A type [ ]) get number of items <strong>in</strong> the array{ text } curly braces enclose commentsTable 1.1Pseudocode language def<strong>in</strong>itions.the ideas of what must be done to simulate an ANS model, than to advocatehow to implement that model.This philosophy carries through the rema<strong>in</strong>der of the text as well, specifically<strong>in</strong> the sections <strong>in</strong> each chapter that describe how to implement the learn<strong>in</strong>galgorithms for the network be<strong>in</strong>g discussed. Rather than present<strong>in</strong>g algorithmsthat might <strong>in</strong>dicate a preference for a specific computer language, we have optedto develop our own pseudocode descriptions for the algorithms. 5 We hope thatyou will have little difficulty translat<strong>in</strong>g our simulator algorithms to your ownpreferrred data structures <strong>and</strong> programm<strong>in</strong>g languages.Part of the purpose of this text, however, is to illustrate the design ofthe algorithms needed to construct simulators for the various neural-networkmodels we shall present. For that reason, the algorithms developed <strong>in</strong> thistext will be rather detailed, perhaps more so than some people prefer. Wehave elected to use detailed algorithms so that we can illustrate, whereverpossible, algorithmic enhancements that should be made to improve the performanceof the simulator. However, with this detail comes a responsibility.S<strong>in</strong>ce we have elected to present pseudocode algorithms, we are obligated todevelop a syntax that precisely describes the actions we <strong>in</strong>tend the computerto perform. For that reason, you should become comfortable with the notationsdescribed <strong>in</strong> Table 1.1, as we will adhere to this syntax <strong>in</strong> the descriptionof the various algorithms <strong>and</strong> data structures throughout the rema<strong>in</strong>der of thistext.One f<strong>in</strong>al note for programmers who prefer the C language. You shouldbe aware of the fundamental differences between the mathematical summationsthat will be described <strong>in</strong> the ANS-specific chapters that follow, <strong>and</strong> the C for(;;)construct. Although it is mathematically correct to describe a summation withan <strong>in</strong>dex, i, start<strong>in</strong>g at 1 <strong>and</strong> runn<strong>in</strong>g through n, <strong>in</strong> C it is preferable to use arraysthat start at <strong>in</strong>dex 0 <strong>and</strong> run through n - 1. Those readers that will be writ<strong>in</strong>gsimulators <strong>in</strong> C must account for this difference <strong>in</strong> their code, <strong>and</strong> should bealert to this difference as they read the theoretical sections of this text.5 We have thus ensured that we have offended everyone equally.


Bibliography 41Suggested Read<strong>in</strong>gsThere are many books appear<strong>in</strong>g that cover various aspects of ANS technologyfrom widely differ<strong>in</strong>g perspectives. We shall not attempt to give an exhaustivebibliography here; rather, we shall <strong>in</strong>dicate sources that we have found to beuseful, for one reason or another.Neurocomput<strong>in</strong>g: Foundations of Research is an excellent source of papershav<strong>in</strong>g an historical <strong>in</strong>terest, as well as of more modern works [2]. Perhapsthe most widely read books are the PDF series, edited by Rumelhart <strong>and</strong> Mc-Clell<strong>and</strong> [23, 22]. Volume III of that series conta<strong>in</strong>s a disk with software thatcan be used for experiments with the technology. An earlier work, ParallelModels of Associative Memory, conta<strong>in</strong>s papers of the same type as thosefound <strong>in</strong> the POP series [13]. We have also <strong>in</strong>cluded <strong>in</strong> the bibliography otherbooks, some that are collections <strong>and</strong> some that are monographs, on the generaltopic of ANS [1, 7, 6, 8, 12, 16, 25, 27, 28, 31]. On a less technical levelis the recent work by Caudill <strong>and</strong> Butler, which provides an excellent reviewof the subject [5]. An excellent <strong>in</strong>troduction to neurophysiology is given byLav<strong>in</strong>e [19]. This book presents the basic term<strong>in</strong>ology <strong>and</strong> technical details ofneurophysiology without the excruciat<strong>in</strong>g detail of a medical text. A more comprehensivereview of neural model<strong>in</strong>g is given <strong>in</strong> the book by McGregor [21].For a cognitive-psychology viewpo<strong>in</strong>t, the works by Anderson <strong>and</strong> Baron areboth well written <strong>and</strong> thought provok<strong>in</strong>g [3, 4].An excellent review article is that by Richard Lippmann [20]. This articlegives an overview of several of the algorithms presented <strong>in</strong> later chapters ofthis text. You might also want to read the Scientific American article by DavidTank <strong>and</strong> John Hopfield [30]. Two other well-written <strong>and</strong> <strong>in</strong>formative reviewarticles appear <strong>in</strong> the first edition of the journal <strong>Neural</strong> <strong>Networks</strong> [9, 18]. Fora comparison between traditional classification techniques <strong>and</strong> neural-networkclassifiers, see the papers by Huang <strong>and</strong> Lippmann [15, 14]. You can get anidea of the types of applications to which neural-network technology may applyfrom the paper by Hecht-Nielsen [11].Bibliography[1] Igor Aleks<strong>and</strong>er, editor. <strong>Neural</strong> Comput<strong>in</strong>g Architectures. MIT Press, Cambridge,MA, 1989.[2] James A. Anderson <strong>and</strong> Edward Rosenfeld, editors. Neurocomput<strong>in</strong>g: Foundationsof Research. MIT Press, Cambridge, MA, 1988.[3] John R. Anderson. The Architecture of Cognition. Cognitive Science Series.Harvard University Press, Cambridge, MA, 1983.[4] Robert J. Baron. The Cerebral Computer. Lawrence Erlbaum Associates,Hillsdale, NJ, 1987.


42 Introduction to ANS Technology[5] Maureen Caudill <strong>and</strong> Charles Butler. Naturally Intelligent Systems. MITPress, Cambridge, MA, 1990.[6] John S. Denker, editor. <strong>Neural</strong> <strong>Networks</strong> for Comput<strong>in</strong>g: AIP ConferenceProceed<strong>in</strong>gs 151. American Institute of Physics, New York, 1986.[7] Rolf Eckmiller <strong>and</strong> Christoph v. d. Malsburg, editors. <strong>Neural</strong> Computers.NATO ASI Series F: Computer <strong>and</strong> Systems Sciences. Spr<strong>in</strong>ger-Verlag,Berl<strong>in</strong>, 1988.[8] Stephen Grossberg, editor. <strong>Neural</strong> <strong>Networks</strong> <strong>and</strong> Natural Intelligence. MITPress, Cambridge, MA, 1988.[9] Stephen Grossberg. Nonl<strong>in</strong>ear neural networks: Pr<strong>in</strong>ciples, mechanisms,<strong>and</strong> architectures. <strong>Neural</strong> <strong>Networks</strong>, 1(1): 17-62, 1988.[10] Donald O. Hebb. The Organization of Behavior. Wiley, New York, 1949.[11] Robert Hecht-Nielsen. Neurocomputer applications. In Rolf Eckmiller<strong>and</strong> Christoph v.d. Malsburg, editors, <strong>Neural</strong> Computers, pages 445-454.Spr<strong>in</strong>ger-Verlag, 1988. NATO ASI Series F: Computers <strong>and</strong> System SciencesVol. 41.[12] Robert Hecht-Nielsen. Neurocomput<strong>in</strong>g. Addison-Wesley, Read<strong>in</strong>g, MA,1990.[13] Geoffrey E. H<strong>in</strong>ton <strong>and</strong> James A. Anderson, editors. Parallel Models ofAssociative Memory. Lawrence Erlbaum Associates, Hillsdale, NJ, 1981.[14]Willian Y. Huang <strong>and</strong> Richard P. Lippmann. Comparison between neuralnet <strong>and</strong> conventional classifiers. In Proceed<strong>in</strong>gs of the IEEE First InternationalConference on <strong>Neural</strong> <strong>Networks</strong>, San Diego, CA, pp. iv.485-iv.494June 1987.[15]Willian Y. Huang <strong>and</strong> Richard P. Lippmann. <strong>Neural</strong> net <strong>and</strong> traditionalclassifiers. In Proceed<strong>in</strong>gs of the Conference on <strong>Neural</strong> Information Process<strong>in</strong>gSystems, Denver, CO, November 1987.[16]Tarun Khanna. Foundations of <strong>Neural</strong> <strong>Networks</strong>. Addison-Wesley, Read<strong>in</strong>g,MA, 1990.[17] C. Klimasauskas. The 1989 N'euro-Comput<strong>in</strong>g Bibliography, MIT Press,Cambridge, MA, 1989.[18]Teuvo Kohonen. An <strong>in</strong>troduction to neural comput<strong>in</strong>g. <strong>Neural</strong> <strong>Networks</strong>,1(1):3-16, 1988.[19] Robert A. Lav<strong>in</strong>e. Neurophysiology: The Fundamentals. The CollamorePress, Lex<strong>in</strong>gton, MA, 1983.[20] Richard P. Lippmann. An <strong>in</strong>troduction to comput<strong>in</strong>g with neural nets. IEEEASSP Magaz<strong>in</strong>e, pp. 4-22, April 1987.[21] Ronald J. MacGregor. <strong>Neural</strong> <strong>and</strong> Bra<strong>in</strong> Model<strong>in</strong>g. Academic Press, SanDiego, CA, 1987.


Bibliography 43[22] James McClell<strong>and</strong> <strong>and</strong> David Rumelhart. Explorations <strong>in</strong> Parallel DistributedProcess<strong>in</strong>g. MIT Press, Cambridge, MA, 1986.[23] James McClell<strong>and</strong> <strong>and</strong> David Rumelhart. Parallel Distributed Process<strong>in</strong>g,volumes 1 <strong>and</strong> 2. MIT Press, Cambridge, MA, 1986.[24] Warren S. McCulloch <strong>and</strong> Walter Pitts. A logical calculus of the ideas immanent<strong>in</strong> nervous activity. Bullet<strong>in</strong> of Mathematical Biophysics, 5:115-133,1943.[25] Carver A. Mead <strong>and</strong> M. A. Mahowald. A silicon model of early visualprocess<strong>in</strong>g. <strong>Neural</strong> <strong>Networks</strong>, l(l):91-98, 1988.[26] Marv<strong>in</strong> M<strong>in</strong>sky <strong>and</strong> Seymour Papert. Perceptrons. MIT Press, Cambridge,MA, 1969.[27] Marv<strong>in</strong> M<strong>in</strong>sky <strong>and</strong> Seymour Papert. Perceptrons: Exp<strong>and</strong>ed Edition. MITPress, Cambridge, MA, 1988.[28] Yoh-Han Pao. Adaptive Pattern Recognition <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>. Addison-Wesley, Read<strong>in</strong>g, MA, 1989.[29] Frank Rosenblatt. The perceptron: A probabilistic model for <strong>in</strong>formationstorage <strong>and</strong> organization <strong>in</strong> the bra<strong>in</strong>. Psychological Review, 65:386^08,1958.[30] David W. Tank <strong>and</strong> John J. Hopfield. Collective computation <strong>in</strong> neuronlikecircuits. Scientific American, 257(6): 104-114, 1987.[31] Philip D. Wasserman. <strong>Neural</strong> Comput<strong>in</strong>g: Theory <strong>and</strong> Practice. Van Nostr<strong>and</strong>Re<strong>in</strong>hold, New York, 1989.


C H A P T E RAdal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eSignal process<strong>in</strong>g developed as an eng<strong>in</strong>eer<strong>in</strong>g discipl<strong>in</strong>e with the advent ofelectronic communication. Initially, analog filters us<strong>in</strong>g resistor-<strong>in</strong>ductor-capacitor(RLC) circuits were designed to remove noise from the communicationsignals. Today, signal process<strong>in</strong>g has evolved <strong>in</strong>to a many-faceted technology,with the emphasis hav<strong>in</strong>g shifted from tuned circuit implementation to digitalsignal processors (DSPs) that can perform the same types of filter<strong>in</strong>g applicationsby execut<strong>in</strong>g convolution filters implemented <strong>in</strong> software. The basis forthe <strong>in</strong>dustry rema<strong>in</strong>s the design <strong>and</strong> implementation of filters to perform noiseremoval from <strong>in</strong>formation-bear<strong>in</strong>g signals.In this chapter, we will focus on a specific type of filter, called the Adal<strong>in</strong>e(<strong>and</strong> the multiple-Adal<strong>in</strong>e, or Madal<strong>in</strong>e) developed by Bernard Widrow ofStanford University. As we will see, the Adal<strong>in</strong>e model is similar to that of as<strong>in</strong>gle PE <strong>in</strong> an ANS.2.1 REVIEW OF SIGNAL PROCESSINGWe beg<strong>in</strong> our discussion of the Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>e networks with a reviewof basic signal-process<strong>in</strong>g theory. An underst<strong>and</strong><strong>in</strong>g of this material is essentialif we are to appreciate the operation <strong>and</strong> applications of these networks.However, this material is also typically covered as part of an undergraduatecurriculum <strong>in</strong> <strong>in</strong>formation cod<strong>in</strong>g <strong>and</strong> data communication. Therefore, readersalready comfortable with signal-process<strong>in</strong>g concepts may skip this first sectionwithout fear of miss<strong>in</strong>g material relevant to the Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>e topics.For those readers who are not familiar with the techniques commonly usedto implement electronic communications <strong>and</strong> signal process<strong>in</strong>g, we shall beg<strong>in</strong>by describ<strong>in</strong>g briefly the data-encod<strong>in</strong>g <strong>and</strong> modulation schemes used <strong>in</strong> anamplitude-modulation (AM) radio transmission. As part of this discussion, weshall illustrate the need for filters <strong>in</strong> the communications <strong>in</strong>dustry. We will then45


46 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>ereview the concepts of the frequency doma<strong>in</strong>, the four basic filter types, <strong>and</strong>Fourier analysis. This prelim<strong>in</strong>ary section concludes with a brief overview ofdigital signal process<strong>in</strong>g, because many of the concepts realized <strong>in</strong> digital filtersare directly applicable to the Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>e (<strong>and</strong> many other) neuralnetworks.2.1.1 Signal Process<strong>in</strong>g <strong>and</strong> FiltersSignal process<strong>in</strong>g is an eng<strong>in</strong>eer<strong>in</strong>g discipl<strong>in</strong>e that deals primarily with the implementationof filters to remove or reduce unwanted frequency componentsfrom an <strong>in</strong>formation-bear<strong>in</strong>g signal. Let's consider, for example, an AM radiobroadcast. Electronic communication techniques, whether for audio signalsor other data, consist of signal encod<strong>in</strong>g <strong>and</strong> modulation. Information to betransmitted—<strong>in</strong> this case, audible sounds, such as voice or music—can be encodedelectronically by an analog signal that exactly reproduces the frequencies<strong>and</strong> amplitudes of the orig<strong>in</strong>al sounds. S<strong>in</strong>ce the sounds be<strong>in</strong>g encoded representa cont<strong>in</strong>uum from silence through voice to music, the <strong>in</strong>stantaneous frequencyof the encoded signal will vary with time, rang<strong>in</strong>g from 0 to approximately10,000 hertz (Hz).Rather than attempt to transmit this encoded signal directly, we transformthe signal <strong>in</strong>to a form more suitable for radio transmission. We accomplish thistransformation by modulat<strong>in</strong>g the amplitude of a high-frequency carrier signalwith the analog <strong>in</strong>formation signal. This process is illustrated <strong>in</strong> Figure 2.1.Here, the carrier is noth<strong>in</strong>g more than a s<strong>in</strong>e wave with a frequency muchgreater than the <strong>in</strong>formation signal. For AM radio, the carrier frequency will be<strong>in</strong> the range of 550 to 1600 kilohertz (KHz). S<strong>in</strong>ce the frequency of the carrieris significantly greater than is the maximum frequency of the <strong>in</strong>formation signal,little <strong>in</strong>formation is lost by this modulation. The modulated signal can then betransmitted to a receiv<strong>in</strong>g station (or broadcast to anyone with a radio receiver),where the signal is demodulated <strong>and</strong> is reproduced as sound.The most obvious reason for a filter <strong>in</strong> AM radio is that different people havedifferent preferences <strong>in</strong> music <strong>and</strong> enterta<strong>in</strong>ment. Therefore, the government <strong>and</strong>the communication <strong>in</strong>dustry have allowed many different radio stations to operate<strong>in</strong> the same geographical area, so that everyone's tastes <strong>in</strong> enterta<strong>in</strong>mentcan be accommodated. With so many different radio stations all broadcast<strong>in</strong>g<strong>in</strong> close proximity, how is it that we can listen to only one station at a time?The answer is to allow each receiver to be tuned by the user to a selectable frequency.In tun<strong>in</strong>g the radio, we are essentially chang<strong>in</strong>g the frequency-responsecharacteristics of a b<strong>and</strong>pass filter <strong>in</strong>side the radio. This filter allows only thesignals from the station <strong>in</strong> which we are <strong>in</strong>terested to pass, while elim<strong>in</strong>at<strong>in</strong>gall the other signals be<strong>in</strong>g broadcast with<strong>in</strong> the spectrum of the AM radio.To illustrate how the b<strong>and</strong>pass filter operates, we will change our referencefrom the time doma<strong>in</strong> to the frequency doma<strong>in</strong>. We beg<strong>in</strong> by construct<strong>in</strong>g atwo-axis graph, where the x axis represents <strong>in</strong>creas<strong>in</strong>g frequencies <strong>and</strong> the yaxis represents decreas<strong>in</strong>g attenuation <strong>in</strong> a unit called the decibel (dB). Such a


2.1 Review of Signal Process<strong>in</strong>g 471.00.5(a)0.05 0.1 0.15 0.2-0.5-1.0Carrier wave1.31.21.1(b)0.90.05 0.1 0.150.20.80.7Wave conta<strong>in</strong><strong>in</strong>g <strong>in</strong>formation1.51.00.5(c)-0.50.05 0.1 0.150.2-1.0-1.5Amplitude-modulated waveFigure 2.1 Typical <strong>in</strong>formation-encod<strong>in</strong>g <strong>and</strong> amplitude-modulationtechniques for electronic communication are shown, (a) Thecarrier wave has a frequency much higher than that of (b) the<strong>in</strong>formation-bear<strong>in</strong>g signal, (c) The carrier wave is modulatedby the <strong>in</strong>formation-bear<strong>in</strong>g signal.


48 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>egraph is illustrated <strong>in</strong> Figure 2.2(a). For the AM radio example, let us imag<strong>in</strong>ethat there are seven AM radio stations, labeled A through G, operat<strong>in</strong>g <strong>in</strong> thearea where we are listen<strong>in</strong>g. The frequencies at which these stations transmitare graphed as vertical l<strong>in</strong>es located on the frequency axis at the po<strong>in</strong>t correspond<strong>in</strong>gto their transmitt<strong>in</strong>g, or carrier, frequency. The amplitude of the l<strong>in</strong>es,as illustrated <strong>in</strong> Figure 2.2(a), is almost 0 dB, <strong>in</strong>dicat<strong>in</strong>g that each station istransmitt<strong>in</strong>g at full power, <strong>and</strong> each can be received equally well.Now we will tune a b<strong>and</strong>pass filter to select one of the seven stations. Thefrequency response of a typical b<strong>and</strong>pass filter is illustrated <strong>in</strong> Figure 2.2(b).Notice that the frequency-response curve is such that all frequencies that falloutside the <strong>in</strong>verted notch are attenuated to very small magnitudes, whereasfrequencies with<strong>in</strong> the passb<strong>and</strong> are allowed to pass with very little attenuation—hence the name "b<strong>and</strong>pass filter." To tune our radio receiver to any one ofthe seven broadcast<strong>in</strong>g stations, we simply adjust the frequency response ofthe filter such that the carrier frequency of the desired station is with<strong>in</strong> thepassb<strong>and</strong>.CD-oCoi0-f\ [3 C) [ D EE F- CBi-10--20-———— i ———— i— -i——— i- — i —500 700 900 1100 1300——— i ———1500 1700Frequency (KHz)CD-o(b)CO-10---20500 700 900 1100 1300 1500 1700Frequency (KHz)Figure 2.2These are frequency doma<strong>in</strong> graphs of (a) AM radio receptionof seven different stations, (b) the frequency response of thetun<strong>in</strong>g filter <strong>and</strong> the magnitude of the received signals afterfilter<strong>in</strong>g.


2.1 Review of Signal Process<strong>in</strong>g 49As another example of the use of filters <strong>in</strong> the communication <strong>in</strong>dustry,consider the problem of echo suppression <strong>in</strong> long-distance telephone communication.As <strong>in</strong>dicated <strong>in</strong> Figure 2.3, the problem is caused by the <strong>in</strong>teractionbetween the amplifiers <strong>and</strong> series coupl<strong>in</strong>g used on both ends of the l<strong>in</strong>e, <strong>and</strong>the delay time required to transmit the voice <strong>in</strong>formation between the switch<strong>in</strong>goffice <strong>and</strong> the communications satellite <strong>in</strong> geostationary orbit, 23,000 milesabove the earth. Specifically, you hear an echo of your own voice <strong>in</strong> the telephonewhen you speak. The signal carry<strong>in</strong>g your voice arrives at the receiv<strong>in</strong>gtelephone approximately 270 milliseconds after you speak. This delay is theamount of time required by the microwave signal to travel the 46,000 milesbetween the transmitt<strong>in</strong>g station, the satellite, <strong>and</strong> the receiv<strong>in</strong>g station on theground. Once received <strong>and</strong> routed to the dest<strong>in</strong>ation telephone, the signal isaga<strong>in</strong> amplified <strong>and</strong> reproduced as sound on the receiv<strong>in</strong>g h<strong>and</strong>set. Unfortunately,it is also often picked up by the transmitter at the receiv<strong>in</strong>g end, dueto imperfections <strong>in</strong> the devices used to decouple the <strong>in</strong>com<strong>in</strong>g signals. It canthen be reamplified <strong>and</strong> fed back to you approximately 1/2 second after youspoke. The result is echo. Obviously, a simple b<strong>and</strong>pass filter cannot be usedto remove the echo, because there is no way to dist<strong>in</strong>guish the echoed signalfrom valid signals.To solve problems such as these, the communications <strong>in</strong>dustry has developedmany different types of filters. These filters not only are used <strong>in</strong> elec-23,300 milesIncom<strong>in</strong>g signalretransmitted dueto coupl<strong>in</strong>g leakageReturn<strong>in</strong>g signal is orig<strong>in</strong>aldelayed by 500 milliseconds,result<strong>in</strong>g <strong>in</strong> an "echo"S^e 2.3Echo can occur <strong>in</strong> long-distance telecommunications.


50 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>etronic communications, but also have an application base that <strong>in</strong>cludes radar<strong>and</strong> sonar imag<strong>in</strong>g, electronic warfare, <strong>and</strong> medical technology. However, allthe application-specific filter implementations can be grouped <strong>in</strong>to four generalfilter types: lowpass, highpass, b<strong>and</strong>pass, <strong>and</strong> b<strong>and</strong>stop. The characteristicfrequency response of these filters is depicted <strong>in</strong> Figure 2.4. The adaptivefilter, which is the subject of the rema<strong>in</strong>der of the chapter, has characteristicsunique to the application it serves. It can reproduce the characteristics ofany of the four basic filter types, alone or <strong>in</strong> comb<strong>in</strong>ation. As we shall showlater, the adaptive filter is ideally suited to the telephone-echo problem justdiscussed.OdBLowpass filterHighpass filterc'reOOdBtFrequency (KHz)-OdBFrequency (KHz)-B<strong>and</strong>pass filterB<strong>and</strong>stop filterOdBtc'reCDFrequency (KHz)-Frequency (KHzFigure 2.4Frequency-response characteristics of the four basic filter typesare shown.


which describes a typical square wave.As illustrated <strong>in</strong> Figure 2.5, if we algebraically add together the first threes<strong>in</strong>usoidal components of this Fourier series, we produce a signal that alreadystrongly resembles the square wave. However, we should notice that the resultantsignal also exhibits ripples <strong>in</strong> both active regions. These ripples will rema<strong>in</strong>to some extent, unless we complete the <strong>in</strong>f<strong>in</strong>ite series. S<strong>in</strong>ce that is obviouslynot practical, we must eventually truncate the series <strong>and</strong> settle for some amountof ripple <strong>in</strong> the result<strong>in</strong>g signal.It turns out that this truncation exactly corresponds to the behavior weobserved when transmitt<strong>in</strong>g a square wave across an electromagnetic media. As|t is impossible to have a medium of <strong>in</strong>f<strong>in</strong>ite b<strong>and</strong>width, it follows that it ismipossible to transmit all the frequency components of a square wave. Thus,L2.1 Review of Signal Process<strong>in</strong>g 512.1.2 Fourier Analysis <strong>and</strong> the Frequency Doma<strong>in</strong>To analyze a signal-process<strong>in</strong>g problem that requires a filter, we must leave thetime doma<strong>in</strong> <strong>and</strong> f<strong>in</strong>d a tool for translat<strong>in</strong>g our filter models <strong>in</strong>to the frequencydoma<strong>in</strong>, because most of the signals we will analyze cannot be completelyunderstood <strong>in</strong> the time doma<strong>in</strong>. For example, most signals consist not only ofa fundamental frequency, but also harmonics that must be considered, or theyconsist of many discrete frequency components that must be accounted for bythe filters we design. There are many tools that we can use to help underst<strong>and</strong>the frequency-doma<strong>in</strong> nature of signals. One of the most commonly used is theFourier series. It has been shown that any periodic signal can be modeled asan <strong>in</strong>f<strong>in</strong>ite series of s<strong>in</strong>es <strong>and</strong> cos<strong>in</strong>es. The Fourier series, which describes thefrequency-doma<strong>in</strong> nature of periodic signals, is given by the equation30 3Cx(t) = ^ a n cos(27rn/ 0


52 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>e1.0J0.5 I:-1.5-1.0-0.5 OiS 1.0 1 ;50.3!0,2!0.1!-1.5-1.0-0.5 !-0.1f-0.2J0.75t0.5J0.250.5 1.0 1.5 0.5 1.0 1.5-0.5-0.750.20.1-1.5-1.0-0.5-0.10.5 1.0 1.5Figure 2.5-0.21The first three frequency-doma<strong>in</strong> components of a square waveare shown. Notice that the s<strong>in</strong>e waves each have differentmagnitudes, as <strong>in</strong>dicated by the coord<strong>in</strong>ates on the /axis, eventhough they are graphed to the same height.when we transmit a periodic square wave, we can observe the frequency-doma<strong>in</strong>effects <strong>in</strong> the time-doma<strong>in</strong> signal as overshoot, undershoot, <strong>and</strong> ripple.This example shows that the Fourier series can be a powerful tool <strong>in</strong> help<strong>in</strong>gus to underst<strong>and</strong> the frequency-doma<strong>in</strong> nature of any periodic signal, <strong>and</strong> topredict ahead of time what transmission effects we must consider as we designfilters for our signal-process<strong>in</strong>g applications.We can also apply Fourier analysis to aperiodic signals, by evaluat<strong>in</strong>g theFourier <strong>in</strong>tegral, which is given by


2.1 Review of Signal Process<strong>in</strong>g 53We will not, however, belabor this po<strong>in</strong>t. Our purpose here is merely to underst<strong>and</strong>the frequency-doma<strong>in</strong> nature of signals. Readers <strong>in</strong>terested <strong>in</strong> <strong>in</strong>vestigat<strong>in</strong>gFourier analysis further are referred to Kaplan [4].2.1.3 Filter Implementation <strong>and</strong> Digital Signal Process<strong>in</strong>gEarly implementations of the four basic filters were predom<strong>in</strong>antly tuned RLCcircuits. This approach had a basic limitation, however, <strong>in</strong> that the filters hadonly a very small range of adjustability. Aside from our be<strong>in</strong>g able to changethe resonant frequency of the filter by adjust<strong>in</strong>g a variable capacitor or <strong>in</strong>ductor,the filters were pretty much fixed once implemented, leav<strong>in</strong>g little room forchange as applications became more sophisticated.The next step <strong>in</strong> the evolution of filter design came about with the advent ofdigital computer systems, <strong>and</strong>, just recently, with the availability of microcomputerchips with architectures custom-tailored for signal-process<strong>in</strong>g applications.The basic concept underly<strong>in</strong>g digital filter implementation is the idea that a cont<strong>in</strong>uousanalog signal can be sampled periodically, quantized, <strong>and</strong> processedby a fairly st<strong>and</strong>ard computer system. This approach, illustrated <strong>in</strong> Figure 2.6,overcame the limitation of fixed implementation, because chang<strong>in</strong>g the filter wassimply a matter of rewrit<strong>in</strong>g the software for the computer. We will thereforeconcentrate on what goes on with<strong>in</strong> the software simulation of the analog filter.We assume that the computer implementation of the filter is a discretetime,l<strong>in</strong>ear, time-<strong>in</strong>variant system. Systems that satisfy these constra<strong>in</strong>ts canperform a transformation on an <strong>in</strong>put signal, based on some predef<strong>in</strong>ed criteria,Orig<strong>in</strong>al signalo>CLOTimeCD t-oDiscrete samples "3-15""Figure 2.6TimeDiscrete-time sampl<strong>in</strong>g of a cont<strong>in</strong>uous signal is shown.


54 Adal<strong>in</strong>e <strong>and</strong> Madel<strong>in</strong>eto produce an output that corresponds to the <strong>in</strong>put as though it had passedthrough an analog filter. Thus, a computer, execut<strong>in</strong>g a program that appliesa given transformation operation, R, to discrete, digitized approximations of acont<strong>in</strong>uous <strong>in</strong>put signal, x(n), can produce an output value y(n) for each <strong>in</strong>putsample, where n is the discrete timestep variable. In its role <strong>in</strong> perform<strong>in</strong>g thistransformation, the computer can be thought of as a digital filter. Moreover, anyfilter can be completely characterized by its response, h(n), to the unit impulsefunction, represented as 6(ri). More precisely,h(n) = R[6(n)]The benefit of this formulation is that, once the system response to the unitimpulse is known, the system output for any <strong>in</strong>put is given byy(n) = R[x(n)]h(i)x(n —where x(n) is the system <strong>in</strong>put.This equation is mean<strong>in</strong>gful to us <strong>in</strong> that it describes a convolution sumbetween the <strong>in</strong>put signal <strong>and</strong> the unit impulse response of the system. The processcan be pictured as a w<strong>in</strong>dow slid<strong>in</strong>g past a scene of <strong>in</strong>terest. As illustrated<strong>in</strong> Figure 2.7, for each time step, the system output is produced by transpos<strong>in</strong>g<strong>and</strong> shift<strong>in</strong>g h(ri) one position to the right. The summation is then performedover all nonzero values of x(ri) for the f<strong>in</strong>ite length of the filter. In this manner,we can realize the filter by repetitively perform<strong>in</strong>g float<strong>in</strong>g-po<strong>in</strong>t multiplications<strong>and</strong> additions, coupled with sample time delays <strong>and</strong> shift operations. Repetitive,mathematical operations are what computers do best; therefore, the convolutionsum provides us with a mechanism for build<strong>in</strong>g the digital equivalent of analogfilters. Readers <strong>in</strong>terested <strong>in</strong> learn<strong>in</strong>g more about digital signal process<strong>in</strong>g arereferred to Oppenheim <strong>and</strong> Schafer [5] or Hamm<strong>in</strong>g [3].It is sufficient for our purposes to note that the convolution sum is a sum-ofproductsoperation similar to the type of operation an ANS PE performs whencomput<strong>in</strong>g its <strong>in</strong>put activation signal. Specifically, the Adal<strong>in</strong>e uses exactly thissum-of-products calculation, without the sample time delays <strong>and</strong> shift operations,to determ<strong>in</strong>e how much <strong>in</strong>put stimulation it receives from an <strong>in</strong>stantaneous <strong>in</strong>putsignal. As we shall see <strong>in</strong> the next section, the Adal<strong>in</strong>e extends the basic filteroperation one step further, <strong>in</strong> that it has implemented with<strong>in</strong> itself a meansof adapt<strong>in</strong>g the weight<strong>in</strong>g coefficients to allow it to <strong>in</strong>crease or decrease thestimulation it receives the next time it is presented with the same signal.The ability of the Adal<strong>in</strong>e to adapt its weight<strong>in</strong>g coefficients is extremelyuseful. When writ<strong>in</strong>g a digital filter program on a computer, the programmermust know exactly how to specify the filter<strong>in</strong>g algorithm <strong>and</strong> what the details ofthe signal characteristics are. If modifications are desired, or if the signal characteristicschange, reprogramm<strong>in</strong>g is required. When the programmer uses an Adal<strong>in</strong>e,the problem shifts to one of be<strong>in</strong>g able to specify the desired output signal,


2.2 Adal<strong>in</strong>e <strong>and</strong> the Adaptive L<strong>in</strong>ear Comb<strong>in</strong>er 55h(n)(a)0.3'0.20.1i i I I i I•- ••" , . i I T i i , , i01234567 nx(n)1.000.50- • • •• • • •(b)1 1 1 1 1 101234567nx(n)0 1 2 3 41 1 1 1 0.550.560.570 5(c) |o |.05|o|.10I .05.1b.10|.201.15|.2b|.20.30.25——————— ^~ y(0) =1 .30 I ——— +~ro|.05|.10 |.15 |.20 |.25 |.30f—^- X(2) == 0.30= .55= .75Figure 2.7 Convolution sum calculation is shown, (a) The process beg<strong>in</strong>sby determ<strong>in</strong><strong>in</strong>g the desired response of the filter to the unitimpulse function at eight discrete timesteps. (b) The <strong>in</strong>putsignal is sampled <strong>and</strong> quantized eight times, (c) The outputof the filter is produced for each timestep by multiplication ofeach term <strong>in</strong> (a) with the correspond<strong>in</strong>g value of (b) for all validtimesteps.given a particular <strong>in</strong>put signal. The Adal<strong>in</strong>e takes the <strong>in</strong>put <strong>and</strong> the desired output,<strong>and</strong> adjusts itself so that it can perform the desired transformation. Furthermore,if the signal characteristics change, the Adal<strong>in</strong>e can adapt automatically.We shall now exp<strong>and</strong> these ideas, <strong>and</strong> beg<strong>in</strong> our <strong>in</strong>vestigation of the Adal<strong>in</strong>e.2.2 ADALINE AND THE ADAPTIVELINEAR COMBINER'he Adal<strong>in</strong>e is a device consist<strong>in</strong>g of a s<strong>in</strong>gle process<strong>in</strong>g element; as such, it isnot technically a neural network. Nevertheless, it is a very important structurethat deserves close study. Moreover, we will show how it can form the basis°f a network <strong>in</strong> a later section.


56 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eThe term Adal<strong>in</strong>e is an acronym; however, its mean<strong>in</strong>g has changed somewhatover the years. Initially called the ADAptive L<strong>in</strong>ear NEuron, it becamethe ADAptive LINear Element, when neural networks fell out of favor <strong>in</strong> thelate 1960s. It is almost identical <strong>in</strong> structure to the general PE described <strong>in</strong>Chapter 1. Figure 2.8 shows the Adal<strong>in</strong>e structure. There are two basic modificationsrequired to make the general PE structure <strong>in</strong>to an Adal<strong>in</strong>e. The firstmodification is the addition of a connection with weight, WQ, which we referto as the bias term. This term is a weight on a connection that has its <strong>in</strong>putvalue always equal to 1. The <strong>in</strong>clusion of such a term is largely a matter ofexperience. We show it here for completeness, but it will not appear <strong>in</strong> thediscussion of the next sections. We shall resurrect the idea of a bias term <strong>in</strong>Chapter 3, on the backpropagation network.The second modification is the addition of a bipolar condition on the output.The dashed box <strong>in</strong> Figure 2.8 encloses a part of the Adal<strong>in</strong>e called the adaptivel<strong>in</strong>ear comb<strong>in</strong>er (ALC). If the output of the ALC is positive, the Adal<strong>in</strong>e outputis +1. If the ALC output is negative, the Adal<strong>in</strong>e output is —1. Because muchof the <strong>in</strong>terest<strong>in</strong>g process<strong>in</strong>g takes place <strong>in</strong> the ALC portion of the Adal<strong>in</strong>e,we shall concentrate on the ALC. Later, we shall add back the b<strong>in</strong>ary outputcondition.The process<strong>in</strong>g done by the ALC is that of the typical process<strong>in</strong>g elementdescribed <strong>in</strong> the previous chapter. The ALC performs a sum-of-products calcu-Threshold—— ^-y-1 ———— +1Bipolar^output- sign(y)Adaptive l<strong>in</strong>ear comb<strong>in</strong>erIFigure 2.8The complete Adal<strong>in</strong>e consists of the adaptive l<strong>in</strong>ear comb<strong>in</strong>er,<strong>in</strong> the dashed box, <strong>and</strong> a bipolar output function. Theadaptive l<strong>in</strong>ear comb<strong>in</strong>er resembles the general PE described<strong>in</strong> Chapter 1.


2.2 Adal<strong>in</strong>e <strong>and</strong> the Adaptive L<strong>in</strong>ear Comb<strong>in</strong>er 57lation us<strong>in</strong>g the <strong>in</strong>put <strong>and</strong> weight vectors, <strong>and</strong> applies an output function to geta s<strong>in</strong>gle output value. Us<strong>in</strong>g the notation <strong>in</strong> Figure 2.8,y =where w 0 is the bias weight. If we make the identification, x 0 = 1, we canrewrite the preced<strong>in</strong>g equation asor, <strong>in</strong> vector notation,j=oy = w'x (2.1)The output function <strong>in</strong> this case is the identity function, as is the activationfunction. The use of the identity function as both output <strong>and</strong> activation functionsmeans that the output is the same as the activation, which is the same as the net<strong>in</strong>put to the unit.The Adal<strong>in</strong>e (or the ALC) is ADAptive <strong>in</strong> the sense that there exists awell-def<strong>in</strong>ed procedure for modify<strong>in</strong>g the weights <strong>in</strong> order to allow the deviceto give the correct output value for the given <strong>in</strong>put. What output value iscorrect depends on the particular process<strong>in</strong>g function be<strong>in</strong>g performed by thedevice. The Adal<strong>in</strong>e (or the ALC) is L<strong>in</strong>ear because the output is a simple l<strong>in</strong>earfunction of the <strong>in</strong>put values. It is a NEuron only <strong>in</strong> the very limited sense of thePEs described <strong>in</strong> the previous chapter. The Adal<strong>in</strong>e could also be said to be aLINear Element, avoid<strong>in</strong>g the NEuron issue altogether. In the next section, welook at a method to tra<strong>in</strong> the Adal<strong>in</strong>e to perform a given process<strong>in</strong>g function.2.2.1 The LMS Learn<strong>in</strong>g RuleGiven an <strong>in</strong>put vector, x, it is straightforward to determ<strong>in</strong>e a set of weights,w, which will result <strong>in</strong> a particular output value, y. Suppose we have a setof <strong>in</strong>put vectors, {x\ , \2, . . . . XL }, each hav<strong>in</strong>g its own, perhaps unique, corrector desired output value, d/, •, k = \.L. The problem of f<strong>in</strong>d<strong>in</strong>g a s<strong>in</strong>gle weightvector that can successfully associate each <strong>in</strong>put vector with its desired outputvalue is no longer simple. In this section, we develop a method called the leastmean-square(LMS) learn<strong>in</strong>g rule, which is one method of f<strong>in</strong>d<strong>in</strong>g the desiredweight vector. We refer to this process of f<strong>in</strong>d<strong>in</strong>g the weight vector as tra<strong>in</strong><strong>in</strong>gthe ALC. The learn<strong>in</strong>g rule can be embedded <strong>in</strong> the device itself, which can thenself-adapt as <strong>in</strong>puts <strong>and</strong> desired outputs are presented to it. Small adjustmentsare made to the weight values as each <strong>in</strong>put-output comb<strong>in</strong>ation is processeduntil the ALC gives correct outputs. In a sense, this procedure is a true tra<strong>in</strong><strong>in</strong>gprocedure, because we do not need to calculate the value of the weight vectorexplicitly. Before describ<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g process <strong>in</strong> detail, let's perform thecalculation manually.


58 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eCalculation of w*. To beg<strong>in</strong>, let's state the problem a little differently: Givenexamples, (\i, di), (x 2 , d 2 ),..., (X/^^L), of some process<strong>in</strong>g function that associates<strong>in</strong>put vectors, x/t, with (or maps to) the desired output values, d/f, whatis the best weight vector, w*, for an ALC that performs this mapp<strong>in</strong>g?To answer this question, we must first def<strong>in</strong>e what it is that constitutes thebest weight vector. Clearly, once the best weight vector is found, we wouldlike the application of each <strong>in</strong>put vector to result <strong>in</strong> the precise, correspond<strong>in</strong>goutput value. Thus, we want to elim<strong>in</strong>ate, or at least to m<strong>in</strong>imize, the differencebetween the desired output <strong>and</strong> the actual output for each <strong>in</strong>put vector. Theapproach we select here is to m<strong>in</strong>imize the mean squared error for the set of<strong>in</strong>put vectors.If the actual output value is y^ for the fcth <strong>in</strong>put vector, then the correspond<strong>in</strong>gerror term is e^. = dk — y^. The mean squared error, or expectationvalue of the error, is def<strong>in</strong>ed byk=\(2.2)where L is the number of <strong>in</strong>put vectors <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g set. 1Us<strong>in</strong>g Eq. (2.1), we can exp<strong>and</strong> the mean squared error as follows:(4) == (4)• w'x,) 2 ) (2.3)(2.4)In go<strong>in</strong>g from Eq. (2.3) to Eq. (2.4), we have made the assumption that thetra<strong>in</strong><strong>in</strong>g set is statistically stationary, mean<strong>in</strong>g that any expectation values varyslowly with respect to time. This assumption allows us to factor out the weightvectors from the expectation value terms <strong>in</strong> Eq. (2.4).Exercise 2.1: Give the details of the derivation that leads from Eq. (2.3), toEq. (2.4) along with the justification for each step. Why are the factors dk <strong>and</strong>x[. left together <strong>in</strong> the last term <strong>in</strong> Eq. (2.4), rather than shown as the productof the two separate expectation values?Def<strong>in</strong>e a matrix R = (xfexj.), called the <strong>in</strong>put correlation matrix, <strong>and</strong> avector p = (dfrX/t). Further, make the identification £ = (e|). Us<strong>in</strong>g thesedef<strong>in</strong>itions, we can rewrite Eq. (2.4) as+w'Rw-2p'w (2.5)This equation shows £ as an explicit function of the weight vector, w. In otherwords, £ ='Widrow <strong>and</strong> Stearns use the notation, E[e 2 k ], for the expectation value; also, the term exemplarswill sometimes be seen as a synonym for tra<strong>in</strong><strong>in</strong>g set.


2.2 Adal<strong>in</strong>e <strong>and</strong> the Adaptive L<strong>in</strong>ear Comb<strong>in</strong>er 59To f<strong>in</strong>d the weight vector correspond<strong>in</strong>g to the m<strong>in</strong>imum mean squarederror, we differentiate Eq. (2.5), evaluate the result at w*, <strong>and</strong> set the resultequal to zero:= 2Rw - 2p2Rw* - 2p = 0Rw* = pw* = R-'p(2.6)(2.7)(2.8)Notice that, although £ is a scalar, ^|^ is a vector.expression of the gradient of £, V£, which is the vector__dw\' dw 2Equation (2.6) is anAll that we have done by the procedure is to show that we can f<strong>in</strong>d a po<strong>in</strong>twhere the slope of the function, £(w), is zero. In general, that po<strong>in</strong>t may be am<strong>in</strong>imum or a maximum po<strong>in</strong>t. In the example that follows, we show a simplecase where the ALC has only two weights. In that situation, the graph of £(w)is a paraboloid. Furthermore, it must be concave upward, s<strong>in</strong>ce all comb<strong>in</strong>ationsof weights must result <strong>in</strong> a nonnegative value for the mean squared error, £.This result is general <strong>and</strong> is obta<strong>in</strong>ed regardless of the dimension of the weightvector. In the case of dimensions higher than two, the paraboloid is known asa hyperparaboloid.Suppose we have an ALC with two <strong>in</strong>puts <strong>and</strong> various other quantitiesdef<strong>in</strong>ed as follows:R = 3 11 4] p=[s] (4) = 10Rather than <strong>in</strong>vert<strong>in</strong>g R, we use Eq. (2.7) to f<strong>in</strong>d the optimum weight vector:3 11 4This equation results <strong>in</strong> two equations for w* <strong>and</strong> w*:(2.9)w* + 4u,' 2 * = 5The solution is w* = (1, 1)'. The graph of £ as a function of the two weightsis shown <strong>in</strong> Figure 2.9.txercise 2.2: Show that the m<strong>in</strong>imum value of the mean squared error can bewritten asL


60 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eFigure 2.9For an ALC with only two weights, the error surface is aparaboloid. The weights that m<strong>in</strong>imize the error occur at thebottom of the paraboloidal surface.Exercise 2.3: Determ<strong>in</strong>e an explicit equation for £ as a function of w\ <strong>and</strong> wius<strong>in</strong>g the example <strong>in</strong> the text. Use it to f<strong>in</strong>d V£, the optimum weight vector,w*, the m<strong>in</strong>imum mean squared error, £ m i n , <strong>and</strong> prove that the paraboloid isconcave upward.In the next section, we shall exam<strong>in</strong>e a method for f<strong>in</strong>d<strong>in</strong>g the optimumweight vector by an iterative procedure. This procedure allows us to avoid theoften-difficult calculations necessary to determ<strong>in</strong>e the weights manually.F<strong>in</strong>d<strong>in</strong>g w* by the Method of Steepest Descent. As you might imag<strong>in</strong>e, theanalytical calculation to determ<strong>in</strong>e the optimum weights for a problem is ratherdifficult <strong>in</strong> general. Not only does the matrix manipulation get cumbersome forlarge dimensions, but also each component of R <strong>and</strong> p is itself an expectationvalue. Thus, explicit calculations of R <strong>and</strong> p require knowledge of the statisticsof the <strong>in</strong>put signals. A better approach would be to let the ALC f<strong>in</strong>d the optimumweights itself by hav<strong>in</strong>g it search over the weight surface to f<strong>in</strong>d the m<strong>in</strong>imum.A purely r<strong>and</strong>om search might not be productive or efficient, so we shall addsome <strong>in</strong>telligence to the procedure.


2.2 Adal<strong>in</strong>e <strong>and</strong> the Adaptive L<strong>in</strong>ear Comb<strong>in</strong>er 61Beg<strong>in</strong> by assign<strong>in</strong>g arbitrary values to the weights. From that po<strong>in</strong>t on theweight surface, determ<strong>in</strong>e the direction of the steepest slope <strong>in</strong> the downwarddirection. Change the weights slightly so that the new weight vector lies fartherdown the surface. Repeat the process until the m<strong>in</strong>imum has been reached. Thisprocedure is illustrated <strong>in</strong> Figure 2.10. Implicit <strong>in</strong> this method is the assumptionthat we know what the weight surface looks like <strong>in</strong> advance. We do not know,but we will see shortly how to get around this problem.Typically, the weight vector does not <strong>in</strong>itially move directly toward them<strong>in</strong>imum po<strong>in</strong>t. The cross-section of the paraboloidal weight surface is usuallyelliptical, so the negative gradient may not po<strong>in</strong>t directly at the m<strong>in</strong>imum po<strong>in</strong>t,at least <strong>in</strong>itially. The situation is illustrated more clearly <strong>in</strong> the contour plot ofthe weight surface <strong>in</strong> Figure 2.11.Figure 2.10We can use this diagram to visualize the steepest-descentmethod. An <strong>in</strong>itial selection for the weight vector results<strong>in</strong> an error, £ start . The steepest-descent method consists ofslid<strong>in</strong>g this po<strong>in</strong>t down the surface toward the bottom, alwaysmov<strong>in</strong>g <strong>in</strong> the direction of the steepest downward slope.


62 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eIV,-1. 2. 3.Figure 2.11In the contour plot of the weight surface of Figure 2.10, thedirection of steepest descent is perpendicular to the contourl<strong>in</strong>es at each po<strong>in</strong>t, <strong>and</strong> this direction does not always po<strong>in</strong>tto the m<strong>in</strong>imum po<strong>in</strong>t.Because the weight vector is variable <strong>in</strong> this procedure, we write it as anexplicit function of the timestep, t. The <strong>in</strong>itial weight vector is denoted w(0),<strong>and</strong> the weight vector at timestep t is w(t). At each step, the next weight vectoris calculated accord<strong>in</strong>g to1) = w(i) + Aw(t) (2.10)where Aw(£) is the change <strong>in</strong> w at the tt\\ timestep.We are look<strong>in</strong>g for the direction of the steepest descent at each po<strong>in</strong>t onthe surface, so we need to calculate the gradient of the surface (which gives thedirection of the steepest upward slope). The negative of the gradient is <strong>in</strong> thedirection of steepest descent. To get the magnitude of the change, multiply thegradient by a suitable constant, /i. The appropriate value for \i will be discussedlater. This procedure results <strong>in</strong> the follow<strong>in</strong>g expression:(2.11)


2.2 Adal<strong>in</strong>e <strong>and</strong> the Adaptive L<strong>in</strong>ear Comb<strong>in</strong>er 63All that is necessary to complete the discussion is to determ<strong>in</strong>e the value ofV£(w(0) at each successive iteration step.The value of V£(w(£)) was determ<strong>in</strong>ed analytically <strong>in</strong> the previous section.Equation (2.6) or Eq. (2.9) could be used here to determ<strong>in</strong>e V£(w(t)), but wewould have the same problem that we had with the analytical determ<strong>in</strong>ationof w*: We would need to know both R <strong>and</strong> p <strong>in</strong> advance. This knowledgeis equivalent to know<strong>in</strong>g what the weight surface looks like <strong>in</strong> advance. Tocircumvent this difficulty, we use an approximation for the gradient that can bedeterm<strong>in</strong>ed from <strong>in</strong>formation that is known explicitly at each iteration.For each step <strong>in</strong> the iteration process, we perform the follow<strong>in</strong>g:1. Apply an <strong>in</strong>put vector, x k , to the Adal<strong>in</strong>e <strong>in</strong>puts.2. Determ<strong>in</strong>e the value of the error squared, £ 2 k(t), us<strong>in</strong>g the current value ofthe weight vector(2.12)3. Calculate an approximation to V£(t), by us<strong>in</strong>g £\(t) as an approximationfor (4):V4(i) « V


64 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>e4.T'3.-'Figure 2.12 The hypothetical path taken by a weight vector as it searchesfor the m<strong>in</strong>imum error us<strong>in</strong>g the IMS algorithm is not asmooth curve because the gradient is be<strong>in</strong>g approximatedat each po<strong>in</strong>t. Note also that step sizes get smaller as them<strong>in</strong>imum-error solution is approached.2.2.2 Practical ConsiderationsThere are several questions to consider when we are attempt<strong>in</strong>g to use the ALCto solve a particular problem:• How many tra<strong>in</strong><strong>in</strong>g vectors are required to solve a particular problem?• How is the expected output generated for each tra<strong>in</strong><strong>in</strong>g vector?• What is the appropriate dimension of the weight vector?• What should be the <strong>in</strong>itial values for the weights?• Is a bias weight required?• What happens if the signal statistics vary with time?


2.2 Adal<strong>in</strong>e <strong>and</strong> the Adaptive L<strong>in</strong>ear Comb<strong>in</strong>er 65• What is the appropriate value for /x?• How do we determ<strong>in</strong>e when to stop tra<strong>in</strong><strong>in</strong>g?The answers to these questions depend on the specific problem be<strong>in</strong>g addressed,so it is difficult to give well-def<strong>in</strong>ed responses that apply <strong>in</strong> all cases. Moreover,for a specific case, the answers are not necessarily <strong>in</strong>dependent.Consider the dimension of the weight vector. If there are a well-def<strong>in</strong>ednumber of <strong>in</strong>puts—say, from multiple sensors—then there would be one weightfor each <strong>in</strong>put. The question would be whether to add a bias weight. Figure 2.13depicts this case, with the bias term added, <strong>in</strong> a somewhat st<strong>and</strong>ard form thatshows the variability of the weights, the error term, <strong>and</strong> the feedback fromthe output to the weights. As for the bias term itself, <strong>in</strong>clud<strong>in</strong>g it sometimeshelps convergence of the weights to an acceptable solution. It is perhaps bestthought of as an extra degree of freedom, <strong>and</strong> its use is largely a matter ofexperimentation with the specific application.A situation different from the previous paragraph arises if there is only as<strong>in</strong>gle <strong>in</strong>put signal, say from a s<strong>in</strong>gle electrocardiograph (EKG) sensor. Forx 0 = 1 (bias <strong>in</strong>put)+d, desiredoutputFigure 2.13 This figure shows a st<strong>and</strong>ard diagram of the ALC with multiple<strong>in</strong>puts <strong>and</strong> a bias term. Weights are <strong>in</strong>dicated as variableresistors to emphasize the adaptive nature of the device.Calculation of the error, e, is shown explicitly as the additionof a negative of the output signal to the desired output value.


66 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eexample, an ALC can be used to remove noise from the <strong>in</strong>put signal <strong>in</strong> order togive a cleaner signal at the output. In a case such as this one, the ALC is arranged<strong>in</strong> a configuration known as a transverse filter. In this configuration, the <strong>in</strong>putsignal is sampled at several po<strong>in</strong>ts <strong>in</strong> time, rather than from several sensors ata s<strong>in</strong>gle time. Figure 2.14 shows the ALC arranged as a transverse filter.For the transverse filter, each additional sample <strong>in</strong> time represents anotherdegree of freedom that can be used to fit the <strong>in</strong>put signal to the desired outputsignal. Thus, if you cannot get a good fit with a small number of samples, trya few more. On the other h<strong>and</strong>, if you get good convergence with your firstFigure 2.14In an ALC arranged as a transverse filter, the <strong>in</strong>dividualsamples are provided by n— 1, presumably equal, time delays,to- The ALC sees the signal at the current time, as well asits value at the previous n - 1 sample times. When data is<strong>in</strong>itially applied, remember to wait at least nt D for data to bepresent at all of the ALC's <strong>in</strong>puts.


2.2 Adal<strong>in</strong>e <strong>and</strong> the Adaptive L<strong>in</strong>ear Comb<strong>in</strong>er 67choice, try one with fewer samples to see whether you get a significant speedup<strong>in</strong> convergence <strong>and</strong> still have satisfactory results (you may be surprised to f<strong>in</strong>dthat the results are better <strong>in</strong> some cases). Moreover, the bias weight is probablysuperfluous <strong>in</strong> this case.Earlier, we alluded to a relationship between tra<strong>in</strong><strong>in</strong>g time <strong>and</strong> the dimensionof the weight vector, especially for the software simulations that we consider<strong>in</strong> this text: More weights generally mean longer tra<strong>in</strong><strong>in</strong>g times. This equationmust be constantly balanced aga<strong>in</strong>st other factors, such as the acceptability ofthe solution. As stated <strong>in</strong> the previous paragraph, us<strong>in</strong>g more weights does notalways result <strong>in</strong> a better solution. Furthermore, there are other factors that affectboth the tra<strong>in</strong><strong>in</strong>g time <strong>and</strong> the acceptability of the solution.The parameter fj. is one factor that has a significant effect on tra<strong>in</strong><strong>in</strong>g. If/i is too large, convergence will never take place, no matter how long is thetra<strong>in</strong><strong>in</strong>g period. If the statistics of the <strong>in</strong>put signal are known, it is possible toshow that the value of fj, is restricted to the rangeL''maxn> 0where A max is the largest eigenvalue of the matrix R, the <strong>in</strong>put correlation matrixdiscussed <strong>in</strong> Section 2.4.1. [9]. Although it is not always reasonable to expectthese statistics to be known, there are cases where they can be estimated. Thetext by Widrow <strong>and</strong> Stearns conta<strong>in</strong>s many examples. In this text, we propose amore heuristic approach: Pick a value for // such that a weight does not changeby more than a small fraction of its current value. This rule is admittedly vague,but experience appears to be the best teacher for select<strong>in</strong>g an appropriate valuefor /;.As tra<strong>in</strong><strong>in</strong>g proceeds, the error value e^. will dim<strong>in</strong>ish (hopefully), result<strong>in</strong>g<strong>in</strong> smaller <strong>and</strong> smaller weight changes, <strong>and</strong>, hence, <strong>in</strong> a slower convergencetoward the m<strong>in</strong>imum of the weight surface. It is sometimes useful to <strong>in</strong>crease thevalue of fj, dur<strong>in</strong>g these periods to speed convergence. Bear <strong>in</strong> m<strong>in</strong>d, however,that a larger n may mean that the weights might bounce around the bottom ofthe weight surface, giv<strong>in</strong>g an overall error that is unacceptable. Here aga<strong>in</strong>,experience is necessary to enable us to judge effectively.One method of compensat<strong>in</strong>g for differences <strong>in</strong> problems is to use normalized<strong>in</strong>put vectors. Instead of XA.., use \k/\*k • Another tactic is to scale thedesired output value. These methods help particularly when we are select<strong>in</strong>g<strong>in</strong>itial weight values or a value for /j,. In most cases, weights can be <strong>in</strong>itialized tor<strong>and</strong>om values of small real numbers—say, between -1.0 <strong>and</strong> 1.0. The value offj. is usually best kept significantly less than 1; a value of 0.1 or even 0.05 maybe reasonable for some problems, but values considerably less may be required.The question of when to stop tra<strong>in</strong><strong>in</strong>g is largely a matter of the requirementson the output of the system. You determ<strong>in</strong>e the amount of error that you cantolerate on the output signal, <strong>and</strong> tra<strong>in</strong> until the observed error is consistentlyless than the required value. S<strong>in</strong>ce the mean squared error is the value used toderive the tra<strong>in</strong><strong>in</strong>g algorithm, that is the quantity that usually determ<strong>in</strong>es when


68 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>ea system has converged to its m<strong>in</strong>imum error solution. Alternatively, observ<strong>in</strong>g<strong>in</strong>dividual errors is often necessary, s<strong>in</strong>ce the system performance may havea requirement that no error exceed a certa<strong>in</strong> amount. Nevertheless, a meansquared error that falls as the iteration number <strong>in</strong>creases is probably your best<strong>in</strong>dication that the system is converg<strong>in</strong>g toward a solution.We usually assume that the <strong>in</strong>put signals are statistically stationary, <strong>and</strong>,therefore, (ejr.) is essentially a constant after the optimum weight values havebeen determ<strong>in</strong>ed. Dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, (e\} will hopefully decrease toward a stablesolution. Suppose, however, that the <strong>in</strong>put signal statistics change somewhatover time, or undergo some discont<strong>in</strong>uity: Additional tra<strong>in</strong><strong>in</strong>g would be requiredto compensate.One way to deal with this situation is to cease or resume tra<strong>in</strong><strong>in</strong>g conditionally,based on the current value of (e\). If the signal statistics change,tra<strong>in</strong><strong>in</strong>g can be re<strong>in</strong>itiated until (ej.) is aga<strong>in</strong> reduced to an acceptable value.This method presumes that a method of error measurement is available.Provided that the <strong>in</strong>put signals are statistically stationary, choos<strong>in</strong>g the numberof <strong>in</strong>put vectors to use dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g may be relatively simple. You canuse real, time-sequenced <strong>in</strong>puts as tra<strong>in</strong><strong>in</strong>g vectors, provided that you know thedesired output for each <strong>in</strong>put vector. If it is possible to identify a sample of<strong>in</strong>put vectors that adequately reproduces the statistical distribution of the actual<strong>in</strong>puts, it may be possible to tra<strong>in</strong> on this set <strong>in</strong> a shorter time. The accuracyof the tra<strong>in</strong><strong>in</strong>g depends on how well the selected set of tra<strong>in</strong><strong>in</strong>g vectors modelsthe distribution of the entire <strong>in</strong>put signal space.The other, related question is how to go about determ<strong>in</strong><strong>in</strong>g the desiredoutput for a given <strong>in</strong>put vector. As with many questions discussed <strong>in</strong> thissection, this depends on the specific details of the problem. Fortunately, forsome problems, know<strong>in</strong>g the desired result is easy compared to f<strong>in</strong>d<strong>in</strong>g analgorithm for transform<strong>in</strong>g the <strong>in</strong>puts <strong>in</strong>to the desired result. The ALC willoften solve the difficult part. The "easy" part is left to the eng<strong>in</strong>eer.Exercise 2.4: A lowpass filter can be constructed with an Adal<strong>in</strong>e hav<strong>in</strong>g twoweights. Consider a simple case of the removal of a r<strong>and</strong>om noise from aconstant signal. The constant signal level is C = 3, <strong>and</strong> the r<strong>and</strong>om noisesignal has a constant power, (r 2 ) — n — 0.025. Assume that the r<strong>and</strong>om noiseis completely uncorrelated with the constant <strong>in</strong>put signal. Calculate the optimumweight vector <strong>and</strong> the mean squared error <strong>in</strong> the output after the optimum weightvector has been found. By f<strong>in</strong>d<strong>in</strong>g the eigenvalues of the matrix, R, determ<strong>in</strong>ethe maximum value of the constant /z for use <strong>in</strong> the LMS algorithm.2.3 APPLICATIONS OF ADAPTIVESIGNAL PROCESSINGUp to now, we have been concerned with the Adal<strong>in</strong>e m<strong>in</strong>us the thresholdcondition on the output. In Section 2.4, on the Madal<strong>in</strong>e, we will replace thethreshold condition <strong>and</strong> exam<strong>in</strong>e networks of Adal<strong>in</strong>es. In this section, we will


2.3 <strong>Applications</strong> of Adaptive Signal Process<strong>in</strong>g69look at a few examples of adaptive signal process<strong>in</strong>g us<strong>in</strong>g only the ALC portionof the Adal<strong>in</strong>e.L2.3.1 Echo Cancellation <strong>in</strong> Telephone CircuitsYou may have experienced the phenomenon of echo <strong>in</strong> telephone conversations:you hear the words you speak <strong>in</strong>to the mouthpiece a fraction of a second later<strong>in</strong> the earphone of the telephone. The echo tends to be most noticeable on longdistancecalls, especially those over satellite l<strong>in</strong>ks where transmission delays canbe a significant fraction of a second.Telephone circuits conta<strong>in</strong> devices called hybrids that are <strong>in</strong>tended toisolate <strong>in</strong>com<strong>in</strong>g signals from outgo<strong>in</strong>g signals, thus avoid<strong>in</strong>g the echo effect.Unfortunately, these circuits do not always perform perfectly, due to causes suchas impedance mismatches, result<strong>in</strong>g <strong>in</strong> some echo back to the speaker. Evenwhen the echo signal has been attenuated by a substantial amount, it still maybe audible, <strong>and</strong> hence an annoyance to the speaker.Certa<strong>in</strong> echo-suppression devices rely on relays that open <strong>and</strong> close circuits<strong>in</strong> the outgo<strong>in</strong>g l<strong>in</strong>es so that <strong>in</strong>com<strong>in</strong>g voice signals are not sent back to thespeaker. When transmission delays are long, as with satellite communications,these echo suppressors can result <strong>in</strong> a loss of parts of words. This choppyspeecheffect is perhaps more familiar than the echo effect. An adaptive filtercan be used to remove the echo effect without the chopp<strong>in</strong>ess of the relays used<strong>in</strong> other echo suppression circuits [9, 7J.Figure 2.15 is a block diagram of a telephone circuit with an adaptivefilter used as an echo-suppression device. The echo is caused by a leakage ofthe <strong>in</strong>com<strong>in</strong>g voice signal to the output l<strong>in</strong>e through the hybrid circuit. Thisleakage adds to the output signal com<strong>in</strong>g from the microphone. The output ofthe adaptive filter, y, is subtracted from the outgo<strong>in</strong>g signal, s + n', where s isthe outgo<strong>in</strong>g pure voice signal <strong>and</strong> n' is the noise, or echo caused by leakage ofthe <strong>in</strong>com<strong>in</strong>g voice signal through the hybrid circuit. The success of the echocancellation depends on how well the adaptive filter can mimic the leakagethrough the hybrid circuit.Notice that the <strong>in</strong>put to the filter is a copy of the <strong>in</strong>com<strong>in</strong>g signal, n, <strong>and</strong>that the error is a copy of the outgo<strong>in</strong>g signal,E = s + n' - y (2.16)We assume that y is correlated with the noise, n', but not with the pure voicesignal, s. If the quantity, n' - y, is nonzero, some echo still rema<strong>in</strong>s <strong>in</strong> the outgo<strong>in</strong>gsignal. Squar<strong>in</strong>g <strong>and</strong> tak<strong>in</strong>g expectation values of both sides of Eq. (2.16)gives(e 2 ) = (s 2 } + {(n' - y) 2 } + 2(s(ri - y)} (2.17)= {s 2 } + {(n' - y) 2 ) (2.18)Equation (2.18) follows, s<strong>in</strong>ce s is not correlated with either y or n', result<strong>in</strong>g<strong>in</strong> the last term <strong>in</strong> Eq. (2.17) be<strong>in</strong>g equal to zero.


70 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eVoicesignal,-snoise,iHybridcircuittTo •—earphoneFigure 2.15Adaptivefilteri i nfrvidyj1Adaptivefilteri rHybridcircuitToearphoneVoicesignalThis figure is a schematic of a telephone circuit us<strong>in</strong>g anadaptive filter to cancel echo. The adaptive filter is depictedas a box; the slanted arrow represents the adjustable weights.The signal power, {.s 2 }, is determ<strong>in</strong>ed by the source of the voice signal—say, some amplifier at the telephone switch<strong>in</strong>g station local to the sender. Thus,(s 2 ) is not directly affected by changes <strong>in</strong> (e 2 ). The adaptive filter attemptsto m<strong>in</strong>imize {t 2 }, <strong>and</strong>, <strong>in</strong> do<strong>in</strong>g so, m<strong>in</strong>imizes ((n' - y) 2 ), the power of theuncanceled noise on the outgo<strong>in</strong>g l<strong>in</strong>e.S<strong>in</strong>ce there is only one <strong>in</strong>put to the adaptive filter, the device would beconfigured as a transverse filter. Widrow <strong>and</strong> Stearns [9] suggest sampl<strong>in</strong>g the<strong>in</strong>com<strong>in</strong>g signal at a rate of 8 KHz <strong>and</strong> us<strong>in</strong>g 128 weight values.2.3.2 Other <strong>Applications</strong>Rather than go <strong>in</strong>to the details of the many applications that can be addressed bythese adaptive filters, we refer you once aga<strong>in</strong> to the excellent text by Widrow<strong>and</strong> Stearns. In this section, we shall simply suggest a few broad areas whereadaptive filters can be used <strong>in</strong> addition to the echo-cancellation application wehave discussed.Figure 2.16 shows an adaptive filter that is used to predict the future value ofa signal based on its present value. A second example is shown <strong>in</strong> Figure 2.17.In this example, the adaptive filter learns to reproduce the output from someplant based on <strong>in</strong>puts to the system. This configuration has many uses as anadaptive control system. The plant could represent many th<strong>in</strong>gs, <strong>in</strong>clud<strong>in</strong>g ahuman operator. In that case, the adaptive filter could learn how to respond tochang<strong>in</strong>g conditions by watch<strong>in</strong>g the human operator. Eventually, such a devicemight result <strong>in</strong> an automated control system, leav<strong>in</strong>g the human free for moreimportant tasks. 2Another useful application of these devices is <strong>in</strong> adaptive beam-form<strong>in</strong>gantenna arrays. Although the term antenna is usually associated with electro-2 Such as tra<strong>in</strong><strong>in</strong>g another adaptive filter with the St<strong>and</strong>ard & Poors 500.


2.3 <strong>Applications</strong> of Adaptive Signal Process<strong>in</strong>g 71Current signalPrediction ofcurrent signalPast signalFigure 2.16This schematic shows an adaptive filter used to predict signalvalues. The <strong>in</strong>put signal used to tra<strong>in</strong> the network is a delayedvalue of the actual signal; that is, it is the signal at some pasttime. The expected output is the current value of the signal.The adaptive filter attempts to m<strong>in</strong>imize the error between itsoutput <strong>and</strong> the current signal, based on an <strong>in</strong>put of the signalvalue from some time <strong>in</strong> the past. Once the filter is correctlypredict<strong>in</strong>g the current signal based on the past signal, thecurrent signal can be used directly as an <strong>in</strong>put without thedelay. The filter will then make a prediction of the futuresignal value.Input signalsPrediction ofplant outputLFigure 2.17This example shows an adaptive filter used to model theoutput from a system, called the plant. Inputs to the filter arethe same as those to the plant. The filter adjusts its weightsbased on the difference between its output <strong>and</strong> the output ofthe plant.


72 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>emagnetic radiation, we broaden the def<strong>in</strong>ition here to <strong>in</strong>clude any spatial arrayof sensors. The basic task here is to learn to steer the array. At any given time,a signal may be arriv<strong>in</strong>g from any given direction, but antennae usually aredirectional <strong>in</strong> their reception characteristics: They respond to signals <strong>in</strong> somedirections, but not <strong>in</strong> others. The antenna array with adaptive filters learns toadjust its directional characteristics <strong>in</strong> order to respond to the <strong>in</strong>com<strong>in</strong>g signalno matter what the direction is, while reduc<strong>in</strong>g its response to unwanted noisesignals com<strong>in</strong>g <strong>in</strong> from other directions.Of course, we have only touched on the number of applications for thesedevices. Unlike many other neural-network architectures, this is a relativelymature device with a long history of success. In the next section, we replacethe b<strong>in</strong>ary output condition on the ALC circuit so that the latter becomes, onceaga<strong>in</strong>, the complete Adal<strong>in</strong>e.2.4 THE MADALINEAs you can see from the discussion <strong>in</strong> Chapter 1, the Adal<strong>in</strong>e resembles theperceptron closely; it also has some of the same limitations as the perceptron.For example, a two-<strong>in</strong>put Adal<strong>in</strong>e cannot compute the XOR function. Comb<strong>in</strong><strong>in</strong>gAdal<strong>in</strong>es <strong>in</strong> a layered structure can overcome this difficulty, as we did <strong>in</strong>Chapter 1 with the perceptron. Such a structure is illustrated <strong>in</strong> Figure 2.18.Exercise 2.5: What logic function is be<strong>in</strong>g computed by the s<strong>in</strong>gle Adal<strong>in</strong>e <strong>in</strong>the output layer of Figure 2.18? Construct a three-<strong>in</strong>put Adal<strong>in</strong>e that computesthe majority function.2.4.1 Madal<strong>in</strong>e ArchitectureMadal<strong>in</strong>e is the acronym for Many Adal<strong>in</strong>es. Arranged <strong>in</strong> a multilayered architectureas illustrated <strong>in</strong> Figure 2.19, the Madal<strong>in</strong>e resembles the general neuralnetworkstructure shown <strong>in</strong> Chapter 1. In this configuration, the Madal<strong>in</strong>e couldbe presented with a large-dimensional <strong>in</strong>put vector—say, the pixel values froma raster scan. With suitable tra<strong>in</strong><strong>in</strong>g, the network could be taught to respondwith a b<strong>in</strong>ary +1 on one of several output nodes, each of which corresponds toa different category of <strong>in</strong>put image. Examples of such categorization are {cat,dog, armadillo, javel<strong>in</strong>a} <strong>and</strong> {Flogger, Tom Cat, Eagle, Fulcrum}. In such anetwork, each of four nodes <strong>in</strong> the output layer corresponds to a s<strong>in</strong>gle class.For a given <strong>in</strong>put pattern, a node would have a +1 output if the <strong>in</strong>put patterncorresponded to the class represented by that particular node. The other threenodes would have a -1 output. If the <strong>in</strong>put pattern were not a member of anyknown class, the results from the network could be ambiguous.To tra<strong>in</strong> such a network, we might be tempted to beg<strong>in</strong> with the LMSalgorithm at the output layer. S<strong>in</strong>ce the network is presumably tra<strong>in</strong>ed withpreviously identified <strong>in</strong>put patterns, the desired output vector is known. What


2.4 The Madal<strong>in</strong>e 73= -1.5Figure 2.18Many Adal<strong>in</strong>es (the Madal<strong>in</strong>e) can compute the XORfunction of two <strong>in</strong>puts. Note the addition of the bias terms toeach Adal<strong>in</strong>e. A positive analog output from an ALC results<strong>in</strong> a +1 output from the associated Adal<strong>in</strong>e; a negative analogoutput results <strong>in</strong> a -1. Likewise, any <strong>in</strong>puts to the device thatare b<strong>in</strong>ary <strong>in</strong> nature must use ±1 rather than 1 <strong>and</strong> 0.we do not know is the desired output for a given node on one of the hiddenlayers. Furthermore, the LMS algorithm would operate on the analog outputsof the ALC, not on the bipolar output values of the Adal<strong>in</strong>e. For these reasons,a different tra<strong>in</strong><strong>in</strong>g strategy has been developed for the Madal<strong>in</strong>e.2.4.2 The MRII Tra<strong>in</strong><strong>in</strong>g AlgorithmIt is possible to devise a method of tra<strong>in</strong><strong>in</strong>g a Madal<strong>in</strong>e-like structure based onthe LMS algorithm; however, the method relies on replac<strong>in</strong>g the l<strong>in</strong>ear thresholdoutput function with a cont<strong>in</strong>uously differentiable function (the threshold functionis discont<strong>in</strong>uous at 0; hence, it is not differentiable there). We will take upthe study of this method <strong>in</strong> the next chapter. For now, we consider a methodknown as Madal<strong>in</strong>e rule II (MRII). The orig<strong>in</strong>al Madal<strong>in</strong>e rule was an earlier


74 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eOutput layerof Madal<strong>in</strong>esHidden layerof Madal<strong>in</strong>esFigure 2.19 Many Adal<strong>in</strong>es can be jo<strong>in</strong>ed <strong>in</strong> a layered neural networksuch as this one.method that we shall not discuss here. Details can be found <strong>in</strong> references givenat the end of this chapter.MRII resembles a trial-<strong>and</strong>-error procedure with added <strong>in</strong>telligence <strong>in</strong> theform of a m<strong>in</strong>imum disturbance pr<strong>in</strong>ciple. S<strong>in</strong>ce the output of the networkis a series of bipolar units, tra<strong>in</strong><strong>in</strong>g amounts to reduc<strong>in</strong>g the number of <strong>in</strong>correctoutput nodes for each tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put pattern. The m<strong>in</strong>imum disturbancepr<strong>in</strong>ciple enforces the notion that those nodes that can affect the output errorwhile <strong>in</strong>curr<strong>in</strong>g the least change <strong>in</strong> their weights should have precedence <strong>in</strong> thelearn<strong>in</strong>g procedure. This pr<strong>in</strong>ciple is embodied <strong>in</strong> the follow<strong>in</strong>g algorithm:1. Apply a tra<strong>in</strong><strong>in</strong>g vector to the <strong>in</strong>puts of the Madal<strong>in</strong>e <strong>and</strong> propagate itthrough to the output units.2. Count the number of <strong>in</strong>correct values <strong>in</strong> the output layer; call this numberthe error.3. For all units on the output layer,a. Select the first previously unselected node whose analog output is closestto zero. (This node is the node that can reverse its bipolar output


2.4 The Madal<strong>in</strong>e 75with the least change <strong>in</strong> its weights—hence the term m<strong>in</strong>imum disturbance.)b. Change the weights on the selected unit such that the bipolar output ofthe unit changes.c. Propagate the <strong>in</strong>put vector forward from the <strong>in</strong>puts to the outputs onceaga<strong>in</strong>.d. If the weight change results <strong>in</strong> a reduction <strong>in</strong> the number of errors,accept the weight change; otherwise, restore the orig<strong>in</strong>al weights-.4. Repeat step 3 for all layers except the <strong>in</strong>put layer.5. For all units on the output layer,a. Select the previously unselected pair of units whose analog outputs areclosest to zero.b. Apply a weight correction to both units, <strong>in</strong> order to change the bipolaroutput of each.c. Propagate the <strong>in</strong>put vector forward from the <strong>in</strong>puts to the outputs.d. If the weight change results <strong>in</strong> a reduction <strong>in</strong> the number of errors,accept the weight change; otherwise, restore the orig<strong>in</strong>al weights.6. Repeat step 5 for all layers except the <strong>in</strong>put layer.If necessary, the sequence <strong>in</strong> steps 5 <strong>and</strong> 6 can be repeated with tripletsof units, or quadruplets of units, or even larger comb<strong>in</strong>ations, until satisfactoryresults are obta<strong>in</strong>ed. Prelim<strong>in</strong>ary <strong>in</strong>dications are that pairs are adequate formodest-sized networks with up to 25 units per layer [8].At the time of this writ<strong>in</strong>g, the MRII was still undergo<strong>in</strong>g experimentationto determ<strong>in</strong>e its convergence characteristics <strong>and</strong> other properties. Moreover, anew learn<strong>in</strong>g algorithm, MRIII, has been developed. MRIII is similar to MRII,but the <strong>in</strong>dividual units have a cont<strong>in</strong>uous output function, rather than the bipolarthreshold function [2]. In the next section, we shall use a Madal<strong>in</strong>e architectureto exam<strong>in</strong>e a specific problem <strong>in</strong> pattern recognition.2.4.3 A Madal<strong>in</strong>e for Translation-InvariantPattern RecognitionVarious Madal<strong>in</strong>e structures have been used recently to demonstrate the applicabilityof this architecture to adaptive pattern recognition hav<strong>in</strong>g the propertiesof translation <strong>in</strong>variance, rotation <strong>in</strong>variance, <strong>and</strong> scale <strong>in</strong>variance. These threeproperties are essential to any robust system that would be called on to recognizeobjects <strong>in</strong> the field of view of optical or <strong>in</strong>frared sensors, for example.Remember, however, that even humans do not always <strong>in</strong>stantly recognize objectsthat have been rotated to unfamiliar orientations, or that have been scaledsignificantly smaller or larger than their everyday size. The po<strong>in</strong>t is that theremay be alternatives to tra<strong>in</strong><strong>in</strong>g <strong>in</strong> <strong>in</strong>stantaneous recognition at all angles <strong>and</strong>scale factors. Be that as it may, it is possible to build neural-network devicesthat exhibit these characteristics to some degree.


76 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eFigure 2.20 shows a portion of a network that is used to implement translation-<strong>in</strong>variantrecognition of a pattern [7]. The ret<strong>in</strong>a is a 5-by-5-pixel array onwhich bit-mapped representation of patterns, such as the letters of the alphabet,can be placed. The portion of the network shown is called a slab. Unlike alayer, a slab does not communicate with other slabs <strong>in</strong> the network, as will beseen shortly. Each Adal<strong>in</strong>e <strong>in</strong> the slab receives the identical 25 <strong>in</strong>puts from theret<strong>in</strong>a, <strong>and</strong> computes a bipolar output <strong>in</strong> the usual fashion; however, the weightson the 25 Adal<strong>in</strong>es share a unique relationship.Consider the weights on the top-left Adal<strong>in</strong>e as be<strong>in</strong>g arranged <strong>in</strong> a squarematrix duplicat<strong>in</strong>g the pixel array on the ret<strong>in</strong>a. The Adal<strong>in</strong>e to the immediateMadal<strong>in</strong>e slabRet<strong>in</strong>aFigure 2.20This s<strong>in</strong>gle slab of Adal<strong>in</strong>es will give the same output (either+ 1 or -1) for a particular pattern on the ret<strong>in</strong>a, regardlessof the horizontal or vertical alignment of that pattern onthe ret<strong>in</strong>a. All 25 <strong>in</strong>dividual Adal<strong>in</strong>es are connected to as<strong>in</strong>gle Adal<strong>in</strong>e that computes the majority function: If mostof the <strong>in</strong>puts are +1, the majority element responds with a+ 1 output. The network derives its translation-<strong>in</strong>varianceproperties from the particular configuration of the weights.See the text for details.


2.4 The Madal<strong>in</strong>e 77right of the top-left pixel has the identical set of weight values, but translatedone pixel to the right: The rightmost column of weights on the first unit wrapsaround to the left to become the leftmost column on the second unit. Similarly,the unit below the top-left unit also has the identical weights, but translatedone pixel down. The bottom row of weights on the first unit becomes the toprow of the unit under it. This translation cont<strong>in</strong>ues across each row <strong>and</strong> downeach column <strong>in</strong> a similar manner. Figure 2.21 illustrates some of these weightmatrices. Because of this relationship among the weight matrices, a s<strong>in</strong>glepattern on the ret<strong>in</strong>a will elicit identical responses from the slab, <strong>in</strong>dependentKey weight matrix: top row, left columnw w w w12 13 14 15"22 ^23 ^24 W 25Weight matrix: top row, 2nd columnw W 12 13W 22 W 23 W 24W 32 ^33 ^34 ^35W35W 32 W 33 W 34^42 W 43 W 44 W 45W 45W 42 W 43W 44W 52 W 53 W 54 W 55W 52 W 53Weight matrix: 2nd row, left columnW W W W51 52 53 45 5JWeight matrix: 5th row, 5th column~ W 55W 54W 53 ^52W S\W 45W 44W 43W 42W 41W W W W22 23 24 25W 35W 34W 33W 32W 3\W 32W W W33 34 35W 25W 24W 23W 22tv 21W W42 43W.. W A ,44 45W 15^14W ,3WK^11Figure 2.21The weight matrix <strong>in</strong> the upper left is the key weight matrix.All other weight matrices on the slab are derived from thismatrix. The matrix to the right of the key weight matrixrepresents the matrix on the Adal<strong>in</strong>e directly to the right of theone with the key weight matrix. Notice that the fifth columnof the key weight matrix has wrapped around to become thefirst column, with the other columns shift<strong>in</strong>g one space tothe right. The matrix below the key weight matrix is theone on the Adal<strong>in</strong>e directly below the Adal<strong>in</strong>e with the keyweight matrix. The matrix diagonal to the key weight matrixrepresents the matrix on the Adal<strong>in</strong>e at the lower right of theslab.


78 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eof the pattern's translational position on the ret<strong>in</strong>a. We encourage you to reflecton this result for a moment (perhaps several moments), to conv<strong>in</strong>ce yourself ofits validity.The majority node is a s<strong>in</strong>gle Adal<strong>in</strong>e that computes a b<strong>in</strong>ary output basedon the outputs of the majority of the Adal<strong>in</strong>es connect<strong>in</strong>g to it. Because of thetranslational relationship among the weight vectors, the placement of a particularpattern at any location on the ret<strong>in</strong>a will result <strong>in</strong> the identical output from themajority element (we impose the restriction that patterns that extend beyondthe ret<strong>in</strong>a boundaries will wrap around to the opposite side, just as the variousweight matrices are derived from the key weight matrix.). Of course, a patterndifferent from the first may elicit a different response from the majority element.Because only two responses are possible, the slab can differentiate two classes on<strong>in</strong>put patterns. In terms of hyperspace, a slab is capable of divid<strong>in</strong>g hyperspace<strong>in</strong>to two regions.To overcome the limitation of only two possible classes, the ret<strong>in</strong>a can beconnected to multiple slabs, each hav<strong>in</strong>g different key weight matrices (Widrow<strong>and</strong> W<strong>in</strong>ter's term for the weight matrix on the top-left element of each slab).Given the b<strong>in</strong>ary nature of the output of each slab, a system of n slabs coulddifferentiate 2" different pattern classes. Figure 2.22 shows four such slabsproduc<strong>in</strong>g a four-dimensional output capable of dist<strong>in</strong>guish<strong>in</strong>g 16 different <strong>in</strong>putpatternclasses with translational <strong>in</strong>variance.Let's review the basic operation of the translation <strong>in</strong>variance network <strong>in</strong>terms of a specific example. Consider the 16 letters A —> P, as the <strong>in</strong>put patternswe would like to identify regardless of their up-down or left-right translationon the 5-by-5-pixel ret<strong>in</strong>a. These translated ret<strong>in</strong>a patterns are the <strong>in</strong>puts to theslabs of the network. Each ret<strong>in</strong>a pattern results <strong>in</strong> an output pattern from the<strong>in</strong>variance network that maps to one of the 16 <strong>in</strong>put classes (<strong>in</strong> this case, eachclass represents a letter). By us<strong>in</strong>g a lookup table, or other method, we canassociate the 16 possible outputs from the <strong>in</strong>variance network with one of the16 possible letters that can be identified by the network.So far, noth<strong>in</strong>g has been said concern<strong>in</strong>g the values of the weights on theAdal<strong>in</strong>es of the various slabs <strong>in</strong> the system. That is because it is not actuallynecessary to tra<strong>in</strong> those nodes <strong>in</strong> the usual sense. In fact, each key weightmatrix can be chosen at r<strong>and</strong>om, provided that each <strong>in</strong>put-pattern class result <strong>in</strong>a unique output vector from the <strong>in</strong>variance network. Us<strong>in</strong>g the example of theprevious paragraph, any translation of one of the letters should result <strong>in</strong> the sameoutput from the <strong>in</strong>variance network. Furthermore, any pattern from a differentclass (i.e., a different letter) must result <strong>in</strong> a different output vector from thenetwork. This requirement means that, if you pick a r<strong>and</strong>om key weight matrixfor a particular slab <strong>and</strong> f<strong>in</strong>d that two letters give the same output pattern, youcan simply pick a different weight matrix.As an alternative to r<strong>and</strong>om selection of key weight matrices, it may bepossible to optimize selection by employ<strong>in</strong>g a tra<strong>in</strong><strong>in</strong>g procedure based on theMRII. Investigations <strong>in</strong> this area are ongo<strong>in</strong>g at the time of this writ<strong>in</strong>g [7].


2.5 Simulat<strong>in</strong>g the Adal<strong>in</strong>e 79Ret<strong>in</strong>a°4Figure 2.22Each of the four slabs <strong>in</strong> the system depicted here will producea +1 or a — 1 output value for every pattern that appears onthe ret<strong>in</strong>a. The output vector is a four-digit b<strong>in</strong>ary number,so the system can potentially differentiate up to 16 differentclasses of <strong>in</strong>put patterns.2.5 SIMULATING THE ADALINEAs we shall for the implementation of all other network simulators we willpresent, we shall beg<strong>in</strong> this section by describ<strong>in</strong>g how the general data structuresare used to model the Adal<strong>in</strong>e unit <strong>and</strong> Madal<strong>in</strong>e network. Once the basicarchitecture has been presented, we will describe the algorithmic process neededto propagate signals through the Adal<strong>in</strong>e. The section concludes with a discussionof the algorithms needed to cause the Adal<strong>in</strong>e to self-adapt accord<strong>in</strong>g tothe learn<strong>in</strong>g laws described previously.2.5.1 Adal<strong>in</strong>e Data StructuresIt is appropriate that the Adal<strong>in</strong>e is the first test of the simulator data structureswe presented <strong>in</strong> Chapter 1 for two reasons:L1. S<strong>in</strong>ce the forward propagation of signals through the s<strong>in</strong>gle Adal<strong>in</strong>e is virtuallyidentical to the forward propagation process <strong>in</strong> most of the othernetworks we will study, it is beneficial for us to observe the Adal<strong>in</strong>e to


80 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>ega<strong>in</strong> a better underst<strong>and</strong><strong>in</strong>g of what is happen<strong>in</strong>g <strong>in</strong> each unit of a largernetwork.2. Because the Adal<strong>in</strong>e is not a network, its implementation exercises theversatility of the network structures we have def<strong>in</strong>ed.As we have already seen, the Adal<strong>in</strong>e is only a s<strong>in</strong>gle process<strong>in</strong>g unit.Therefore, some of the generality we built <strong>in</strong>to our network structures will notbe required. Specifically, there will be no real need to h<strong>and</strong>le multiple units <strong>and</strong>layers of units for the Adal<strong>in</strong>e. Nevertheless, we will <strong>in</strong>clude the use of thosestructures, because we would like to be able to extend the Adal<strong>in</strong>e easily <strong>in</strong>tothe Madal<strong>in</strong>e.We beg<strong>in</strong> by def<strong>in</strong><strong>in</strong>g our network record as a structure that will conta<strong>in</strong>all the parameters that will be used globally, as well as po<strong>in</strong>ters to locate thedynamic arrays that will conta<strong>in</strong> the network data. In the case of the Adal<strong>in</strong>e,a good c<strong>and</strong>idate structure for this record will take the formrecord Adal<strong>in</strong>e =mu : float;<strong>in</strong>put: "layer;output : "layer;end record{Storage for stability term}{Po<strong>in</strong>ter to <strong>in</strong>put layer}{Po<strong>in</strong>ter to output layer}Note that, even though there is only one unit <strong>in</strong> the Adal<strong>in</strong>e, we will usetwo layers to model the network. Thus, the <strong>in</strong>put <strong>and</strong> output po<strong>in</strong>ters willpo<strong>in</strong>t to different layer records. We do this because we will use the <strong>in</strong>putlayer as storage for hold<strong>in</strong>g the <strong>in</strong>put signal vector to the Adal<strong>in</strong>e. There will beno connections associated with this layer, as the <strong>in</strong>put will be provided by someother process <strong>in</strong> the system (e.g., a time-multiplexed analog-to-digital converter,or an array of sensors).Conversely, the output layer will conta<strong>in</strong> one weight array to model theconnections between the <strong>in</strong>put <strong>and</strong> the output (recall that our data structurespresume that PEs process <strong>in</strong>put connections primarily). Keep<strong>in</strong>g <strong>in</strong> m<strong>in</strong>d thatwe would like to extend this structure easily to h<strong>and</strong>le the Madal<strong>in</strong>e network,we will reta<strong>in</strong> the <strong>in</strong>direction to the connection weight array provided by theweight_ptr array described <strong>in</strong> Chapter 1. Notice that, <strong>in</strong> the case of theAdal<strong>in</strong>e, however, the weight_ptr array will conta<strong>in</strong> only one value, thepo<strong>in</strong>ter to the <strong>in</strong>put connection array.There is one other th<strong>in</strong>g to consider that may vary between Adal<strong>in</strong>e units.As we have seen previously, there are two parts to the Adal<strong>in</strong>e structure: thel<strong>in</strong>ear ALC <strong>and</strong> the bipolar Adal<strong>in</strong>e units. To dist<strong>in</strong>guish between them, wedef<strong>in</strong>e an enumerated type to classify each Adal<strong>in</strong>e neuron:type NODE_TYPE : {l<strong>in</strong>ear, b<strong>in</strong>ary};We now have everyth<strong>in</strong>g we need to def<strong>in</strong>e the layer record structure forthe Adal<strong>in</strong>e. A prototype structure for this record is as follows.


2.5 Simulat<strong>in</strong>g the Adal<strong>in</strong>erecord layer =activation : NODE_TYPE {k<strong>in</strong>d of Adal<strong>in</strong>e node}outs: ~float[]; {po<strong>in</strong>ter to unit output array}weights : ""float[]; {<strong>in</strong>direct access to weight arrays}end recordF<strong>in</strong>ally, three dynamically allocated arrays are needed to conta<strong>in</strong> the outputof the Adal<strong>in</strong>e unit, the weight_ptrs <strong>and</strong> the connection weights values.We will not specify the structure of these arrays, other than to <strong>in</strong>dicate that theouts <strong>and</strong> weights arrays will both conta<strong>in</strong> float<strong>in</strong>g-po<strong>in</strong>t values, whereas theweight_ptr array will store memory addresses <strong>and</strong> must therefore conta<strong>in</strong>memory po<strong>in</strong>ter types. The entire data structure for the Adal<strong>in</strong>e simulator isdepicted <strong>in</strong> Figure 2.23.2.5.2 Signal Propagation Through the Adal<strong>in</strong>eIf signals are to be propagated through the Adal<strong>in</strong>e successfully, two activitiesmust occur: We must obta<strong>in</strong> the <strong>in</strong>put signal vector to stimulate the Adal<strong>in</strong>e,<strong>and</strong> the Adal<strong>in</strong>e must perform its <strong>in</strong>put-summation <strong>and</strong> output-transformationfunctions. S<strong>in</strong>ce the orig<strong>in</strong> of the <strong>in</strong>put signal vector is somewhat applicationspecific, we will presume that the user will provide the code necessary to keepthe data located <strong>in</strong> the outs array <strong>in</strong> the Adal<strong>in</strong>e. <strong>in</strong>puts layer current.We shall now concentrate on the matter of comput<strong>in</strong>g the <strong>in</strong>put stimulationvalue <strong>and</strong> transform<strong>in</strong>g it to the appropriate output. We can accomplish thistask through the application of two algorithmic functions, which we will namesunu<strong>in</strong>puts <strong>and</strong> compute_output. The algorithms for these functions areas follows:outputsweightsFigure 2.23The Adal<strong>in</strong>e simulator data structure is shown.


82 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>efunction sum_<strong>in</strong>puts (INPUTSWEIGHTSreturn floatvar sum : float;temp : float;<strong>in</strong>s : "float[];wts : 'float[];i : <strong>in</strong>teger;float[]){local accumulator}{scratch memory}{local po<strong>in</strong>ter}{local po<strong>in</strong>ter}{iteration counter}beg<strong>in</strong>sum = 0;<strong>in</strong>s = INPUTS;wts = WEIGHTS'{<strong>in</strong>itialize accumulator}{locate <strong>in</strong>put array}{locate connection array}for i = 1 to length(wts) do{for all weights <strong>in</strong> array}temp = <strong>in</strong>s[i] * wts[i]; {modulate <strong>in</strong>put}sum = sum + temp; {sum modulated <strong>in</strong>puts}end doreturn(sum);end function;{return the modulated sum}function compute_output (INPUT : float;ACT : NODE TYPE) return floatbeg<strong>in</strong>if (ACT = l<strong>in</strong>ear)then return (INPUT)elseif (INPUT >= 0.0)then return (1.0)else return (-1.0)end function;{if the Adal<strong>in</strong>e is a l<strong>in</strong>ear unit}{then just return the <strong>in</strong>put}{otherwise}{if the <strong>in</strong>put is positive}{then return a b<strong>in</strong>ary true}; {else return a b<strong>in</strong>ary false}2.5.3 Adapt<strong>in</strong>g the Adal<strong>in</strong>eNow that our simulator can forward propagate signal <strong>in</strong>formation, we turn our attentionto the implementation of the learn<strong>in</strong>g algorithms. Here aga<strong>in</strong> we assumethat the <strong>in</strong>put signal pattern is placed <strong>in</strong> the appropriate array by an applicationspecificprocess. Dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, however, we will need to know what thetarget output d^ is for every <strong>in</strong>put vector, so that we can compute the error termfor the Adal<strong>in</strong>e.Recall that, dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, the LMS algorithm requires that the Adal<strong>in</strong>eupdate its weights after every forward propagation for a new <strong>in</strong>put pattern.We must also consider that the Adal<strong>in</strong>e application may need to adapt the


2.5 Simulat<strong>in</strong>g the Adal<strong>in</strong>e S3Adal<strong>in</strong>e while it is runn<strong>in</strong>g. Based on these observations, there is no needto store or accumulate errors across all patterns with<strong>in</strong> the tra<strong>in</strong><strong>in</strong>g algorithm.Thus, we can design the tra<strong>in</strong><strong>in</strong>g algorithm merely to adapt the weights for as<strong>in</strong>gle pattern. However, this design decision places on the application programthe responsibility for determ<strong>in</strong><strong>in</strong>g when the Adal<strong>in</strong>e has tra<strong>in</strong>ed sufficiently.This approach is usually acceptable because of the advantages it offers overthe implementation of a self-conta<strong>in</strong>ed tra<strong>in</strong><strong>in</strong>g loop. Specifically, it means thatwe can use the same tra<strong>in</strong><strong>in</strong>g function to adapt the Adal<strong>in</strong>e <strong>in</strong>itially or whileit is on-l<strong>in</strong>e. The generality of the algorithm is a particularly useful feature,<strong>in</strong> that the application program merely needs to detect a condition requir<strong>in</strong>gadaptation. It can then sample the <strong>in</strong>put that caused the error <strong>and</strong> generate thecorrect response "on the fly," provided we have some way of know<strong>in</strong>g thatthe error is <strong>in</strong>creas<strong>in</strong>g <strong>and</strong> can generate the correct desired values to accommodateretra<strong>in</strong><strong>in</strong>g. These values, <strong>in</strong> turn, can then be <strong>in</strong>put to the Adal<strong>in</strong>etra<strong>in</strong><strong>in</strong>g algorithm, thus allow<strong>in</strong>g adaptation at run time. F<strong>in</strong>ally, it also reducesthe housekeep<strong>in</strong>g chores that must be performed by the simulator, s<strong>in</strong>cewe will not need to ma<strong>in</strong>ta<strong>in</strong> a list of expected outputs for all tra<strong>in</strong><strong>in</strong>g patterns.We must now def<strong>in</strong>e algorithms to compute the squared error term (£ 2 (t)),the approximation of the gradient of the error surface, <strong>and</strong> to update the connectionweights to the Adal<strong>in</strong>e. We can aga<strong>in</strong> simplify matters by comb<strong>in</strong><strong>in</strong>gthe computation of the error <strong>and</strong> the update of the connection weights<strong>in</strong>to one function, as there is no need to compute the former withoutperform<strong>in</strong>g the latter. We now present the algorithms to accomplish thesefunctions:function compute_error (A : Adal<strong>in</strong>e; TARGET : float)return floatvar tempi : float; {scratch memory}temp2 : float; {scratch memory}err : float;{error term for unit}beg<strong>in</strong>tempi = sum_<strong>in</strong>puts (A.<strong>in</strong>put.outs, A.output.weights);temp2 = compute_output (tempi, A.output~.activation) ;err = absolute (TARGET - temp2); {fast error}return (err);{return error}end function;function update_weights (A : Adal<strong>in</strong>e; ERR : float)return voidvar grad : float; {the gradient of the error}<strong>in</strong>s : "float[]; {po<strong>in</strong>ter to <strong>in</strong>puts array}wts : "float[]; {po<strong>in</strong>ter to weights array}i : <strong>in</strong>teger; {iteration counter}


84 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>ebeg<strong>in</strong><strong>in</strong>s = A.<strong>in</strong>put.outs; {locate start of <strong>in</strong>put vector}= A. output.weights";{locate start of connections)for i = 1 to length(wts) do {for all connections, do}grad = -2 * err * <strong>in</strong>s[i]; {approximate gradient}wts[i] = wts[i] - grad * A.mu;{update connection}end do;end function;2.5.4 Complet<strong>in</strong>g the Adal<strong>in</strong>e SimulatorThe algorithms we have just def<strong>in</strong>ed are sufficient to implement an Adal<strong>in</strong>esimulator <strong>in</strong> both learn<strong>in</strong>g <strong>and</strong> operational modes. To offer a clean <strong>in</strong>terfaceto any external program that must call our simulator to perform an Adal<strong>in</strong>efunction, we can comb<strong>in</strong>e the modules we have described <strong>in</strong>to two higher-levelfunctions. These functions will perform the two types of activities the Adal<strong>in</strong>emust perform: f orwarcLpropagate <strong>and</strong> adapt-Adal<strong>in</strong>e.function forward_jaropagatevar tempi : float;(A : Adal<strong>in</strong>e) return void{scratch memory}beg<strong>in</strong>tempi = sum_<strong>in</strong>puts (A.<strong>in</strong>puts.outs,A. outputs.weights);A.outputs.outs[1] = compute_outputA.node_type);end function;(tempi.function adapt_Adal<strong>in</strong>ereturn floatvar err : float;(A : Adal<strong>in</strong>e; TARGET : float){tra<strong>in</strong> until small}beg<strong>in</strong>forward_propagate (A);{Apply <strong>in</strong>put signal}err = compute_error (A, TARGET); {Compute error}update_weights (A, err);{Adapt Adal<strong>in</strong>e}return(err);end function;2.5.5 Madal<strong>in</strong>e Simulator ImplementationAs we have discussed earlier, the Madal<strong>in</strong>e network is simply a collection ofb<strong>in</strong>ary Adal<strong>in</strong>e units, connected together <strong>in</strong> a layered structure. However, eventhough they share the same type of process<strong>in</strong>g unit, the learn<strong>in</strong>g strategies imple-


2.5 Simulat<strong>in</strong>g the Adal<strong>in</strong>e 85mented for the Madal<strong>in</strong>e are significantly different, as described <strong>in</strong> Section 2.5.2.Provid<strong>in</strong>g that as a guide, along with the discussion of the data structures needed,we leave the algorithm development for the Madal<strong>in</strong>e network to you as an exercise.In this regard, you should note that the layered structure of the Madal<strong>in</strong>elends itself directly to our simulator data structures. As illustrated <strong>in</strong> Figure 2.24,we can implement a layer of Adal<strong>in</strong>e units as easily as we created a s<strong>in</strong>gleAdal<strong>in</strong>e. The major differences here will be the length of the cuts arrays <strong>in</strong>the layer records (s<strong>in</strong>ce there will be more than one Adal<strong>in</strong>e output per layer),<strong>and</strong> the length <strong>and</strong> number of connection arrays (there will be one weightsarray for each Adal<strong>in</strong>e <strong>in</strong> the layer, <strong>and</strong> the weight.ptr array will beextended by one slot for each new weights array).Similarly, there will be more layer records as the depth of the Madal<strong>in</strong>e<strong>in</strong>creases, <strong>and</strong>, for each layer, there will be a correspond<strong>in</strong>g <strong>in</strong>crease <strong>in</strong> thenumber of cuts, weights, <strong>and</strong> weight.ptr arrays. Based on these observations,one fact that becomes immediately perceptible is the comb<strong>in</strong>atorialgrowth of both memory consumed <strong>and</strong> computer time required to support a l<strong>in</strong>eargrowth <strong>in</strong> network size. This relationship between computer resources <strong>and</strong>model siz<strong>in</strong>g is true not only for the Madal<strong>in</strong>e, but for all ANS models we willstudy. It is for these reasons that we have stressed optimization <strong>in</strong> data structures.outputs/activation°3weightsMadal<strong>in</strong>eoutsweights—————— ^-Weight p"XA W^A w 3^•^^outputsFigure 2.24Madal<strong>in</strong>e data structures are shown.


86 Adal<strong>in</strong>e <strong>and</strong> Madal<strong>in</strong>eProgramm<strong>in</strong>g Exercises2.1. Extend the Adal<strong>in</strong>e simulator to <strong>in</strong>clude the bias unit, 0, as described <strong>in</strong> thetext.2.2. Extend the simulator to implement a three-layer Madal<strong>in</strong>e us<strong>in</strong>g the algorithmsdiscussed <strong>in</strong> Section 2.3.2. Be sure to use the b<strong>in</strong>ary Adal<strong>in</strong>e type.Test the operation of your simulator by tra<strong>in</strong><strong>in</strong>g it to solve the XOR problemdescribed <strong>in</strong> the text.2.3. We have <strong>in</strong>dicated that the network stability term, it, can greatly affect theability of the Adal<strong>in</strong>e to converge on a solution. Us<strong>in</strong>g four different valuesfor /z of your own choos<strong>in</strong>g, tra<strong>in</strong> an Adal<strong>in</strong>e to elim<strong>in</strong>ate noise from an<strong>in</strong>put s<strong>in</strong>usoid rang<strong>in</strong>g from 0 to 2-n (one way to do this is to use a scaledr<strong>and</strong>om-number generator to provide the noise). Graph the curve of tra<strong>in</strong><strong>in</strong>giterations versus /z.Suggested Read<strong>in</strong>gsThe authoritative text by Widrow <strong>and</strong> Stearns is the st<strong>and</strong>ard reference to thematerial conta<strong>in</strong>ed <strong>in</strong> this chapter [9]. The orig<strong>in</strong>al delta-rule derivation isconta<strong>in</strong>ed <strong>in</strong> a 1960 paper by Widrow <strong>and</strong> Hoff [6], which is also repr<strong>in</strong>ted <strong>in</strong>the collection edited by Anderson <strong>and</strong> Rosenfeld [1].Bibliography[1] James A. Anderson <strong>and</strong> Edward Rosenfeld, editors. Neurocomput<strong>in</strong>g: Foundationsof Research. MIT Press, Cambridge, MA, 1988.[2] David Andes, Bernard Widrow, Michael Lehr, <strong>and</strong> Eric Wan. MRIII: Arobust algorithm for tra<strong>in</strong><strong>in</strong>g analog neural networks. In Proceed<strong>in</strong>gs ofthe International Jo<strong>in</strong>t Conference on <strong>Neural</strong> <strong>Networks</strong>, pages I-533-I-536, January 1990.[3] Richard W. Hamm<strong>in</strong>g. Digital Filters. Prentice-Hall, Englewood Cliffs,NJ, 1983.[4] Wilfred Kaplan. Advanced Calculus, 3rd edition. Addison-Wesley, Read<strong>in</strong>g,MA, 1984.[5] Alan V. Oppenheim arid Ronald W. Schafer. Digital Signal Process<strong>in</strong>g.Prentice-Hall, Englewood Cliffs, NJ, 1975.[6] Bernard Widrow <strong>and</strong> Marcian E. Hoff. Adaptive switch<strong>in</strong>g circuits. In 7960IRE WESCON Convention Record, New York, pages 96-104, 1960. IRE.[7] Bernard Widrow <strong>and</strong> Rodney W<strong>in</strong>ter. <strong>Neural</strong> nets for adaptive filter<strong>in</strong>g <strong>and</strong>adaptive pattern recognition. Computer, 21(3):25-39, March 1988.


Bibliography 87[8] Rodney W<strong>in</strong>ter <strong>and</strong> Bernard Widrow. MADALINE RULE II: A tra<strong>in</strong><strong>in</strong>galgorithm for neural networks. In Proceed<strong>in</strong>gs of the IEEE Second InternationalConference on <strong>Neural</strong> <strong>Networks</strong>, San Diego, CA, 1:401-408,July 1988.[9] Bernard Widrow <strong>and</strong> Samuel D. Stearns. Adaptive Signal Process<strong>in</strong>g. SignalProcess<strong>in</strong>g Series. Prentice-Hall, Englewood Cliffs, NJ, 1985.


HRBackpropagationThere are many potential computer applications that are difficult to implementbecause there are many problems unsuited to solution by a sequential process.<strong>Applications</strong> that must perform some complex data translation, yet have nopredef<strong>in</strong>ed mapp<strong>in</strong>g function to describe the translation process, or those thatmust provide a "best guess" as output when presented with noisy <strong>in</strong>put data arebut two examples of problems of this type.An ANS that we have found to be useful <strong>in</strong> address<strong>in</strong>g problems requir<strong>in</strong>grecognition of complex patterns <strong>and</strong> perform<strong>in</strong>g nontrivial mapp<strong>in</strong>g functions isthe backpropagation network (BPN), formalized first by Werbos [11], <strong>and</strong> laterby Parker [8] <strong>and</strong> by Rummelhart <strong>and</strong> McClell<strong>and</strong> [7]. This network, illustratedgenetically <strong>in</strong> Figure 3.1, is designed to operate as a multilayer, feedforwardnetwork, us<strong>in</strong>g the supervised mode of learn<strong>in</strong>g.The chapter beg<strong>in</strong>s with a discussion of an example of a problem mapp<strong>in</strong>gcharacter image to ASCII, which appears simple, but can quickly overwhelmtraditional approaches. Then, we look at how the backpropagation network operatesto solve such a problem. Follow<strong>in</strong>g that discussion is a detailed derivationof the equations that govern the learn<strong>in</strong>g process <strong>in</strong> the backpropagation network.From there, we describe some practical applications of the BPN as described <strong>in</strong>the literature. The chapter concludes with details of the BPN software simulatorwith<strong>in</strong> the context of the general design given <strong>in</strong> Chapter 1.3.1 THE BACKPROPAGATION NETWORKTo illustrate some problems that often arise when we are attempt<strong>in</strong>g to automatecomplex pattern-recognition applications, let us consider the design of a computerprogram that must translate a 5 x 7 matrix of b<strong>in</strong>ary numbers represent<strong>in</strong>gthe bit-mapped pixel image of an alphanumeric character to its equivalent eightbitASCII code. This basic problem, pictured <strong>in</strong> Figure 3.2, appears to berelatively trivial at first glance. S<strong>in</strong>ce there is no obvious mathematical function89


90 BackpropagationOutput read <strong>in</strong> parallelAlways 1Figure 3.1Input applied <strong>in</strong> parallelAlways 1The general backpropagation network architecture is shown.that will perform the desired translation, <strong>and</strong> because it would undoubtedly taketoo much time (both human <strong>and</strong> computer time) to perform a pixel-by-pixelcorrelation, the best algorithmic solution would be to use a lookup table.The lookup table needed to solve this problem would be a one-dimensionall<strong>in</strong>ear array of ordered pairs, each tak<strong>in</strong>g the form:record AELEMENT =pattern : long <strong>in</strong>teger;ascii : byte;end record;= 0010010101000111111100011000110001= 0951FC631 16= 65 10 ASCIIFigure 3.2Each character image is mapped to its correspond<strong>in</strong>g ASCIIcode.


3.1 The Backpropagation Network91The first is the numeric equivalent of the bit-pattern code, which we generateby mov<strong>in</strong>g the seven rows of the matrix to a s<strong>in</strong>gle row <strong>and</strong> consider<strong>in</strong>g theresult to be a 35-bit b<strong>in</strong>ary number. The second is the ASCII code associatedwith the character. The array would conta<strong>in</strong> exactly the same number of orderedpairsas there were characters to convert. The algorithm needed to perform theconversion process would take the follow<strong>in</strong>g form:function TRANSLATE(INPUT : long <strong>in</strong>teger;LUT : "AELEMENT[]> return ascii;{performs pixel-matrix to ASCII character conversion}var TABLE : "AELEMENT[];found : boolean;i : <strong>in</strong>teger;beg<strong>in</strong>TABLE = LUT;found = false;{locate translation table}{translation not found yet}for i = 1 to length(TABLE) do {for all items <strong>in</strong> table)if TABLE[i].pattern = INPUTthen Found = True; Exit;{translation found, quit loop}end;If FoundThen return TABLE[i].asciiElse return 0end;{return ascii}Although the lookup-table approach is reasonably fast <strong>and</strong> easy to ma<strong>in</strong>ta<strong>in</strong>,there are many situations that occur <strong>in</strong> real systems that cannot be h<strong>and</strong>led bythis method. For example, consider the same pixel-image-to-ASCII conversionprocess <strong>in</strong> a more realistic environment. Let's suppose that our character imagescanner alters a r<strong>and</strong>om pixel <strong>in</strong> the <strong>in</strong>put image matrix due to noise when theimage was read. This s<strong>in</strong>gle pixel error would cause the lookup algorithm toreturn either a null or the wrong ASCII code, s<strong>in</strong>ce the match between the <strong>in</strong>putpattern <strong>and</strong> the target pattern must be exact.Now consider the amount of additional software (<strong>and</strong>, hence, CPU time)that must be added to the lookup-table algorithm to improve the ability of thecomputer to "guess" at which character the noisy image should have been.S<strong>in</strong>gle-bit errors are fairly easy to f<strong>in</strong>d <strong>and</strong> correct. Multibit errors become<strong>in</strong>creas<strong>in</strong>gly difficult as the number of bit errors grows. To complicate matterseven further, how could our software compensate for noise on the image if thatnoise happened to make an "O" look like a "Q", or an "E" look like an "F"? Ifour character-conversion system had to produce an accurate output all the time,


92 Backpropagationan <strong>in</strong>ord<strong>in</strong>ate amount of CPU time would be spent elim<strong>in</strong>at<strong>in</strong>g noise from the<strong>in</strong>put pattern prior to attempt<strong>in</strong>g to translate it to ASCII.One solution to this dilemma is to take advantage of the parallel nature ofneural networks to reduce the time required by a sequential processor to performthe mapp<strong>in</strong>g. In addition, system-development time can be reduced because thenetwork can learn the proper algorithm without hav<strong>in</strong>g someone deduce thatalgorithm <strong>in</strong> advance.3.1.1 The Backpropagation ApproachProblems such as the noisy image-to-ASCII example are difficult to solve bycomputer due to the basic <strong>in</strong>compatibility between the mach<strong>in</strong>e <strong>and</strong> the problem.Most of today's computer systems have been designed to perform mathematical<strong>and</strong> logic functions at speeds that are <strong>in</strong>comprehensible to humans. Eventhe relatively unsophisticated desktop microcomputers commonplace today canperform hundreds of thous<strong>and</strong>s of numeric comparisons or comb<strong>in</strong>ations everysecond.However, as our previous example illustrated, mathematical prowess is notwhat is needed to recognize complex patterns <strong>in</strong> noisy environments. In fact,an algorithmic search of even a relatively small <strong>in</strong>put space can prove to betime-consum<strong>in</strong>g. The problem is the sequential nature of the computer itself;the "fetch-execute" cycle of the von Neumann architecture allows the mach<strong>in</strong>eto perform only one operation at a time. In most cases, the time requiredby the computer to perform each <strong>in</strong>struction is so short (typically about onemillionthof a second) that the aggregate time required for even a large programis <strong>in</strong>significant to the human users. However, for applications that must searchthrough a large <strong>in</strong>put space, or attempt to correlate all possible permutations ofa complex pattern, the time required by even a very fast mach<strong>in</strong>e can quicklybecome <strong>in</strong>tolerable.What we need is a new process<strong>in</strong>g system that can exam<strong>in</strong>e all the pixels <strong>in</strong>the image <strong>in</strong> parallel. Ideally, such a system would not have to be programmedexplicitly; rather, it would adapt itself to "learn" the relationship between a set ofexample patterns, <strong>and</strong> would be able to apply the same relationship to new <strong>in</strong>putpatterns. This system would be able to focus on the features of an arbitrary <strong>in</strong>putthat resemble other patterns seen previously, such as those pixels <strong>in</strong> the noisyimage that "look" like a known character, <strong>and</strong> to ignore the noise. Fortunately,such a system exists; we call this system the backpropagation network (BPN).3.1.2 BPN OperationIn Section 3.2, we will cover the details of the mechanics of backpropagation.A summary description of the network operation is appropriate here, to illustratehow the BPN can be used to solve complex pattern-match<strong>in</strong>g problems. To beg<strong>in</strong>with, the network learns a predef<strong>in</strong>ed set of <strong>in</strong>put-output example pairs by us<strong>in</strong>ga two-phase propagate-adapt cycle. After an <strong>in</strong>put pattern has been applied asa stimulus to the first layer of network units, it is propagated through each upper


3.2 The Generalized Delta Rule 93layer until an output is generated. This output pattern is then compared to thedesired output, <strong>and</strong> an error signal is computed for each output unit.The error signals are then transmitted backward from the output layer toeach node <strong>in</strong> the <strong>in</strong>termediate layer that contributes directly to the output. However,each unit <strong>in</strong> the <strong>in</strong>termediate layer receives only a portion of the total errorsignal, based roughly on the relative contribution the unit made to the orig<strong>in</strong>aloutput. This process repeats, layer by layer, until each node <strong>in</strong> the network hasreceived an error signal that describes its relative contribution to the total error.Based on the error signal received, connection weights are then updated by eachunit to cause the network to converge toward a state that allows all the tra<strong>in</strong><strong>in</strong>gpatterns to be encoded.The significance of this process is that, as the network tra<strong>in</strong>s, the nodes<strong>in</strong> the <strong>in</strong>termediate layers organize themselves such that different nodes learnto recognize different features of the total <strong>in</strong>put space. After tra<strong>in</strong><strong>in</strong>g, whenpresented with an arbitrary <strong>in</strong>put pattern that is noisy or <strong>in</strong>complete, the units <strong>in</strong>the hidden layers of the network will respond with an active output if the new<strong>in</strong>put conta<strong>in</strong>s a pattern that resembles the feature the <strong>in</strong>dividual units learnedto recognize dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g. Conversely, hidden-layer units have a tendency to<strong>in</strong>hibit their outputs if the <strong>in</strong>put pattern does not conta<strong>in</strong> the feature that theywere tra<strong>in</strong>ed to recognize.As the signals propagate through the different layers <strong>in</strong> the network, theactivity pattern present at each upper layer can be thought of as a pattern withfeatures that can be recognized by units <strong>in</strong> the subsequent layer. The outputpattern generated can be thought of as a feature map that provides an <strong>in</strong>dicationof the presence or absence of many different feature comb<strong>in</strong>ations at the <strong>in</strong>put.The total effect of this behavior is that the BPN provides an effective means ofallow<strong>in</strong>g a computer system to exam<strong>in</strong>e data patterns that may be <strong>in</strong>completeor noisy, <strong>and</strong> to recognize subtle patterns from the partial <strong>in</strong>put.Several researchers have shown that dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, BPNs tend to develop<strong>in</strong>ternal relationships between nodes so as to organize the tra<strong>in</strong><strong>in</strong>g data <strong>in</strong>toclasses of patterns [5]. This tendency can be extrapolated to the hypothesisthat all hidden-layer units <strong>in</strong> the BPN are somehow associated with specificfeatures of the <strong>in</strong>put pattern as a result of tra<strong>in</strong><strong>in</strong>g. Exactly what the associationis may or may not be evident to the human observer. What is important is thatthe network has found an <strong>in</strong>ternal representation that enables it to generate thedesired outputs when given the tra<strong>in</strong><strong>in</strong>g <strong>in</strong>puts. This same <strong>in</strong>ternal representationcan be applied to <strong>in</strong>puts that were not used dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g. The BPN willclassify these previously unseen <strong>in</strong>puts accord<strong>in</strong>g to the features they share withthe tra<strong>in</strong><strong>in</strong>g examples.3.2 THE GENERALIZED DELTA RULEIn this section, we present the formal mathematical description of BPN operation.We shall present a detailed derivation of the generalized delta rule(GDR), which is the learn<strong>in</strong>g algorithm for the network.


94 BackpropagationFigure 3.3 serves as the reference for most of the discussion. The BPN is alayered, feedforward network that is fully <strong>in</strong>terconnected by layers. Thus, thereare no feedback connections <strong>and</strong> no connections that bypass one layer to godirectly to a later layer. Although only three layers are used <strong>in</strong> the discussion,more than one hidden layer is permissible.A neural network is called a mapp<strong>in</strong>g network if it is able to computesome functional relationship between its <strong>in</strong>put <strong>and</strong> its output. For example, ifthe <strong>in</strong>put to a network is the value of an angle, <strong>and</strong> the output is the cos<strong>in</strong>e ofthat angle, the network performs the mapp<strong>in</strong>g 9 —> cos(#). For such a simplefunction, we do not need a neural network; however, we might want to performa complicated mapp<strong>in</strong>g where we do not know how to describe the functionalrelationship <strong>in</strong> advance, but we do know of examples of the correct mapp<strong>in</strong>g.'PNFigure 3.3The three-layer BPN architecture follows closely the generalnetwork description given <strong>in</strong> Chapter 1. The bias weights, 0^ ,<strong>and</strong> Q° k , <strong>and</strong> the bias units are optional. The bias units providea fictitious <strong>in</strong>put value of 1 on a connection to the bias weight.We can then treat the bias weight (or simply, bias) like anyother weight: It contributes to the net-<strong>in</strong>put value to the unit,<strong>and</strong> it participates <strong>in</strong> the learn<strong>in</strong>g process like any other weight.


3.2 The Generalized Delta Rule 95In this situation, the power of a neural network to discover its own algorithmsis extremely useful.Suppose we have a set of P vector-pairs, (xi,yj), (x 2 ,y2),---> (xp,yp),which are examples of a functional mapp<strong>in</strong>g y = 0(x) : x € R^y 6 R M -We want to tra<strong>in</strong> the network so that it will learn an approximation o = y' ='(x). We shall derive a method of do<strong>in</strong>g this tra<strong>in</strong><strong>in</strong>g that usually works,provided the tra<strong>in</strong><strong>in</strong>g-vector pairs have been chosen properly <strong>and</strong> there is a sufficientnumber of them. (Def<strong>in</strong>itions of properly <strong>and</strong> sufficient will be given<strong>in</strong> Section 3.3.) Remember that learn<strong>in</strong>g <strong>in</strong> a neural network means f<strong>in</strong>d<strong>in</strong>g anappropriate set of weights. The learn<strong>in</strong>g technique that we describe here resemblesthe problem of f<strong>in</strong>d<strong>in</strong>g the equation of a l<strong>in</strong>e that best fits a numberof known po<strong>in</strong>ts. Moreover, it is a generalization of the LMS rule thatwe discussed <strong>in</strong> Chapter 2. For a l<strong>in</strong>e-fitt<strong>in</strong>g problem, we would probablyuse a least-squares approximation. Because the relationship we are try<strong>in</strong>g tomap is likely to be nonl<strong>in</strong>ear, as well as multidimensional, we employ an iterativeversion of the simple least-squares method, called a steepest-descenttechnique.To beg<strong>in</strong>, let's review the equations for <strong>in</strong>formation process<strong>in</strong>g <strong>in</strong> the threelayernetwork <strong>in</strong> Figure 3.3. An <strong>in</strong>put vector, x p = (x p \,x p 2,...,x p N) t , isapplied to the <strong>in</strong>put layer of the network. The <strong>in</strong>put units distribute the valuesto the hidden-layer units. The net <strong>in</strong>put to the jth hidden unit is(3.1)where wf t is the weight on the connection from the ith <strong>in</strong>put unit, <strong>and</strong> 6^ isthe bias term discussed <strong>in</strong> Chapter 2. The "h" superscript refers to quantitieson the hidden layer. Assume that the activation of this node is equal to the net<strong>in</strong>put; then, the output of this node isThe equations for the output nodes are) (3.2)L° kj i pj + 8° k (3.3)o pk = />e&) (3.4)where the "o" superscript refers to quantities on the output layer.The <strong>in</strong>itial set of weight values represents a first guess as to the properweights for the problem. Unlike some methods, the technique we employ heredoes not depend on mak<strong>in</strong>g a good first guess. There are guidel<strong>in</strong>es for select<strong>in</strong>gthe <strong>in</strong>itial weights, however, <strong>and</strong> we shall discuss them <strong>in</strong> Section 3.3. The basicprocedure for tra<strong>in</strong><strong>in</strong>g the network is embodied <strong>in</strong> the follow<strong>in</strong>g description:


96 Backpropagation1. Apply an <strong>in</strong>put vector to the network <strong>and</strong> calculate the correspond<strong>in</strong>g outputvalues.2. Compare the actual outputs with the correct outputs <strong>and</strong> determ<strong>in</strong>e a measureof the error.3. Determ<strong>in</strong>e <strong>in</strong> which direction (+ or —) to change each weight <strong>in</strong> order toreduce the error.4. Determ<strong>in</strong>e the amount by which to change each weight.5. Apply the corrections to the weights.6. Repeat items 1 through 5 with all the tra<strong>in</strong><strong>in</strong>g vectors until the error for allvectors <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g set is reduced to an acceptable value.In Chapter 2, we described an iterative weight-change law for network withno hidden units <strong>and</strong> l<strong>in</strong>ear output units, called the LMS rule or delta rule:w(t = w(t)t (3.5)where /z is a positive constant, x kl is the ith component of the fcth tra<strong>in</strong><strong>in</strong>gvector, <strong>and</strong> e k is the difference between the actual output <strong>and</strong> the correct value,£k — (dk — yk)- Equation 3.5 is just the component form of Eq. (2.15).A similar equation results when the network has more than two layers, orwhen the output functions are nonl<strong>in</strong>ear. We shall derive the results explicitly<strong>in</strong> the next sections.3.2.1 Update of Output-Layer WeightsIn the derivation of the delta rule, the error for the fcth <strong>in</strong>put vector is E k =(dk- — y k ), where the desired output is d k <strong>and</strong> the actual output is y k . In thischapter, we adopt a slightly different notation that is somewhat <strong>in</strong>consistent withthe notation we used <strong>in</strong> Chapter 2. Because there are multiple units <strong>in</strong> a layer,a s<strong>in</strong>gle error value, such as e k , will not suffice for the BPN. We shall def<strong>in</strong>ethe error at a s<strong>in</strong>gle output unit to be 8 pk = (y p k — o p k), where the subscript"p" refers to the pth tra<strong>in</strong><strong>in</strong>g vector, <strong>and</strong> "fc" refers to the fcth output unit. Inthis case, y pk is the desired output value, <strong>and</strong> o pk is the actual output from thefcth unit. The error that is m<strong>in</strong>imized by the GDR is the sum of the squares ofthe errors for all output units:(3.6)The factor of | <strong>in</strong> Eq. (3.6) is there for convenience <strong>in</strong> calculat<strong>in</strong>g derivativeslater. S<strong>in</strong>ce an arbitrary constant will appear <strong>in</strong> the f<strong>in</strong>al result, the presenceof this factor does not <strong>in</strong>validate the derivation.To determ<strong>in</strong>e the direction <strong>in</strong> which to change the weights, we calculate thenegative of the gradient of E p , VE P , with respect to the weights, Wkj. Then,


3.2 The Generalized Delta Rule 97we can adjust the values of the weights such that the total error is reduced. Itis often useful to th<strong>in</strong>k of E p as a surface <strong>in</strong> weight space. Figure 3.4 shows asimple example where the network has only two weights.To keep th<strong>in</strong>gs simple, we consider each component of VE P separately.From Eq. (3.6) <strong>and</strong> the def<strong>in</strong>ition of ,,*.,(3.7)<strong>and</strong>3(net; A .)—— (3.8)Figure 3.4 This hypothetical surface <strong>in</strong> weight space h<strong>in</strong>ts at thecomplexity of these surfaces <strong>in</strong> comparison with the relativelysimple hyperparaboloid of the Adal<strong>in</strong>e (see Chapter 2). Thegradient, VE /lt at po<strong>in</strong>t z appears along with the negative of thegradient. Weight changes should occur <strong>in</strong> the direction of thenegative gradient, which is the direction of the steepest descentof the surface at the po<strong>in</strong>t z. Furthermore, weight changesshould be made iteratively until E,, reaches the m<strong>in</strong>imum po<strong>in</strong>t


98 Backpropagationwhere we have used Eq. (3.4) for the output value, o ; ,/,., <strong>and</strong> the cha<strong>in</strong> rule forpartial derivatives. For the moment, we shall not try to evaluate the derivativeof f l k', but <strong>in</strong>stead will write it simply as / A "'(net" )A .). The last factor <strong>in</strong> Eq. (3.8)is(3.9)Comb<strong>in</strong><strong>in</strong>g Eqs. (3.8) <strong>and</strong> (3.9), we have for the negative gradientdE- - (3.10)As far as the magnitude of the weight change is concerned, we take it tobe proportional to the negative gradient. Thus, the weights on the output layerare updated accord<strong>in</strong>g to(3.11)where(3.12)The factor rj is called the learn<strong>in</strong>g-rate parameter. We shall discuss the valueof /i <strong>in</strong> Section 3.3. For now, it is sufficient to note that it is positive <strong>and</strong> isusually less than 1.Let's go back to look at the function /[". First, notice the requirement thatthe function /£ be differentiable. This requirement elim<strong>in</strong>ates the possibility ofus<strong>in</strong>g a l<strong>in</strong>ear threshold unit such as we described <strong>in</strong> Chapter 2, s<strong>in</strong>ce the outputfunction for such a unit is not differentiable at the threshold value.There are two forms of the output function that are of <strong>in</strong>terest here:= net°,The first function def<strong>in</strong>es the l<strong>in</strong>ear output unit. The latter function is called asigmoid, or logistic function; it is illustrated <strong>in</strong> Figure 3.5. The choice of outputfunction depends on how you choose to represent the output data. For example,if you want the output units to be b<strong>in</strong>ary, you use a sigmoid output function,s<strong>in</strong>ce the sigmoid is output-limit<strong>in</strong>g <strong>and</strong> quasi-bistable but is also differentiable.In other cases, either a l<strong>in</strong>ear or a sigmoid output function is appropriate.In the first case, /£' = 1; <strong>in</strong> the second case, /£' = / A °(l - /£) = o, )A -(l -o t ,i,.). For these two cases, we have(3.13)


3.2 The Generalized Delta Rule 991.0 -Figure 3.5This graph shows the characteristic S-shape of the sigmoidfunction.for the l<strong>in</strong>ear output, <strong>and</strong>w' kj (t + 1) = w kj (t) + T](y pk - o pk )o pk (\ - o pk )i p j (3.14)for the sigmoidal output.We want to summarize the weight-update equations by def<strong>in</strong><strong>in</strong>g a quantity= 6 pt /f(net° fc )We can then write the weight-update equation asw° kj ( = w° k (t)(3.15)(3. 16)regardless of the functional form of the output function, /£.We wish to make a comment regard<strong>in</strong>g the relationship between the gradientdescentmethod described here <strong>and</strong> the least-squares technique. If we were try<strong>in</strong>gto make the generalized delta rule entirely analogous to a least-squares method,we would not actually change any of the weight values until all of the tra<strong>in</strong><strong>in</strong>gpatterns had been presented to the network once. We would simply accumulatethe changes as each pattern was processed, sum them, <strong>and</strong> make one update tothe weights. We would then repeat the process until the error was acceptablylow. The error that this process m<strong>in</strong>imizes is(3.17)where P is the number of patterns <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g set. In practice, we havefound little advantage to this strict adherence to analogy with the least-squares


1 00 Backpropagationmethod. Moreover, you must store a large amount of <strong>in</strong>formation to use thismethod. We recommend that you perform weight updates as each tra<strong>in</strong><strong>in</strong>gpattern is processed.Exercise 3.1: A certa<strong>in</strong> network has output nodes called quadratic neurons.The net <strong>in</strong>put to such a neuron isnet*. = ^w kj (ij -v kj ) 2jThe output function is sigmoidal. Both Wkj <strong>and</strong> v k j are weights, <strong>and</strong> ij is thejth <strong>in</strong>put value. Assume that the w weights <strong>and</strong> v weights are <strong>in</strong>dependent.a. Determ<strong>in</strong>e the weight-update equations for both types of weights.b. What is the significance of this type of node? H<strong>in</strong>t: Consider a s<strong>in</strong>gle unit ofthe type described here, hav<strong>in</strong>g two <strong>in</strong>puts <strong>and</strong> a l<strong>in</strong>ear-threshold functionoutput. With what geometric figure does this unit partition the <strong>in</strong>put space?(This exercise was suggested by Gary Mclntire, Loral Space Information Systems,who derived <strong>and</strong> implemented this network.)3.2.2 Updates of Hidden-Layer WeightsWe would like to repeat for the hidden layer the same type of calculation as wedid for the output layer. A problem arises when we try to determ<strong>in</strong>e a measureof the error of the outputs of the hidden-layer units. We know what the actualoutput is, but we have no way of know<strong>in</strong>g <strong>in</strong> advance what the correct outputshould be for these units. Intuitively, the total error, E p , must somehow berelated to the output values on the hidden layer. We can verify our <strong>in</strong>tuition bygo<strong>in</strong>g back to Eq. (3.7):Pk5£(%>frkWe know that z w depends on the weights on the hidden layer through Eqs. (3.1)<strong>and</strong> (3.2). We can exploit this fact to calculate the gradient of E p with respectto the hidden-layer weights.dE p \ ^ d , ,2'J,, „ , p pp' k ) di vj 9(net^) dw^= — > (j/pfc - 0,,k)-^-. —— — — - —— 7T. ——— — ——— J— „ ., —— (3.18)


3.2 The Generalized Delta Rule 101Each of the factors <strong>in</strong> Eq. (3. 1 8) can be calculated explicitly from previousequations. The result isO 77"- - ' (3.19)Exercise 3.2: Verify the steps between Eqs. (3.18) <strong>and</strong> (3.19).We update the hidden-layer weights <strong>in</strong> proportion to the negative ofEq. (3.19):AX'; = '//j"(n


1 02 Backpropagation1. Apply the <strong>in</strong>put vector, x ; , = (x p \ . x /)2 , .... x ; ,,v)' to the <strong>in</strong>put units.2. Calculate the net-<strong>in</strong>put values to the hidden layer units:3. Calculate the outputs from the hidden layer:4. Move to the output layer. Calculate the net-<strong>in</strong>put values to each unit:<strong>in</strong>et '/-A- = 5D Wkjipj + °l5. Calculate the outputs:o p k = /i'(net° fc )6. Calculate the error terms for the output units:7. Calculate the error terms for the hidden units:Notice that the error terms on the hidden units are calculated before theconnection weights to the output-layer units have been updated.8. Update weights on the output layer:9. Update weights on the hidden layer:Wjj(t -4- 1) = w h j[(t) + rjSpjXjThe order of the weight updates on an <strong>in</strong>dividual layer is not important.Be sure to calculate the error term1 AP ~ 2 ' - 1 p *'s<strong>in</strong>ce this quantity is the measure of how well the network is learn<strong>in</strong>g. Whenthe error is acceptably small for each of the tra<strong>in</strong><strong>in</strong>g-vector pairs, tra<strong>in</strong><strong>in</strong>g canbe discont<strong>in</strong>ued.


3.3 Practical Considerations 1033.3 PRACTICAL CONSIDERATIONSThere are some topics that we omitted from previous sections so as not to divertyour attention from the ma<strong>in</strong> ideas. Before mov<strong>in</strong>g on to the discussion ofapplications <strong>and</strong> the simulator, let's pick up these loose ends.3.3.1 Tra<strong>in</strong><strong>in</strong>g DataWe promised a def<strong>in</strong>ition of the terms sufficient <strong>and</strong> properly regard<strong>in</strong>g theselection of tra<strong>in</strong><strong>in</strong>g-vector pairs for the BPN. Unfortunately, there is no s<strong>in</strong>gledef<strong>in</strong>ition that applies to all cases. As with many aspects of neural-networksystems, experience is often the best teacher. As you ga<strong>in</strong> facility with us<strong>in</strong>gnetworks, you will also ga<strong>in</strong> an appreciation for how to select <strong>and</strong> preparetra<strong>in</strong><strong>in</strong>g sets. Thus, we shall give only a few guidel<strong>in</strong>es here.In general, you can use as many data as you have available to tra<strong>in</strong> thenetwork, although you may not need to use them all. From the available tra<strong>in</strong><strong>in</strong>gdata, a small subset is often all that you need to tra<strong>in</strong> a network successfully.The rema<strong>in</strong><strong>in</strong>g data can be used to test the network to verify that the network canperform the desired mapp<strong>in</strong>g on <strong>in</strong>put vectors it has never encountered dur<strong>in</strong>gtra<strong>in</strong><strong>in</strong>g.If you are tra<strong>in</strong><strong>in</strong>g a network to perform <strong>in</strong> a noisy environment, such as thepixel-image-to-ASCII example, then <strong>in</strong>clude some noisy <strong>in</strong>put vectors <strong>in</strong> thedata set. Sometimes the addition of noise to the <strong>in</strong>put vectors dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>ghelps the network to converge even if no noise is expected on the <strong>in</strong>puts.The BPN is good at generalization. What we mean by generalization here isthat, given several different <strong>in</strong>put vectors, all belong<strong>in</strong>g to the same'class, a BPNwill learn to key off of significant similarities <strong>in</strong> the <strong>in</strong>put vectors. Irrelevantdata will be ignored. As an example, suppose we want to tra<strong>in</strong> a network todeterm<strong>in</strong>e whether a bipolar number of length 5 is even or odd. With onlya small set of examples used for tra<strong>in</strong><strong>in</strong>g, the BPN will adjust its weights sothat a classification will be made solely on the basis of the value of the leastsignificant bit <strong>in</strong> the number: The network learns to ignore the irrelevant data<strong>in</strong> the other bits.In contrast to generalization, the BPN will not extrapolate well. If a BPNis <strong>in</strong>adequately or <strong>in</strong>sufficiently tra<strong>in</strong>ed on a particular class of <strong>in</strong>put vectors,subsequent identification of members of that class may be unreliable. Make surethat the tra<strong>in</strong><strong>in</strong>g data cover the entire expected <strong>in</strong>put space. Dur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>gprocess, select tra<strong>in</strong><strong>in</strong>g-vector pairs r<strong>and</strong>omly from the set, if the problem lendsitself to this strategy. In any event, do not tra<strong>in</strong> the network completely with<strong>in</strong>put vectors of one class, <strong>and</strong> then switch to another class: The network willforget the orig<strong>in</strong>al tra<strong>in</strong><strong>in</strong>g.If the output function is sigmoidal, then you will have to scale the outputvalues. Because of the form of the sigmoid function, the network outputs cannever reach 0 or 1. Therefore, use values such as 0.1 <strong>and</strong> 0.9 to represent thesmallest <strong>and</strong> largest output values. You can also shift the sigmoid so that, for


104 Backpropagationexample, the limit<strong>in</strong>g values become ±0.4. Moreover, you can change the slopeof the l<strong>in</strong>ear portion of the sigmoid curve by <strong>in</strong>clud<strong>in</strong>g a multiplicative constant<strong>in</strong> the exponential. There are many such possibilities that depend largely on theproblem be<strong>in</strong>g solved.3.3.2 Network Siz<strong>in</strong>gJust how many nodes are needed to solve a particular problem? Are three layersalways sufficient? As with the questions concern<strong>in</strong>g proper tra<strong>in</strong><strong>in</strong>g data, thereare no strict answers to questions such as these. Generally, three layers aresufficient. Sometimes, however, a problem seems to be easier to solve withmore than one hidden layer. In this case, easier means that the network learnsfaster.The size of the <strong>in</strong>put layer is usually dictated by the nature of the application.You can often determ<strong>in</strong>e the number of output nodes by decid<strong>in</strong>g whether youwant analog values or b<strong>in</strong>ary values on the output units. Section 3.4 conta<strong>in</strong>stwo examples that illustrate both of these situations.Determ<strong>in</strong><strong>in</strong>g the number of units to use <strong>in</strong> the hidden layer is not usually asstraightforward as it is for the <strong>in</strong>put <strong>and</strong> output layers. The ma<strong>in</strong> idea is to useas few hidden-layer units as possible, because each unit adds to the load on theCPU dur<strong>in</strong>g simulations. Of course, <strong>in</strong> a system that is fully implemented <strong>in</strong>hardware (one processor per process<strong>in</strong>g element), additional CPU load<strong>in</strong>g is notas much of a consideration (<strong>in</strong>terprocessor communication may be a problem,however). We hesitate to offer specific guidel<strong>in</strong>es except to say that, <strong>in</strong> ourexperience, for networks of reasonable size (hundreds or thous<strong>and</strong>s of <strong>in</strong>puts),the size of the hidden layer needs to be only a relatively small fraction of thatof the <strong>in</strong>put layer. If the network fails to converge to a solution it may be thatmore hidden nodes are required. If it does converge, you might try fewer hiddennodes <strong>and</strong> settle on a size on the basis of overall system performance.It is also possible to remove hidden units that are superfluous. If youexam<strong>in</strong>e the weight values on the hidden nodes periodically as the networktra<strong>in</strong>s, you will see that weights on certa<strong>in</strong> nodes change very little from theirstart<strong>in</strong>g values. These nodes may not be participat<strong>in</strong>g <strong>in</strong> the learn<strong>in</strong>g process, <strong>and</strong>fewer hidden units may suffice. There is also an automatic method, developedby Rumelhart, for prun<strong>in</strong>g unneeded nodes from the network. 13.3.3 Weights <strong>and</strong> Learn<strong>in</strong>g ParametersWeights should be <strong>in</strong>itialized to small, r<strong>and</strong>om values—say between ±0.5—asshould the bias terms, #,,, that appear <strong>in</strong> the equations for the net <strong>in</strong>put to aunit. It is common practice to treat this bias value as another weight, which is'Unscheduled talk given at the Second International Conference on <strong>Neural</strong> <strong>Networks</strong>, San Diego,June 1988.


3.3 Practical Considerations 105connected to a fictitious unit that always has an output of I. To see how thisscheme works, recall Eq. (3.3):net£ fe = f>,Vw+**j=iBy mak<strong>in</strong>g the def<strong>in</strong>itions, O c k = w% (L+t} <strong>and</strong> i p (L+\) = 1, we can writeL + lSo 9° k is treated just like a weight, <strong>and</strong> it participates <strong>in</strong> the learn<strong>in</strong>g processas a weight. Another possibility is simply to remove the bias terms altogether;their use is optional.Selection of a value for the learn<strong>in</strong>g rate parameter, 77, has a significanteffect on the network performance. Usually, 77 must be a small number—onthe order of 0.05 to 0.25—to ensure that the network will settle to a solution.A small value of r; means that the network will have to make a large numberof iterations, but that is the price to be paid. It is often possible to <strong>in</strong>creasethe size of i) as learn<strong>in</strong>g proceeds. Increas<strong>in</strong>g 77 as the network error decreaseswill often help to speed convergence by <strong>in</strong>creas<strong>in</strong>g the step size as the errorreaches a m<strong>in</strong>imum, but the network may bounce around too far from the actualm<strong>in</strong>imum value if 77 gets too large.Another way to <strong>in</strong>crease the speed of convergence is to use a techniquecalled momentum. When calculat<strong>in</strong>g the weight-change value, A p w, we add afraction of the previous change. This additional term tends to keep the weightchanges go<strong>in</strong>g <strong>in</strong> the same direction—hence the term momentum. The weightchangeequations on the output layer then becomew' kj (t + 1) = w?.j(t) + r,6'; k i pj + a& p w° kj (t - 1) (3.24)with a similar equation on the hidden layer. In Eq. (3.24), a is the momentumparameter, <strong>and</strong> it is usually set to a positive value less than 1. The use of themomentum term is also optional.A f<strong>in</strong>al topic concerns the possibility of converg<strong>in</strong>g to a local m<strong>in</strong>imum <strong>in</strong>weight space. Figure 3.6 illustrates the idea. Once a network settles on a m<strong>in</strong>imum,whether local or global, learn<strong>in</strong>g ceases. If a local m<strong>in</strong>imum is reached,the error at the network outputs may still be unacceptably high. Fortunately,this problem does not appear to cause much difficulty <strong>in</strong> practice. If a networkstops learn<strong>in</strong>g before reach<strong>in</strong>g an acceptable solution, a change <strong>in</strong> the number ofhidden nodes or <strong>in</strong> the learn<strong>in</strong>g parameters will often fix the problem; or we cansimply start over with a different set of <strong>in</strong>itial weights. When a network reachesan acceptable solution, there is no guarantee that it has reached the global m<strong>in</strong>imumrather than a local one. If the solution is acceptable from an error st<strong>and</strong>po<strong>in</strong>t,it does not matter whether the m<strong>in</strong>imum is global or local, or even whetherthe tra<strong>in</strong><strong>in</strong>g was halted at some po<strong>in</strong>t before a true m<strong>in</strong>imum was reached.


106 Backpropagation-- E,wFigure 3.6This graph shows a cross-section of a hypothetical error surface<strong>in</strong> weight space. The po<strong>in</strong>t, z m<strong>in</strong> , is called the global m<strong>in</strong>imum.Notice, however, that there are other m<strong>in</strong>imum po<strong>in</strong>ts, z\ <strong>and</strong>z 2 . A gradient-descent search for the global m<strong>in</strong>imum mightaccidentally f<strong>in</strong>d one of these local m<strong>in</strong>ima <strong>in</strong>stead of theglobal m<strong>in</strong>imum.Exercise 3.5: Consider a three-layer BPN with all weights <strong>in</strong>itialized to thesame value on every unit. Prove that this network will never be able to learnanyth<strong>in</strong>g. Interpret this result <strong>in</strong> terms of the error surface <strong>in</strong> weight space.3.4 BPN APPLICATIONSThe BPN is a versatile tool that is readily applied to a number of diverseproblems. To a large extent, its versatility is due to the general nature of thenetwork learn<strong>in</strong>g process. As we discussed <strong>in</strong> the previous section, there areonly two equations needed to backpropagate error signals with<strong>in</strong> the network;which of the two is used depends on whether the process<strong>in</strong>g unit receiv<strong>in</strong>g theerror signal contributes directly to the output. Those units that do not connectdirectly to the output use the same error-propagation mechanism regardless ofwhere they are <strong>in</strong> the network structure.The generality offered by this common process allows arrangement <strong>and</strong>connectivity of <strong>in</strong>dividual units with<strong>in</strong> the network that can vary dramatically.Similarly, due to the variety of network structures that can be created <strong>and</strong>tra<strong>in</strong>ed successfully us<strong>in</strong>g the backpropagation algorithms, this network-learn<strong>in</strong>gtechnique can be applied to many different k<strong>in</strong>ds of problems. In the rema<strong>in</strong>derof this section, we will describe two such applications, selected to illustrate thediversity of the BPN network architecture.


3.4 BPN <strong>Applications</strong> 1073.4.1 Data CompressionAs our first example, let's consider the common problem of data compression.Specifically, we would like to try to f<strong>in</strong>d a way to reduce the data needed to encode<strong>and</strong> reproduce accurately a moderately high-resolution video image, so thatwe might transmit these images over low- to medium-b<strong>and</strong>width communicationequipment. Although there are many algorithmic approaches to perform<strong>in</strong>gdata compression, most of these are designed to deal with static data, such asASCII text, or with display images that are fairly consistent, such as computergraphics. Because video data rarely conta<strong>in</strong> regular, well-def<strong>in</strong>ed forms (<strong>and</strong>even less frequently conta<strong>in</strong> empty space), video data compression is a difficultproblem from an algorithmic viewpo<strong>in</strong>t.Conversely, as orig<strong>in</strong>ally described <strong>in</strong> [1], a neural-network approach isideal for a video data-reduction application, because a BPN can be tra<strong>in</strong>ed easilyto map a set of patterns from an n-dimensional space to an m-dimensional space.S<strong>in</strong>ce any video image can be thought of as a matrix of picture elements (pixels),it naturally follows that the image can also be conceptualized as a vector <strong>in</strong> n-space. If we limit the video to be encoded to monochromatic, images can berepresented as vectors of elements, each represent<strong>in</strong>g the gray-scale value of as<strong>in</strong>gle pixel (0 through 255).Network Architecture for Data Compression. The first step <strong>in</strong> solv<strong>in</strong>g thisproblem is to try to f<strong>in</strong>d a way to structure our network so that it will performthe desired data compression. We would like to select a network architecturethat provides a reasonable data-reduction factor (say, four-to-one), while stillenabl<strong>in</strong>g us to recover a close approximation of the orig<strong>in</strong>al image from theencoded form. The network illustrated <strong>in</strong> Figure 3.7 will satisfy both of theserequirements.At first glance, it may seem unusual that the proposed network will havea one-to-one correspondence between <strong>in</strong>put <strong>and</strong> output units. After all, didwe not <strong>in</strong>dicate that data compression was the desired objective? On further<strong>in</strong>vestigation, the strategy implied by the network architecture becomes apparent;s<strong>in</strong>ce there are fewer hidden units than <strong>in</strong>put units, the hidden layer mustrepresent the compressed form of the data. This is exactly the plan of attack.By provid<strong>in</strong>g an image vector as the <strong>in</strong>put stimulation pattern, the networkwill propagate the <strong>in</strong>put through the hidden units to the output. S<strong>in</strong>ce thehidden layer conta<strong>in</strong>s only one-quarter of the number of process<strong>in</strong>g units as the<strong>in</strong>put layer, the output values produced by the hidden-layer units can be thought°f as the encoded form of the <strong>in</strong>put. Furthermore, by propagat<strong>in</strong>g the outputof the hidden-layer units forward to the output layer, we have implemented amechanism for reconstruct<strong>in</strong>g the orig<strong>in</strong>al image from the encoded form, as wellas for tra<strong>in</strong><strong>in</strong>g the network.Dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, the network will be shown examples of r<strong>and</strong>om pixel vectorstaken from representative video images. Each vector will be used as bothL


108 BackpropagationFigure 3.7This BPN will do four-to-one data compression.the <strong>in</strong>put to the network <strong>and</strong> the target output. Us<strong>in</strong>g the backpropagationprocess, the network will develop the <strong>in</strong>ternal weight cod<strong>in</strong>g so that the imageis compressed <strong>in</strong>to one-quarter of its orig<strong>in</strong>al size at the outputs of thehidden units. If we then read out the values produced by the hidden-layerunits <strong>in</strong> our network <strong>and</strong> transmit those values to our receiv<strong>in</strong>g station, wecan reconstruct the orig<strong>in</strong>al image by propagat<strong>in</strong>g the compressed image tothe output units <strong>in</strong> an identical network. Such a system is depicted <strong>in</strong> Figure3.8.Network Siz<strong>in</strong>g. There are two problems rema<strong>in</strong><strong>in</strong>g to be solved for this application:the first is the network-siz<strong>in</strong>g problem, <strong>and</strong> the second is the generationof the sample data sets needed to tra<strong>in</strong> the network. We will address thenetwork-siz<strong>in</strong>g aspect first.It is unrealistic to expect to create a network that will conta<strong>in</strong> an <strong>in</strong>putunit for every pixel <strong>in</strong> a s<strong>in</strong>gle video image. Even if we restricted ourselves tothe relatively low resolution of the 525-l<strong>in</strong>e scan rate specified by the NationalTelevision St<strong>and</strong>ard Code (NTSC) for commercial television, our network wouldhave to have 336,000 <strong>in</strong>put units (525 l<strong>in</strong>es x 640 pixels). Moreover, the entirenetwork would conta<strong>in</strong> roughly 750,000 process<strong>in</strong>g units (336000 + 336000/4 +336000) <strong>and</strong> 50 billion connections. As we have mentioned <strong>in</strong> earlier chapters,simulat<strong>in</strong>g a network conta<strong>in</strong><strong>in</strong>g a large number of units <strong>and</strong> a vast number


3.4 BPN <strong>Applications</strong> 109Transmission medium (cable)Figure 3.8In this example, the output activity pattern from the secondlayer of units is transferred to a receiv<strong>in</strong>g station, where it isapplied as the output of another layer of units that forms thetop half of the three-layer network. The receiv<strong>in</strong>g networkthen reconstructs the transmitted image from the compressedform, us<strong>in</strong>g the <strong>in</strong>verse mapp<strong>in</strong>g function conta<strong>in</strong>ed <strong>in</strong> theconnection weights to the top half of the network.of connections on anyth<strong>in</strong>g less than a dedicated supercomputer is much tootime-consum<strong>in</strong>g to be considered practical.Our siz<strong>in</strong>g strategy is therefore somewhat less ambitious; we will restrictthe size of our <strong>in</strong>put <strong>and</strong> output spaces to 64 pixels. Thus, the hidden layer willconta<strong>in</strong> 16 units. Although this might appear to be a significant compromiseover the full image network, the smaller network offers two practical benefitsover its larger counterpart:* It is easy to simulate on small computers.* It makes obta<strong>in</strong><strong>in</strong>g tra<strong>in</strong><strong>in</strong>g data for the smaller network easier.


110 BackpropagationTra<strong>in</strong><strong>in</strong>g the Network. By now, it is probably evident why the smaller networkis easier to simulate. Why obta<strong>in</strong><strong>in</strong>g tra<strong>in</strong><strong>in</strong>g data for the smaller networkis easier is probably not as obvious. If we consider the nature of the applicationthe network is attempt<strong>in</strong>g to address, we can see that the network is try<strong>in</strong>g tolearn a mapp<strong>in</strong>g from ri-space to n/4-space <strong>and</strong> the <strong>in</strong>verse mapp<strong>in</strong>g back ton-space. S<strong>in</strong>ce the number of possible permutations of the <strong>in</strong>put pattern is significantlysmaller <strong>in</strong> 64 dimensions than it is <strong>in</strong> 336,000 dimensions, it followsthat far fewer r<strong>and</strong>om tra<strong>in</strong><strong>in</strong>g patterns are needed for the network to learn toreproduce the <strong>in</strong>put <strong>in</strong> 64-space. 2 It is also easier to generate r<strong>and</strong>om imagesfor tra<strong>in</strong><strong>in</strong>g the network by us<strong>in</strong>g 64 <strong>in</strong>puts, because a s<strong>in</strong>gle complete videoimage can be subdivided <strong>in</strong>to about 5000 8x8 pixel matrices, each of whichcan be used to tra<strong>in</strong> the network.Based on these observations, our approach of downsiz<strong>in</strong>g the network hassolved both of the rema<strong>in</strong><strong>in</strong>g issues. We now have a network that is easy tomanage <strong>and</strong> that offers a means of acquir<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g data sets from readilyobta<strong>in</strong>able sources.Exercise 3.6: Assume that the system described <strong>in</strong> this section has been builtus<strong>in</strong>g a BPN simulator on a s<strong>in</strong>gle-processor computer. What are the process<strong>in</strong>grequirements to allow entire video images to be sent at st<strong>and</strong>ard frame rates(30 Hz <strong>in</strong>terlaced, 60 Hz non<strong>in</strong>terlaced)?3.4.2 Pa<strong>in</strong>t-Quality InspectionVisual <strong>in</strong>spection of pa<strong>in</strong>ted surfaces, such as automobile body panels, is currentlya very time-consum<strong>in</strong>g <strong>and</strong> labor-<strong>in</strong>tensive process. To reduce the amountof time required to perform this <strong>in</strong>spection, one of the major U.S. automobilemanufacturers reflects a laser beam off the pa<strong>in</strong>ted panel <strong>and</strong> on to a projectionscreen. S<strong>in</strong>ce the light source is a coherent beam, the amount of scatter observed<strong>in</strong> the reflected image of the laser provides an <strong>in</strong>dication of the quality ofthe pa<strong>in</strong>t f<strong>in</strong>ish on the car. A poor pa<strong>in</strong>t job is one that conta<strong>in</strong>s ripples, lookslike "orange peel," or lacks sh<strong>in</strong>e. A laser beam reflected off a panel with apoor f<strong>in</strong>ish will therefore be relatively diffuse. Conversely, a good-quality pa<strong>in</strong>tf<strong>in</strong>ish will be relatively smooth <strong>and</strong> will exhibit a bright luster. A laser lightreflected off a high-quality pa<strong>in</strong>t f<strong>in</strong>ish will appear to an observer as very closeto uniform throughout its image. Figure 3.9 illustrates the k<strong>in</strong>d of differencestypically observed as a result of perform<strong>in</strong>g this test.We have now seen how it might be possible to automate the quality <strong>in</strong>spectionof a pa<strong>in</strong>ted surface. However, our system design has presumed that-Encod<strong>in</strong>g all of the patterns <strong>in</strong> 64-space is obviously not feasible. Practically, the best encod<strong>in</strong>gwe can hope for us<strong>in</strong>g an ANS is an output mapp<strong>in</strong>g that resembles the desired output with<strong>in</strong> somemarg<strong>in</strong> of error.


3.4 BPN <strong>Applications</strong> 111(a)(b)Figure 3.9The scatter typically observed when a laser beam is reflectedoff pa<strong>in</strong>ted sheet-metal surfaces, (a) A poor-quality pa<strong>in</strong>t f<strong>in</strong>ish,(b) A better-quality pa<strong>in</strong>t f<strong>in</strong>ish.there is an "observer" present to assess the pa<strong>in</strong>t quality by perform<strong>in</strong>g a visual<strong>in</strong>spection of the reflected laser image. In the past, this part of the <strong>in</strong>spectionprocess would have been performed primarily by humans, because conventionalcomputer-programm<strong>in</strong>g techniques that could be used to automate the "observation"<strong>and</strong> scor<strong>in</strong>g process suffered from a lack of flexibility <strong>and</strong> were notparticularly robust. 3 To illustrate why an algorithmic analysis of the reflectedlaser image might be considered <strong>in</strong>flexible, consider that such a program wouldhave to exam<strong>in</strong>e every pixel <strong>in</strong> the <strong>in</strong>put image, correlate features of each pixel(such as brightness) with those observed <strong>in</strong> a multitude of neighbor<strong>in</strong>g pixels,<strong>and</strong> assess the coherency of the image as a whole. Small, localized perturbations<strong>in</strong> the image might represent relatively m<strong>in</strong>or problems, such as a f<strong>in</strong>gerpr<strong>in</strong>ton the pa<strong>in</strong>t panel. The complexity of such a program makes it difficult tomodify.By us<strong>in</strong>g a BPN to perform the quality-scor<strong>in</strong>g application, we have constructeda system that captures the expertise of the human <strong>in</strong>spectors, <strong>and</strong> isrelatively easy to ma<strong>in</strong>ta<strong>in</strong> <strong>and</strong> update. To improve the performance of thesystem, we have coupled algorithmic techniques to simplify the problem, illustrat<strong>in</strong>gonce aga<strong>in</strong> that difficult problems are much easier to solve when wecan work with a complete set of tools. We shall now describe the system wedeveloped to address this application.An algorithmic solution to this problem does exist, <strong>and</strong> has been successfully applied to the problemdescribed. However, the amount of time (<strong>and</strong> hence money) needed to ma<strong>in</strong>ta<strong>in</strong> <strong>and</strong> update thatsystem, should the need arise, probably would be prohibitive.


112 BackpropagationAutomatic Pa<strong>in</strong>t QA System Concept. To automate the pa<strong>in</strong>t <strong>in</strong>spection process,a video system was easily substituted for the human visual system. However,we were then faced with the problem of try<strong>in</strong>g to create a BPN to exam<strong>in</strong>e<strong>and</strong> score the pa<strong>in</strong>t quality given the video <strong>in</strong>put. To accomplish the exam<strong>in</strong>ation,we constructed the system illustrated <strong>in</strong> Figure 3.10. The <strong>in</strong>put video imagewas run through a video frame-grabber to record a snapshot of the reflected laserimage. This snapshot conta<strong>in</strong>ed an image 400-by-75 pixels <strong>in</strong> size, each pixelstored as one of 256 values represent<strong>in</strong>g its <strong>in</strong>tensity. To keep the size of thenetwork needed to solve the problem manageable, we elected to take 10 sampleimages from the snapshot, each sample consist<strong>in</strong>g of a 30-by-30-pixel squarecentered on a region of the image with the brightest <strong>in</strong>tensity. This approachallowed us to reduce the <strong>in</strong>put size of the BPN to 900 units (down from the30,000 units that would have been required to process the entire image). Thedesired output was to be a numerical score <strong>in</strong> the range of 1 through 20 (a1 represented the best possible pa<strong>in</strong>t f<strong>in</strong>ish; a 20 represented the worst). Toproduce that type of score, we constructed the BPN with one output unit—thatunit produc<strong>in</strong>g a l<strong>in</strong>ear output that was <strong>in</strong>terpreted as the scaled pa<strong>in</strong>t score.Internally, 50 sigmoidal units were used on a s<strong>in</strong>gle hidden layer. In addition,the <strong>in</strong>put <strong>and</strong> hidden layers each conta<strong>in</strong>ed threshold ([9]) units used to bias theunits on the hidden <strong>and</strong> output layers, respectively.Once the network was constructed (<strong>and</strong> tra<strong>in</strong>ed), 10 sample images weretaken from the snapshot us<strong>in</strong>g two different sampl<strong>in</strong>g techniques. In the firsttest, the samples were selected r<strong>and</strong>omly from the image (<strong>in</strong> the sense that theirposition on the beam image was r<strong>and</strong>om); <strong>in</strong> the second test, 10 sequentialsamples were taken, so as to ensure that the entire beam was exam<strong>in</strong>ed. 4 Inboth cases, the <strong>in</strong>put sample was propagated through the tra<strong>in</strong>ed BPN, <strong>and</strong> thescore produced as output by the network was averaged across the 10 trials. Theaverage score, as well as the range of scores produced, were then provided tothe user for comparison <strong>and</strong> <strong>in</strong>terpretation.Tra<strong>in</strong><strong>in</strong>g the Pa<strong>in</strong>t QA Network. At the time of the development of this application,this network was significantly larger than any other network we had yettra<strong>in</strong>ed. Consider the size of the network used: 901 <strong>in</strong>puts, 51 hiddens, 1 output,produc<strong>in</strong>g a network with 45,101 connections, each modeled as a float<strong>in</strong>g-po<strong>in</strong>tnumber. Similarly, the unit output values were modeled as float<strong>in</strong>g-po<strong>in</strong>t numbers,s<strong>in</strong>ce each element <strong>in</strong> the <strong>in</strong>put vector represented a pixel <strong>in</strong>tensity value(scaled between 0 <strong>and</strong> 1), <strong>and</strong> the network output unit was l<strong>in</strong>ear.The number of tra<strong>in</strong><strong>in</strong>g patterns with which we had to work was a functionof the number of control pa<strong>in</strong>t panels to which we had access (18), as well as ofthe number of sample images we needed from each panel to acquire a relativelycomplete tra<strong>in</strong><strong>in</strong>g set (approximately 6600 images per panel). Dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g,4 Results of the tests were consistent with scores assessed for the same pa<strong>in</strong>t panels by the humanexperts, with<strong>in</strong> a relatively m<strong>in</strong>or error range, regardless of the sample-selection technique used.


3.4 BPN <strong>Applications</strong> 113Framegrabber400 x 75Video imageInputselectionalgorithm30x30Pixel imageUser<strong>in</strong>terfaceTBackpropagationnetworkFigure 3.10The BPN system is constructed to perform pa<strong>in</strong>t-qualityassessment. In this example, the BPN was merely a softwaresimulation of the network described <strong>in</strong> the text. Inputs wereprovided to the network through an array structure located <strong>in</strong>system memory by a po<strong>in</strong>ter argument supplied as <strong>in</strong>put tothe simulation rout<strong>in</strong>e.the samples were presented to the network r<strong>and</strong>omly to ensure that no s<strong>in</strong>glepa<strong>in</strong>t panel dom<strong>in</strong>ated the tra<strong>in</strong><strong>in</strong>g.From these numbers, we can see that there was a great deal of computertime consumed dur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g process. For example, one tra<strong>in</strong><strong>in</strong>g epoch (as<strong>in</strong>gle tra<strong>in</strong><strong>in</strong>g pass through all tra<strong>in</strong><strong>in</strong>g patterns) required the host computer toperform approximately 13.5 million connection updates, which translates <strong>in</strong>toroughly 360,000 float<strong>in</strong>g-po<strong>in</strong>t operations (FLOPS) per pattern (2 FLOPS perconnection dur<strong>in</strong>g forward propagation, 6 FLOPS dur<strong>in</strong>g error propagation),or 108 million FLOPS per epoch. You can now underst<strong>and</strong> why we haveemphasized efficiency <strong>in</strong> our simulator design.Exercise 3.7: Estimate the number of float<strong>in</strong>g-po<strong>in</strong>t operations required to simulatea BPN that used the entire 400-by-75-pixel image as <strong>in</strong>put. Assume 50hidden-layer units <strong>and</strong> one output unit, with threshold units on the <strong>in</strong>put <strong>and</strong>hidden layers as described previously.We performed the network tra<strong>in</strong><strong>in</strong>g for this application on a dedicated LISPcomputer workstation. It required almost 2 weeks of un<strong>in</strong>terrupted computation


114 Backpropagationfor the network to converge on that mach<strong>in</strong>e. However, once the network wastra<strong>in</strong>ed, we ported the pa<strong>in</strong>t QA application to an 80386-based desktop computerby simply transferr<strong>in</strong>g the network connection weights to a disk file <strong>and</strong> copy<strong>in</strong>gthe file onto the disk on the desktop mach<strong>in</strong>e. Then, for demonstration <strong>and</strong> laterpa<strong>in</strong>t QA applications, the network was utilized <strong>in</strong> a production mode only. Thedual-phased nature of the BPN allowed the latter to be employed <strong>in</strong> a relativelylow-cost delivery system, without loss of any of the benefits associated with aneural-network solution as compared to traditional software techniques.3.5 THE BACKPROPAGATION SIMULATORIn this section, we shall describe the adaptations to the general-purpose neuralsimulator presented <strong>in</strong> Chapter 1, <strong>and</strong> shall present the detailed algorithmsneeded to implement a BPN simulator. We shall beg<strong>in</strong> with a brief review ofthe general signal- <strong>and</strong> error-propagation process through the BPN, then shallrelate that process to the design of the simulator program.3.5.1 Review of Signal PropagationIn a BPN, signals flow bidirectionally, but <strong>in</strong> only one direction at a time.Dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, there are two types of signals present <strong>in</strong> the network: dur<strong>in</strong>gthe first half-cycle, modulated output signals flow from <strong>in</strong>put to output; dur<strong>in</strong>gthe second half-cycle, error signals flow from output layer to <strong>in</strong>put layer. In theproduction mode, only the feedforward, modulated output signal is utilized.Several assumptions have been <strong>in</strong>corporated <strong>in</strong>to the design of this simulator.First, the output function on all hidden- <strong>and</strong> output-layer units is assumedto be the sigmoid function. This assumption is also implicit <strong>in</strong> the pseudocodefor calculat<strong>in</strong>g error terms for each unit. In addition, we have <strong>in</strong>cluded themomentum term <strong>in</strong> the weight-update calculations. These assumptions implythe need to store weight updates at one iteration, for use on the next iteration.F<strong>in</strong>ally, bias values have not been <strong>in</strong>cluded <strong>in</strong> the calculations. The addition ofthese is left as an exercise at the end of the chapter.In this network model, the <strong>in</strong>put units are fan-out processors only. That is,the units <strong>in</strong> the <strong>in</strong>put layer perform no data conversion on the network <strong>in</strong>putpattern. They simply act to hold the components of the <strong>in</strong>put vector with<strong>in</strong>the network structure. Thus, the tra<strong>in</strong><strong>in</strong>g process beg<strong>in</strong>s when an externallyprovided <strong>in</strong>put pattern is applied to the <strong>in</strong>put layer of units. Forward signalpropagation then occurs accord<strong>in</strong>g to the follow<strong>in</strong>g sequence of activities:1. Locate the first process<strong>in</strong>g unit <strong>in</strong> the layer immediately above the currentlayer.2. Set the current <strong>in</strong>put total to zero.3. Compute the product of the first <strong>in</strong>put connection weight <strong>and</strong> the outputfrom the transmitt<strong>in</strong>g unit.


3.5 The Backpropagation Simulator 1154. Add that product to the cumulative total.5. Repeat steps 3 <strong>and</strong> 4 for each <strong>in</strong>put connection.6. Compute the output value for this unit by apply<strong>in</strong>g the output functionf(x) = 1/(1 + e~ x ), where x — <strong>in</strong>put total.7. Repeat steps 2 through 6 for each unit <strong>in</strong> this layer.8. Repeat steps 1 through 7 for each layer <strong>in</strong> the network.Once an output value has been calculated for every unit <strong>in</strong> the network, thevalues computed for the units <strong>in</strong> the output layer are compared to the desiredoutput pattern, element by element. At each output unit, an error value iscalculated. These error terms are then fed back to all other units <strong>in</strong> the networkstructure through the follow<strong>in</strong>g sequence of steps:1. Locate the first process<strong>in</strong>g unit <strong>in</strong> the layer immediately below the outputlayer.2. Set the current error total to zero.3. Compute the product of the first output connection weight <strong>and</strong> the errorprovided by the unit <strong>in</strong> the upper layer.4. Add that product to the cumulative error.5. Repeat steps 3 <strong>and</strong> 4 for each output connection.6. Multiply the cumulative error by o(l — o), where o is the output value ofthe hidden layer unit produced dur<strong>in</strong>g the feedforward operation.7. Repeat steps 2 through 6 for each unit on this layer.8. Repeat steps 1 through 7 for each layer.9. Locate the first process<strong>in</strong>g unit <strong>in</strong> the layer above the <strong>in</strong>put layer.10. Compute the weight change value for the first <strong>in</strong>put connection to this unitby add<strong>in</strong>g a fraction of the cumulative error at this unit to the <strong>in</strong>put valueto this unit.H. Modify the weight change term by add<strong>in</strong>g a momentum term equal to afraction of the weight change value from the previous iteration.12. Save the new weight change value as the old weight change value for thisconnection.13. Change the connection weight by add<strong>in</strong>g the new connection weight changevalue to the old connection weight.14. Repeat steps 10 through 13 for each <strong>in</strong>put connection to this unit.15. Repeat steps 10 through 14 for each unit <strong>in</strong> this layer.16. Repeat steps 10 through 15 for each layer <strong>in</strong> the network.


116 Backpropagation3.5.2 BPN Special ConsiderationsIn Chapter 1, we emphasized that our simulator was designed to optimize thesignal-propagation process through the network by organiz<strong>in</strong>g the <strong>in</strong>put connectionsto each unit as l<strong>in</strong>ear sequential arrays. Thus, it becomes possibleto perform the <strong>in</strong>put sum-of-products calculation <strong>in</strong> a relatively straightforwardmanner. We simply step through the appropriate connection <strong>and</strong> unit outputarrays, summ<strong>in</strong>g products as we go. Unfortunately, this structure does not lenditself easily to the backpropagation of errors that must be performed by thisnetwork.To underst<strong>and</strong> why there is a problem, consider that the output connectionsfrom each unit are be<strong>in</strong>g used to sum the error products dur<strong>in</strong>g the learn<strong>in</strong>gprocess. Thus, we must jump between arrays to access output connection valuesthat are conta<strong>in</strong>ed <strong>in</strong> <strong>in</strong>put connection arrays to the units above, ratherthan stepp<strong>in</strong>g through arrays as we did dur<strong>in</strong>g the forward-propagation phase.Because the computer must now explicitly compute where to f<strong>in</strong>d the next connectionvalue, error propagation is much less efficient, <strong>and</strong>, hence, tra<strong>in</strong><strong>in</strong>g issignificantly slower than is production-mode operation.3.5.3 BPN Data StructuresWe beg<strong>in</strong> our discussion of the BPN simulator with a presentation of the backpropagationnetwork data structures that we will require. Although the BPN issimilar <strong>in</strong> structure to the Madal<strong>in</strong>e network described <strong>in</strong> Chapter 2, it is alsodifferent <strong>in</strong> that it requires the use of several additional parameters that must bestored on a connection or network unit basis. Based on our knowledge of howthe BPN operates, we shall now propose a record of data that will def<strong>in</strong>e thetop-level structure of the BPN simulator:record BPN =INUNITS : "layer;OUTUNITS : "layer;LAYERS : "layerf];alpha,eta : float;end record;{locate <strong>in</strong>put layer}{locate output units}{dynamically sized network}{the momentum term}{the learn<strong>in</strong>g rate}Figure 3.11 illustrates the relationship between the network record <strong>and</strong> allsubord<strong>in</strong>ate structures, which we shall now discuss. As we complete our discussionof the data structures, you should refer to Figure 3.11 to clarify someof the more subtle po<strong>in</strong>ts.Inspection of the BPN record structure reveals that this structure is designedto allow us to create networks conta<strong>in</strong><strong>in</strong>g more than just three layers of units.In practice, BPNs that require more than three layers to solve a problem arenot prevalent. However, there are several examples cited <strong>in</strong> the literature referencedat the end of this chapter where multilayer BPNs were utilized, so we


3.5 The Backpropagation Simulator 117Figure 3.11The BPN data structure is shown without the arrays forthe error <strong>and</strong> last^delta terms for clarity. As before,the network is def<strong>in</strong>ed by a record conta<strong>in</strong><strong>in</strong>g po<strong>in</strong>tersto the subord<strong>in</strong>ate structures, as well as network-specificparameters. In this diagram, only three layers are illustrated,although many more hidden layers could be added by simpleextension of the layer_ptr array.have <strong>in</strong>cluded the capability to construct networks of this type <strong>in</strong> our simulatordesign.It is obvious that the BPN record conta<strong>in</strong>s the <strong>in</strong>formation that is of global<strong>in</strong>terest to the units <strong>in</strong> the network—specifically, the alpha (a) <strong>and</strong> eta (77) terms.However, we must now def<strong>in</strong>e the layer structure that we will use to constructthe rema<strong>in</strong>der of the network, s<strong>in</strong>ce it is the basis for locat<strong>in</strong>g all <strong>in</strong>formationused to def<strong>in</strong>e the units on each layer. To def<strong>in</strong>e the layer structure, we mustremember that the BPN has two different types of operation, <strong>and</strong> that different<strong>in</strong>formation is needed <strong>in</strong> each phase. Thus, the layer structure conta<strong>in</strong>s po<strong>in</strong>tersto two different sets of arrays: one set used dur<strong>in</strong>g forward propagation, <strong>and</strong>one set used dur<strong>in</strong>g error propagation. Armed with this underst<strong>and</strong><strong>in</strong>g, we cannow def<strong>in</strong>e the layer structure for the BPN:record layer =outputs : "float[];weights : ""float[];errors : "float[];last_delta : ""floatf]end record;{locate output array}{locate connection array(s)}{locate error terms for layer}{locate previous delta terms}Dur<strong>in</strong>g the forward-propagation phase, the network will use the <strong>in</strong>formationconta<strong>in</strong>ed <strong>in</strong> the outputs <strong>and</strong> weights arrays, just as we saw <strong>in</strong> the design


118 Backpropagationof the Adal<strong>in</strong>e simulator. However, dur<strong>in</strong>g the backpropagation phase, the BPNrequires access to an array of error terms (one for each of the units on thelayer) <strong>and</strong> to the list of change parameters used dur<strong>in</strong>g the previous learn<strong>in</strong>gpass (stored on a connection basis). By comb<strong>in</strong><strong>in</strong>g the access mechanisms to allthese terms <strong>in</strong> the layer structure, we can cont<strong>in</strong>ue to keep process<strong>in</strong>g efficient, atleast dur<strong>in</strong>g the forward-propagation phase, as our data structures will be exactlyas described <strong>in</strong> Chapter 1. Unfortunately, activity dur<strong>in</strong>g the backpropagationphase will be <strong>in</strong>efficient, because we will be access<strong>in</strong>g different arrays ratherthan access<strong>in</strong>g sequential locations with<strong>in</strong> the arrays. However, we will haveto live with the <strong>in</strong>efficiency <strong>in</strong>curred here s<strong>in</strong>ce we have elected to model thenetwork as a set of arrays.3.5.4 Forward Signal-Propagation <strong>Algorithms</strong>The follow<strong>in</strong>g four algorithms will implement the feedforward signal-propagationprocess <strong>in</strong> our network simulator model. They are presented <strong>in</strong> a bottom-upfashion, mean<strong>in</strong>g that each is def<strong>in</strong>ed before it is used.The first procedure will serve as the <strong>in</strong>terface rout<strong>in</strong>e between the hostcomputer <strong>and</strong> the BPN simulation. It assumes that the user has def<strong>in</strong>ed an arrayof float<strong>in</strong>g-po<strong>in</strong>t numbers that <strong>in</strong>dicate the pattern to be applied to the networkas <strong>in</strong>puts.procedure set_<strong>in</strong>puts (INPUTS, NET_IN : "float[]){copy the <strong>in</strong>put values <strong>in</strong>to the net <strong>in</strong>put layer}vartempi:"float[];temp2:- float[];i : <strong>in</strong>teger;beg<strong>in</strong>tempi = NET_IN;temp2 = INPUTS;{a local po<strong>in</strong>ter}{a local po<strong>in</strong>ter}{iteration counter}{locate net <strong>in</strong>put layer}{locate <strong>in</strong>put values}for i 1 to length(NET_IN) do{for all <strong>in</strong>put values, do}tempi[i] = temp2[i]; {copy <strong>in</strong>put to net <strong>in</strong>put}end do;end;The next rout<strong>in</strong>e performs the forward signal propagation between any twolayers, located by the po<strong>in</strong>ter values passed <strong>in</strong>to the rout<strong>in</strong>e. This rout<strong>in</strong>e embodiesthe calculations done <strong>in</strong> Eqs. (3.1) <strong>and</strong> (3.2) for the hidden layer, <strong>and</strong> <strong>in</strong>Eqs. (3.3) <strong>and</strong> (3.4) for the output layer.


3.5 The Backpropagation Simulator 119procedure propagate_layer (LOWER, UPPER: "layer){propagate signals from the lower to the upper layer}var<strong>in</strong>puts : "float[]; (size <strong>in</strong>put layer}current : "float[]; {size current layer}connects : "float[]; {step through <strong>in</strong>puts}sum : real;{accumulate products}i, j : <strong>in</strong>teger; {iteration counters}beg<strong>in</strong><strong>in</strong>puts = LOWER".outputs;current = UPPER".outputs;{locate lower layer}{locate upper layer}for i = 1 to length(current) do{for all units <strong>in</strong> layer}sum = 0;{reset accumulator}connects = UPPER".weights"[i];{f<strong>in</strong>d start of wt. array}for j = 1 to length(<strong>in</strong>puts) do{for all <strong>in</strong>puts to unit}sum = sum + <strong>in</strong>puts[j] * connects[j];{accumulate products}end do;current [i] = 1.0 / (1.0 + exp(-sum));{generate output}end do;end;The next procedure performs the forward signal propagation for the entirenetwork. It assumes the <strong>in</strong>put layer conta<strong>in</strong>s a valid <strong>in</strong>put pattern, placed thereby a higher-level call to set-<strong>in</strong>puts.procedure propagate_forward (NET: BPN){perform the forward signal propagation for net}varupper : "layer; {po<strong>in</strong>ter to upper layer}lower : "layer; {po<strong>in</strong>ter to lower layer}i : <strong>in</strong>teger;{layer counter}beg<strong>in</strong>for i = 1 to length(NET.layers) do {for all layers}lower = NET.layers[i]; {get po<strong>in</strong>ter to <strong>in</strong>put layer}upper = NET.layers[i+1]; {get po<strong>in</strong>ter to next layer}propagate_layer (lower, upper); {propagate forward}end do;end;


120 BackpropagationThe f<strong>in</strong>al rout<strong>in</strong>e needed for forward propagation will extract the outputvalues generated by the network <strong>and</strong> copy them <strong>in</strong>to an external array specifiedby the call<strong>in</strong>g program. This rout<strong>in</strong>e is the complement of the set-<strong>in</strong>putrout<strong>in</strong>e described earlier.procedure get_outputs (NETJDUTS, OUTPUTS : "float[]){copy the net out values <strong>in</strong>to the outputs specified.)vartempi:"float[];temp2:"float[];beg<strong>in</strong>tempi = NETJDUTS;temp2 = OUTPUTS;{a local po<strong>in</strong>ter}{a local po<strong>in</strong>ter}{locate net output layer}{locate output values array}for i = 1 to length(NET_OUTS)temp2[i] = tempi[i];end do;end;do{for all outputs, do}{copy net output}{to output array}3.5.5 Error-Propagation Rout<strong>in</strong>esThe backward propagation of error terms is similar to the forward propagationof signals. The major difference here is that error signals, once computed,are be<strong>in</strong>g backpropagated through output connections from a unit, rather thanthrough <strong>in</strong>put connections.If we allow an extra array to conta<strong>in</strong> error terms associated with each unitwith<strong>in</strong> a layer, similar to our data structure for unit outputs, the error-propagationprocedure can be accomplished <strong>in</strong> three rout<strong>in</strong>es. The first will compute the errorterm for each unit on the output layer. The second will backpropagate errorsfrom a layer with known errors to the layer immediately below. The third willuse the error term at any unit to update the output connection values from thatunit.The pseudocode designs for these rout<strong>in</strong>es are as follows. The first calculatesthe values of 6° k on the output layer, accord<strong>in</strong>g to Eq. (3.15).procedure compute_output_error (NET : BPN;TARGET: "float[]){compare output to target, update errors accord<strong>in</strong>gly}varerrors :outputs•float;"float;{used to store error values}{access to network outputs}beg<strong>in</strong>errors = NET .OUT/UNITS' .errors;{f<strong>in</strong>d error array}


3.5 The Backpropagation Simulator 121outputs = NET.OUTUNITS".outputs;{get po<strong>in</strong>ter to unit outputs}for i = 1 to length(outputs) do {for all output units}errors[i] = outputs[i]*(1-outputs[i])*(TARGET[i]-outputs[i]);end do;end;In the backpropagation network, the terms 77 <strong>and</strong> a will be used globally togovern the update of all connections. For that reason, we have extended the networkrecord to <strong>in</strong>clude these parameters. We will refer to these values as "eta"<strong>and</strong> "alpha" respectively. We now provide an algorithm for backpropagat<strong>in</strong>gthe error term to any unit below the output layer <strong>in</strong> the network structure. Thisrout<strong>in</strong>e calculates 8^ for hidden-layer units accord<strong>in</strong>g to Eq. (3.22).procedure backpropagate_error (UPPER,LOWER: "layer){backpropagate errors from an upper to a lower layer}varsenders : "float[]; {source errors}receivers : ~float[]; {receiv<strong>in</strong>g errors}connects : "float[]; {po<strong>in</strong>ter to connection arrays}unit : float;{unit output value}i, j : <strong>in</strong>teger; {<strong>in</strong>dices}beg<strong>in</strong>senders = UPPER".errors;receivers = LOWER".errors;{known errors}{errors to be computed}for i = 1 to length(receivers) do{for all receiv<strong>in</strong>g units}receivers[i] = 0;{<strong>in</strong>it error accumulator}for j = 1 to length(senders) do{for all send<strong>in</strong>g units}connects = UPPER".weights"[j};{locate connection array}receivers[i] = receivers[i] + senders[j]* connects[i] ;end do;unit = LOWER".outputs[i]; {get unit output}receivers[i] = receivers[i] * unit * (1-unit);end do;end;F<strong>in</strong>ally, we must now step through the network structure once more to adjustconnection weights. We move from the <strong>in</strong>put layer to the output layer.


122 BackpropagationHere aga<strong>in</strong>, to improve performance, we process only <strong>in</strong>put connections, so oursimulator can once more step through sequential arrays, rather than jump<strong>in</strong>gfrom array to array as we had to do <strong>in</strong> the backpropagate.error procedure.This rout<strong>in</strong>e <strong>in</strong>corporates the momentum term discussed <strong>in</strong> Section 3.4.3.Specifically, alpha is the momentum parameter, <strong>and</strong> delta refers to theweight change values; see Eq. (3.24).procedure adjust_weights (NET:BPN){update all connection weights based on new error values)varcurrent<strong>in</strong>puts :units :weightsdelta :error :: "layer;"float[];-floatf];: -float[]~ float[];"float[] ;i, j, k : <strong>in</strong>teger;{access layer data record}{array of <strong>in</strong>put values}{access units <strong>in</strong> layer}{connections to unit}{po<strong>in</strong>ter to delta arrays}{po<strong>in</strong>ter to error arrays}{iteration <strong>in</strong>dices}beg<strong>in</strong>for i = 2 to length(NET.layers) do{start<strong>in</strong>g at first computed layer}current = NET.layers[i];{get po<strong>in</strong>ter to layer}units = NET.layers[i]".outputs;{step through units}<strong>in</strong>puts = NET.layers[i-1]".outputs;{access <strong>in</strong>put array}for j = 1 to length(units) do{for all units <strong>in</strong> layer}weights = current.weights[j];{f<strong>in</strong>d <strong>in</strong>put connections }delta = current.deltas[j]; {locate last delta}error = NET.layers[i]".errors;{access unit errors}for k = 1 to length(weights) do{for all connections}weightsfk] =weights[k] + (<strong>in</strong>puts[k]*NET.eta*error[k]) +(NET.alpha * delta[k]) ;end do;end do;end do;end;


Programm<strong>in</strong>g Exercises 1233.5.6 The Complete BPN SimulatorWe have now implemented the algorithms needed to perform the backpropagationfunction. All that rema<strong>in</strong>s is to implement a top-level rout<strong>in</strong>e that calls oursignal-propagation procedures <strong>in</strong> the correct sequence to allow the simulator tobe used. For production-mode operation after tra<strong>in</strong><strong>in</strong>g, this rout<strong>in</strong>e would takethe follow<strong>in</strong>g general form:beg<strong>in</strong>call set_<strong>in</strong>puts to stimulate the network with an <strong>in</strong>put.call propagate_forward to generate an output.call get_outputs to exam<strong>in</strong>e the output generated,endDur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, the rout<strong>in</strong>e would be extended to this form:beg<strong>in</strong>while network error is larger than some predef<strong>in</strong>ed limitdocall set_<strong>in</strong>puts to apply a tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put.call propagate_forward to generate an output.call compute_output_error to determ<strong>in</strong>e errors.call backpropagate_error to update error values,call adjust_weights to modify the network,end doend.Programm<strong>in</strong>g Exercises3.1. Implement the backpropagation network simulator us<strong>in</strong>g the pseudocodeexamples provided. Test the network by tra<strong>in</strong><strong>in</strong>g it to solve the characterrecognitionproblem described <strong>in</strong> Section 3.1. Use a 5-by-7-character matrixas <strong>in</strong>put, <strong>and</strong> tra<strong>in</strong> the network to recognize all 36 alphanumeric characters(uppercase letters <strong>and</strong> 10 digits). Describe the network's tolerance to noisy<strong>in</strong>puts after tra<strong>in</strong><strong>in</strong>g is complete.3.2. Modify the BPN simulator developed <strong>in</strong> Programm<strong>in</strong>g Exercise 3.1 toimplement l<strong>in</strong>ear units <strong>in</strong> the output layer only. Rerun the characterrecognitionexample, <strong>and</strong> compare the network response with the resultsobta<strong>in</strong>ed <strong>in</strong> Programm<strong>in</strong>g Exercise 3.1. Be sure to compare both the tra<strong>in</strong><strong>in</strong>g<strong>and</strong> the production behaviors of the networks.3.3. Us<strong>in</strong>g the XOR problem described <strong>in</strong> Chapter 1, determ<strong>in</strong>e how many hiddenunits are needed by a sigmoidal, three-layer BPN to learn the fourconditions completely.3.4. The BPN simulator adjusts its <strong>in</strong>ternal connection status after every tra<strong>in</strong><strong>in</strong>gpattern. Modify the simulator design to implement true steepest descent byadjust<strong>in</strong>g weights only after all tra<strong>in</strong><strong>in</strong>g patterns have been exam<strong>in</strong>ed. Test


124 Backpropagationyour modifications on the XOR problem set from Chapter 1, <strong>and</strong> on thecharacter-identification problem described <strong>in</strong> this chapter.3.5. Modify your BPN simulator to <strong>in</strong>corporate the bias terms. Follow the suggestion<strong>in</strong> Section 3.4.3 <strong>and</strong> consider the bias terms to be weights connectedto a fictitious unit that always has an output of 1. Tra<strong>in</strong> the network us<strong>in</strong>gthe character-recognition example. Note any differences <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g orperformance of the network when compared to those of earlier implementations.Suggested Read<strong>in</strong>gsBoth Chapter 8 of PDF [7] <strong>and</strong> Chapter 5 of the PDF H<strong>and</strong>book [6] conta<strong>in</strong>discussions of backpropagation <strong>and</strong> of the generalized delta rule. They aregood supplements to the material <strong>in</strong> this chapter. The books by Wasserman [10]<strong>and</strong> Hecht-Nielsen [4] also conta<strong>in</strong> treatments of the backpropagation algorithm.Early accounts of the algorithm can be found <strong>in</strong> the report by Parker [8] <strong>and</strong>the thesis by Werbos [11].Cottrell <strong>and</strong> colleagues [1] describe the image-compression technique discussed<strong>in</strong> Section 4 of this chapter. Gorman <strong>and</strong> Sejnowski [3] have usedbackpropagation to classify SONAR signals. This article is particularly <strong>in</strong>terest<strong>in</strong>gfor its analysis of the weights on the hidden units <strong>in</strong> their network. A famousdemonstration system that uses a backpropagation network is Terry Sejnowski'sNETtalk [9]. In this system, a neural network replaces a conventional systemthat translates ASCII text <strong>in</strong>to phonemes for eventual speech production. Audiotapes of the system while it is learn<strong>in</strong>g are m<strong>in</strong>dful of the behavior patterns seen<strong>in</strong> human children while they are learn<strong>in</strong>g to talk. An example of a commercialvisual-<strong>in</strong>spection system is given <strong>in</strong> the paper by Glover [2].Because the backpropagation algorithm is so expensive computationally,people have made numerous attempts to speed convergence. Many of theseattempts are documented <strong>in</strong> the various proceed<strong>in</strong>gs of IEEE/INNS conferences.We hesitate to recommend any particular method, s<strong>in</strong>ce we have not yet foundone that results <strong>in</strong> a network as capable as the orig<strong>in</strong>al.Bibliography[1] G. W. Cottrell, P. Munro, <strong>and</strong> D. Zipser. Image compression by backpropagation: An example of extensional programm<strong>in</strong>g. Technical ReportICS 8702, Institute for Cognitive Science, University of California, SanDiego, CA, February 1987.[2] David E. Glover. Optical Fourier/electronic neurocomputer mach<strong>in</strong>e vision<strong>in</strong>spection system. In Proceed<strong>in</strong>gs of the Vision '88 Conference, Dearborn,MI, June 1988. Society of Manufactur<strong>in</strong>g Eng<strong>in</strong>eers.


Bibliography 125[3] R. Paul German <strong>and</strong> Terrence J. Sejnowski. Analysis of hidden units <strong>in</strong>a layered network tra<strong>in</strong>ed to classify sonar targets. <strong>Neural</strong> <strong>Networks</strong>,1(1):76-90, 1988.[4] Robert Hecht-Nielsen. Neurocomput<strong>in</strong>g. Addison-Wesley, Read<strong>in</strong>g, MA,1990.[5] Geoffrey E. H<strong>in</strong>ton <strong>and</strong> Terrence J. Sejnowski. <strong>Neural</strong> network architecturesfor AI. Tutorial No. MP2, AAAI87, Seattle, WA, July 1987.[6] James McClell<strong>and</strong> <strong>and</strong> David Rumelhart. Explorations <strong>in</strong> Parallel DistributedProcess<strong>in</strong>g. MIT Press, Cambridge, MA, 1986.[7] James McClell<strong>and</strong> <strong>and</strong> David Rumelhart. Parallel Distributed Process<strong>in</strong>g,volumes 1 <strong>and</strong> 2. MIT Press, Cambridge, MA, 1986.f8] D. B. Parker. Learn<strong>in</strong>g logic. Technical Report TR-47, Center for ComputationalResearch <strong>in</strong> Economics <strong>and</strong> Management Science, MIT, Cambridge,MA, April 1985.[9] Terrence J. Sejnowski <strong>and</strong> Charles R. Rosenberg. Parallel networks thatlearn to pronounce English text. Complex Systems, 1:145-168, 1987.[10] Philip D. Wasserman. <strong>Neural</strong> Comput<strong>in</strong>g: Theory <strong>and</strong> Practice. Van Nostr<strong>and</strong>Re<strong>in</strong>hold, New York, 1989.[11] P. Werbos. Beyond Regression: New Tools for Prediction <strong>and</strong> Analysis <strong>in</strong>the Behavioral Sciences. PhD thesis, Harvard, Cambridge, MA, August1974.


HThe BAM <strong>and</strong> theHopfield MemoryThe subject of this chapter is a type of ANS called an associative memory.When you read a bit further, you may wonder why the backpropagation networkdiscussed <strong>in</strong> the previous chapter was not <strong>in</strong>cluded <strong>in</strong> this category. In fact, thedef<strong>in</strong>ition of an associative memory, which we shall present shortly, does applyto the backpropagation network <strong>in</strong> certa<strong>in</strong> circumstances. Nevertheless, we havechosen to delay the formal discussion of associative memories until now. Ourdef<strong>in</strong>itions <strong>and</strong> discussion will be slanted toward the two varieties of memoriestreated <strong>in</strong> this chapter: the bidirectional associative memory (BAM), <strong>and</strong> theHopfield memory. You should be able to generalize the discussion to coverother network models.The concept of an associative memory is a fairly <strong>in</strong>tuitive one: Associativememory appears to be one of the primary functions of the bra<strong>in</strong>. We easil>associate the face of a friend with that friend's name, or a name with a telephonenumber.Many devices exhibit associative-memory characteristics. For example, thememory bank <strong>in</strong> a computer is a type of associative memory: it associatesaddresses with data. An object-oriented program (OOP) with <strong>in</strong>heritance canexhibit another type of associative memory. Given a datum, the OOP associatesother data with it, through the OOP's <strong>in</strong>heritance network. This type ofmemory is called a content-addressable memory (CAM). The CAM associatesdata with addresses of other data; it does the opposite of the computer memorybank.The Hopfield memory, <strong>in</strong> particular, played an important role <strong>in</strong> the currentresurgence of <strong>in</strong>terest <strong>in</strong> the field of ANS. Probably as much as any other s<strong>in</strong>glefactor, the efforts of John Hopfield, of the California Institute of Technology,have had a profound, stimulat<strong>in</strong>g effect on the scientific community <strong>in</strong> the area197


128 The BAM <strong>and</strong> the Hopfield Memoryof ANS. Before describ<strong>in</strong>g the BAM <strong>and</strong> the Hopfield memory, we shall presenta few def<strong>in</strong>itions <strong>in</strong> the next section.4.1 ASSOCIATIVE-MEMORY DEFINITIONSIn this section, we review some basic def<strong>in</strong>itions <strong>and</strong> concepts related to associativememories. We shall beg<strong>in</strong> with a discussion of Hamm<strong>in</strong>g distance,not because the concept is likely to be new to you, but because we want torelate it to the more familiar Euclidean distance, <strong>in</strong> order to make the notion ofHamm<strong>in</strong>g distance more plausible. Then we shall discuss a simple associativememory called the l<strong>in</strong>ear associator.4.1.1 Hamm<strong>in</strong>g DistanceFigure 4.1 shows a set of po<strong>in</strong>ts which form the three-dimensional Hamm<strong>in</strong>gcube. In general, Hamm<strong>in</strong>g space can be def<strong>in</strong>ed by the expressionH n = {* = (xi,x 2 ,...,x n ) t ER n :xi 6 (±1)} (4.1)In words, n-dimensional Hamm<strong>in</strong>g space is the set of n-dimensional vectors,with each component an element of the real numbers, R, subject to the conditionthat each component is restricted to the values ±1. This space has 2" po<strong>in</strong>ts,all equidistant from the orig<strong>in</strong> of Euclidean space.Many neural-network models use the concept of the distance between twovectors. There are, however, many different measures of distance. In thissection, we shall def<strong>in</strong>e the distance measure known as Hamm<strong>in</strong>g distance <strong>and</strong>shall show its relationship to the familiar Euclidean distance between po<strong>in</strong>ts. Inlater chapters, we shall explore other distance measures.Let x = (xi,x2,... ,£„)* <strong>and</strong> y = (y],y2, ••• ,J/n) f be two vectors <strong>in</strong> n-dimensional Euclidean space, subject to the restriction that Xi,yi e {±1}, sothat x <strong>and</strong> y are also vectors <strong>in</strong> n-dimensional Hamm<strong>in</strong>g space. The Euclide<strong>and</strong>istance between the two vector endpo<strong>in</strong>ts isd= V( x i - y\) 2 + fe - 2/2) 2 + • • •S<strong>in</strong>ce x t ,y t e {±1}, then (x, - yrf e {0,4}:Thus, the Euclidean distance can be written asd = \/4(# mismatched components of x <strong>and</strong> y)


4.1 Associative-Memory Def<strong>in</strong>itions 129(1-1,1)H.1,1)(1-1-1)(-1,1-1)Figure 4.1(1,1-1)This figure shows the Hamm<strong>in</strong>g cube <strong>in</strong> three-dimensionalspace. The entire three-dimensional Hamm<strong>in</strong>g space, H 3 ,comprises the eight po<strong>in</strong>ts hav<strong>in</strong>g coord<strong>in</strong>ate values of either— 1 or +1. In this three-dimensional space, no other po<strong>in</strong>tsexist.We def<strong>in</strong>e the Hamm<strong>in</strong>g distance ash = # mismatched components of x <strong>and</strong> y (4.2)or the number of bits that are different between x <strong>and</strong> y.'The Hamm<strong>in</strong>g distance is related to the Euclidean distance by the equationor(4.3)(4.4)LEven though the components of the vectors are ± 1, rather than 0 <strong>and</strong> 1, we shall use the term bitsto represent one of the vector components. We shall refer to vectors hav<strong>in</strong>g components of ± 1 asbe<strong>in</strong>g bipolar, rather than b<strong>in</strong>ary. We shall reserve the term b<strong>in</strong>ary for vectors whose componentsare 0 <strong>and</strong> 1.


= yix*x 2L2130 The BAM <strong>and</strong> the Hopfield MemoryWe shall use the concept of Hamm<strong>in</strong>g distance a little later <strong>in</strong> our discussionof the BAM. In the next section, we shall take a look at the formal def<strong>in</strong>itionof the associative memory <strong>and</strong> the details of the l<strong>in</strong>ear-associator model,Exercise 4.1: Determ<strong>in</strong>e the Euclidean distance between (1,1,1,1,1)* <strong>and</strong>(-1,—1,1,-1,1)*. Use this result to determ<strong>in</strong>e the Hamm<strong>in</strong>g distance withEq. (4.4).4.1.2 The L<strong>in</strong>ear AssociatorSuppose we have L pairs of vectors, {(xi, yO, (x 2 , y 2 ),..., (xj,, y L )}, with x» £R n , <strong>and</strong> y; e R m . We call these vectors exemplars, because we will usethem as examples of correct associations. We can dist<strong>in</strong>guish three types ofassociative memories:1. Heteroassociative memory: Implements a mapp<strong>in</strong>g, $, of x to y suchthat $(Xj) = y;, <strong>and</strong>, if an arbitrary x is closer to Xj than to any otherXj, j = 1,...,L, then (x) = y^. In this <strong>and</strong> the follow<strong>in</strong>g def<strong>in</strong>itions,closer means with respect to Hamm<strong>in</strong>g distance.2. Interpolate associative memory: Implements a mapp<strong>in</strong>g, $, of x toy such that $(Xj) — y,, but, if the <strong>in</strong>put vector differs from one of theexemplars by the vector d, such that x = Xj + d, then the output of thememory also differs from one of the exemplars by some vector e: $(x) =$(Xj + d) = y; + e.3. Autoassociative memory: Assumes y { = x z <strong>and</strong> implements a mapp<strong>in</strong>g,$, of x to x such that (Xj) = Xj, <strong>and</strong>, if some arbitrary x is closer to x,than to any other Xj, j — 1,..., L, then $(x) = Xj.Build<strong>in</strong>g such a memory is not such a difficult task mathematically if wemake the further restriction that the vectors, Xj, form an orthonormal set. 2 Tobuild an <strong>in</strong>terpolative associative memory, we def<strong>in</strong>e the functiony 2 x*, y L x* L )x (4-5)If Xj is the <strong>in</strong>put vector, then ij = 1 if i = j, <strong>and</strong> 6ij = 0 if


4.2 The BAM 131All the 6ij terms <strong>in</strong> the preced<strong>in</strong>g expression vanish, except for 622, which isequal to 1. The result is perfect recall of $2'.If the <strong>in</strong>put vector is different from one of the exemplars, such that x —\, + d, then the output iswheree == $(Xi + d) = yt + eNote that there is noth<strong>in</strong>g <strong>in</strong> the discussion of the l<strong>in</strong>ear associator thatrequires that the <strong>in</strong>put or output vectors be members of Hamm<strong>in</strong>g space: Theonly requirement is that they be orthonormal. Furthermore, notice that therewas no tra<strong>in</strong><strong>in</strong>g <strong>in</strong>volved <strong>in</strong> the def<strong>in</strong>ition of the l<strong>in</strong>ear associator. The functionthat mapped x <strong>in</strong>to y was def<strong>in</strong>ed by the mathematical expression <strong>in</strong> Eq. (4.5).Most of the models we discuss <strong>in</strong> this chapter share this characteristic; that is,they are not tra<strong>in</strong>ed <strong>in</strong> the sense that an Adal<strong>in</strong>e or backpropagation network istra<strong>in</strong>ed.In the next section, we take up the discussion of BAM. This model utilizesthe distributed process<strong>in</strong>g approach, discussed <strong>in</strong> the previous chapters, toimplement an associative memory.4.2 THE BAMThe BAM consists of two layers of process<strong>in</strong>g elements that are fully <strong>in</strong>terconnectedbetween the layers. The units may, or may not, have feedback connectionsto themselves. The general case is illustrated <strong>in</strong> Figure 4.2.4.2.1 BAM ArchitectureAs <strong>in</strong> other neural network architectures, <strong>in</strong> the BAM architecture there areweights associated with the connections between process<strong>in</strong>g elements. Unlike<strong>in</strong> many other architectures, these weights can be determ<strong>in</strong>ed <strong>in</strong> advance if allof the tra<strong>in</strong><strong>in</strong>g vectors can be identified.We can borrow the procedure from the l<strong>in</strong>ear-associator model to constructthe weight matrix. Given L vector pairs that constitute the set of exemplars thatwe would like to store, we can construct the matrix:w = y,xj+y 2 x£ + ---+y L x£ (4.6)This equation gives the weights on the connections from the x layer to the ylayer. For example, the value w 2 i is the weight on the connection from the


132 The BAM <strong>and</strong> the Hopfield Memoryx layery layerFigure 4.2The BAM shown here has n units on the x layer, <strong>and</strong> m unitson the y layer. For convenience, we shall call the x vectorthe <strong>in</strong>put vector, <strong>and</strong> call the y vector the output vector. Inthis network, x e H", <strong>and</strong> y e H m . All connections betweenunits are bidirectional, with weights at each end. Informationpasses back <strong>and</strong> forth from one layer to the other, through theseconnections. Feedback connections at each unit may not bepresent <strong>in</strong> all BAM architectures.third unit on the x layer to the second unit on the y layer. To construct theweights for the x layer units, we simply take the transpose of the weight matrix,w'.We can make the BAM <strong>in</strong>to an autoassociative memory by construct<strong>in</strong>g theweight matrix asW = XiX, + X 2Xj + • • • + X Lx' LIn this case, the weight matrix is square <strong>and</strong> symmetric.4.2.2 BAM Process<strong>in</strong>gOnce the weight matrix has been constructed, the BAM can be used to recall<strong>in</strong>formation (e.g., a telephone number), when presented with some key <strong>in</strong>formation(a name correspond<strong>in</strong>g to a particular telephone number). If the desired<strong>in</strong>formation is only partially known <strong>in</strong> advance or is noisy (a misspelled namesuch as "Simth"), the BAM may be able to complete the <strong>in</strong>formation (giv<strong>in</strong>gthe proper spell<strong>in</strong>g, "Smith," <strong>and</strong> the correct telephone number).


.^Although we consistently beg<strong>in</strong> with the X-to-y propagation, you could beg<strong>in</strong> <strong>in</strong> the other direction.4.2 The BAM 133To recall <strong>in</strong>formation us<strong>in</strong>g the BAM, we perform the follow<strong>in</strong>g steps:1. Apply an <strong>in</strong>itial vector pair, (x 0 , yo), to the process<strong>in</strong>g elements of the BAM.2. Propagate the <strong>in</strong>formation from the x layer to the y layer, <strong>and</strong> update thevalues on the y-layer units. We shall see how this propagation is doneshortly. 33. Propagate the updated y <strong>in</strong>formation back to the x layer <strong>and</strong> update theunits there.4. Repeat steps 2 <strong>and</strong> 3 until there is no further change <strong>in</strong> the units on eachlayer.This algorithm is what gives the BAM its bidirectional nature. The terms <strong>in</strong>put<strong>and</strong> output refer to different quantities, depend<strong>in</strong>g on the current direction ofthe propagation. For example, <strong>in</strong> go<strong>in</strong>g from y to x, the y vector is consideredas the <strong>in</strong>put to the network, <strong>and</strong> the x vector is the output. The opposite is truewhen propagat<strong>in</strong>g from x to y.If all goes well, the f<strong>in</strong>al, stable state will recall one of the exemplarsused to construct the weight matrix. S<strong>in</strong>ce, <strong>in</strong> this example, we assume weknow someth<strong>in</strong>g about the desired x vector, but perhaps know noth<strong>in</strong>g aboutthe associated y vector, we hope that the f<strong>in</strong>al output is the exemplar whosex, vector is closest <strong>in</strong> Hamm<strong>in</strong>g distance to the orig<strong>in</strong>al <strong>in</strong>put vector, x 0 . Thisscenario works well provided we have not overloaded the BAM with exemplars.If we try to put too much <strong>in</strong>formation <strong>in</strong> a given BAM, a phenomenon known ascrosstalk occurs between exemplar patterns. Crosstalk occurs when exemplarpatterns are too close to each other. The <strong>in</strong>teraction between these patterns canresult <strong>in</strong> the creation of spurious stable states. In that case, the BAM couldstabilize on mean<strong>in</strong>gless vectors. If we th<strong>in</strong>k <strong>in</strong> terms of a surface <strong>in</strong> weightspace, as we did <strong>in</strong> Chapters 2 <strong>and</strong> 3, the spurious stable states correspond tom<strong>in</strong>ima that appear between the m<strong>in</strong>ima that correspond to the exemplars.4.2.3 BAM MathematicsThe basic process<strong>in</strong>g done by each unit of the BAM is similar to that done bythe general process<strong>in</strong>g element discussed <strong>in</strong> the first chapter. The units computesums of products of the <strong>in</strong>puts <strong>and</strong> weights to determ<strong>in</strong>e a net-<strong>in</strong>put value, net.On the y layer,net*' = wx (4.7)where net ;i/ is the vector of net-<strong>in</strong>put values on the y layer. In terms of the<strong>in</strong>dividual units, y t ,


134 The BAM <strong>and</strong> the Hopfield MemoryOn the x layer,w*y (4.9)(4.10)The quantities n <strong>and</strong> m are the dimensions of the x <strong>and</strong> y layers, respectively.The output value for each process<strong>in</strong>g element depends on the net <strong>in</strong>putvalue, <strong>and</strong> on the current output value on the layer. The new value of y attimestep t + 1, y(t + 1), is related to the value of y at timestep t, y(t), bynet? > 0net*' = 0net? < 0(4.11)Similarly, x(t + 1) is related to x(t) by+ 1 netf > 0Xi(t) netf = 0-1 netf < 0(4.12)Let's illustrate BAM process<strong>in</strong>g with a specific example. Letl =(!,-!, -1,1, -1,1,!,-!,-!, 1)' <strong>and</strong> y, = (1, -1, -1, -1, -1,1)'2=(l,l,l,-l,-l,-l,l,l,-l,-l)* <strong>and</strong> y 2 = (l,1,1,1,-!,-!)*We have purposely made these vectors rather long to m<strong>in</strong>imize the possibilityof crosstalk. H<strong>and</strong> calculation of the weight matrix is tedious when the vectorsare long, but the weight matrix is fairly sparse.The weight matrix is calculated from Eq. (4.6). The result isw/ 2 0 0 0 - 2 0 2 0 - 2 0 \0 2 2-2 0-2 0 2 0-20 2 2-2 0-2 0 2 0-20 2 2-2 0-2 0 2 0-2-200020-20200-2-2 2 0 2 0-2 0 2


4.2 The BAM 135For our first trial, we choose an x vector with a Hamm<strong>in</strong>g distance of 1 fromXi: XQ = (—1, —1, —1,1, —1,1,1, -1, -1, 1)*. This situation could representnoise on the <strong>in</strong>put vector. The start<strong>in</strong>g yo vector is one of the tra<strong>in</strong><strong>in</strong>g vectors,y 2 ; y 0 = (1,1,1,1,—!,—!)*. (Note that <strong>in</strong> a realistic problem, you may nothave prior knowledge of the output vector. Use a r<strong>and</strong>om bipolar vector ifnecessary.)We will propagate first from x to y. The net <strong>in</strong>puts to the y units are net y —(4,-12,-12,-12,-4,12)*. The new y vector is y new = (1,-1,-1,-1,-1, 1)',which is also one of the tra<strong>in</strong><strong>in</strong>g vectors. Propagat<strong>in</strong>g back to the x layer we getx new = (1, — 1, —1,1,—1,1,1,— 1, —1,1)*. Further passes result <strong>in</strong> no change,so we are f<strong>in</strong>ished. The BAM successfully recalled the first tra<strong>in</strong><strong>in</strong>g set.Exercise 4.2: Repeat the calculation just shown, but beg<strong>in</strong> with the y-to-x propagation.Is the result what you expected?For our second example, we choose the follow<strong>in</strong>g <strong>in</strong>itial vectors:x 0 = (-1,1,1, -1,1,1,1, -1,1,-1)'yo = (-1,1,-!, 1,-1,-1)*The Hamm<strong>in</strong>g distances of the XQ vector from the tra<strong>in</strong><strong>in</strong>g vectors are /I(XQ, xj) =7 <strong>and</strong> /i(xo,X2) = 5. For the yo vector, the values are /i(yo,yi) = 4 <strong>and</strong>h(yo,y2) = 2. Based on these results, we might expect that the BAM wouldsettle on the second exemplar as a f<strong>in</strong>al solution.We start aga<strong>in</strong> by propagat<strong>in</strong>g from x to y, <strong>and</strong> the new y vector isy ne w = (-1,1,1,1,1,-!)*. Propagat<strong>in</strong>g back from y to x, we get x new =(—1. 1, 1. —1,1, —1, —1,1, 1, —1)'. Further propagation does not change the results.If you exam<strong>in</strong>e these output vectors, you will notice that they do notmatch any of the exemplars. Furthermore, they are actually the complement ofthe first tra<strong>in</strong><strong>in</strong>g set, (x ne w,ynew) = (xf,yf), where the "c" superscript refers tothe complement. This example illustrates a basic property of the BAM: If youencode an exemplar, (x,y), you also encode its complement, (x c ,y c ).The best way to familiarize yourself with the properties of a BAM is towork through many examples. Thus, we recommend the follow<strong>in</strong>g exercises.Exercise 4.3: Us<strong>in</strong>g the same weight matrix as <strong>in</strong> Exercise 4.2, experiment withseveral different <strong>in</strong>put vectors to <strong>in</strong>vestigate the characteristics of the BAM. Inparticular, evaluate the difference between start<strong>in</strong>g with x-to-y propagation, <strong>and</strong>y-to-x propagation. Pick start<strong>in</strong>g vectors that have various Hamm<strong>in</strong>g distancesfrom the exemplar vectors. In addition, try add<strong>in</strong>g more exemplars to the weightmatrix. You can add more exemplars to the weight matrix by a simple additiveprocess. How many exemplars can you add before crosstalk becomes asignificant problem?


136 The BAM <strong>and</strong> the Hopfield MemoryExercise 4.4: Construct an autoassociative BAM us<strong>in</strong>g the follow<strong>in</strong>g tra<strong>in</strong><strong>in</strong>gvectors:xj =(!,-!,-1,1,-1,1)* <strong>and</strong> x 2 = (1,1,1, -1, -1, -1)*Determ<strong>in</strong>e the output us<strong>in</strong>g XQ = (1,1,1,1,—1,1)*, which is a Hamm<strong>in</strong>g distanceof two from each tra<strong>in</strong><strong>in</strong>g vector. Try XQ = (—1,1,1,—1,1,—1)', whichis a complement of one of the tra<strong>in</strong><strong>in</strong>g vectors. Experiment with this network<strong>in</strong> accordance with the <strong>in</strong>structions <strong>in</strong> Exercise 4.3. In addition, try sett<strong>in</strong>g thediagonal elements of the weight matrix equal to zero. Does do<strong>in</strong>g so have anyeffect on the operation of the BAM?4.2.4 BAM Energy FunctionIn the previous two chapters, we discussed an iterative process for f<strong>in</strong>d<strong>in</strong>g weightvalues that are appropriate for a particular application. Dur<strong>in</strong>g those discussions,each po<strong>in</strong>t <strong>in</strong> weight space had associated with it a certa<strong>in</strong> error value. Thelearn<strong>in</strong>g process was an iterative attempt to f<strong>in</strong>d the weights which m<strong>in</strong>imizedthe error. To ga<strong>in</strong> an underst<strong>and</strong><strong>in</strong>g of the process, we exam<strong>in</strong>ed simple caseshav<strong>in</strong>g two weights so that each weight vector corresponded to a po<strong>in</strong>t on anerror surface <strong>in</strong> three dimensions. The height of the surface at each po<strong>in</strong>tdeterm<strong>in</strong>ed the error associated with that weight vector. To m<strong>in</strong>imize the error,we began at some given start<strong>in</strong>g po<strong>in</strong>t <strong>and</strong> moved along the surface until wereached the deepest valley on the surface. This m<strong>in</strong>imum po<strong>in</strong>t corresponded tothe weights that resulted <strong>in</strong> the smallest error value. Once these weights werefound, no further changes were permitted <strong>and</strong> tra<strong>in</strong><strong>in</strong>g was complete.Dur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g process, the weights form a dynamical system. That is,the weights change as a function of time, <strong>and</strong> those changes can be representedas a set of coupled differential equations.For the BAM that we have been discuss<strong>in</strong>g <strong>in</strong> the last few sections, a slightlydifferent situation occurs. The weights are calculated <strong>in</strong> advance, <strong>and</strong> are notpart of a dynamical system. On the other h<strong>and</strong>, an unknown pattern presentedto the BAM may require several passes before the network stabilizes on a f<strong>in</strong>alresult. In this situation, the x <strong>and</strong> y vectors change as a function of time, <strong>and</strong>they form a dynamical system.In both of the dynamical systems described, we are <strong>in</strong>terested <strong>in</strong> severalaspects of system behavior: Does a solution exist? If it does, will the systemconverge to it <strong>in</strong> a f<strong>in</strong>ite time? What is the solution? Up to now we have beenprimarily concerned with the last of those three questions. We shall now lookat the first two.For the simple examples discussed so far, the question of the existenceof a solution is academic. We found solutions; therefore, they must exist.Nevertheless, we may have been simply lucky <strong>in</strong> our choice of problems. Itis still a valid question to ask whether a BAM, or for that matter, any othernetwork, will always converge to a stable solution. The technique discussed here


4.2 The BAM 137is fairly easy to apply to the BAM. Unfortunately, many network architecturesdo not have convergence proofs. The lack of such a proof does not mean thatthe network will not function properly, but there is no guarantee that it willconverge for any given problem.In the theory of dynamical systems, a theorem can be proved concern<strong>in</strong>g theexistence of stable states that uses the concept of a function called a Lyapunovfunction, or energy function. We shall present a nonrigorous version here,which is useful for our purposes. If a bounded function of the state variablesof a dynamical system can be found, such that all state changes result <strong>in</strong> adecrease <strong>in</strong> the value of the function, then the system has a stable solution. 4This function is called a Lyapunov function, or energy function. In the case ofthe BAM, such a function exists. We shall call it the BAM energy function;it has the form£(x,y) = -y'wx (4.13)or, <strong>in</strong> terms of components,We shall now state an important theorem about the BAM energy functionthat will help to answer our questions about the existence of stable solutions ofthe BAM process<strong>in</strong>g equations. The theorem has three parts:1. Any change <strong>in</strong> x or y dur<strong>in</strong>g BAM process<strong>in</strong>g results <strong>in</strong> a decrease <strong>in</strong> E.2. E is bounded below by E m \ n = — ^- Wij .3. When E changes, it must change by a f<strong>in</strong>ite amount.Items 1 <strong>and</strong> 2 prove that E is a Lyapunov function, <strong>and</strong> that the dynamicalsystem has a stable state. In particular, item 2 shows that E can decrease onlyto a certa<strong>in</strong> value; it can't cont<strong>in</strong>ue down to negative <strong>in</strong>f<strong>in</strong>ity, so that eventuallythe x <strong>and</strong> y vectors must stop chang<strong>in</strong>g. Item 3 prevents the possibility thatchanges <strong>in</strong> E might be <strong>in</strong>f<strong>in</strong>itesimally small, result<strong>in</strong>g <strong>in</strong> an <strong>in</strong>f<strong>in</strong>ite amount oftime spent before the m<strong>in</strong>imum E is reached.In essence, the weight matrix determ<strong>in</strong>es the contour of a surface, or l<strong>and</strong>scape,with hills <strong>and</strong> valleys, much like the ones we have discussed <strong>in</strong> previouschapters. Figure 4.3 illustrates a cross-sectional view of such a surface. Theanalogy of the E function as an energy function results from an analysis of howthe BAM operates. The <strong>in</strong>itial state of the BAM is determ<strong>in</strong>ed by the choice ofthe start<strong>in</strong>g vectors, (x <strong>and</strong> y). As the BAM processes the data, x <strong>and</strong> y change,result<strong>in</strong>g <strong>in</strong> movement of the energy over the l<strong>and</strong>scape, which is guaranteed bythe BAM energy theorem to be downward.4 See Hirsch <strong>and</strong> Smale [3] or Beltrami [1J for a more rigorous version of the theorem.


CDCUJ138 The BAM <strong>and</strong> the Hopfield MemoryD)Figure 4.3This figure shows a cross-section of a BAM energy l<strong>and</strong>scape<strong>in</strong> two dimensions. The particular topography results fromthe choice of exemplar vectors that go <strong>in</strong>to mak<strong>in</strong>g up theweight matrix. Dur<strong>in</strong>g process<strong>in</strong>g, the BAM energy value willmove from its start<strong>in</strong>g po<strong>in</strong>t down the energy hill to the nearestm<strong>in</strong>imum, while the BAM outputs move from state a to stateb. Notice that the m<strong>in</strong>ima reached need not be the global, orlowest, m<strong>in</strong>ima on the l<strong>and</strong>scape.Initially, the changes <strong>in</strong> the calculated values of E(x, y) are large. As thex <strong>and</strong> y vectors reach their stable state, the value of E changes by smalleramounts, <strong>and</strong> eventually stops chang<strong>in</strong>g when the m<strong>in</strong>imum po<strong>in</strong>t is reached.This situation corresponds to a physical system such as a ball roll<strong>in</strong>g down ahill <strong>in</strong>to a valley, but with enough friction that, by the time the ball reachesthe bottom, it has no more energy <strong>and</strong> therefore it stops. Thus, the BAMresembles a dissipative dynamic system <strong>in</strong> which the E function corresponds tothe energy of the physical system. Remember that the weight matrix determ<strong>in</strong>esthe contour of this energy l<strong>and</strong>scape; that is, it determ<strong>in</strong>es how many energyvalleys there are, how far apart they are, how deep they are, <strong>and</strong> whether thereare any unexpected valleys (i.e., spurious states).We need to clarify one po<strong>in</strong>t. We have been illustrat<strong>in</strong>g these concepts us<strong>in</strong>ga two-dimensional cross-section of an energy l<strong>and</strong>scape, <strong>and</strong> us<strong>in</strong>g the familiar


4.2 The BAM 139term valley to refer to the locations of the m<strong>in</strong>ima. A more precise term wouldbe bas<strong>in</strong>. In fact, the literature on dynamical systems refers to these locationsas bas<strong>in</strong>s of attraction.To solidify the concept of BAM energy, we return to the examples ofthe previous section. First, notice that accord<strong>in</strong>g to part two of the BAM energytheorem, the m<strong>in</strong>imum value of E is —64, found by summ<strong>in</strong>g the negativesof all the magnitudes of the components of the matrix. A calculationof E for each of the two tra<strong>in</strong><strong>in</strong>g vectors shows that both pairs sit at thebottom of bas<strong>in</strong>s hav<strong>in</strong>g this same value of E. Our first trial vectors werex 0 = (-I,-!,-!,!,-},!,!,-!,-],!)* <strong>and</strong> y 0 = (1,1,1,1,-1,-1)'. Theenergy of this system is E = -y^wxc = -8.The first propagation results <strong>in</strong> y new = (!,-!,—!,-!,—1, 1)*, <strong>and</strong> a newenergy value E = -y£ ew wx 0 — -24. Propagation back to the x layer resulted<strong>in</strong> x new = (1, —1, —1,1, —1,1,1, —1, —1,1)*. The energy is now E =-y^ew wx new = -64. At this po<strong>in</strong>t, no further passes through the system arenecessary, s<strong>in</strong>ce —64 is the lowest possible energy. S<strong>in</strong>ce any further change <strong>in</strong>x or y would lower the energy, accord<strong>in</strong>g to the theorem, no such changes arepossible.Exercise 4.5: Perform the BAM energy calculation on the second example fromSection 4.2.3.Proof of the BAM Energy Theorem. In this section, we prove the first partof the BAM energy theorem. We present this proof because it is both clever<strong>and</strong> easy to underst<strong>and</strong>. The proof is not essential to your underst<strong>and</strong><strong>in</strong>g of therema<strong>in</strong><strong>in</strong>g material, so you may skip it if you wish.We beg<strong>in</strong> with Eq. (4.14), which is reproduced here:Accord<strong>in</strong>g to the theorem, any change <strong>in</strong> x or y must result <strong>in</strong> a decrease <strong>in</strong> thevalue of E. For simplicity, we first consider a change <strong>in</strong> a s<strong>in</strong>gle component ofy, specifically yf..We can rewrite Eq. (4. 14) show<strong>in</strong>g the term with y k explicitly:Now, make the change y/. — > y" ew . The new energy value is£ new = - E vrvj - E E vwi


140 The BAM <strong>and</strong> the Hopfield MemoryIS<strong>in</strong>ce only yj has changed, the second terms on the right sides of Eqs. (4.15)<strong>and</strong> (4.16) are identical. In that case we can write the change <strong>in</strong> energy asnA F — I F new — F} — (<strong>in</strong> — ?i new^ V^ i/i, T • (A 17 s !L±I-J — \n* n/) — \yk y^ ) 7 u/^iXy \I.LI)For convenience, we recall the state-change equations that determ<strong>in</strong>e thenew value of y 0. But, accord<strong>in</strong>g to the procedure forcalculat<strong>in</strong>g y" ew , this transition can occur only if Y^]=\ w kj x j < 0- Therefore,the value of AE is the product of one factor that is greater than zero <strong>and</strong> onethat is less than zero. The result is that AE < 0.The second possibility is that y k = -1 <strong>and</strong> y" ew — +1. Then, (y k -y" ew ) 0- Aga<strong>in</strong>, AE ' s tne productof one factor less than zero <strong>and</strong> one greater than zero. In both cases where y^changes, AE decreases. Note that, for the case where y/,. does not change, bothfactors <strong>in</strong> the equation for AE are zero, so the energy does not change unlessone of the vectors changes.Equation (4.17) can be extended to cover the situation where more thanone component of the y vector changes. If we write Ay, — (y; - y" ew ), thenthe equation that replaces Eq. (4.17) for the general case where any or allcomponents of y can change isniAE = (E new - E) - ^ Ay, >T w ijXj (4.18)This equation is a sum of m terms, one for each possible Ay;, which are eithernegative or zero depend<strong>in</strong>g on whether or not y; changed. Thus, <strong>in</strong> the generalcase, E must decrease if y changes.| Exercise 4.6: Prove part 2 of the BAM energy theorem.| Exercise 4.7: Prove part 3 of the BAM energy theorem.Exercise 4.8: Suppose we have denned an autoassociative BAM whose weightmatrix is calculated accord<strong>in</strong>g tonW = XiX* + X2Xj + • • • + XLX*Lwhere the x, are not necessarily orthonormal. Show that the weight matrix canbe written asw = al + S


4.3 The Hopfield Memory 141where a is a constant, I is the identity matrix, <strong>and</strong> S is identical to the weightmatrix, but with zeros on the diagonal. For an arbitrary <strong>in</strong>put vector, x, showthat the value of the BAM energy function isE=-(3- x'Sxwhere /3 is a constant. From this result, deduce that the change <strong>in</strong> energy, AE,dur<strong>in</strong>g a state change, is <strong>in</strong>dependent of the diagonal elements on the weightmatrix.4.3 THE HOPFIELD MEMORYIn this section we describe two versions of an ANS, which we call the Hopfieldmemory. We shall show that you can consider the Hopfield memory as aderivative of the BAM, although we doubt that that was the way the Hopfieldmemory orig<strong>in</strong>ated. The two versions are the discrete Hopfield memory, <strong>and</strong>the cont<strong>in</strong>uous Hopfield memory, depend<strong>in</strong>g on whether the unit outputs area discrete or a cont<strong>in</strong>uous function of the <strong>in</strong>puts respectively.4.3.1 Discrete Hopfield MemoryIn the discussion of the previous sections, we def<strong>in</strong>ed an autoassociative BAM asone which stored <strong>and</strong> recalled a set of vectors {\i, x 2 ,..., XL }• The prescriptionfor determ<strong>in</strong><strong>in</strong>g the weights was to calculate the correlation matrix:w =Figure 4.4 illustrates a BAM that performs this autoassociative function.We po<strong>in</strong>ted out <strong>in</strong> the previous section that the weight matrix for an autoassociativememory is square <strong>and</strong> symmetric, which means, for example, thatthe weights w\ 2 <strong>and</strong> wi\ are equal. S<strong>in</strong>ce each of the two layers has the samenumber of units, <strong>and</strong> the connection weight from the nth unit on layer 1 to thenth unit on layer 2 is the same as the connection weight from the nth unit onlayer 2 back to the nth unit on layer 1, it is possible to reduce the autoassociativeBAM structure to one hav<strong>in</strong>g only a s<strong>in</strong>gle layer. Figure 4.5 illustrates thisstructure. A somewhat different render<strong>in</strong>g appears <strong>in</strong> Figure 4.6. The figureshows a fully connected network, without the feedback from each unit to itself.The major difference between the architecture of Figure 4.5 <strong>and</strong> that ofFigure 4.6 is the existence of the external <strong>in</strong>put signals, /,. This additionmodifies the calculation of the net <strong>in</strong>put to a given unit by the <strong>in</strong>clusion ofthe /, ; term. In this case,net,; = (4.19)


142 The BAM <strong>and</strong> the Hopfield Memoryx layerx layerFigure 4.4 The autoassociative BAM architecture has an equal number ofunits on each layer. Note that we have omitted the feedbackterms to each unit.x layerFigure 4.5 The autoassociative BAM can be reduced to a s<strong>in</strong>gle-layerstructure. Notice that, when the reduction is carried out, thefeedback connections to each unit reappear.


4.3 The Hopfield Memory 143Figure 4.6This figure shows the Hopfield-memory architecture withoutthe feedback connections to each unit. Elim<strong>in</strong>at<strong>in</strong>g theseconnections explicitly forces the weight matrix to have zeroson the diagonal. We have also added external <strong>in</strong>put signals,/,:, to each unit.Moreover, we can allow the threshold condition on the output value to take ona value other than zero. Then, <strong>in</strong>stead of Eq. (4.12), we would have+ 1 netj >Xi(t) net z —-1 net;


144 The BAM <strong>and</strong> the Hopfield MemoryThe factor of i did not appear <strong>in</strong> the energy equation of the BAM. In the BAM,both forward <strong>and</strong> backward passes contributed equally to the total energy of thesystem. In the Hopfield memory, there is only a s<strong>in</strong>gle layer, hence, half theenergy that there is <strong>in</strong> the BAM.Exercise 4.9: Beg<strong>in</strong>n<strong>in</strong>g with Eq. (4.22), show that the Hopfield network willalways converge to a stable state by prov<strong>in</strong>g a theorem similar to the BAMenergy theorem of the previous section. This exercise assumes that you haveread the section on the proof of the BAM energy theorem.4.3.2 Cont<strong>in</strong>uous Hopfield ModelHopfield's <strong>in</strong>tent was to extend his discrete-memory model by <strong>in</strong>corporat<strong>in</strong>gsome results from neurobiology that make his PEs more closely resemble actualneurons. For example, it is known that real neurons have a cont<strong>in</strong>uous,graded output response as a function of their <strong>in</strong>puts, rather than the two-state,on-or-off b<strong>in</strong>ary output. By us<strong>in</strong>g this modification <strong>and</strong> others, Hopfield constructeda new, cont<strong>in</strong>uous-memory model that had the same useful propertiesof an associative memory that the discrete model showed. Moreover, there is ananalogous electronic circuit us<strong>in</strong>g nonl<strong>in</strong>ear amplifiers <strong>and</strong> resistors, which suggeststhe possibility of build<strong>in</strong>g these associative memory circuits us<strong>in</strong>g VLSItechnology.To develop the cont<strong>in</strong>uous model, we shall def<strong>in</strong>e ui to be the net <strong>in</strong>put tothe ith PE. One possible biological analog of u t is the summed action potentialsat the axon hillock of a neuron. In the case of the neuron, the output of the cellwould be a series of potential spikes whose mean frequency versus total actionpotential resembles the sigmoid curve <strong>in</strong> Figure 4.7(a). For use <strong>in</strong> the Hopfieldmodel, the PE output function isv t - g t (Xu,,) = - (1 + tanh(AiO) (4.23)where A is a constant called the ga<strong>in</strong> parameter. This relationship is illustrated<strong>in</strong> Figure 4.7(b) for several values of the ga<strong>in</strong> parameter, A.In real neurons, there will be a time delay between the appearance of theoutputs, Vj, of other cells, <strong>and</strong> the result<strong>in</strong>g net <strong>in</strong>put, u t , to a cell. This delayis caused by the resistance <strong>and</strong> capacitance of the cell membrane <strong>and</strong> the f<strong>in</strong>iteconductance of the synapse between the jth <strong>and</strong> ith cells. These ideas are<strong>in</strong>corporated <strong>in</strong>to the circuit shown <strong>in</strong> Figure 4.8.Each amplifier has an <strong>in</strong>put resistance, p, <strong>and</strong> an <strong>in</strong>put capacitance, C, asshown. Also shown are the external signals, /,-. In the case of an actual circuit,the external signals would supply a constant current to each amplifier.The net-<strong>in</strong>put current to each amplifier is the sum of the <strong>in</strong>dividual currentcontributions from other units, plus the external-<strong>in</strong>put current, m<strong>in</strong>us leakageacross the <strong>in</strong>put resistor, p. The contribution from each connect<strong>in</strong>g unit is thevoltage value across the resistor at the connection, divided by the connection


4.3 The Hopfield Memory 145A =50(a)Figure 4.7 (a) This sigmoid curve approximates the output of a neuron <strong>in</strong>response to the total action potential, (b) This series of graphshas been drawn us<strong>in</strong>g Eq. (4.23), for three different values ofthe ga<strong>in</strong> parameter, X. Note that, for large values of the ga<strong>in</strong>,the function approaches the step function used <strong>in</strong> the discretemodel.resistance. For the connection from the j'th unit to the z'th, this contributionwould be (vj - u,}/R t} = (v, - u,)T,j. The leakage current is Ui/p. Thus, thetotal contribution from all connect<strong>in</strong>g units <strong>and</strong> the external <strong>in</strong>put isIT, = —Rearrang<strong>in</strong>g slightly giveswhere -J- — ^ , 7^7 is tne parallel comb<strong>in</strong>ation of the <strong>in</strong>put resistor <strong>and</strong> theconnection-matrix resistors. We can treat the circuit as a transient RC circuit,<strong>and</strong> can f<strong>in</strong>d the value of w, from the equation that describes the charg<strong>in</strong>g ofthe capacitor as a result of the net-<strong>in</strong>put current. That equation is(4.24)These equations, one for each unit <strong>in</strong> the memory circuit, completely describethe time evolution of the system. If each PE is given an <strong>in</strong>itial value,«,(0), these equations can be solved on a digital computer us<strong>in</strong>g the numericaltechniques for <strong>in</strong>itial value problems. Do not forget to apply the output function,Eq. (4.23), to each u, to obta<strong>in</strong> the correspond<strong>in</strong>g amplifier output, v-,.


146 The BAM <strong>and</strong> the Hopfield MemoryDetail ofconnectionTo other unitsNonl<strong>in</strong>ear amplifiersInvert<strong>in</strong>gamplifiersFrom other unitsFigure 4.8 In this circuit diagram for the cont<strong>in</strong>uous Hopfield memory,amplifiers with a sigmoid output characteristic are used as thePEs. The black circles at the <strong>in</strong>tersection po<strong>in</strong>ts of the l<strong>in</strong>esrepresent connections between PEs. At each connection, weplace a resistor hav<strong>in</strong>g a value R t j — l/|Tjj|, where we haveused Hopfield's notation T;, to represent the weight matrix.S<strong>in</strong>ce all real resistor values are positive, <strong>in</strong>vert<strong>in</strong>g amplifiersare used to simulate <strong>in</strong>hibitory signals. Thus, a PE consists oftwo amplifiers. If the output of a particular element excitessome other element, then the connection is made with thesignal from the non<strong>in</strong>vert<strong>in</strong>g amplifier. If the connection is<strong>in</strong>hibitory, it is made from the <strong>in</strong>vert<strong>in</strong>g amplifier.Cont<strong>in</strong>uous-Model Energy Function. Like the BAM, the discrete Hopfieldmemory always converges to a stable po<strong>in</strong>t <strong>in</strong> Hamm<strong>in</strong>g space: one of the 2"vertices of the Hamm<strong>in</strong>g hypercube. 5 The energy function that allows us toanalyze the cont<strong>in</strong>uous model is= ~ £ £ 1 £ Ji.£ (4.25)5 As we are now work<strong>in</strong>g with b<strong>in</strong>ary vectors, rather than bipolar, we must th<strong>in</strong>k of the Hamm<strong>in</strong>ghypercube as be<strong>in</strong>g made up of po<strong>in</strong>ts whose values are {0, + 1} rather than ±1.


4.3 The Hopfield Memory 147In Eq. (4.25), g~ l (v) = u is the <strong>in</strong>verse of the function v — g(u). It is graphed<strong>in</strong> Figure 4.9, along with the <strong>in</strong>tegral of g~^v), as a function of v.To show that Eq. (4.25) is an appropriate Lyapunov function for the system,we shall take the time derivative of Eq. (4.25) assum<strong>in</strong>g 7^ is symmetric:dE_~dtNotice that the quantity <strong>in</strong> parentheses <strong>in</strong> Eq. (4.26) is identical to the right-h<strong>and</strong>side of Eq. (4.24). Then,dE r—* dvi duidt ^—*idt dtBecause u z = g~ l (v{), we can use the cha<strong>in</strong> rule to write<strong>and</strong> Eq. (4.26) becomesduj _ dg~ l (vj)dvtdt dvi dtdE ^d9tdt(4.27)Figure 4.9(a) shows that g~\Vi) is a monotonically <strong>in</strong>creas<strong>in</strong>g function of v l <strong>and</strong>therefore its derivative is positive everywhere. All the factors <strong>in</strong> the summationof Eq. (4.27) are positive, so dE/dt must decrease as the system evolves. The-1Figure 4.9(a) The graph shows the <strong>in</strong>verse of the nonl<strong>in</strong>ear outputfunction, (b) The graph shows the <strong>in</strong>tegral of the nonl<strong>in</strong>earoutput function.


148 The BAM <strong>and</strong> the Hopfield Memorysystem eventually reaches a stable configuration, where dE/dt — dvi/dt = 0.We have assumed, for this argument, that E is bounded. It would be necessaryto establish this fact to ensure that E will eventually stop chang<strong>in</strong>g.Effects of the Nonl<strong>in</strong>ear Output Function. If we make the assumption thatthe external <strong>in</strong>puts /, are all zero, then the energy equation for the cont<strong>in</strong>uousmodel, Eq. (4.25), is identical to that of the discrete model, Eq. (4.22), exceptfor the contribution19i (v)dv. (4.28)R, LThis term alters the energy l<strong>and</strong>scape such that the stable po<strong>in</strong>ts of the systemno longer lie exactly at the corners of the Hamm<strong>in</strong>g hypercube. The value of thega<strong>in</strong> parameter determ<strong>in</strong>es how close the stable po<strong>in</strong>ts come to the hypercubecorners. In the limit of very high ga<strong>in</strong>, A —> oo, Eq. (4.28) is driven to zero,<strong>and</strong> the cont<strong>in</strong>uous model becomes identical to the discrete model. For f<strong>in</strong>itega<strong>in</strong>, the stable po<strong>in</strong>ts move toward the <strong>in</strong>terior of the hypercube. As the ga<strong>in</strong>becomes smaller, these stable po<strong>in</strong>ts may coalesce. Ultimately, as A —> 0, onlya s<strong>in</strong>gle stable po<strong>in</strong>t exists for the system. Thus, prudent choice of the ga<strong>in</strong>parameter is necessary for successful operation.4.3.3 The Travel<strong>in</strong>g-Salesperson ProblemIn this section, we shall exam<strong>in</strong>e the application of the cont<strong>in</strong>uous Hopfieldmemory to a class of problems known as optimization problems. Simply put,these problems are typically posed <strong>in</strong> terms of f<strong>in</strong>d<strong>in</strong>g the best way to do someth<strong>in</strong>g,subject to certa<strong>in</strong> constra<strong>in</strong>ts. The best solution is generally def<strong>in</strong>ed bya specific criterion. For example, there might be a cost associated with eachpotential solution, <strong>and</strong> the best solution is the one that m<strong>in</strong>imizes the cost whilestill fulfill<strong>in</strong>g all the requirements for an acceptable solution. In fact, <strong>in</strong> manycases optimization problems are described <strong>in</strong> terms of a cost function.One such optimization problem is the travel<strong>in</strong>g-salesperson problem (TSP).In its simplest form, a salesperson must make a circuit through a certa<strong>in</strong> numberof cities, visit<strong>in</strong>g each only once, while m<strong>in</strong>imiz<strong>in</strong>g the total distance traveled.The problem is to f<strong>in</strong>d the right sequence of cities to visit. The constra<strong>in</strong>ts arethat all cities are visited, each is visited only once, <strong>and</strong> the salesperson returnsto the start<strong>in</strong>g po<strong>in</strong>t at the end of the trip. The cost function to be m<strong>in</strong>imized isthe total distance traveled <strong>in</strong> the course of the trip.Notice that m<strong>in</strong>imum distance is not a constra<strong>in</strong>t <strong>in</strong> this problem. Constra<strong>in</strong>tsmust be satisfied, whereas m<strong>in</strong>imum distance is only a desirable end. We can<strong>in</strong>clude m<strong>in</strong>imum distance as a constra<strong>in</strong>t if we dist<strong>in</strong>guish between two types ofconstra<strong>in</strong>ts: weak <strong>and</strong> strong. Weak constra<strong>in</strong>ts, such as m<strong>in</strong>imum distance forthe TSP, are conditions that we desire, but may not achieve. Strong constra<strong>in</strong>tsare conditions that must be satisfied, or the solution we obta<strong>in</strong> will be <strong>in</strong>valid.


4.3 The Hopfield Memory 149The TSP is computationally <strong>in</strong>tensive if an exhaustive search is to be performedcompar<strong>in</strong>g all possible routes to f<strong>in</strong>d the best one. For an n-city tour,there are n\ possible paths. Due to degeneracies, the number of dist<strong>in</strong>ct solutionsis less than n\. The term dist<strong>in</strong>ct <strong>in</strong> this case refers to tours with differenttotal distances. For a given tour, it does not matter which of the n cities is thestart<strong>in</strong>g po<strong>in</strong>t, <strong>in</strong> terms of the total distance traveled. This degeneracy reducesthe number of dist<strong>in</strong>ct tours by a factor of n. Similarly, for a given circuit, itdoes not matter <strong>in</strong> which of two directions the salesperson travels. This factfurther reduces the number of dist<strong>in</strong>ct tours by a factor of two. Thus, for ann-city tour, there are nl/2n dist<strong>in</strong>ct circuits to consider.For a five-city tour, there would be 120/10 = 12 dist<strong>in</strong>ct tours—hardlya problem worthy of solution by a computer! A 10-city tour, however, has3628800/20 = 181440 dist<strong>in</strong>ct tours; a 30-city tour has over 4 x 10 30 possibilities.Add<strong>in</strong>g a s<strong>in</strong>gle city to a tour results <strong>in</strong> an <strong>in</strong>crease <strong>in</strong> the number ofdist<strong>in</strong>ct tours by a factor of(n+ l)!/2(n+ 1)n!/2n ~ "Thus, a 31-city tour requires that we exam<strong>in</strong>e 31 times as many tours as wemust for a 30-city tour. The amount of computation time required by a digitalcomputer to solve this problem grows exponentially with the number of cities.The problem belongs to a class of problems known as NP-complete [2]. Becauseof the computational burden, it is often the case with optimization problems thata good solution found quickly is more desirable than the best solution found toolate to be of use.The Hopfield memory is well suited for this type of problem. The characteristicof <strong>in</strong>terest is the rapid m<strong>in</strong>imization of an energy function, E. Althoughthe network is guaranteed to converge to a m<strong>in</strong>imum of the energy function,there is no guarantee that it will converge to the lowest energy m<strong>in</strong>imum: Thesolution will likely be a good one, but not necessarily the best. Because the PEs<strong>in</strong> the memory operate <strong>in</strong> parallel (<strong>in</strong> the actual circuit), computation time ism<strong>in</strong>imized. In fact, add<strong>in</strong>g another city would not significantly affect the timerequired to determ<strong>in</strong>e a solution, assum<strong>in</strong>g the existence of the actual circuitswith parallel PEs (the simulation of this system on a digital computer may stillrequire a considerable amount of time).To use the Hopfield memory for this application, we must f<strong>in</strong>d a way tomap the problem onto the network architecture. This task is not a simple one.For example, there is no longer a set of well-def<strong>in</strong>ed vectors whose correlationmatrix determ<strong>in</strong>es the connection weights for the network.The first item is to develop a representation of the problem solutions thatfits an architecture hav<strong>in</strong>g a s<strong>in</strong>gle array of PEs. We develop it by allow<strong>in</strong>ga set of n PEs (for an n-city tour) to represent the n possible positions for agiven city <strong>in</strong> the sequence of the tour. For a five-city tour, the outputs of fivePEs could convey this <strong>in</strong>formation. For example, the five elements: 00010would <strong>in</strong>dicate that the city <strong>in</strong> question was the fourth to be visited on the tour


150 The BAM <strong>and</strong> the Hopfield Memory(we assume the high-ga<strong>in</strong> limit for this discussion). There would be five suchgroups, each hav<strong>in</strong>g <strong>in</strong>formation about the position of one city. Figure 4.10illustrates this representation.We shall follow the matrix format of Figure 4.10(b) henceforth, so theoutputs will be labeled v X i, where the "X" subscript refers to the city, <strong>and</strong> the"i" subscript refers to a position on the tour. To account for the constra<strong>in</strong>t thatthe tour beg<strong>in</strong> <strong>and</strong> end at the same city, we must def<strong>in</strong>e vx( n +\) — vx\ <strong>and</strong>v xo = vxn- The need for these def<strong>in</strong>itions will become clear shortly.To def<strong>in</strong>e the connection weight matrix, we shall beg<strong>in</strong> with the energyfunction. Once it is def<strong>in</strong>ed, a suitable connection-weight matrix can be deduced.An energy function must be constructed that satisfies the follow<strong>in</strong>g criteria:1. Energy m<strong>in</strong>ima must favor states that have each city only once on the tour.B01000 10000 00010 00001 00100(a)1 234501000 A10000 B00010 C00001 D00100 E(b)Figure 4.10(a) In this representation scheme for the output vectors <strong>in</strong> afive-city TSP problem, five units are associated with each ofthe five cities. The cities are labeled A through E. The positionof the 1 with<strong>in</strong> any group of five represents the locationof that particular city <strong>in</strong> the sequence of the tour. For thisexample, the sequence is B-A-E-C-D, with the return to Bassumed. Notice that N = n 2 PEs are required to representthe <strong>in</strong>formation for an n-city tour, (b) This matrix provides analternative way of look<strong>in</strong>g at the units. The PEs are arranged<strong>in</strong> a two-dimensional matrix configuration, with each rowrepresent<strong>in</strong>g a city <strong>and</strong> each column represent<strong>in</strong>g a positionon the sequence of the tour.


4.3 The Hopfield Memory 1512. Energy m<strong>in</strong>ima must favor states that have each position on the tour onlyonce. For example, two different cities cannot be the second city on thetour.3. Energy m<strong>in</strong>ima must favor states that <strong>in</strong>clude all n cities.4. Energy m<strong>in</strong>ima must favor states with the shortest total distances.The energy function that satisfies these conditions is not easy to construct.We shall state the entire function, then analyze each term to show its contribution.The energy equation isA » « " B "Cj — /-^ / ^ / *JXi *JX i ~~r~ •> *• " * • j Arj/ ' * ' -^L X = li = \ i = >.Ci=lP.In the last term of this equation, if i = 1, then the quantity VYO appears. Ifi = n, then the quantity vy,n+i appears. These results establish the need forthe def<strong>in</strong>itions, v X ( n +\) = vxi <strong>and</strong> v xo = v Xn , discussed previously.Before we dissect this rather formidable energy equation, first note thatd x y represents the distance between city X <strong>and</strong> city Y, <strong>and</strong> that d x y =d Yx . Furthermore, if the parameters, A,B,C, <strong>and</strong> D, are positive, then E isnonnegative. 6Let's take the first term <strong>in</strong> Eq. (4.29) <strong>and</strong> consider the five-city tour proposed<strong>in</strong> Figure 4.10. The product terms, v Xi v X j, refer to a s<strong>in</strong>gle city, X. The <strong>in</strong>nertwo sums result <strong>in</strong> a sum of products, v\iVxj, for all comb<strong>in</strong>ations of i <strong>and</strong>j as long as i / j. After the network has stabilized on a solution, <strong>and</strong> <strong>in</strong> thehigh-ga<strong>in</strong> limit where v € {0, 1}, all such terms will be zero if <strong>and</strong> only if thereis a s<strong>in</strong>gle v xi — 1 <strong>and</strong> all other v xi = 0. That is, the contribution of the firstterm <strong>in</strong> Eq. (4.29) will be zero if <strong>and</strong> only if a s<strong>in</strong>gle city appears <strong>in</strong> each rowof the PE matrix. This condition corresponds to the constra<strong>in</strong>t that each cityappear only once on the tour. Any other situation results <strong>in</strong> a positive value forthis term <strong>in</strong> the energy equation.Us<strong>in</strong>g a similar analysis, we can show that the second term <strong>in</strong> Eq. (4.29)can be shown to be zero, if <strong>and</strong> only if each column of the PE matrix conta<strong>in</strong>sa s<strong>in</strong>gle value of 1. This condition corresponds to the second constra<strong>in</strong>t on theproblem; that each position on the tour has a unique city associated with it.In the third term, ^x ^V v X i is a simple summation of all n 2 outputvalues. There should be only n of these terms that have a value of 1; all othersThe parameters. A, B, C, <strong>and</strong> D, <strong>in</strong> Eq. (4.29) have noth<strong>in</strong>g to do with the cities labeled A, B,C, <strong>and</strong> D.


152 The BAM <strong>and</strong> the Hopfield Memoryshould be zero. Then the third term will be zero, s<strong>in</strong>ce ^x J^ vxt = n.If more or less than n terms are equal to 1, then the sum will be greater orless than n, <strong>and</strong> the contribution of the third term <strong>in</strong> Eq. (4.29) will be greaterthan zero.The f<strong>in</strong>al term <strong>in</strong> Eq. (4.29) computes a value proportional to the distancetraveled on the tour. Thus, a m<strong>in</strong>imum distance tour results <strong>in</strong> a m<strong>in</strong>imumcontribution of this term to the energy function.Exercise 4.10: For the five-city example <strong>in</strong> Figure 4.10(b), show that the thirdterm <strong>in</strong> Eq. (4.29) isD(d BA d DB )The result of this analysis is that the energy function of Eq. (4.29) will bem<strong>in</strong>imized only when a solution satisfies all four of the constra<strong>in</strong>ts (three strong,one weak) listed previously. Now we wish to construct a weight matrix thatcorresponds to this energy function so that the network will compute solutionsproperly.As is often the case, it is easier to specify what the network should notdo than to say what it should do. Therefore, the connection weight matrixis def<strong>in</strong>ed solely <strong>in</strong> terms of <strong>in</strong>hibitions between PEs. Instead of the double<strong>in</strong>dex that has been used up to now to describe the weight matrix, we adopt afour-<strong>in</strong>dex scheme that corresponds to the double-<strong>in</strong>dex scheme on the outputvectors. The connection-matrix elements are the n 2 by n 2 quantities, TXI.YJ,where X <strong>and</strong> Y refer to the cities, <strong>and</strong> i <strong>and</strong> j refer to the positions on the tour.The first term of the energy function is zero if <strong>and</strong> only if a s<strong>in</strong>gle element<strong>in</strong> each row of the output-unit matrix is 1. This situation is favored if, whenone unit <strong>in</strong> a row is on (i.e., it has a large output value relative to the others),then it <strong>in</strong>hibits the other units <strong>in</strong> the same row. This situation is essentiallya w<strong>in</strong>ner-take-all competition, where the rich get richer at the expense of thepoor. Consider the quantitywhere 6.,,,, = 1 if u = v, <strong>and</strong> 6 UV = 0 if u / v. The first delta is zero except ona s<strong>in</strong>gle row where X = Y. The quantity <strong>in</strong> parentheses is 1 unless i = j. Thatfactor ensures that a unit <strong>in</strong>hibits all other units on its row but does not <strong>in</strong>hibititself. Therefore, if this quantity represented a connection weight between units,all units on a particular row would have an <strong>in</strong>hibitory connection of strength —Ato all other units on the same row. As the network evolved toward a solution, ifone unit on a row began to show a larger output value than the others, it wouldtend to <strong>in</strong>hibit the other units on the same row. This situation is called lateral<strong>in</strong>hibition.The second term <strong>in</strong> the energy function is zero if <strong>and</strong> only if a s<strong>in</strong>gle unit<strong>in</strong> each column of the output-unit matrix is 1 . The quantity


4.3 The Hopfield Memory 153causes a lateral <strong>in</strong>hibition between units on each column. The first delta ensuresthat this <strong>in</strong>hibition is conf<strong>in</strong>ed to each column, where i = j. The second deltaensures that each unit does not <strong>in</strong>hibit itself.The contribution of the third term <strong>in</strong> the energy equation is perhaps not so<strong>in</strong>tuitive as the first two. Because it <strong>in</strong>volves a sum of all of the outputs, it has arather global character, unlike the first two terms, which were localized to rows<strong>and</strong> columns. Thus, we <strong>in</strong>clude a global <strong>in</strong>hibition, — C, such that each unit <strong>in</strong>the network is <strong>in</strong>hibited by this constant amount.F<strong>in</strong>ally, recall that the last term <strong>in</strong> the energy function conta<strong>in</strong>s <strong>in</strong>formationabout the distance traveled on the tour. The desire to m<strong>in</strong>imize this term can betranslated <strong>in</strong>to connections between units that <strong>in</strong>hibit the selection of adjacentcities <strong>in</strong> proportion to the distance between those cities. Consider the termFor a given column, j (i.e., for a given position on the tour), the two delta termsensure that <strong>in</strong>hibitory connections are made only to units on adjacent columns.Units on adjacent columns represent cities that might come either before or afterthe cities on column j. The factor -Dd X y ensures that the units represent<strong>in</strong>gcities farther apart will receive the largest <strong>in</strong>hibitory signal.We can now def<strong>in</strong>e the entire connection matrix by add<strong>in</strong>g the contributionsof the previous four paragraphs:T Xi ,Yj = -A6xY(l-Si j )-B6 i j(l-SxY)-C-Dd XY (6 j .i+i+6 j . i -i) (4.30)The <strong>in</strong>hibitory connections between units are illustrated graphically <strong>in</strong> Figure4.11.To f<strong>in</strong>d a solution to the TSP, we must return to the equations that describethe time evolution of the network. Equation (4.24) is the one we want:Here, we have used N as the summation limit to avoid confusion with the npreviously def<strong>in</strong>ed. Because all of the terms <strong>in</strong> T t j conta<strong>in</strong> arbitrary constants,<strong>and</strong> Ii can be adjusted to any desired values, we can divide this equation by C<strong>and</strong> writedt ^ l} J rwhere r = RC, the system time constant, <strong>and</strong> we have assumed that Rj = Rfor all i.


154 The BAM <strong>and</strong> the Hopfield MemoryPosition on tour3 4Figure 4.11This schematic illustrates the pattern of <strong>in</strong>hibitory connectionsbetween PEs for the TSP problem: Unit a illustrates the<strong>in</strong>hibition between units on a s<strong>in</strong>gle row, unit b showsthe <strong>in</strong>hibition with<strong>in</strong> a s<strong>in</strong>gle column, <strong>and</strong> unit c shows the<strong>in</strong>hibition of units <strong>in</strong> adjacent columns. The global <strong>in</strong>hibitionis not shown.A digital simulation of this system requires that we <strong>in</strong>tegrate the above setof equations numerically. For a sufficiently small value of A£, we can writeAN(4.31)Then, we can iteratively update the Uj values accord<strong>in</strong>g to(4.32)


4.3 The Hopfield Memory 155where Aw^ is given by Eq. (4.31). The f<strong>in</strong>al output values are then calculatedus<strong>in</strong>g the output functionNotice that, <strong>in</strong> these equations, we have returned to the subscript notation used<strong>in</strong> the discussion of the general system: v^ rather than VYJ-In the double-subscript notation, we have<strong>and</strong>u X i(t + 1) = uxitt) + &u xt (4.33)vxt = 9Xr(u X i) = ^(l + tanh(Awxi)) (4.34)If we substitute TXI.YJ from Eq. (4.30) <strong>in</strong>to Eq. (4.31), <strong>and</strong> def<strong>in</strong>e the external<strong>in</strong>puts as Ixi = Cri , with n' a constant, <strong>and</strong> C equal to the C <strong>in</strong> Eq. (4.30),the result takes on an <strong>in</strong>terest<strong>in</strong>g form (see Exercise 4.11):n -- D d(4.35)Exercise 4.11: Assume that n' = n <strong>in</strong> Eq. (4.35). Then, the sum of terms,~A(...) - B(...) - C(...) - D(...), has a simple relationship to the TSP energyfunction <strong>in</strong> Eq. (4.29). What is that relationship?Exercise 4.12: Us<strong>in</strong>g the double-subscript notation on the outputs of the PEs,«C3 refers to the output of the unit that represents city C <strong>in</strong> position 3 of thetour. This unit is also element v^ of the output-unit matrix. What is the generalequation that converts the dual subscripts of the matrix notation, Vjk <strong>in</strong>to theproper s<strong>in</strong>gle subscript of the vector notation, i>,:?Exercise 4.13: There are 25 possible connections to unit v C 3 — ^33 from otherunits <strong>in</strong> the five-city tour problem. Determ<strong>in</strong>e the values of the resistors, Rij =l/\Tij\, that form those connections.To complete the solution of the TSP, suitable values for the constants mustbe chosen, along with the <strong>in</strong>itial values of the uxi- Hopfield [6] providesparameters suitable for a 10-city problem: A = B = 500, C = 200, D = 500,T — 1, A = 50, <strong>and</strong> n' = 15. Notice that it is not necessary to choose n' = n.Because n' enters the equations through the external <strong>in</strong>puts, 7j = Cn', it canbe used as another adjustable parameter. These parameters must be empiricallychosen, <strong>and</strong> those for a 10-city tour will not necessarily work for tours ofdifferent sizes.


156 The BAM <strong>and</strong> the Hopfield MemoryWe might be tempted to make all of the <strong>in</strong>itial values of the uxi equal toa constant UQO such that, at t = 0,because that is what we expect that particular sum to be when the network hasstabilized on a solution. Assign<strong>in</strong>g <strong>in</strong>itial values <strong>in</strong> that manner, however, hasthe effect of plac<strong>in</strong>g the system on an unstable equilibrium po<strong>in</strong>t, much like aball placed at the exact top of a hill. Without at least a slight nudge,the ballwould rema<strong>in</strong> there forever. Given that nudge, however, the ball would rolldown the hill. We can give our TSP system a nudge by add<strong>in</strong>g a r<strong>and</strong>om noiseterm to the UQO values, so that uxi = "oo + 6uxi, where Suxt is the r<strong>and</strong>omnoise term, which may be different for each unit.In the ball-on-the-hill analogy, the direction of the nudge determ<strong>in</strong>es thedirection <strong>in</strong> which the ball rolls off the hill. Likewise, different r<strong>and</strong>om-noiseselections for the <strong>in</strong>itial uxi values may result <strong>in</strong> different f<strong>in</strong>al stable states.Refer back to the discussion of optimization problems earlier <strong>in</strong> this section,where we said that a good solution now may be better than the best solutionlater. Hopfield's solution to the TSP may not always f<strong>in</strong>d the best solution(the one with the shortest distance possible), but repeated trials have shownthat the network generally settles on tours at or near the m<strong>in</strong>imum distance.Figure 4.12 shows a graphical representation of how a network would evolvetoward a solution.We have discussed this example at great length to show both the power <strong>and</strong>the complexity of the Hopfield network. The example also illustrates a generalpr<strong>in</strong>ciple about neural networks: For a given problem, f<strong>in</strong>d<strong>in</strong>g an appropriaterepresentation of the data or constra<strong>in</strong>ts is often the most difficult part of thesolution.4.4 SIMULATING THE BAMAs you may already suspect, the implementation of the BAM network simulatorwill be straightforward. The only difficulty is the implementation of bidirectionalconnections between the layers, <strong>and</strong>, with a little f<strong>in</strong>esse, this is a relativelyeasy problem to overcome. We shall beg<strong>in</strong> by describ<strong>in</strong>g the general nature ofthe problems associated with model<strong>in</strong>g bidirectional connections <strong>in</strong> a sequentialmemory array. From there, we will present the data structures needed to overcomethese problems while rema<strong>in</strong><strong>in</strong>g compatible with our basic simulator. Weconclude this section with a presentation of the algorithms needed to implementthe BAM.4.4.1 Bidirectional-Connection ConsiderationsLet us first consider the basic data structures we have def<strong>in</strong>ed for our simulator.We have assumed that all network PEs will be organized <strong>in</strong>to layers, withconnections primarily between the layers. Further, we have decided that the


4.4 Simulat<strong>in</strong>g the BAM 1571 2 3 4 5 6 7 8 9 10A ....-••...BCDEFGH(a)(b)(c)(d)LFigure 4.12This sequence of diagrams illustrates the convergence of theHopfield network for a 10-city TSP tour. The output values,vxi, are represented as squares at each location <strong>in</strong> the outputunitmatrix. The size of the square is proportional to themagnitude of the output value, (a, b, c) At the <strong>in</strong>termediatesteps, the system has not yet settled on a valid tour. Themagnitude of the output values for these <strong>in</strong>termediate stepscan be thought of as the current estimate of the confidencethat a particular city will end up <strong>in</strong> a particular position onthe tour, (d) The network has stabilized on the valid tour,DHIFGEAJCB. Source: Repr<strong>in</strong>ted with permission of Spr<strong>in</strong>ger-Verlag, Heidelberg, from J. J. Hopfield <strong>and</strong> D. W. Tank, "<strong>Neural</strong>computation of decisions <strong>in</strong> optimization problems." BiologicalCybernetics, 52:141-152, 1985.


158 The BAM <strong>and</strong> the Hopfield Memory<strong>in</strong>dividual PEs with<strong>in</strong> any layer will be simulated by process<strong>in</strong>g <strong>in</strong>puts, with noprovision for process<strong>in</strong>g output connections. With respect to model<strong>in</strong>g bidirectionalconnections, we are faced with the dilemma of us<strong>in</strong>g a s<strong>in</strong>gle connectionas <strong>in</strong>put to two different PEs. Thus, our parallel array structures for model<strong>in</strong>gnetwork connections are no longer valid.As an example, consider the weight matrix illustrated on page 136 as partof the discussion <strong>in</strong> Section 4.2. For clarity, we will consider this matrix asbe<strong>in</strong>g an R x C array, whereR = rows = 6<strong>and</strong>C — columns — 10.Next, consider the implementation of this matrix <strong>in</strong> computer memory, asdepicted <strong>in</strong> Figure 4.13. S<strong>in</strong>ce memory is organized as a one-dimensional l<strong>in</strong>eararray of cells (or bytes, words, etc.), most modern computer languages willallocate <strong>and</strong> ma<strong>in</strong>ta<strong>in</strong> this matrix as a one-dimensional array of R vectors, eachC cells long, arranged sequentially <strong>in</strong> the computer memory. 7 In this implementation,access to each row vector requires at least one multiplication (row<strong>in</strong>dex x number of columns per row) <strong>and</strong> an addition (to determ<strong>in</strong>e the memoryaddress of the row, offset from the base address of the array). However, oncethe beg<strong>in</strong>n<strong>in</strong>g of the row has been located, access to the <strong>in</strong>dividual componentswith<strong>in</strong> the vector is simply an <strong>in</strong>crement operation.In the column-vector case, access to the data is not quite as easy. Simplyput, each component of the column vector must be accessed by performanceof a multiplication (as before, to access the appropriate row), plus an additionto locate the appropriate cell. The penalty imposed by this approach is suchthat, for the entire column vector to be accessed, R multiplications must beperformed. To access each element <strong>in</strong> the matrix as a component of a columnvector, we must do R x C multiplications, or one for each element—a timeconsum<strong>in</strong>gprocess.4.4.2 BAM Simulator Data StructuresS<strong>in</strong>ce we have chosen to use the array-based model for our basic network datastructure, we are faced with the complicated (<strong>and</strong> CPU-time-consum<strong>in</strong>g) problemof access<strong>in</strong>g the network weight matrix first as a set of row vectors forthe propagation from layer x to layer y, then access<strong>in</strong>g weights as a set of columnvectors for the propagation <strong>in</strong> the other direction. Further complicat<strong>in</strong>gthe situation is the fact that we have chosen to isolate the weight vectors <strong>in</strong> ournetwork data structure, access<strong>in</strong>g each array <strong>in</strong>directly through the <strong>in</strong>termediate7 FORTRAN, which uses a column-major array organization, is the notable exception.


4.4 Simulat<strong>in</strong>g the BAM 159weight_ptr array. If we hold strictly to this scheme, we must significantlymodify the design of our simulator to allow access to the connections fromboth layers of PEs, a situation illustrated <strong>in</strong> Figure 4.14. As shown <strong>in</strong> thisdiagram, all the connection weights will be conta<strong>in</strong>ed <strong>in</strong> a set of arrays associatedwith one layer of PEs. The connections back to the other layer must thenbe <strong>in</strong>dividually accessed by <strong>in</strong>dex<strong>in</strong>g <strong>in</strong>to each array to extract the appropriateelement.To solve this dilemma, let's now consider a slight modification to the conceptualmodel of the BAM. Until now, we have considered the connectionsbetween the layers as one set of bidirectional paths; that is, signals can passHigh memoryY////////ABA + 6(10)Row 5BA + 5(10)Row 4BA + 4(10)Row 3BA + 3(10)Row 2BA + 2(10)Row 1RowOBA+1(10)10 ColumnsBase address (BA)Low memoryFigure 4.13The row-major structure used to implement a matrix is shown.In this technique, memory is allocated sequentially so thatcolumn values with<strong>in</strong> the same row are adjacent. Thisstructure allows the computer to step through all values <strong>in</strong>a s<strong>in</strong>gle row by simply <strong>in</strong>crement<strong>in</strong>g a memory po<strong>in</strong>ter.


160 The BAM <strong>and</strong> the Hopfield Memoryoutputsweight matrixy l weightsyweightsyweightsFigure 4.14 This bidirectional connection implementation uses ourst<strong>and</strong>ard data structures. Here, the connection arrays locatedby the layer y structure are identical to those previouslydescribed for the backpropagation simulator. However, thepo<strong>in</strong>ters associated with the layer x structure locate theconnection <strong>in</strong> the first weights array that is associatedwith the column weight vector. Hence, stepp<strong>in</strong>g throughconnections to layer x requires locat<strong>in</strong>g the connection <strong>in</strong>each weights array at the same offset from the beg<strong>in</strong>n<strong>in</strong>g ofarray as the first connection.from layer x to layer y as well as from layer y to layer x. If we <strong>in</strong>stead considerthe connections as two sets of unidirectional paths, we can logically implementthe same network if we simply connect the outputs of the x layer to the <strong>in</strong>putson the y layer, <strong>and</strong>, similarly, connect the outputs of the y layer to the <strong>in</strong>putson the x layer. To complete this model, we must <strong>in</strong>itialize the connections fromx to y with the predeterm<strong>in</strong>ed weight matrix, while the connections from y tox must conta<strong>in</strong> the transpose of the weight matrix. This strategy allows us toprocess only <strong>in</strong>puts at each PE, <strong>and</strong>, s<strong>in</strong>ce the connections are always accessed<strong>in</strong> the desired row-major form, allows efficient signal propagation through thesimulator, regardless of direction.


4.4 Simulat<strong>in</strong>g the BAM 161The disadvantage to this approach is that it consumes twice as much memoryas does the s<strong>in</strong>gle-matrix implementation. There is not much that we c<strong>and</strong>o to solve this problem other than revert<strong>in</strong>g to the s<strong>in</strong>gle-matrix model. Evena l<strong>in</strong>ked-list implementation will not solve the problem, as it will require approximatelythree times the memory of the s<strong>in</strong>gle-matrix model. Thus, <strong>in</strong> termsof memory consumption, the s<strong>in</strong>gle-matrix model is the most efficient implementation.However, as we have already seen, there are performance issuesthat must be considered when we use the s<strong>in</strong>gle matrix. We therefore chooseto implement the double matrix, because run-time performance, especially <strong>in</strong> alarge network application, must be good enough to prevent long periods of deadtime while the human operator waits for the computer to arrive at a solution.The rema<strong>in</strong>der of the network is completely compatible with our genericnetwork data structures. For the BAM, we beg<strong>in</strong> by def<strong>in</strong><strong>in</strong>g a network withtwo layers:record BAM =X : "layer;Y : "layer;end record;{po<strong>in</strong>ter to first layer record}{po<strong>in</strong>ter to second layer record}As before, we now consider the implementation of the layers themselves.In the case of the BAM, a layer structure is simply a record used to conta<strong>in</strong>po<strong>in</strong>ters to the outputs <strong>and</strong> weight_ptr arrays. Such a record is def<strong>in</strong>edby the structurerecord LAYER =OUTS : ~<strong>in</strong>teger[];WEIGHTS : ~"<strong>in</strong>teger[];end record;{po<strong>in</strong>ter to node outputs array}{po<strong>in</strong>ter to weight_ptr array}Notice that we have specified <strong>in</strong>teger values for the outputs <strong>and</strong> weights<strong>in</strong> the network. This is a benefit derived from the b<strong>in</strong>ary nature of the network,<strong>and</strong> from the fact that the <strong>in</strong>dividual connection weights are given by the dotproduct between two <strong>in</strong>teger vectors, result<strong>in</strong>g <strong>in</strong> an <strong>in</strong>teger value. We use <strong>in</strong>tegers<strong>in</strong> this model, s<strong>in</strong>ce most computers can process <strong>in</strong>teger values much fasterthan they can float<strong>in</strong>g-po<strong>in</strong>t values. Hence, the performance improvement ofthe simulator for large BAM applications justifies the use of <strong>in</strong>tegers.We now def<strong>in</strong>e the three arrays needed to store the node outputs, theconnection weights, <strong>and</strong> the <strong>in</strong>termediate weight-ptr. These arrays willbe sized dynamically to conform to the desired BAM network structure. In thecase of the outputs arrays, one will conta<strong>in</strong> x <strong>in</strong>teger values, whereas theother must be sized to conta<strong>in</strong> y <strong>in</strong>tegers. The weight_ptr array will conta<strong>in</strong>a memory po<strong>in</strong>ter for each PE on the layer; that is, x po<strong>in</strong>ters will be requiredto locate the connection arrays for each node on the x layer, <strong>and</strong> y po<strong>in</strong>ters forthe connections to the y layer.Conversely, each of the weights arrays must be sized to accommodate an<strong>in</strong>teger value for each connection to the layer from the <strong>in</strong>put layer. Thus, each


162 The BAM <strong>and</strong> the Hopfield Memoryweights array on the x layer will conta<strong>in</strong> y values, whereas the weightsarrays on the y layer will each conta<strong>in</strong> x values. The complete BAM datastructure is illustrated <strong>in</strong> Figure 4.15.4.4.3 BAM Initialization <strong>Algorithms</strong>As we have noted earlier, the BAM is different from most of the other ANSnetworks discussed <strong>in</strong> this text, <strong>in</strong> that it is not tra<strong>in</strong>ed; rather, it is <strong>in</strong>itialized.Specifically, it is <strong>in</strong>itialized from the set of tra<strong>in</strong><strong>in</strong>g vectors that it will be requiredto recall. To develop this algorithm, we use the formula used previously togenerate the weight matrix for the BAM, given by Eq. (4.6), <strong>and</strong> repeated hereoutputsy 1 weightsFigure 4.15 The data structures for the BAM simulator are shown. Noticethe difference <strong>in</strong> the implementation of the connection arrays<strong>in</strong> this model <strong>and</strong> <strong>in</strong> the s<strong>in</strong>gle-matrix model described earlier.


4.4 Simulat<strong>in</strong>g the BAM 163for reference:We can translate this general formula <strong>in</strong>to one that can be used to determ<strong>in</strong>eany specific connection weight, given L tra<strong>in</strong><strong>in</strong>g pairs to be encoded <strong>in</strong> the BAM.This new equation, which will form the basis of the rout<strong>in</strong>e that we will use to<strong>in</strong>itialize the connection weights <strong>in</strong> the BAM simulation, is given byWRC = \i[c] (4.36)where the variables r <strong>and</strong> c denote the row <strong>and</strong> column position of the weightvalue of <strong>in</strong>terest. We assume that, for purposes of computer simulation, eachof the tra<strong>in</strong><strong>in</strong>g vectors x <strong>and</strong> y are one-dimensional arrays of length C <strong>and</strong> R,respectively. We also presume that the calculation will be performed only todeterm<strong>in</strong>e the weights for the connections from layer x to layer y. Once thevalues for these connections are determ<strong>in</strong>ed, the connections from y to x aresimply the transpose of this weight matrix.Us<strong>in</strong>g this equation, we can now write a rout<strong>in</strong>e to determ<strong>in</strong>e any weightvalue for the BAM. The follow<strong>in</strong>g algorithm presumes that all the tra<strong>in</strong><strong>in</strong>g pairsto be encoded are conta<strong>in</strong>ed <strong>in</strong> two external, two-dimensional matrices namedXT <strong>and</strong> YT. These arrays will conta<strong>in</strong> the patterns to be encoded <strong>in</strong> the BAM,organized as L <strong>in</strong>stances of either x or y vectors. Thus, the dimensions of theXT <strong>and</strong> YT <strong>in</strong>itialization matrices are L x C <strong>and</strong> L x R respectively.function weight (r,c,L:<strong>in</strong>teger; XT,YT:"<strong>in</strong>teger[][])return <strong>in</strong>teger;{determ<strong>in</strong>e <strong>and</strong> return weight value for position r,c}var i : <strong>in</strong>teger;x,y : "<strong>in</strong>teger[] [] ;sum : <strong>in</strong>teger;beg<strong>in</strong>sum = 0;x = XT;y = YT;{loop iteration counter}{local array po<strong>in</strong>ters}{local accumulator}{<strong>in</strong>itialize accumulator}{<strong>in</strong>itialize x po<strong>in</strong>ter}{<strong>in</strong>itialize y po<strong>in</strong>ter}for i = 1 to L do {for all tra<strong>in</strong><strong>in</strong>g pairs}sum = sum + y[i][r] * x[i] [c];end do;return (sum);end function;{return the result}The weight function allows us to compute the value to be associatedwith any particular connection. We will now extend that basic function <strong>in</strong>to ageneral rout<strong>in</strong>e to <strong>in</strong>itialize all the weights arrays for all the <strong>in</strong>put connections


164 The BAM <strong>and</strong> the Hopfield Memoryto the PEs <strong>in</strong> layer y. This algorithm uses two functions, called rows-<strong>in</strong> <strong>and</strong>cols-<strong>in</strong>, that return the number of rows <strong>and</strong> columns <strong>in</strong> a given matrix. Theimplementation of these two algorithms is left to the reader as an exercise.procedure <strong>in</strong>itialize (Y:~layer; XT,YT:"<strong>in</strong>teger[][]);{<strong>in</strong>itialize all <strong>in</strong>put connections to a layer, Y}var connects: "<strong>in</strong>teger[]units: ~<strong>in</strong>teger[];L,R,C : <strong>in</strong>teger;i, j: <strong>in</strong>teger;beg<strong>in</strong>units = Y".WEIGHTS;L = rows_<strong>in</strong> (XT);R = cols_<strong>in</strong> (YT);C = cols <strong>in</strong> (XT);(connection po<strong>in</strong>ter){po<strong>in</strong>ters <strong>in</strong>to weight_ptrs}{size-of variables}{iteration counters}{locate weight_ptr array}{number of tra<strong>in</strong><strong>in</strong>g patterns}{dimension of Y vector}{dimension of X vector}for i = 1 to length(units) do {for all units on layer}connects = unit[i]; {get po<strong>in</strong>ter to weight array}for j = 1 to length(connects) do{for all connections to unit}connects[j] = weight (R, C, L, XT, YT) ;{<strong>in</strong>itialize weight}end do;end do;end procedure;We <strong>in</strong>dicated earlier that the connections from layer y to layer x could be<strong>in</strong>itialized by use of the transpose of the weight matrix computed for the <strong>in</strong>putsto layer y. We could develop another rout<strong>in</strong>e to copy data from one set ofarrays to another, but <strong>in</strong>spection of the <strong>in</strong>itialize algorithm just described<strong>in</strong>dicates that another algorithm will not be needed. By substitut<strong>in</strong>g the x layerrecord for the y, <strong>and</strong> swapp<strong>in</strong>g the order of the two <strong>in</strong>put tra<strong>in</strong><strong>in</strong>g arrays, we canuse <strong>in</strong>itialize to <strong>in</strong>itialize the <strong>in</strong>put connections to the x layer as well, <strong>and</strong>we will have reduced the amount of code needed to implement the simulator.On the other h<strong>and</strong>, the transpose operation is a relatively easy algorithm to write,<strong>and</strong>, s<strong>in</strong>ce it <strong>in</strong>volves only copy<strong>in</strong>g data from one array to another, it is alsoextremely fast. We therefore leave to you the choice of which of these twoapproaches to use to complete the BAM <strong>in</strong>itialization.4.4.4 BAM Signal PropagationNow that we have created <strong>and</strong> <strong>in</strong>itialized the BAM, we need only to implementan algorithm to perform the signal propagation through the network. Here aga<strong>in</strong>,we would like this rout<strong>in</strong>e to be general enough to propagate signals to eitherlayer of the BAM. We will therefore design the rout<strong>in</strong>e such that the direction


4.4 Simulat<strong>in</strong>g the BAM 165of signal propagation will be determ<strong>in</strong>ed by the order of the <strong>in</strong>put argumentsto the rout<strong>in</strong>e. For simplicity, we also assume that the layer outputs arrayshave been <strong>in</strong>itialized to conta<strong>in</strong> the patterns to be propagated.Before we proceed, however, note that the desired algorithm generalityimplies that this rout<strong>in</strong>e will not be sufficient to implement completely theiterated signal-propagation function needed to allow the BAM to stabilize. Thisiteration must be performed by a higher-level rout<strong>in</strong>e. We will therefore designthe unidirectional BAM propagation rout<strong>in</strong>e as a function that returns the numberof patterns changed <strong>in</strong> the receiv<strong>in</strong>g layer, so that the iterated propagation rout<strong>in</strong>ecan easily determ<strong>in</strong>e when to stop.With these concerns <strong>in</strong> m<strong>in</strong>d, we can now design the unidirectional signalpropagationrout<strong>in</strong>e. Such a rout<strong>in</strong>e will take this general form:function propagate (X,Y:"layer) return <strong>in</strong>teger;{propagate signals from layer X to layer Y}var changes, i, j: <strong>in</strong>teger;<strong>in</strong>s, outs : "<strong>in</strong>teger[];connects : ~<strong>in</strong>teger[];sum : <strong>in</strong>teger;{local counters}{local po<strong>in</strong>ters}{locate connections}{sum of products}beg<strong>in</strong>outs = Y'.OUTS;changes = 0;{locate start of Y array}{<strong>in</strong>itialize counter}for i = 1 to length (outs) do<strong>in</strong>s = X'.OUTS;connects= Y".WEIGHTS[i];sum =0;{for all output units}{locate X outputs}{f<strong>in</strong>d connections}{<strong>in</strong>itial sum}for j = 1 to length(<strong>in</strong>s) do {for all <strong>in</strong>puts}sum = sum + <strong>in</strong>s[j] * connects[j];end do;if (sum < 0){if negative sum}then sum = -1{use -1 as output}else if (sum > 0) {if positive sum}then sum = 1 {use 1 as output}else sum = outs[i]; {else use old output}if (sum != outs[i]) {if unit changed}then changes = changes + 1;outs[i] = sum;end do;return (changes);end function;{store new output}{number of changes}


166 Programm<strong>in</strong>g ExercisesTo complete the BAM simulator, we will need a top-level rout<strong>in</strong>e to performthe bidirectional signal propagation. We will use the propagate rout<strong>in</strong>edescribed previously to perform the signal propagation between layers, <strong>and</strong> wewill iterate until no units change state on two successive passes, as that will<strong>in</strong>dicate that the BAM has stabilized. Here aga<strong>in</strong>, we assume that the <strong>in</strong>putvectors have been <strong>in</strong>itialized by an external process prior to call<strong>in</strong>g recall.procedure recall (net:BAM);{propagate signals <strong>in</strong> the BAM until stabilized}var delta : <strong>in</strong>teger;beg<strong>in</strong>delta = 100;{how many units change}{arbitrary non-zero value}while (delta != 0) {until two successive passes}dodelta =0;{reset to zero}delta = delta + propagate (net'.X, net'.Y);delta = delta + propagate (net'.Y, net~.X);end do;end procedure;Programm<strong>in</strong>g Exercises4.1. Def<strong>in</strong>e the pseudocode algorithms for the functions rows.<strong>in</strong> <strong>and</strong> cols_<strong>in</strong>as described <strong>in</strong> the text.4.2. Implement the BAM simulator described <strong>in</strong> Section 4.4, add<strong>in</strong>g a rout<strong>in</strong>e to<strong>in</strong>itialize the <strong>in</strong>put vectors from patterns read from a data file. Test the BAMwith the two tra<strong>in</strong><strong>in</strong>g vectors described <strong>in</strong> Exercise 4.4 <strong>in</strong> Section 4.2.3.4.3. Modify the BAM simulator so that the <strong>in</strong>itial direction of signal propagationcan be specified by the user at run time. Repeat Exercise 4.2, start<strong>in</strong>g signalpropagation first from x to y, then from y to x. Describe the results for eachcase.4.4. Develop an encod<strong>in</strong>g scheme to represent the follow<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g pairs fora BAM application. Initialize your simulator with the tra<strong>in</strong><strong>in</strong>g data, <strong>and</strong>then apply a "noisy" <strong>in</strong>put pattern to the <strong>in</strong>put (H<strong>in</strong>t: One way to do thisexercise is to encode each character as a seven-bit ASCII code, lett<strong>in</strong>g —1represent a logic 0 <strong>and</strong> +1 represent a logic 1). Does your BAM return thecorrect results?xCATDOGFLYYTABBYROVERPESKY


Bibliography 167Suggested Read<strong>in</strong>gsIntroductory articles on the BAM, by Bart Kosko, appear <strong>in</strong> the IEEE ICNNproceed<strong>in</strong>gs <strong>and</strong> Byte magaz<strong>in</strong>e [9, 10]. Two of Kosko's papers discuss how tomake the BAM weights adaptive [8, 11].The Scientific American article by Tank <strong>and</strong> Hopfield provides a good <strong>in</strong>troductionto the Hopfield network as we have discussed the latter <strong>in</strong> this chapter[13]. It is also worthwhile to review some of the earlier papers that discussthe development of the network <strong>and</strong> the use of the network for optimizationproblems such as the TSP [4, 5, 6, 7].The issue of the <strong>in</strong>formation storage capacity of associative memories istreated <strong>in</strong> detail <strong>in</strong> the paper by Kuh <strong>and</strong> Dick<strong>in</strong>son [12].The paper by Tagliar<strong>in</strong>i <strong>and</strong> Page, on "solv<strong>in</strong>g constra<strong>in</strong>t satisfaction problemswith neural networks," is a good complement to the discussion of the TSP<strong>in</strong> this chapter [14].Bibliography[1] Edward J. Beltrami. Mathematics for Dynamic Model<strong>in</strong>g. Academic Press,Orl<strong>and</strong>o, FL, 1987.[2] M. R. Garey <strong>and</strong> D. S. Johnson. Computers <strong>and</strong> Intractability. W. H.Freeman, New York, 1979.[3] Morris W. Hirsch <strong>and</strong> Stephen Smale. Differential Equations, DynamicalSystems, <strong>and</strong> L<strong>in</strong>ear Algebra. Academic Press, New York, 1974.[4] John J. Hopfield. <strong>Neural</strong> networks <strong>and</strong> physical systems with emergentcollective computational abilities. Proc. Natl. Acad. Sci. USA, 79:2554-2558, April 1982. Biophysics.[5] John J. Hopfield. Neurons with graded response have collective computationalproperties like those of two-state neurons. Proc. Natl. Acad. Sci.USA, 81:3088-3092, May 1984. Biophysics.[6] John J. Hopfield <strong>and</strong> David W. Tank. "<strong>Neural</strong>" computation of decisions<strong>in</strong> optimization problems. Biological Cybernetics, 52:141-152, 1985.[7] John J. Hopfield <strong>and</strong> David W. Tank. Comput<strong>in</strong>g with neural circuits: Amodel. Science, 233:625-633, August 1986.[8] Bart Kosko. Adaptive bidirectional associative memories. Applied Optics,26(23):4947-4960, December 1987.[9] Bart Kosko. Competitive bidirectional associative memories. In Proceed<strong>in</strong>gsof the IEEE First International Conference on <strong>Neural</strong> <strong>Networks</strong>, SanDiego, CA, II: 759-766, June 1987.[lOJBart Kosko. Construct<strong>in</strong>g an associative memory. Byte, 12(10): 137-144,September 1987.


168 The BAM <strong>and</strong> the Hopfield Memory[11] Bart Kosko. Bidirectional associative memories. IEEE Transactions onSystems, Man, <strong>and</strong> Cybernetics, 18(1):49-60, January-February 1988.[12] Anthony Kuh <strong>and</strong> Bradley W. Dick<strong>in</strong>son. Information capacity of associativememories. IEEE Transactions on Information Theory, 35(l):59-68,January 1989.[13] David W. Tank <strong>and</strong> John J. Hopfield. Collective computation <strong>in</strong> neuronlikecircuits. Scientific American, 257(6): 104-114, December 1987.[14] Gene A. Tagliar<strong>in</strong>i <strong>and</strong> Edward W. Page. Solv<strong>in</strong>g constra<strong>in</strong>t satisfactionproblems with neural networks. In Proceed<strong>in</strong>gs of the First IEEE Conferenceon <strong>Neural</strong> <strong>Networks</strong>, San Diego, CA, III: 741-747, June 1987.


C H A P T E RSimulated Anneal<strong>in</strong>gThe neural networks discussed <strong>in</strong> Chapters 2, 3, <strong>and</strong> 4 relied on the m<strong>in</strong>imizationof some function dur<strong>in</strong>g either the learn<strong>in</strong>g process (Adal<strong>in</strong>e <strong>and</strong> backpropagation)or the recall process (BAM <strong>and</strong> Hopfield network). The techniqueemployed to perform this m<strong>in</strong>imization is essentially the opposite of a st<strong>and</strong>ardheuristic used to f<strong>in</strong>d the maximum of a function. That technique is known ashill climb<strong>in</strong>g.The term hill climb<strong>in</strong>g derives from a simple analogy. Imag<strong>in</strong>e that you arest<strong>and</strong><strong>in</strong>g at some unknown location <strong>in</strong> an area of hilly terra<strong>in</strong>, with the goal ofwalk<strong>in</strong>g up to the highest peak. The problem is that it is foggy <strong>and</strong> you cannotsee more than a few feet <strong>in</strong> any direction. Barr<strong>in</strong>g obvious solutions, such aswait<strong>in</strong>g for the fog to lift, a logical way to proceed would be to beg<strong>in</strong> walk<strong>in</strong>g<strong>in</strong> the steepest upward direction possible. If you walk only upward at each step,you will eventually reach a spot where the only possible way to go is down. Atthis po<strong>in</strong>t, you are at the top of a hill. The question that rema<strong>in</strong>s is whether thishill is <strong>in</strong>deed the highest hill possible. Unfortunately, without further extensiveexploration, that question cannot be answered.The methods we have used to m<strong>in</strong>imize energy or error functions <strong>in</strong> previouschapters often suffer from a similar problem: If only downward stepsare allowed, when the m<strong>in</strong>imum is reached it may not be the lowest m<strong>in</strong>imumpossible. The lowest m<strong>in</strong>imum is referred to as the global m<strong>in</strong>imum, <strong>and</strong> anyother m<strong>in</strong>imum that exists is called a local m<strong>in</strong>imum.It is not always necessary, or even desirable, to reach the global m<strong>in</strong>imumdur<strong>in</strong>g a search. In one <strong>in</strong>stance, it is impossible to reach any but theglobal m<strong>in</strong>imum. In the case of the Adal<strong>in</strong>e, the error surface was shown tobe a hyperparaboloid with a s<strong>in</strong>gle m<strong>in</strong>imum. Thus, f<strong>in</strong>d<strong>in</strong>g a local m<strong>in</strong>imumis impossible. In the BAM <strong>and</strong> discrete Hopfield model, we store items atthe vertices of the Hamm<strong>in</strong>g hypercube, each of which occupies a m<strong>in</strong>imumof the energy surface. When recall<strong>in</strong>g an item, we beg<strong>in</strong> with some partial<strong>in</strong>formation <strong>and</strong> seek the local m<strong>in</strong>imum nearest to the start<strong>in</strong>g po<strong>in</strong>t. Hope-169


170 Simulated Anneal<strong>in</strong>gfully, the item stored at that local m<strong>in</strong>imum will represent the complete itemof <strong>in</strong>terest. The po<strong>in</strong>t we reach may or may not lie at the global m<strong>in</strong>imumof the energy function. Thus, we do not care whether the m<strong>in</strong>imum that wereach is global; we desire only that it correspond to the data <strong>in</strong> which we are<strong>in</strong>terested.The generalized delta rule, used as the learn<strong>in</strong>g algorithm for the backpropagationnetwork, performs gradient descent down an error surface with atopology that is not well understood. It is possible, as is seen occasionally <strong>in</strong>practice, that the system will end up <strong>in</strong> a local m<strong>in</strong>imum. The effect is thatthe network appears to stop learn<strong>in</strong>g; that is, the error does not cont<strong>in</strong>ue todecrease with additional tra<strong>in</strong><strong>in</strong>g. Whether or not this situation is acceptabledepends on the value of the error when the m<strong>in</strong>imum is reached. If the erroris acceptable, then it does not matter whether or not the m<strong>in</strong>imum is global.If the error is unacceptable, the problem often can be remedied by retra<strong>in</strong><strong>in</strong>gof the network with different learn<strong>in</strong>g parameters, or with a different r<strong>and</strong>omweight <strong>in</strong>itialization. In the case of backpropagation, we see that f<strong>in</strong>d<strong>in</strong>g theglobal m<strong>in</strong>imum is desirable, but we can live with a local m<strong>in</strong>imum <strong>in</strong> manycases.A further example of local-m<strong>in</strong>ima effects is found <strong>in</strong> the cont<strong>in</strong>uous Hopfieldmemory as the latter is used to perform an optimization calculation. Thetravel<strong>in</strong>g-salesperson problem is a well-def<strong>in</strong>ed problem subject to certa<strong>in</strong> constra<strong>in</strong>ts.The salesperson must visit each city once <strong>and</strong> only once on the tour.This restriction is known as a strong constra<strong>in</strong>t. A violation of this constra<strong>in</strong>tis not permitted <strong>in</strong> any real solution. An additional constra<strong>in</strong>t is that the totaldistance traveled must be m<strong>in</strong>imized. Failure to f<strong>in</strong>d the solution with theabsolute m<strong>in</strong>imum distance does not <strong>in</strong>validate the solution completely. Anysolution that does not have the m<strong>in</strong>imum distance results <strong>in</strong> a penalty or cost<strong>in</strong>crease. It is up to the <strong>in</strong>dividual to decide how much cost is acceptable <strong>in</strong>return for a relatively quick solution. The m<strong>in</strong>imum-distance requirement isan example of a weak constra<strong>in</strong>t; it is desirable, but not absolutely necessary.F<strong>in</strong>d<strong>in</strong>g the absolute shortest route corresponds to f<strong>in</strong>d<strong>in</strong>g the global m<strong>in</strong>imumof the energy function. As with backpropagation, we would like to f<strong>in</strong>d theglobal m<strong>in</strong>imum, but will settle for a local m<strong>in</strong>imum, provided the cost is nottoo high.In the follow<strong>in</strong>g sections, we shall present one method for reduc<strong>in</strong>g thepossibility of fall<strong>in</strong>g <strong>in</strong>to a local m<strong>in</strong>imum. That method is called simulatedanneal<strong>in</strong>g because of its strong analogy to the physical anneal<strong>in</strong>g process doneto metals <strong>and</strong> other substances. Along the way, we shall briefly explore a fewconcepts <strong>in</strong> <strong>in</strong>formation theory, <strong>and</strong> discuss the relationship between <strong>in</strong>formationtheory <strong>and</strong> a branch of physics known as statistical mechanics. Because we donot expect that you are an <strong>in</strong>formation theorist or a physicist, the discussionis somewhat brief. However, we do assume a knowledge of basic probabilitytheory, a discussion of which can be found <strong>in</strong> many fundamental texts.


5.1 Information Theory <strong>and</strong> Statistical Mechanics 1715.1 INFORMATION THEORY ANDSTATISTICAL MECHANICSIn this section we shall present a few topics from the fields of <strong>in</strong>formationtheory <strong>and</strong> statistical mechanics. We choose to discuss only those topics thathave relevance to the discussion of simulated anneal<strong>in</strong>g, so that the treatment isbrief.5.1.1 Information-Theory ConceptsEvery computer scientist underst<strong>and</strong>s what a bit is. It is a b<strong>in</strong>ary digit, a th<strong>in</strong>gthat has a value of either 1 or 0. Memory <strong>in</strong> a digital computer is implementedas a series of bits jo<strong>in</strong>ed together logically to form bytes, or words. In themathematical discipl<strong>in</strong>e of <strong>in</strong>formation theory, however, a bit is someth<strong>in</strong>g else.Suppose some event, e, occurs with some probability, P(e). If we observe thate has occurred, then, accord<strong>in</strong>g to <strong>in</strong>formation theory, we have received= log2bits of <strong>in</strong>formation, where Iog 2 refers to the log to the base 2.You may need to get used to this notion. For example, suppose that P(e) =1/2, so there is a 50-percent chance that the event occurs. In that case, I(e) =Iog 2 2 = 1 bit. We can, therefore, def<strong>in</strong>e a bit as the amount of <strong>in</strong>formationreceived when one of two equally probable alternatives is specified. If we knowfor sure that an event will occur, its occurrence provides us with no <strong>in</strong>formation:Iog 2 1 = 0. Some reflection on these ideas will help you to underst<strong>and</strong> the <strong>in</strong>tentof Eq. (5.1). The most <strong>in</strong>formation is received when we have absolutely no clueregard<strong>in</strong>g whether the event will occur. Notice also that bits can occur <strong>in</strong>fractional quantities.Suppose we have an <strong>in</strong>formation source, which has a sequential output ofsymbols from the set, 5 = {s\,S2,...,s q }, with each symbol occurr<strong>in</strong>g witha fixed probability, {P(s\), P(S2), • •-, P(s q )}. A simple example would be anautomatic character generator that types letters accord<strong>in</strong>g to a certa<strong>in</strong> probabilitydistribution. If the probability of send<strong>in</strong>g each symbol is <strong>in</strong>dependent of symbolspreviously sent, then we have what is called a zero-memory source. For suchan <strong>in</strong>formation source, the amount of <strong>in</strong>formation received from each symbol is/( «> = log2 (5 - 2)The average amount of <strong>in</strong>formation received per symbol is9(5.3)


172 Simulated Anneal<strong>in</strong>gEquation (5.3) is the def<strong>in</strong>ition of the entropy, H(S), of the source, S:(5.4)In a physical system, entropy is associated with a measure of disorder <strong>in</strong> thesystem. It is the same <strong>in</strong> <strong>in</strong>formation theory. The most disorderly system isthe one where all symbols occur with equal probability. Thus, the maximum<strong>in</strong>formation (maximum entropy) is received from such a system.Exercise 5.1: Show that the average amount of <strong>in</strong>formation received from azero-memory source hav<strong>in</strong>g q symbols occurr<strong>in</strong>g with equal probabilities, \/q,isH = -\og 2 q (5.5)Exercise 5.2: Consider two sources, each of which sends a sequence of symbolswhose possible values are the 26 letters of the English alphabet <strong>and</strong> the"space" character. The first source, Si , sends the letters with equal probability.The second source, 52, sends letters with the probabilities equal to their relativefrequency of occurrence <strong>in</strong> English text. Which source transmits the most <strong>in</strong>formation?How many bits of <strong>in</strong>formation per symbol are transmitted by eachsource on the average?We can demonstrate explicitly that the maximum entropy occurs for a sourcewhose symbol probabilities are all equal. Suppose we have two sources, Si <strong>and</strong>52, each conta<strong>in</strong><strong>in</strong>g q symbols, where the symbol probabilities are {Pu} <strong>and</strong>{P 2 ;}, i = l,...,q, <strong>and</strong> the probabilities are normalized so that J^-Pii =J^j P 2 , = 1. The difference <strong>in</strong> entropy between these two sources isBy us<strong>in</strong>g the trick of add<strong>in</strong>g <strong>and</strong> subtract<strong>in</strong>g the same quantity from theright side of the equation, we can writeHi-H 2 = - i Iog 2 P,,; + Pi,; Iog 2 P 2z - P u Iog 2 P 2i - P 2i Iog 2 P 2i ]^ !°g 2 5 1 + ( P H - P 28 ) Iog 2 P«] (5.6)9 p 9v—^ -* it x~= - > P H log, —- - >i-P 2 i)log 2 P«If we identify S 2 as a source with equiprobable symbols, then H 2 = H =— Iog 2 q, as <strong>in</strong> Eq. (5.5). S<strong>in</strong>ce Iog 2 P 2 i = Iog 2 - is <strong>in</strong>dependent of i, <strong>and</strong>


5.1 Information Theory <strong>and</strong> Statistical Mechanics 173£ ; (.PH - PU) = Ej P\i - £/ p 2


174 Simulated Anneal<strong>in</strong>gSi. In other words, the calculation of the effective distance from 82 to Si placesmore emphasis on those symbols that have a higher probability of occurrence<strong>in</strong> Si, which is why the term asymmetric was used <strong>in</strong> the def<strong>in</strong>ition of G. If theidentities of PH <strong>and</strong> P 2 ; were reversed <strong>in</strong> Eqs. (5.9) <strong>and</strong> (5.10), the calculatedvalue of G would be different. In that case, we would be f<strong>in</strong>d<strong>in</strong>g the distancefrom Si to 82, <strong>in</strong>stead of the other way around.We now conclude our brief <strong>in</strong>troduction to <strong>in</strong>formation theory. In the nextsection, we discuss some elementary concepts <strong>in</strong> statistical mechanics. Youshould be alert to similarities between the discussion <strong>in</strong> the present section <strong>and</strong>that <strong>in</strong> the next.5.1.2 Statistical-Mechanics ConceptsStatistical mechanics is a branch of physics that deals with systems conta<strong>in</strong><strong>in</strong>ga large number of particles, usually so large that the gross properties of thesystem cannot be determ<strong>in</strong>ed by evaluation of the contributions of each particle<strong>in</strong>dividually. Instead, overall properties are determ<strong>in</strong>ed by the average behaviorof a number of identical systems. A collection of identical systems is called anensemble, <strong>and</strong> is characterized by the average of its constituent systems, alongwith the statistical fluctuations about the average. For example, the averageenergy of an ensemble is the statistical average of the energies of the ensemble'sconstituent systems, each weighted by the probability that the system has aparticular energy. If P r is the probability that a system has an energy, E T , thenthe average energy of the ensemble of such systems is(5.11)An important class of systems are those that are <strong>in</strong> thermal contact with amuch larger system called a heat reservoir. A heat reservoir is characterized bythe fact that any thermal <strong>in</strong>teraction with the smaller system <strong>in</strong> question results<strong>in</strong> only <strong>in</strong>f<strong>in</strong>itesimal changes <strong>in</strong> the properties of the reservoir, whereas thesmaller system may undergo substantial change, until an equilibrium conditionis reached. An example would be a warm bottle of w<strong>in</strong>e submersed <strong>in</strong> a large,cold lake. As an equilibrium is reached, the temperature of the w<strong>in</strong>e mightchange considerably, whereas the overall temperature of the lake probably wouldchange by only an immeasurable amount. 1If we were to exam<strong>in</strong>e a large number of identical w<strong>in</strong>e bottles, immersed<strong>in</strong> identical cold lakes (or the same lake if it were very big compared to the totalvolume of w<strong>in</strong>e), we would f<strong>in</strong>d some variation <strong>in</strong> the total energy conta<strong>in</strong>ed <strong>in</strong>the w<strong>in</strong>e of different bottles. Furthermore, we would f<strong>in</strong>d that the probability,1 We have been us<strong>in</strong>g terms such as energy, equilibrium, <strong>and</strong> temperature without supply<strong>in</strong>g precisedef<strong>in</strong>itions. We are rely<strong>in</strong>g on your <strong>in</strong>tuitive underst<strong>and</strong><strong>in</strong>g of these concepts, <strong>and</strong> of others that willappear later, so that we can avoid a lengthy treatment of the subject at h<strong>and</strong>.


5.1 Information Theory <strong>and</strong> Statistical Mechanics 175P r , that a w<strong>in</strong>e bottle was at a certa<strong>in</strong> energy, E r , was proportional to anexponential factor:P r = Ce~ 0Erwhere (3 is a parameter that depends on the temperature of the lake. S<strong>in</strong>ce thesum of all such probabilities must be 1, Y^r Pr = 1, the proportionality constantmust beso(5.12)The quantity, C~ { , has a special name <strong>and</strong> symbol <strong>in</strong> statistical mechanics: Itis called the partition function, <strong>and</strong> the usual symbol for it is Z:(5.13)We can write the average energy of our w<strong>in</strong>e bottle as(E) = ^r£ z r£r (5-14)The exponential factor, e~^Er , is called the Boltzmann factor, <strong>and</strong> theprobability distribution <strong>in</strong> Eq. (5.12) is called the Boltzmann distribution. Ensembleswhose properties follow the Boltzmann distribution are called canonicalensembles. The factor /? is related to the absolute temperature (temperature <strong>in</strong>Kelv<strong>in</strong>s) by(5.15)where ks is a constant known as the Boltzmann constant.In a physical system, entropy is def<strong>in</strong>ed by essentially the same relationshipthat we saw <strong>in</strong> the section on <strong>in</strong>formation theory:(5.16)Although we discussed <strong>in</strong>formation theory first, the development of statisticalmechanics preceded that of <strong>in</strong>formation theory. The development of <strong>in</strong>formationtheory began essentially with the recognition of the relationship betweenstatistical mechanics <strong>and</strong> <strong>in</strong>formation through the concept of entropy [7]. Equation(5.16) differs from the <strong>in</strong>formation-theory def<strong>in</strong>ition of entropy <strong>in</strong> Eq. (5.4)by only a constant multiplier. Note that the conversion between the natural log<strong>and</strong> the base-2 log is given byIo 82 xIn,x = log.,x = -——Iog 2 e


176 Simulated Anneal<strong>in</strong>gIn the previous section, we showed that the maximum entropy (<strong>in</strong>formation)is obta<strong>in</strong>ed from a source with equiprobable symbols. We can make a similarargument for physical systems. Suppose we have two systems with entropies5[ = ~k B ^r P\ r In P\ T <strong>and</strong> S 2 = -k B Yl r PIT In Par, such that both systemshave the same average energy:r = (E 2Furthermore, assume that the P 2r probabilities follow the canonical distribution<strong>in</strong> Eq. (5.12). By perform<strong>in</strong>g an analysis identical to that of the previous section,we can show that the difference <strong>in</strong> entropies of the two systems is given by(5.17)Us<strong>in</strong>g the <strong>in</strong>equality, Inx < x — 1, we can show that Si will always be lessthan 82 unless P\ T also follows the canonical distribution.Exercise 5.3: Derive Eq. (5.17) assum<strong>in</strong>g that P 2r is the canonical distribution.Us<strong>in</strong>g the <strong>in</strong>equality, Ini < x - 1, show that the maximum entropy for thesystem, Si, with a given average energy, occurs when the probability distributionof that system is the canonical distribution.In previous chapters, we have exploited the analogy of an energy functionassociated with various neural networks, <strong>and</strong> we have seen <strong>in</strong> this chapter that theconcept of entropy applies to both physical <strong>and</strong> <strong>in</strong>formation systems. We wish toextend the analogy along the follow<strong>in</strong>g l<strong>in</strong>es: If our neural networks have energy<strong>and</strong> entropy, is it possible to def<strong>in</strong>e a temperature parameter that has mean<strong>in</strong>gfor neural networks? If so, what is the benefit to be ga<strong>in</strong>ed by def<strong>in</strong><strong>in</strong>g such aparameter? Can we place our neural network <strong>in</strong> contact with a fictitious heatreservoir, <strong>and</strong> aga<strong>in</strong>, is there some advantage to this analogy? These questions,<strong>and</strong> their answers, form the basis of the discussion <strong>in</strong> the next section.5.1.3 Anneal<strong>in</strong>g: Real <strong>and</strong> SimulatedOur <strong>in</strong>tuition should tell us that a lump of material at a high temperature has ahigher energy state than an identical lump at a lower temperature. Suppose wewish to reduce the energy of the material to its lowest possible value. Simplylower<strong>in</strong>g the temperature to absolute zero will not necessarily ensure that thematerial is <strong>in</strong> its lowest possible energy configuration. Let's consider the exampleof a silicon boule be<strong>in</strong>g grown <strong>in</strong> a furnace to be used as a substrate for<strong>in</strong>tegrated-circuit devices. It is highly desirable that the crystal structure be aperfect, regular crystal lattice at ambient temperature. Once the silicon boule isformed, it must be properly cooled to ensure that the crystal lattice will formproperly. Rapid cool<strong>in</strong>g can result <strong>in</strong> many imperfections with<strong>in</strong> the crystalstructure, or <strong>in</strong> a substance that is glasslike, with no regular crystall<strong>in</strong>e structure


5.1 Information Theory <strong>and</strong> Statistical Mechanics 177at all. Both of these configurations have a higher energy than does the crystalwith the perfect lattice structure: They represent local energy m<strong>in</strong>ima.An anneal<strong>in</strong>g process must be used to f<strong>in</strong>d the global energy m<strong>in</strong>imum.The temperature of the boule must be lowered gradually, giv<strong>in</strong>g atoms with<strong>in</strong>the structure time to rearrange themselves <strong>in</strong>to the proper configuration. Ateach temperature, sufficient time must be allowed so that the material reachesan equilibrium. In equilibrium, the material follows the canonical probabilitydistribution:P(E r )


178 Simulated Anneal<strong>in</strong>gIf we give the system a gentle shak<strong>in</strong>g, then, once the ball gets to the globalm<strong>in</strong>imum side, it is less likely to acquire sufficient energy to get back acrossto the local m<strong>in</strong>imum side. However, because of the gentle shak<strong>in</strong>g, it mighttake a very long time before the ball gets just the right push to get it over tothe global m<strong>in</strong>imum side <strong>in</strong> the first place.Anneal<strong>in</strong>g represents a compromise between vigorous shak<strong>in</strong>g <strong>and</strong> gentleshak<strong>in</strong>g. At high temperatures, the large thermal energy corresponds to vigorousshak<strong>in</strong>g; low temperatures correspond to gentle shak<strong>in</strong>g. To anneal an object,we raise the temperature, then gradually lower it back to ambient temperature,allow<strong>in</strong>g the object to reach equilibrium at each stop along the way. The techniqueof gradually lower<strong>in</strong>g the temperature is the best way to ensure that alocal m<strong>in</strong>imum can be avoided without hav<strong>in</strong>g to spend an <strong>in</strong>f<strong>in</strong>ite amount oftime wait<strong>in</strong>g for a transition out of a local m<strong>in</strong>imum.Exercise 5.4: Beg<strong>in</strong>n<strong>in</strong>g with Eq. (5.18), f<strong>in</strong>d an equation that expresses therelative probability of the system be<strong>in</strong>g <strong>in</strong> the energy states, E a or Eb, at somegiven temperature, T. Use this result to prove that, as T — *• 0, the lower energystate is highly favored over the higher energy state.We can postulate that it is possible to extend the analogy between <strong>in</strong>formationtheory <strong>and</strong> statistical mechanics to allow us to place our neural network(<strong>in</strong>formation system) <strong>in</strong> contact with a heat reservoir at some, as yet undef<strong>in</strong>ed,temperature. If so, then we can perform a simulated anneal<strong>in</strong>g process wherebywe gradually lower the system temperature while process<strong>in</strong>g takes place <strong>in</strong> thenetwork, <strong>in</strong> the hopes of avoid<strong>in</strong>g a local m<strong>in</strong>imum on the energy l<strong>and</strong>scape.To perform this process, we must simulate the effects of temperature onour system. In a physical system, molecules have an average k<strong>in</strong>etic energyproportional to the temperature of the system. Individual molecules may havemore or less k<strong>in</strong>etic energy than the average, <strong>and</strong> r<strong>and</strong>om collisions may causea molecule either to ga<strong>in</strong> or to lose energy. We can simulate this behavior <strong>in</strong> aneural network by add<strong>in</strong>g a stochastic element to the process<strong>in</strong>g.Let's look at an example from Chapter 4. Suppose we had a Hopfieldnetwork with b<strong>in</strong>ary outputs <strong>and</strong> we were seek<strong>in</strong>g the network output with thelowest possible energy. Accord<strong>in</strong>g to the recipe <strong>in</strong> Chapter 4, each output nodewould be updated determ<strong>in</strong>istically, depend<strong>in</strong>g on the sign of the net <strong>in</strong>put tothat unit <strong>and</strong> its current value. As we saw, this procedure usually leads to asolution at the nearest local m<strong>in</strong>imum of energy. Instead of this determ<strong>in</strong>isticprocedure, let's heat the system to a temperature, T, <strong>and</strong> let the output value ofeach unit be determ<strong>in</strong>ed stochastically accord<strong>in</strong>g to the Boltzmann distribution.For a s<strong>in</strong>gle unit, Xi, if the energy of the network is E a when Xj = 1, <strong>and</strong>E b when x t = 0, then, regardless of the previous state of Xj, let Xi = 1 with aprobability of-E a /k B TPi = „-! a/k BTPi =1(5.1")e~E b/k B(5.20)


5.2 The Boltzmann Mach<strong>in</strong>e 179where AE 1 , = Eb — E a , <strong>and</strong> T is a parameter that is the analog of temperature.This operation ensures that, every so often, a unit will update so as to <strong>in</strong>crease theenergy of the system, thus help<strong>in</strong>g the system get out of local-m<strong>in</strong>imum valleys.Equation (5.19) results directly from the canonical distribution. Because onlyone unit is chang<strong>in</strong>g, there are only two potential states open to the system,which expla<strong>in</strong>s the two-term sum <strong>in</strong> the denom<strong>in</strong>ator.As process<strong>in</strong>g cont<strong>in</strong>ues, the temperature is reduced gradually. In theend, there will be a high probability that the system is <strong>in</strong> a global energym<strong>in</strong>imum. The actual mechanics of this process will be discussed <strong>in</strong> thenext section. Rather than cont<strong>in</strong>ue us<strong>in</strong>g the Hopfield memory, we switchour attention to the particular network architecture known as the Boltzmannmach<strong>in</strong>e.5.2 THE BOLTZMANN MACHINEThere are many similarities between the architecture <strong>and</strong> PEs of the Boltzmannmach<strong>in</strong>e, <strong>and</strong> those of other neural networks that we have discussedpreviously. However, there is a fundamental difference <strong>in</strong> the way we mustth<strong>in</strong>k of the Boltzmann mach<strong>in</strong>e. The output of <strong>in</strong>dividual PEs <strong>in</strong> the Boltzmannmach<strong>in</strong>e is a stochastic function of the <strong>in</strong>puts, rather than a determ<strong>in</strong>isticfunction: The output of a given node is calculated us<strong>in</strong>g probabilities, ratherthan a threshold or sigmoid output function. Moreover, we shall see that thetra<strong>in</strong><strong>in</strong>g procedure does not depend solely on f<strong>in</strong>d<strong>in</strong>g an energy m<strong>in</strong>imum onan energy l<strong>and</strong>scape. Rather, the learn<strong>in</strong>g algorithm will comb<strong>in</strong>e an energym<strong>in</strong>imization with an entropy maximization consistent with the use of the Boltzmanndistribution to describe energy-state probabilities. These differences area direct result of apply<strong>in</strong>g analogies from statistical mechanics to the neuralnetwork.5.2.1 Basic Architecture <strong>and</strong> Process<strong>in</strong>gThere are two different Boltzmann mach<strong>in</strong>e architectures of <strong>in</strong>terest here. Onewe shall call the Boltzmann completion network, <strong>and</strong> the other we shall callthe Boltzmann <strong>in</strong>put-output network. For the moment, we shall concentrateon the Boltzmann completion architecture, which appears <strong>in</strong> Figure 5.2.The function of the Boltzmann completion network is to learn a set of<strong>in</strong>put patterns, <strong>and</strong> then to be able to supply miss<strong>in</strong>g parts of the patterns whena partial, or noisy, <strong>in</strong>put pattern is processed. The <strong>in</strong>put vectors are b<strong>in</strong>ary, witheach component an element of {0, 1}.As with the discrete Hopfield memory, the system energy can be calculatedfrom1=1 J=lj**


180 Simulated Anneal<strong>in</strong>gHidden layerVisible layerFigure 5.2In the Boltzmann completion architecture, there are twolayers of units: visible <strong>and</strong> hidden. The network is fully<strong>in</strong>terconnected between layers <strong>and</strong> among units on eachlayer. The connections are bidirectional <strong>and</strong> the weights aresymmetric, w i} = Wji.where n is the total number of units <strong>in</strong> the network, <strong>and</strong> Xk is the output of thefcth unit. The energy difference between the system with x^ = 0 <strong>and</strong> x^ — 1 isgiven byfe = (E Xk=0 - (5.22)Notice that the summation <strong>in</strong> Eq. (5.22) is identical to the usual def<strong>in</strong>ition ofthe net-<strong>in</strong>put value to unit k, so that= net fc (5.23)We shall assume for the moment that we already have an appropriate setof weights that encodes a set of b<strong>in</strong>ary vectors <strong>in</strong>to the Boltzmann completionnetwork. Suppose that the vector, x = (0,1,1,0,0,1,0)*, is one of the vectorslearned by the network. We would like to be able to recall this vector givenonly partial knowledge, for example, the vector, x' = (0. u, 1,0, u, 1,0)*, whereu represents an unknown component. The recall procedure will be performedus<strong>in</strong>g a simulated-anneal<strong>in</strong>g technique with x' as the start<strong>in</strong>g vector on the visibleunits. The procedure is described by the follow<strong>in</strong>g algorithm:


5.2 The Boltzmann Mach<strong>in</strong>e 1811. Force the outputs of all known visible units to the values specified by the<strong>in</strong>itial <strong>in</strong>put vector, x'.2. Assign all unknown visible units, <strong>and</strong> all hidden units, r<strong>and</strong>om output valuesfrom the set {1,0}.3. Select a unit, Xk, at r<strong>and</strong>om <strong>and</strong> calculate its net-<strong>in</strong>put value, net^.4. Regardless of the current value of the unit, assign the output value, Xk = 1,with probability p/. = I/ (l +e~ neu / T ). This stochastic choice can beimplemented by comparison of the value of p^ to that of a number, z,selected r<strong>and</strong>omly from a uniform distribution between zero <strong>and</strong> one. Ifz < pk, then set x^ — 1. The parameter, T, acts as the temperature of thesystem. More will be said about the temperature <strong>in</strong> Section 5.2.3.5. Repeat steps 3 <strong>and</strong> 4 until all units have had some probability of be<strong>in</strong>gselected for update. This number of unit-updates def<strong>in</strong>es a process<strong>in</strong>g cycle.For example, <strong>in</strong> a 10-unit network, 10 r<strong>and</strong>om unit selections would be aprocess<strong>in</strong>g cycle. Complet<strong>in</strong>g a s<strong>in</strong>gle process<strong>in</strong>g cycle does not guaranteethat every unit has been updated.6. Repeat step 5 for several process<strong>in</strong>g cycles, until thermal equilibrium hasbeen reached at the given temperature, T. The number of process<strong>in</strong>g cyclesrequired to reach equilibrium is not easy to specify. Usually, we guess thenumber of process<strong>in</strong>g cycles required to reach equilibrium.7. Lower the temperature, T, <strong>and</strong> repeat steps 3 through 7.Once the temperature has been reduced to a small value, the network willstabilize. The f<strong>in</strong>al result will be the outputs of the visible units.The set of temperatures, along with the correspond<strong>in</strong>g number of process<strong>in</strong>gcycles at each temperature, constitute the anneal<strong>in</strong>g schedule for the network.For example, {2 process<strong>in</strong>g cycles at a temperature of 10, 2 process<strong>in</strong>gcycles at 8, 4 process<strong>in</strong>g cycles at 6, 4 process<strong>in</strong>g cycles at 4, 8 process<strong>in</strong>gcycles at 2}, may be an appropriate anneal<strong>in</strong>g schedule for a certa<strong>in</strong> problem.By perform<strong>in</strong>g an anneal<strong>in</strong>g dur<strong>in</strong>g pattern recall, we hope to avoid shallow,local m<strong>in</strong>ima. The f<strong>in</strong>al result should be a deeper, local m<strong>in</strong>imum consistentwith the known components of the <strong>in</strong>itial <strong>in</strong>put vector, x'. We expect the f<strong>in</strong>alresult to be the orig<strong>in</strong>al vector, x.An alternative formulation of the Boltzmann mach<strong>in</strong>e is the Boltzmann<strong>in</strong>put—output network shown <strong>in</strong> Figure 5.3. This network functions as a heteroassociativememory. Dur<strong>in</strong>g the recall process, the <strong>in</strong>put-vector units areclamped permanently <strong>and</strong> are not updated dur<strong>in</strong>g the anneal<strong>in</strong>g process. Allhidden units <strong>and</strong> output units are updated accord<strong>in</strong>g to the simulated-anneal<strong>in</strong>gprocedure described previously for the Boltzmann completion network.


182 Simulated Anneal<strong>in</strong>gHidden layerInput unitsVisible layerOutput unitsFigure 5.3For the Boltzmann <strong>in</strong>put-output network, the visible unitsare separated <strong>in</strong>to <strong>in</strong>put <strong>and</strong> output units. There are noconnections among <strong>in</strong>put units, <strong>and</strong> the connections from <strong>in</strong>putunits to other units <strong>in</strong> the network are unidirectional. All otherconnections are bidirectional, as <strong>in</strong> the Boltzmann completionnetwork.5.2.2 Learn<strong>in</strong>g <strong>in</strong> Boltzmann Mach<strong>in</strong>esNow that we have <strong>in</strong>vestigated the process<strong>in</strong>g done to recall previously storedpatterns <strong>in</strong> Boltzmann-mach<strong>in</strong>e architectures, let's return to the problem of stor<strong>in</strong>gthe patterns <strong>in</strong> the network. As was the case with pattern recall, learn<strong>in</strong>g<strong>in</strong> Boltzmann mach<strong>in</strong>es is accomplished us<strong>in</strong>g a simulated-anneal<strong>in</strong>g technique.This technique has fundamental differences from learn<strong>in</strong>g techniques <strong>in</strong>vestigated<strong>in</strong> previous chapters because of its stochastic nature. The learn<strong>in</strong>g algorithmfor Boltzmann mach<strong>in</strong>es employs a gradient-descent technique, similar toprevious learn<strong>in</strong>g methods, although the function be<strong>in</strong>g m<strong>in</strong>imized is no longeridentified with the energy of the network as described by Eq. (5.21).Throughout this chapter, we have emphasized that the output values ofthe process<strong>in</strong>g elements have a probabilistic character. Thus, the output ofthe network as a whole also has a probabilistic character. When tra<strong>in</strong><strong>in</strong>g theBoltzmann mach<strong>in</strong>e, we supply examples that are representative of the entire


5.2 The Boltzmann Mach<strong>in</strong>e 183population of possible <strong>in</strong>put vectors. The learn<strong>in</strong>g algorithm that we employmust cause the network to form a model of the entire population of <strong>in</strong>put patternsbased on these examples. Unfortunately, there are often many different modelsthat are consistent with the examples. The question is how to choose fromamong the various models.One method of choos<strong>in</strong>g among different models is to <strong>in</strong>sist that the modelof the population produced by the network will result <strong>in</strong> the most homogeneousdistribution of <strong>in</strong>put patterns consistent with the examples supplied. We canillustrate this concept with an example.Suppose that we know that the first component of a three-component <strong>in</strong>putvector will have a value of +1 <strong>in</strong> 40 percent of all vectors <strong>in</strong> the population.Out of eight possible three-component vectors, four have their first componentas +1. There are an <strong>in</strong>f<strong>in</strong>ite number of ways that the probabilities of occurrenceof those four vectors can comb<strong>in</strong>e to yield a total probability of 40 percentfor the occurrence of +1 <strong>in</strong> the first position. One example is P{ 1,0,0} =10%, P{1,0,1} = 4%, P{1,1,0} = 8%, P{1,1,1} = 18%. The mosthomogeneous distribution would be to assign equal probabilities to each of thefour vectors, such that P{1,0,0} = 10%, P{1,0,1} = 10%, P{1,1,0} =10%, P{1,1,1} = 10%. The rationale for this choice is that the <strong>in</strong>formationavailable gives us no reason to assign a higher probability of occurrence to anyone of the vectors.If a Boltzmann completion network learns the most homogeneous distribution,then repeated trials with an <strong>in</strong>put vector of P{u, 0,0}, where u is unknown,should result <strong>in</strong> a f<strong>in</strong>al output of P{ 1,0,0} <strong>in</strong> approximately 10 trials out ofevery 100 (the more trials, the closer the results will be to 10 out of 100).Recall from Section 5.1.1 that the asymmetric divergence,measures the difference between an <strong>in</strong>formation source, Si, with symbol probabilities,PH, <strong>and</strong> a source, 82, with symbol probabilities, PH. Suppose theprobabilities, PI;, represent our knowledge of the probability distribution of asource. If the probabilities, PH, represent the most homogeneous distribution,then a learn<strong>in</strong>g algorithm that discovered a set of weights, Wij, that m<strong>in</strong>imizedG would satisfy our desire to have the network learn the most homogeneousdistribution consistent with what is known about the distribution. Such a learn<strong>in</strong>galgorithm would maximize the entropy of the system; so far, however, wehave not said anyth<strong>in</strong>g about the energy of the system. The follow<strong>in</strong>g exerciseis a prelude to that discussion.Exercise 5.5: In the discussion of <strong>in</strong>formation theory <strong>in</strong> Section 5.1.1, weshowed that the distribution with the maximum possible entropy was the onewith equal symbol probabilities, P-n = l/q, where q is the number of symbols.Consider a physical system <strong>in</strong> equilibrium, at some f<strong>in</strong>ite temperature, T.


184 Simulated Anneal<strong>in</strong>gThe distribution function for that physical system is the canonical distributionfunction given by Eq. (5.12). Show that, <strong>in</strong> the limit as T —> oo, the physicalsystem has equiprobable states, P r = \/q, where q is the number of possiblestates of the system.As demonstrated <strong>in</strong> Exercise 5.5, an <strong>in</strong>formation system with equiprobablesymbols is analogous to a physical system <strong>in</strong> equilibrium at an <strong>in</strong>f<strong>in</strong>ite temperature.If the temperature of the physical system is f<strong>in</strong>ite, the distributionwith the maximum entropy (most homogeneous) is the canonical distribution.Furthermore, if the temperature is reduced gradually, lower-energy states willbe favored (recall the discussion <strong>in</strong> Section 5.1.2). A reasonable approach totra<strong>in</strong><strong>in</strong>g the Boltzmann mach<strong>in</strong>e can now be constructed:1. Artificially raise the temperature of the neural network to some f<strong>in</strong>ite value.2. Anneal the system until equilibrium is reached at some low temperaturevalue. (This m<strong>in</strong>imum temperature should not be zero.)3. Adjust the weights of the network so that the difference between the observedprobability distribution <strong>and</strong> the canonical distribution is reduced (i.e.,change the weights to reduce G).4. Repeat steps 1 through 3 until the weights no longer change.This procedure is a comb<strong>in</strong>ation of gradient descent <strong>in</strong> the function, G, <strong>and</strong>simulated anneal<strong>in</strong>g as described <strong>in</strong> Section 5.2.1.To perform a gradient descent <strong>in</strong> G, we must learn how changes <strong>in</strong> theweight values affect that function. Suppose we have a set of vectors, {V 0 }, thatwe would like a Boltzmann completion network to learn. These vectors wouldappear as the outputs of the visible units <strong>in</strong> the network. Def<strong>in</strong>e {H/,} as theset of vectors appear<strong>in</strong>g on the hidden units.We can impose our knowledge about the V a vectors by clamp<strong>in</strong>g the outputsof the visible units to each V re accord<strong>in</strong>g to our knowledge of the probabilityof occurrence of V Q . By clamp<strong>in</strong>g, we mean that the output values are fixed<strong>and</strong> do not change, even though other units may be chang<strong>in</strong>g accord<strong>in</strong>g to thestochastic model presented earlier. Thus, the probability that the visible unitswill be clamped to the vector V a is P + (V 0 ), where the "+" superscript <strong>in</strong>dicatesthat the visible units have been clamped. Units on the hidden layer may undergochange while a s<strong>in</strong>gle vector is clamped to the visible units. The probabilitythat vector V a is clamped to the visible units, <strong>and</strong> that vector H;, appears on thehidden units, is P + (V a A H;,), <strong>and</strong>The reason that we must account for the hidden-layer vector is that theenergy of the system depends on all of the units <strong>in</strong> the network, not just on thevisible units. The total energy of the system with V a on the visible units <strong>and</strong>


5.2 The Boltzmann Mach<strong>in</strong>e 185Hf, on the hidden units isF L — — V^ ii' . T ab r ab (5 24~>06 ~ 2-*i *3 * 3 l 3 "^*/i


186 Simulated Anneal<strong>in</strong>g<strong>and</strong> the derivative of the partition function isdE mn _ E I T \'^T e )(5.30)_ j_^xmn a.mn -£lmn/Ty / -•Jm.nSubstitut<strong>in</strong>g Eqs. (5.29) <strong>and</strong> (5.30) <strong>in</strong>to Eq. (5.28) yields(5.31)where we have made use of the def<strong>in</strong>ition of P~(V a A H f) ) <strong>and</strong> the def<strong>in</strong>itionofP-(VJ.Equation (5.31) can now be substituted <strong>in</strong>to Eq. (5.27) to givedw - T^P- (VaJa,.()V P + (V )^a V ° ;(5.32)To simplify this equation, we first note that £] n -P + (V n ) = 1. Next, we shalluse a result from probability theory:P+(V a A H 6 ) = P + (H b |V a )P+(V a )Stated <strong>in</strong> words, this equation means that the probability of hav<strong>in</strong>g V a on thevisible layer <strong>and</strong> H^ on the hidden layer is equal to the probability of hav<strong>in</strong>g Hf,on the hidden layer given that V a was on the visible layer, times the probabilitythat V a is on the visible layer. An analogous def<strong>in</strong>ition <strong>and</strong> statement can bemade for P~(VJ:P-(V 0 A H fc ) = P-(H 6 |V a )P-(V 0 )If V a is on the visible layer, then the probability that Hb will occur on thehidden layer should not depend on whether V a got there by be<strong>in</strong>g clamped tothat state or by free-runn<strong>in</strong>g to that state. Therefore, it must be true thatP+(H 6 |V a ) = P-(H 6 |V a )


5.2 The Boltzmann Mach<strong>in</strong>e 187Then<strong>and</strong>P~(\ a A H 6 ) P-(Vg)Q A H 6 ) P+(V a )P+rv \P-(V 0 A H 6 ) ° = P+(V Q A H 6 )* ( ' a/Us<strong>in</strong>g these results, we can writedG 1 ,where<strong>and</strong>PT = p ~ ( v « A H " )a - fc x (5 - 33)a.bz (5.34)The <strong>in</strong>terpretation of Eqs. (5.33) <strong>and</strong> (5.34) will be given shortly. For now,recall that weight changes occur <strong>in</strong> the direction of the negative gradient of G.Weight updates are calculated accord<strong>in</strong>g toA U ;,,=e(p+ - P r.) (5.35)where e is a constant.The quantities, p~ <strong>and</strong> p^ are called co-occurrence probabilities becausethey compute the frequency that xf' <strong>and</strong> x"-'' are both active (an output valueof 1) averaged over all possible comb<strong>in</strong>ations of the patterns, V a <strong>and</strong> H/,. Thus,pfj is the co-occurrence probability when the V Q patterns are be<strong>in</strong>g clamped onthe visible units, <strong>and</strong> p~. is the co-occurrence probability when the network isfree-runn<strong>in</strong>g. As seen <strong>in</strong> Eq. (5.35), the weights will cont<strong>in</strong>ue to change as longas the two co-occurrence probabilities differ.Several paragraphs ago, we described a simple algorithm for tra<strong>in</strong><strong>in</strong>g aBoltzmann mach<strong>in</strong>e. We shall now exp<strong>and</strong> that algorithm to <strong>in</strong>clude the methodfor determ<strong>in</strong><strong>in</strong>g the weight-update values as specified by Eq. (5.35):,1. Clamp one tra<strong>in</strong><strong>in</strong>g vector to the visible units of the network.2. Anneal the network accord<strong>in</strong>g to the anneal<strong>in</strong>g schedule until equilibriumis reached at the desired m<strong>in</strong>imum temperature.3. Cont<strong>in</strong>ue to run the network for several more process<strong>in</strong>g cycles. Aftereach process<strong>in</strong>g cycle, determ<strong>in</strong>e which pairs of connected units are onsimultaneously.4. Average the co-occurrence results from step 3.


188 Simulated Anneal<strong>in</strong>g5. Repeat steps 1 through 4 for all tra<strong>in</strong><strong>in</strong>g vectors, <strong>and</strong> average the cooccurrenceresults to get an estimate of pj for each pair of connectedunits.6. Unclamp the visible units, <strong>and</strong> anneal the network until equilibrium isreached at the desired m<strong>in</strong>imum temperature.7. Cont<strong>in</strong>ue to run the network for several more process<strong>in</strong>g cycles. Aftereach process<strong>in</strong>g cycle, determ<strong>in</strong>e which pairs of connected units are onsimultaneously.8. Average the co-occurrence results from step 7.9. Repeat steps 6 through 8 for the same number of times as was done <strong>in</strong> step5, <strong>and</strong> average the co-occurrence results to get an estimate of p~ for eachpair of connected units.10. Calculate <strong>and</strong> apply the appropriate weight changes. (The entire sequencefrom step 1 to step 10 def<strong>in</strong>es a sweep.)11. Repeat steps 1 through 10 until p+ -p~ is sufficiently small.An alternative way to decide when to halt tra<strong>in</strong><strong>in</strong>g is to perform a test procedureafter each sweep. Clamp partial or noisy <strong>in</strong>put vectors to the visible units, annealthe network, <strong>and</strong> see how well the network reproduces the correct vector. Whenthe performance is adequate, tra<strong>in</strong><strong>in</strong>g can be stopped.Although it is not possible to give an exact procedure for determ<strong>in</strong><strong>in</strong>g theanneal<strong>in</strong>g schedule, it is possible to provide guidel<strong>in</strong>es <strong>and</strong> suggestions. Thenext section provides these guidel<strong>in</strong>es, with other practical <strong>in</strong>formation aboutthe simulation of the Boltzmann mach<strong>in</strong>e.Exercise 5.6: Modify the Boltzmann learn<strong>in</strong>g algorithm to accommodate theBoltzmann <strong>in</strong>put-output network described <strong>in</strong> Section 5.2.1.5.2.3 Practical ConsiderationsIf there is a s<strong>in</strong>gle word that describes the learn<strong>in</strong>g process <strong>in</strong> a Boltzmann mach<strong>in</strong>esimulation, that word is slow. Even relatively small networks may requirethous<strong>and</strong>s of process<strong>in</strong>g cycles to learn a set of <strong>in</strong>put patterns adequately. Thissituation is exacerbated by the fact that the anneal<strong>in</strong>g schedule must <strong>in</strong>corporatea very slow reduction <strong>in</strong> temperature if the global m<strong>in</strong>imum is to be found. Studiesperformed on simulated anneal<strong>in</strong>g by Geman <strong>and</strong> Geman showed that thetemperature must be reduced <strong>in</strong> proportion to the <strong>in</strong>verse log of the temperature:T(t n ) =T 1In t nwhere TO is the start<strong>in</strong>g temperature, <strong>and</strong> the discrete-time variable, t n , representsthe nth process<strong>in</strong>g cycle [2]. Anneal<strong>in</strong>g a s<strong>in</strong>gle <strong>in</strong>put pattern from atemperature of 40 to a temperature of 0.5 requires an impractical e 80 process<strong>in</strong>gcycles to ensure that the global m<strong>in</strong>imum has been found (at T = 0.5).


5.3 The Boltzmann Simulator 189Fortunately, we do not need to follow this prescription to obta<strong>in</strong> good results,but, other than the results of Geman <strong>and</strong> Geman, anneal<strong>in</strong>g schedules must bedeterm<strong>in</strong>ed ad hoc.Examples of anneal<strong>in</strong>g schedules used to solve a few small problems aregiven <strong>in</strong> a paper by Ackley, H<strong>in</strong>ton, <strong>and</strong> Sejnowski [1]. In a problem called the8-3-8 encoder problem, they used a Boltzmann <strong>in</strong>put-output network with 16visible units <strong>and</strong> three hidden units. They clamped an eight-bit b<strong>in</strong>ary number tothe <strong>in</strong>put nodes. Only one of the bits <strong>in</strong> the <strong>in</strong>put vector was allowed to be on atany given time. The problem was to tra<strong>in</strong> the network to reproduce the identicalvector on the output nodes. Thus, the three hidden nodes learned a three-digitb<strong>in</strong>ary code for the eight possible <strong>in</strong>put vectors. The anneal<strong>in</strong>g schedule was{2 process<strong>in</strong>g cycles at a temperature of 20, 2 at 15, 2 at 12, 4 at 10}. Severalthous<strong>and</strong> sweeps were required to tra<strong>in</strong> the network to perform the encod<strong>in</strong>g task.The anneal<strong>in</strong>g technique that we have described is not the only possiblemethod. The bibliography at the end of this chapter conta<strong>in</strong>s references thatdescribe other methods. One method <strong>in</strong> particular that we wish to note here isdescribed by Szu [8], Szu's method is based on the Cauchy distribution, ratherthan the Boltzmann distribution. The Cauchy distribution has the same generalshape as the Boltzmann distribution, but does not fall off as sharply at largeenergies. The implication is that the Cauchy mach<strong>in</strong>e may occasionally take arather large jump out of a local m<strong>in</strong>imum. The advantage of this approach isthat the global m<strong>in</strong>imum can be reached with a much shorter anneal<strong>in</strong>g schedule.For the Cauchy mach<strong>in</strong>e, the anneal<strong>in</strong>g temperature should follow the <strong>in</strong>verseof the time:TLT(t n ) =l+t,,This anneal<strong>in</strong>g schedule is much faster than is the one for the correspond<strong>in</strong>gBoltzmann mach<strong>in</strong>e..5.3 THE BOLTZMANN SIMULATORAs you may have <strong>in</strong>ferred from the previous discussion, simulat<strong>in</strong>g the Boltzmannnetwork on a conventional computer will consume enormous amounts ofboth system memory <strong>and</strong> CPU time. For these reasons (as well as for brevity),we will limit our discussion of the simulator to only the most important datastructures <strong>and</strong> algorithms needed to implement the Boltzmann network. Thedevelopment of the less difficult or application-specific algorithms is left to you.We beg<strong>in</strong> by def<strong>in</strong><strong>in</strong>g the data structures that we must add to our generic simulatorto implement the Boltzmann network. We then develop the algorithmsthat will be used to run the network simulator, <strong>and</strong> conclude the chapter witha presentation of an example problem that the Boltzmann network can be usedto solve.


190 Simulated Anneal<strong>in</strong>g5.3.1 The Modified Boltzmann NetworkThe Boltzmann network, as we have discussed throughout this chapter, is similar<strong>in</strong> structure to the BAM <strong>and</strong> Hopfield networks described <strong>in</strong> Chapter 4. Inparticular, the Boltzmann network consists of a set of PEs, each completely<strong>in</strong>terconnected with every other unit <strong>in</strong> the network. 2 This type of networkarrangement is usually diagrammed as a set of units with bidirectional connectionsto every other unit, as illustrated earlier <strong>in</strong> Figure 5.2. For purposes ofsimulation, however, we will <strong>in</strong>stead create the network structure with a s<strong>in</strong>glelayer of PEs, each hav<strong>in</strong>g a unidirectional <strong>in</strong>put connection from every otherunit. We adopt this convention because it offers us many <strong>in</strong>sights <strong>in</strong>to howto go about implement<strong>in</strong>g the Boltzmann simulator program. However, youshould note that this alternative model is functionally identical to the networkconta<strong>in</strong><strong>in</strong>g bidirectional connections.As an example, consider the two different views of the same Boltzmannnetwork illustrated <strong>in</strong> Figure 5.4. Notice that, <strong>in</strong> the modified view shown <strong>in</strong>Figure 5.4(b), the network structure is much busier (<strong>and</strong> more confus<strong>in</strong>g) thanit is <strong>in</strong> the st<strong>and</strong>ard arrangement. If we look beneath the surface though, <strong>and</strong>consider only the process<strong>in</strong>g go<strong>in</strong>g on at the level of each network unit, themodified view reveals that each PE behaves <strong>in</strong> a familiar manner; an aggregate<strong>in</strong>put stimulation provided by many modulated <strong>in</strong>put signals is converted toa s<strong>in</strong>gle output that is, <strong>in</strong> turn, modulated to provide an <strong>in</strong>put to every othernetwork unit. This type of process<strong>in</strong>g is exactly what our simulator is designedto model efficiently.Based on this observation, our basic data structures for the network simulatorwill not require modification. We need only to <strong>in</strong>troduce some newstructures to account for how the network learns, <strong>and</strong> to <strong>in</strong>corporate them <strong>in</strong>tothe algorithms that we must develop to propagate signals through the network.5.3.2 Boltzmann Data StructuresThere are two significant differences between the structure of the Boltzmannnetwork <strong>and</strong> that of other ANS networks we have discussed previously. TheBoltzmann model is tra<strong>in</strong>ed by us<strong>in</strong>g an anneal<strong>in</strong>g schedule to determ<strong>in</strong>e howmany tra<strong>in</strong><strong>in</strong>g passes to complete at every temperature <strong>in</strong>terval, <strong>and</strong> the Boltzmannnetwork is not a layered network. We will account for these differencesseparately as we modify our simulator data structures to implement the Boltzmannnetwork. We beg<strong>in</strong> by implement<strong>in</strong>g a mechanism that will allow thesimulator to follow an anneal<strong>in</strong>g schedule dur<strong>in</strong>g signal propagation <strong>and</strong> tra<strong>in</strong><strong>in</strong>g.S<strong>in</strong>ce it is impractical for us, as program designers, to try to accommodateevery possible application by implement<strong>in</strong>g a universal anneal<strong>in</strong>g schedule, we2 The Boltzmann <strong>in</strong>put-output model is an exception, due to its lack of <strong>in</strong>terconnection between<strong>in</strong>put units.


5.3 The Boltzmann Simulator 191(b)Figure 5.4 A Boltzmann network is shown, (a) from the traditional view,<strong>and</strong> (b) modified to conform to the unidirectional connectionformat.will <strong>in</strong>stead require the user to provide the anneal<strong>in</strong>g schedule for each application.The implementation of the schedule is relatively simple if we observethat the latter is noth<strong>in</strong>g more than a variable length set of ordered pairs oftemperature <strong>and</strong> tra<strong>in</strong><strong>in</strong>g passes. Thus, we can construct an anneal<strong>in</strong>g scheduleof <strong>in</strong>def<strong>in</strong>ite size if we def<strong>in</strong>e a record structure to store each ordered pair,<strong>and</strong> dynamically allocate an array of these records to create the schedule. Ourstructure for the anneal<strong>in</strong>g pair is given by the declaration:record ANNEAL =TEMPERATURE : float;PASSES : <strong>in</strong>teger;end record;{temperature}{tra<strong>in</strong><strong>in</strong>g passes}


192 Simulated Anneal<strong>in</strong>gWe complete the anneal<strong>in</strong>g schedule by denn<strong>in</strong>g an array of ANNEALrecords, each specified by the user. As <strong>in</strong> the case of other dynamically allocatedarrays discussed <strong>in</strong> this text, we do not presuppose the use of any particular computerlanguage; therefore, the actual specification <strong>and</strong> mechanism for <strong>in</strong>itializationof the SCHEDULE array is left to you. For purposes of discussion, however,the declaration of a typical anneal<strong>in</strong>g schedule array might take the formrecord SCHEDULE (SIZE)=LENGTH : <strong>in</strong>teger = SIZE;STEP : array[1..SIZE] of ANNEAL;end record;where SIZE is the user-provided length of the array.Before we can def<strong>in</strong>e the rema<strong>in</strong><strong>in</strong>g data structures needed to implement theBoltzmann network, we must take <strong>in</strong>to account the second difference betweenthe Boltzmann <strong>and</strong> other ANS network models we have discussed previously.Specifically, the Boltzmann mach<strong>in</strong>e is a nonlayered network, <strong>and</strong> has dedicated<strong>in</strong>put <strong>and</strong> output units that are subsets of units with<strong>in</strong> the same layer. Moreover,the <strong>in</strong>puts <strong>and</strong> outputs may overlap, as <strong>in</strong> the case of the Boltzmann completionnetwork, <strong>in</strong> which the <strong>in</strong>put units are the output units. An example of this typeof network is the one that will simply fill <strong>in</strong> miss<strong>in</strong>g data elements from an<strong>in</strong>complete <strong>in</strong>put. On the other h<strong>and</strong>, the network may have two discrete setsof units, one act<strong>in</strong>g as dedicated <strong>in</strong>put units, the other as dedicated output units.To account for these differences, we can make use of the fact that the outputsfrom all of the units <strong>in</strong> the network will be modeled as a l<strong>in</strong>early sequential arrayof data values <strong>in</strong> the computer memory, arranged from first to last, go<strong>in</strong>g fromlow memory toward high memory. This fact holds true because, as we have observed,the Boltzmann network can be thought of as conta<strong>in</strong><strong>in</strong>g a s<strong>in</strong>gle layer ofunits. Furthermore, this sequential structure holds for all of the <strong>in</strong>termediate networkstructures that we must use (e.g., the unit weights array). If we thereforeadopt the convention that all units <strong>in</strong> the network must be arranged sequentially,with the visible units occurr<strong>in</strong>g first, we can simply locate the first unit of the network<strong>in</strong>put by us<strong>in</strong>g an <strong>in</strong>teger <strong>in</strong>dex. Once the first unit is found, all other <strong>in</strong>putunits will follow <strong>in</strong> sequence. The same technique can be used to locate the networkoutputs as well. An additional benefit derived from the use of this methodis the ability to provide either discrete or overlapp<strong>in</strong>g network <strong>in</strong>puts <strong>and</strong> outputs.Unfortunately, our convention will <strong>in</strong>troduce some other difficulties, becausethe number of units dedicated to <strong>in</strong>put <strong>and</strong> output is no longer def<strong>in</strong>edexplicitly by the number of elements <strong>in</strong> an array, as it is <strong>in</strong> other network models.We must therefore compensate for the loss of layer identity by construct<strong>in</strong>gone additional record that will be used to locate <strong>and</strong> size the network <strong>in</strong>put <strong>and</strong>output units. This record will take the formrecord DEDICATED =FIRST : <strong>in</strong>teger;LENGTH : <strong>in</strong>teger;end record;{<strong>in</strong>dex to first dedicated unit}{number of units needed}


5.3 The Boltzmann Simulator 193We will also need another structure to help us determ<strong>in</strong>e the required weightchanges for the network dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g. S<strong>in</strong>ce this network adapts its connectionsas a function of the probability that any two units are on simultaneously,we must have a means of collect<strong>in</strong>g the statistics about how often that situationoccurs for every two units <strong>in</strong> the network. The most straightforward approach tocollect<strong>in</strong>g these statistics is to construct an array that will be used to store <strong>and</strong>calculate the co-occurrence parameters, p^- <strong>and</strong> p~. For purposes of simulation,we must ma<strong>in</strong>ta<strong>in</strong> a count value for every comb<strong>in</strong>ation of two units <strong>in</strong> thenetwork. Thus, we must be able to store <strong>in</strong>formation about the relationships betweenTV units taken two at a time. The storage required for all this <strong>in</strong>formationwill consume an array of dimension N(N - l)/2, with each element a float<strong>in</strong>gpo<strong>in</strong>tnumber that will be used as both a count <strong>and</strong> an average. We will elaborateon this dual functionality as part of the discussion of the Boltzmann learn<strong>in</strong>galgorithms (see Section 5.3.3). For now, the declaration for this structure willaga<strong>in</strong> be dependent on the language of implementation; for that reason, it willnot be presented here. For clarity, we will name the array CO-OCCURRENCE;the declaration of the SCHEDULE. STEP array provided earlier is an exampleof how this array might be constructed.Hav<strong>in</strong>g created an array to collect the CO-OCCURRENCE statistics aboutour Boltzmann mach<strong>in</strong>e, we now observe that it alone will not be sufficient forour simulation. A s<strong>in</strong>gle array will allow us to accumulate the frequency withwhich any two network units are on together; however, these statistics must begathered <strong>and</strong> averaged for every tra<strong>in</strong><strong>in</strong>g pattern, as well as for the aggregate.Furthermore, we must collect network statistics dur<strong>in</strong>g different runs (clampedto compute p^ <strong>and</strong> undamped for p~). Fortunately, all these statistics can becomb<strong>in</strong>ed, thus m<strong>in</strong>imiz<strong>in</strong>g the amount of storage needed to preserve the data.However, the <strong>in</strong>formation from each run must be isolated so that the appropriateA?« can be computed, accord<strong>in</strong>g to Eq. (5.35). We shall therefore allocate twosuch arrays, <strong>and</strong> shall create another record to locate each at run time. Such arecord is given byrecord COOCCURRENCE =CLAMPED : "float[];{locate array for p+ calculation}UNCLAMPED : "float[];{locate array for p- calculation}end record;Before press<strong>in</strong>g on, let us consider how to utilize effectively the arrayswe have just created. To organize each array such that it is both mean<strong>in</strong>gful<strong>and</strong> efficient, we must determ<strong>in</strong>e a means of associat<strong>in</strong>g array locations, <strong>and</strong>therefore computer memory, with network connections. This association is bestaccomplished implicitly, by allocat<strong>in</strong>g the first N — 1 locations <strong>in</strong> the array tothe connections between the first <strong>and</strong> all other network units. The next N — 2slots will be given to the second unit, <strong>and</strong> so on. As an example, consider anetwork with five units, labeled A through E; such a network is depicted <strong>in</strong>


194 Simulated Anneal<strong>in</strong>gCo-occurenceA-BA-CA-DLow memoryA-EB-CB-DB-EC-DC-EHigh memoryD-E(a)(b)Figure 5.5The CO-OCCURRENCE array is shown (a) depicted as asequential memory array, <strong>and</strong> (b) with its mapp<strong>in</strong>g to theconnections <strong>in</strong> the Boltzmann network.Figure 5.5(a). The first four entries <strong>in</strong> the CO-OCCURRENCE array for thisnetwork would be mapped to the connections between units A-B, A-C, A-D, <strong>and</strong> A-E, as shown <strong>in</strong> Figure 5.5(b). Likewise, the next three slots wouldbe mapped to the connections between units B-C, B-D, <strong>and</strong> B-E; the nexttwo to C-D, <strong>and</strong> C- E; <strong>and</strong> the last one to D-E. By us<strong>in</strong>g the arrays <strong>in</strong> thismanner, we can collect co-occurrence statistics about the network by start<strong>in</strong>gat the first <strong>in</strong>put unit <strong>and</strong> sequentially scann<strong>in</strong>g all other units <strong>in</strong> the network.After complet<strong>in</strong>g this <strong>in</strong>itial pass, we can complete the network scan by merely<strong>in</strong>crement<strong>in</strong>g our array po<strong>in</strong>ter to access the second unit, then the third, fourth,..., nth units.We can now specify the rema<strong>in</strong><strong>in</strong>g data structures needed to implementthe Boltzmann network simulator. We beg<strong>in</strong> by def<strong>in</strong><strong>in</strong>g the top-level recordstructure used to def<strong>in</strong>e the Boltzmann network:record BOLTZMANN =UNITS : <strong>in</strong>teger;CLAMPED : boolean;INPUTS : DEDICATED;OUTPUTS : DEDICATED;NODES : "layer;TEMPERATURE : float;CURRENT : <strong>in</strong>teger;{number of units <strong>in</strong> network}{true=clamped; false=unclamped}{locate <strong>and</strong> size network <strong>in</strong>put}(locate <strong>and</strong> size network output}{po<strong>in</strong>ter to layer structure){current network temperature}{step <strong>in</strong> anneal<strong>in</strong>g schedule)


5.3 The Boltzmann Simulator 195ANNEALING : ~SCHEDULE[];{po<strong>in</strong>ter to user-def<strong>in</strong>ed schedule}STATISTICS : COOCCURRENCE;{po<strong>in</strong>ters to statistics arrays}end record;Figure 5.6 provides an illustration of how the values <strong>in</strong> the BOLTZMANNstructure <strong>in</strong>teract to specify a Boltzmann network. Here, as <strong>in</strong> other networkmodels, the layer structure is the gateway to the network specific data structures.All that is needed to ga<strong>in</strong> access to the layer-specific data are po<strong>in</strong>tersto the appropriate arrays. Thus, the structure for the layer record isgiven bycuts weights 1BoltzmannNodesAnneal<strong>in</strong>gCurrent.Figure 5.6Organization of the Boltzmann network us<strong>in</strong>g the def<strong>in</strong>ed datastructure is shown. In this example, the <strong>in</strong>put <strong>and</strong> outputunits are the same, <strong>and</strong> the network is <strong>in</strong> the third step ofits anneal<strong>in</strong>g schedule.


196 Simulated Anneal<strong>in</strong>grecord LAYER =outs : ~float[]; {po<strong>in</strong>ter to unit outputs array}weights : ~"float[]; {po<strong>in</strong>ters <strong>in</strong> weight _ptr array}end record;where out s is a po<strong>in</strong>ter used to locate the beg<strong>in</strong>n<strong>in</strong>g of the unit outputs array<strong>in</strong> memory, <strong>and</strong> weights is a po<strong>in</strong>ter to the <strong>in</strong>termediate weight-ptr array,which is used <strong>in</strong> turn to locate each of the <strong>in</strong>put connection arrays <strong>in</strong> thesystem. S<strong>in</strong>ce the Boltzmann network requires only one layer of PEs, we willneed only one layer po<strong>in</strong>ter <strong>in</strong> the BOLTZMANN record. All these low-leveldata structures are exactly the same as those specified <strong>in</strong> the generic simulatordiscussed <strong>in</strong> Chapter 1.5.3.3 Boltzmann <strong>Algorithms</strong>Let us assume, for now, that our network data structures conta<strong>in</strong> valid weightsfor all the connections, <strong>and</strong> that the user has <strong>in</strong>itialized the anneal<strong>in</strong>g scheduleto conta<strong>in</strong> the <strong>in</strong>formation given <strong>in</strong> Table 5.1; <strong>in</strong> other words, the network datastructures represent a tra<strong>in</strong>ed network. We must now create the programs thatthe host computer will execute to simulate the network <strong>in</strong> production mode. Weshall start by develop<strong>in</strong>g the <strong>in</strong>formation-recall rout<strong>in</strong>es.Boltzmann Production <strong>Algorithms</strong>. Remember that <strong>in</strong>formation recall <strong>in</strong> theBoltzmann network consists of a sequence of steps where we first apply an<strong>in</strong>put to the network, raise the temperature to some predef<strong>in</strong>ed level, <strong>and</strong> annealthe network while slowly lower<strong>in</strong>g the temperature. In this example, we would<strong>in</strong>itially raise the temperature to 5 <strong>and</strong> would perform four stochastic signalpropagations; we would then lower the temperature to 4 <strong>and</strong> would performsix signal propagations, <strong>and</strong> so on. After complet<strong>in</strong>g the four required signalpropagations when the temperature of the network is 1, we can consider thenetwork annealed. At this po<strong>in</strong>t, we simply read the output values from thevisible units.If, however, we th<strong>in</strong>k about the process we just described, we can decomposethe <strong>in</strong>formation-recall problem <strong>in</strong>to three lower-level subrout<strong>in</strong>es:Temperature54321Passes46764Table 5.1The anneal<strong>in</strong>g schedule for the simulator example.


5.3 The Boltzmann Simulator 197apply_<strong>in</strong>put A rout<strong>in</strong>e used to take a user-provided or tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put <strong>and</strong>apply it to the network, <strong>and</strong> to <strong>in</strong>itialize the output from all unknown unitsto a r<strong>and</strong>om state,anneal A rout<strong>in</strong>e used to stimulate the Boltzmann network accord<strong>in</strong>g to thepreviously <strong>in</strong>itialized anneal<strong>in</strong>g schedule,get-output A function used to locate the start of the output array <strong>in</strong> thecomputer memory, so that the network response can be accessed.S<strong>in</strong>ce the anneal rout<strong>in</strong>e is the place where most of the process<strong>in</strong>g isaccomplished, we shall concentrate on the development of just that rout<strong>in</strong>e,leav<strong>in</strong>g the design of the other two algorithms to you. The mathematics ofthe Boltzmann network tells us that the anneal<strong>in</strong>g process, <strong>in</strong> production mode,consists of two major functions that are repeated until the network has stabilizedat a low temperature. These functions, described next, can each be implementedas subrout<strong>in</strong>es that are called by the parent anneal process.set_temp A procedure used to set the current network temperature <strong>and</strong> anneal<strong>in</strong>gschedule pass count to values specified <strong>in</strong> the overall anneal<strong>in</strong>gschedule.propagate A function used to perform one signal propagation through theentire network, us<strong>in</strong>g the current temperature <strong>and</strong> probabilistic unit selection.This rout<strong>in</strong>e should be capable of perform<strong>in</strong>g the signal propagationregardless of the network state (clamped or undamped).Signal Propagation <strong>in</strong> the Boltzmann Network. We shall now def<strong>in</strong>e themost basic of the needed subrout<strong>in</strong>es, the propagate procedure. The algorithmfor this procedure, which follows, presumes that the user-providedapply-<strong>in</strong>put <strong>and</strong> not-yet-def<strong>in</strong>ed set-temp functions have been executedto <strong>in</strong>itialize the outputs of the network's units <strong>and</strong> temperature parameterto the desired states.procedure propagate (NET:BOLTZMANN){perform one signal propagation pass through network.}Lvar unit : <strong>in</strong>teger;{r<strong>and</strong>omly selected unit}p : float;{probability of unit be<strong>in</strong>g on}neti : float;{net <strong>in</strong>put to unit}threshold : <strong>in</strong>teger; {po<strong>in</strong>t at which unit turns on}i, j : <strong>in</strong>teger; {iteration counters}<strong>in</strong>puts : "float[]; {po<strong>in</strong>ter to unit outputs array}connects : ~float[]; {po<strong>in</strong>ter to unit weights array}undamped : <strong>in</strong>teger; {<strong>in</strong>dex to first undamped unit}firstc : <strong>in</strong>teger; (<strong>in</strong>dex to first connection}beg<strong>in</strong>{locate the first nonvisible unit, assum<strong>in</strong>g first<strong>in</strong>dex = 1}undamped = NET .OUTPUT. FIRST + NET . OUTPUT . LENGTH - 1;


198 Simulated Anneal<strong>in</strong>gif (NET.INPUTS.FIRST = NET.OUTPUTS.FIRST)then firstc = NET.INPUTS.FIRST {Boltzmann completion}else firstc = NET.INPUTS.LENGTH + 1;{Boltzmann <strong>in</strong>put-output}end if;for i = 1 to NET.UNITSdoif (NET.CLAMPED)then{for as many units <strong>in</strong> network}{if network is clamped}{select an undamped unit}unit = r<strong>and</strong>om (NET. UNITS - undamped)+ undamped;else{select any unit}unit = r<strong>and</strong>om(NET.UNITS);end if;neti =0;{<strong>in</strong>itialize <strong>in</strong>put}<strong>in</strong>puts = NET.NODES".CUTS; {locate <strong>in</strong>puts}connects = NET.NODES".WEIGHTS[i]';{<strong>and</strong> connections}for j = firstc to NET.UNITS{all connections to unit}do{compute sum of products}neti = neti + <strong>in</strong>puts[j] * connects[j];end do;{this next statement is used to improveperformance, as described <strong>in</strong> the text}if (NET.INPUTS.FIRST = NET.OUTPUTS.FIRST)or (i >= firstc)thenneti = neti - <strong>in</strong>puts[i] * connects[i];{no connection}end if;p = 1.0 / (1.0+ exp(-neti / NET.TEMPERATURE));threshold = round (p * 10000); {convert to <strong>in</strong>teger}if (r<strong>and</strong>om(lOOOO)


5.3 The Boltzmann Simulator 199Before we move on to the next rout<strong>in</strong>e, there are three aspects of thepropagate procedure that bear further discussion: the selection mechanismfor unit update, the computation of the neti term, <strong>and</strong> the method we havechosen for determ<strong>in</strong><strong>in</strong>g when a unit is or is not active.In the first case, the Boltzmann network must be able to run with its <strong>in</strong>putseither clamped or free-runn<strong>in</strong>g. So that we do not need to have differentpropagate rout<strong>in</strong>es for each mode, we simply use a Boolean variable <strong>in</strong>the network record to <strong>in</strong>dicate the current mode of operation, <strong>and</strong> enable thepropagate rout<strong>in</strong>e to select a unit for update accord<strong>in</strong>gly. If the networkis clamped, we cannot select an <strong>in</strong>put or output unit for update. We accountfor these differences by assum<strong>in</strong>g that the visible units to the network are thefirst TV units <strong>in</strong> the layer. We thus can be assured that the visible units willnot change if we simply select a r<strong>and</strong>om unit from the set of units that do not<strong>in</strong>clude the first N units. We accomplish this selection by decreas<strong>in</strong>g the rangeof the r<strong>and</strong>om-number generator to the number of network units m<strong>in</strong>us N, <strong>and</strong>then add<strong>in</strong>g TV to the result. S<strong>in</strong>ce we have decided that all our arrays will usethe first TV <strong>in</strong>dices to locate the visible units, generat<strong>in</strong>g a r<strong>and</strong>om <strong>in</strong>dex greaterthan TV will always select a r<strong>and</strong>om unit beyond the range of the visible units.However, if the network is undamped, any unit must be available for update.Inspection of the algorithm for propagate will reveal that these two casesare h<strong>and</strong>led by the if-then-else clause at the beg<strong>in</strong>n<strong>in</strong>g of the rout<strong>in</strong>e.Second, there are two salient po<strong>in</strong>ts regard<strong>in</strong>g the computation of the netiterm with respect to the propagate rout<strong>in</strong>e. The first po<strong>in</strong>t is that connectionsbetween <strong>in</strong>put units are processed only when the network is configuredas a Boltzmann completion network. In the Boltzmann <strong>in</strong>put-output mode,connections between <strong>in</strong>put units do not exist. This structure conforms to themathematical model described earlier. The second po<strong>in</strong>t about the calculationof the neti term is that we have obviously wasted computer time by process<strong>in</strong>ga connection from each unit to itself twice, once as part of the summationloop dur<strong>in</strong>g the calculation of the neti value, <strong>and</strong> once to subtract it out afterthe total neti has been calculated. The reason we have chosen to implementthe algorithm <strong>in</strong> this manner is, aga<strong>in</strong>, to improve performance. Even thoughwe have consumed computer time by process<strong>in</strong>g a nonexistent connection forevery unit <strong>in</strong> the network, we have used far less time than would be requiredto disallow the computation of the miss<strong>in</strong>g connection selectively dur<strong>in</strong>g everyiteration of the summation loop. Furthermore, we can easily elim<strong>in</strong>ate the error<strong>in</strong>troduced <strong>in</strong> the <strong>in</strong>put summation by process<strong>in</strong>g the nonexistent connection bysubtract<strong>in</strong>g out just that term after complet<strong>in</strong>g the loop, prior to updat<strong>in</strong>g theoutput of the unit. You might also observe that we have wasted memory byallocat<strong>in</strong>g space for the connections between each unit <strong>and</strong> itself. We have chosento implement the network <strong>in</strong> this fashion to simplify process<strong>in</strong>g, <strong>and</strong> thusto improve performance as described.As an example of why it is desirable to optimize the code at the expenseof wasted memory, consider the alternative case where only valid connections


200 Simulated Anneal<strong>in</strong>gare modeled. S<strong>in</strong>ce no unit has a connection to itself, but all units have outputsma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> the same array, the code to process all <strong>in</strong>put connections to a unitwould have to be written as two different loops: one for those <strong>in</strong>put PEs thatprecede the current unit, where the array <strong>in</strong>dices for outputs <strong>and</strong> connectionscorrespond one-to-one, <strong>and</strong> one loop for <strong>in</strong>puts from units that follow, whereunit outputs are displaced by one array entry from the correspond<strong>in</strong>g connection.This situation occurs because we have organized the unit outputs <strong>and</strong> connectionsas l<strong>in</strong>early sequential arrays <strong>in</strong> memory. Such a situation is illustrated <strong>in</strong>Figure 5.7.outputsweights (<strong>in</strong>put)w (i-l)j(a)outputs x weights (<strong>in</strong>put)(b)w (i+2)jFigure 5.7The illustration shows array process<strong>in</strong>g (a) when memory isallocated for all possible connections, <strong>and</strong> (b) when memoryis not allocated for <strong>in</strong>tra-unit connections. In (a), the codenecessary to perform this <strong>in</strong>put summation simply computesthe <strong>in</strong>put value for all connections, then elim<strong>in</strong>ates the error<strong>in</strong>troduced by process<strong>in</strong>g the nonexistent connection to itself.In (b), the code must be more selective about access<strong>in</strong>gconnections, s<strong>in</strong>ce the one-to-one mapp<strong>in</strong>g of connections tounits is lost. Obviously, approach (a) is our preferred method,s<strong>in</strong>ce it will execute much faster than approach (b).


5.3 The Boltzmann Simulator 201F<strong>in</strong>ally, with respect to decid<strong>in</strong>g when to activate the output of a unit, recallthat the Boltzmann network differs from the other networks that we have studied<strong>in</strong> that PEs are activated stochastically rather than determ<strong>in</strong>istically. Recall thatthe equationdef<strong>in</strong>es how we calculate the probability that a unit x/t is active with respectto its <strong>in</strong>put stimulation (net,t). However, simply know<strong>in</strong>g the probability thata unit will generate an output does not guarantee that the unit will generate anoutput. We must therefore implement a mechanism that allows the computer totranslate the calculated probability <strong>in</strong>to a unit output that occurs with the sameprobability; <strong>in</strong> effect, we must let the computer roll the dice to determ<strong>in</strong>e whenan output is active <strong>and</strong> when it is not.One method for do<strong>in</strong>g this is to make use of the pseudor<strong>and</strong>om-number generatoravailable <strong>in</strong> most high-level computer languages. Here, we take advantageof the fact that the computed probability, p^, will always be a fractional numberrang<strong>in</strong>g between zero <strong>and</strong> one, as illustrated by the graph depicted <strong>in</strong> Figure 5.8.We can map p/, to an <strong>in</strong>teger threshold value between zero <strong>and</strong> some arbitrarilylarge number by simply multiply<strong>in</strong>g the ceil<strong>in</strong>g value by the computed probability<strong>and</strong> round<strong>in</strong>g the result <strong>in</strong>to an <strong>in</strong>teger. We then generate a r<strong>and</strong>om numberbetween zero <strong>and</strong> the selected ceil<strong>in</strong>g, <strong>and</strong>, if the probability does not exceedthe threshold value just computed, the output of the unit is set to one. Assum<strong>in</strong>gthat the pseudor<strong>and</strong>om-number generator has a uniform probability distributionacross the <strong>in</strong>terval of <strong>in</strong>terest, the r<strong>and</strong>om number produced will not exceed thethreshold value with a probability equal to the specified value, pk. Thus, wenow have a means of stochastically activat<strong>in</strong>g unit outputs <strong>in</strong> the network.-20 -15 -10 0Net <strong>in</strong>putFigure 5.8Shown here is a graph of the probability, p k , that the /cth unitis on at five different temperatures, T.


202 Simulated Anneal<strong>in</strong>gBoltzmann Learn<strong>in</strong>g <strong>Algorithms</strong>. There are five additional functions thatmust be def<strong>in</strong>ed to tra<strong>in</strong> the Boltzmann network:set_temp A function used to update the parameters <strong>in</strong> the BOLTZMANN recordto reflect the network temperature at the current step, as specified <strong>in</strong> theanneal<strong>in</strong>g schedule.pplus A function used to compute <strong>and</strong> average the co-occurrence probabilitiesfor a network with clamped <strong>in</strong>puts after it has reached equilibrium at them<strong>in</strong>imum temperature.pm<strong>in</strong>us A function similar to pplus, but used when the network is freerunn<strong>in</strong>g.update_connections The procedure that modifies the connection weights<strong>in</strong> the network to tra<strong>in</strong> the Boltzmann simulator.The implementation of the set-temp function is straightforward, as def<strong>in</strong>edhere:procedure set_temp (NET:BOLTZMANN; N:<strong>in</strong>teger){set the temperature <strong>and</strong> schedule step}beg<strong>in</strong>NET.CURRENT = N; {set current step}NET.TEMPERATURE = NET.ANNEALING".STEP[N].TEMPERATURE;end procedure;On the other h<strong>and</strong>, the estimation of the pj <strong>and</strong> p~ terms is complex,<strong>and</strong> each must be accomplished <strong>in</strong> two steps: <strong>in</strong> the first, statistics about theco-occurrence between network units must be gathered <strong>and</strong> averaged for eachtra<strong>in</strong><strong>in</strong>g pattern; <strong>in</strong> the second, the statistics across all tra<strong>in</strong><strong>in</strong>g patterns arecollected. This separation provides a natural breakpo<strong>in</strong>t for algorithm development.We can therefore def<strong>in</strong>e two algorithms, sum_cooccurrence <strong>and</strong>pplus, that respectively address the two steps identified.We shall now turn our attention to the computation of the co-occurrenceprobability, pj, when the <strong>in</strong>put to the network is clamped to an arbitrary <strong>in</strong>putvector, V a . As we did with propagate, we will assume that the <strong>in</strong>put patternhas been placed on the <strong>in</strong>put units by an earlier call to set-<strong>in</strong>puts. Furthermore,we shall assume that the statistics arrays have been <strong>in</strong>itialized by anearlier call to a user-supplied rout<strong>in</strong>e that we refer to as zero-Statistics.procedure sum_cooccurrence (NET:BOLTZMANN){accumulate co-occurence statistics for thespecified network}var i,j,k : <strong>in</strong>teger;connect : <strong>in</strong>teger;stats : "float[];{loop counters}{co-occurence <strong>in</strong>dex}{po<strong>in</strong>ter to statistics array}


5.3 The Boltzmann Simulator 203beg<strong>in</strong>if (NET.CLAMPED) {if network is clamped}then stats = NET.STATISTICS.CLAMPEDelse stats = NET.STATISTICS.UNCLAMPED;end if;for i = 1 to 5dopropagate (NET);{arbitrary number of cycles}{run the network once}connect = 1; {start at first pair}for j = 1 to NET.UNITS{for all units <strong>in</strong> network}end do;end do;end procedure;doif (NET.NODES.OUTS"[j] = 1) {if unit is on}thenfor k = j to NET.UNITS {for rest of units}doif (NET.NODES.OUTS~[k] = 1)thenstats"[connect] = stats"[connect] + 1;end if;connect = next (connect);end do;else{skip to next unit connect}connect = connect + (NET.UNITS - j);end if;Notice that the sum_cooccurrence rout<strong>in</strong>e does not average the accumulatedresults after complet<strong>in</strong>g the exam<strong>in</strong>ation. We delay this computationto the pplus rout<strong>in</strong>e so that we can cont<strong>in</strong>ue to use the clamped array tocollect statistics across all patterns. If we averaged the results after each cycle,we would be forced to ma<strong>in</strong>ta<strong>in</strong> different arrays for each pattern, thus <strong>in</strong>creas<strong>in</strong>gthe need for storage at a near-exponential rate. In addition, note that, by us<strong>in</strong>ga po<strong>in</strong>ter to the appropriate statistics array, we have generalized the rout<strong>in</strong>e sothat it may be used to collect statistics for the network <strong>in</strong> either the clamped orundamped modes of operation.Before we def<strong>in</strong>e the algorithm needed to estimate the pt. term for theBoltzmann network, we will make a few assumptions. S<strong>in</strong>ce the total number oftra<strong>in</strong><strong>in</strong>g patterns that the network must learn will depend on the application, we


204 Simulated Anneal<strong>in</strong>gmust write the code so that the computer will calculate the co-occurrence statisticsfor a variable number of tra<strong>in</strong><strong>in</strong>g patterns. We must therefore assume thatthe tra<strong>in</strong><strong>in</strong>g data are available to the simulator from some external source (suchas a global array or disk file) that we will refer to as PATTERNS, <strong>and</strong> that thetotal number of tra<strong>in</strong><strong>in</strong>g patterns conta<strong>in</strong>ed <strong>in</strong> this source is obta<strong>in</strong>able througha call to an application-def<strong>in</strong>ed function that we will call how_many. We alsopresume that you will provide the rout<strong>in</strong>es to <strong>in</strong>itialize the co-occurrencearrays to zero, <strong>and</strong> set the outputs of the <strong>in</strong>put network units to the state specifiedby the Uh pattern <strong>in</strong> the PATTERNS data source. We will refer to theseprocedures as <strong>in</strong>itialize_arrays <strong>and</strong> set-<strong>in</strong>puts, respectively. Basedon these assumptions, we shall now def<strong>in</strong>e our algorithm for comput<strong>in</strong>g pplus:procedure pplus (NET:BOLTZMANN)var trials : <strong>in</strong>teger;i : <strong>in</strong>teger;{average over trials}{loop counter}beg<strong>in</strong>trials = how_many (PATTERNS) * 5;{five sums per pattern}for i = 1 to trials{for all trials}doNET.STATISTICS.CLAMPED"[i] = {average results}NET.STATISTICS.CLAMPED'[i] / trials;end do;end procedure;The implementation of pm<strong>in</strong>us is similar to the pplus algorithm, <strong>and</strong> isleft to you as an exercise.5.3.4 The Complete Boltzmann SimulatorNow that we have def<strong>in</strong>ed all the lower-level functions needed to implement theBoltzmann network, we shall describe the algorithms needed to tie everyth<strong>in</strong>gtogether. As previously stated, the two user-provided rout<strong>in</strong>es (set-<strong>in</strong>puts<strong>and</strong> get .outputs) are assumed to <strong>in</strong>itialize <strong>and</strong> recover <strong>in</strong>put <strong>and</strong> output datato or from the simulator for an external process. However, we have yet todef<strong>in</strong>e the two <strong>in</strong>termediate rout<strong>in</strong>es that will be used to perform the networksimulation given the externally provided <strong>in</strong>puts. We now beg<strong>in</strong> to correct thatdeficiency by describ<strong>in</strong>g the algorithm for the anneal process.procedure anneal (NET:BOLTZMANN){perform one pass through anneal<strong>in</strong>g schedule forcurrent <strong>in</strong>put}var passes : <strong>in</strong>teger; {passes at current temperature}steps : <strong>in</strong>teger; {number of steps <strong>in</strong> schedule}i, j : <strong>in</strong>teger; {loop counters}


5.3 The Boltzmann Simulator 205beg<strong>in</strong>steps = NET.ANNEALING".LENGTH;{steps <strong>in</strong> schedule}for i = 1 to steps {for all steps <strong>in</strong> schedule}dopasses = NET.ANNEALING".STEP[i].PASSES;set_temp (NET, i); {set current anneal<strong>in</strong>g step}for j = 1 to passes {for all passes <strong>in</strong> step}dopropagate (NET);{perform required process<strong>in</strong>g cycles}end do;end do;end procedure;All that rema<strong>in</strong>s to complete the learn<strong>in</strong>g-mode algorithms is a rout<strong>in</strong>e toupdate the connection weights <strong>in</strong> the network accord<strong>in</strong>g to the statistics collecteddur<strong>in</strong>g the anneal<strong>in</strong>g process. This rout<strong>in</strong>e will compute <strong>and</strong> apply the Aw termfor each connection <strong>in</strong> the network. To simplify the program, we assume thatthe e constant conta<strong>in</strong>ed <strong>in</strong> Eq. (5.35) will always be 0.3.procedure update_connections (NET:BOLTZMANN){update all connections based on cooccurence statistics}var connect : "float []; {po<strong>in</strong>ter to connection array}pp, pm : float[];{statistics arrays}dupconnect : "float[];{po<strong>in</strong>ter to duplicate connection}i, j, stat : <strong>in</strong>teger; {iteration <strong>in</strong>dices}beg<strong>in</strong>pp = NET.STATISTICS.CLAMPED";{locate pplus statistics}pm = NET.STATISTICS.UNCLAMPED";{locate pm<strong>in</strong>us statistics}stat = 1;{start at first statistic}Lfor i = 1 to NET.UNITS {for all units <strong>in</strong> network}doconnect = NET.NODES.WEIGHTS"[i];{locate connections}for j = (i + 1) to NET.UNITS{for all connections}doconnect"[j] = 0.3 * (pptstat] - pm[stat]);stat = stat + 1;{next statistic}dupconnect = NET.NODES.WEIGHTS"[j] ;{locate tw<strong>in</strong>}


206 Simulated Anneal<strong>in</strong>gend do;end do;end procedure;dupconnect'[i] = connect"[j];{copy to tw<strong>in</strong>}Notice that the update_connections rout<strong>in</strong>e modifies two connectionvalues dur<strong>in</strong>g every iteration, because we are model<strong>in</strong>g bidirectional connectionsas two unidirectional connections, <strong>and</strong> each must always conta<strong>in</strong> the same value.Given the data structures we have def<strong>in</strong>ed for our simulator, we must preserve thebidirectional nature of the network connections by always modify<strong>in</strong>g the values<strong>in</strong> two different arrays, such that these arrays always conta<strong>in</strong> the same data. Thealgorithm for update_connections satisfies this requirement by locat<strong>in</strong>g theassociated tw<strong>in</strong> connection dur<strong>in</strong>g every update cycle, <strong>and</strong> copy<strong>in</strong>g the new valuefrom the current connection to the tw<strong>in</strong> connection, as illustrated <strong>in</strong> Figure 5.9.We shall now describe the algorithm used to tra<strong>in</strong> the Boltzmann simulator.Here, as before, we assume that the tra<strong>in</strong><strong>in</strong>g patterns to be learned are conta<strong>in</strong>ed<strong>in</strong> a globally accessible storage array named PATTERNS, <strong>and</strong> that the numberof patterns <strong>in</strong> this array is obta<strong>in</strong>able through a call to an application-def<strong>in</strong>edrout<strong>in</strong>e, howjnany. Notice that <strong>in</strong> this function, we call the user-suppliedrout<strong>in</strong>e, zero-Statistics, to <strong>in</strong>itialize the statistics arrays.outs/ Tw<strong>in</strong> connectionsTw<strong>in</strong> connectionsFigure 5.9Updat<strong>in</strong>g of the connections <strong>in</strong> the Boltzmann simulator isshown. The weights <strong>in</strong> the arrays highlighted by the darkenedboxes represent connections modified by one pass through theupdate_connections procedure.i


5.4 Us<strong>in</strong>g the Boltzmann Simulator 207function learn (NET:BOLTZMANN){cause network to learn <strong>in</strong>put PATTERNS}var i : <strong>in</strong>teger;{iteration counters}beg<strong>in</strong>NET.CLAMPED = true;zero_statistics (NET);{clamp visible units){<strong>in</strong>it statistics arrays}for i = 1 to how_many (PATTERNS)doset_<strong>in</strong>puts (NET, PATTERNS, i);anneal (NET);{apply anneal<strong>in</strong>g schedule)sum_cooccurrence (NET); {collect statistics)end do;pplus (NET);{estimate p+)NET.CLAMPED = false;{unclamp visible units}for i = 1 to how_many (PATTERNS)doset_<strong>in</strong>puts (NET, PATTERNS, i);anneal (NET);{apply anneal<strong>in</strong>g schedule}sum_cooccurrence (NET); {collect statistics)end do;pm<strong>in</strong>us (NET);update_connections (NET);end procedure;{estimate p-}{modify connections}The algorithm necessary to have the network recall a pattern given an <strong>in</strong>putpattern (production mode) is straightforward, <strong>and</strong> now depends on only therout<strong>in</strong>es def<strong>in</strong>ed by the user to apply the new <strong>in</strong>put pattern to the network <strong>and</strong> toread the result<strong>in</strong>g output. These rout<strong>in</strong>es, apply_<strong>in</strong>puts <strong>and</strong> get.outputs,respectively, are comb<strong>in</strong>ed with anneal to generate the desired output, asshown next:procedure recall (NET:BOLTZMANN; INVEC,OUTVEC:"float[]){stimulate the network to generate an output from <strong>in</strong>put)beg<strong>in</strong>apply_<strong>in</strong>puts (NET, INVEC);anneal (NET);get_output (NET, OUTVEC);end procedure;{set the <strong>in</strong>put){generate output){return output)5.4 USING THE BOLTZMANN SIMULATORWith the exception of the backpropagation network described <strong>in</strong> Chapter 3,the Boltzmann network is probably the most general-purpose network of thosediscussed <strong>in</strong> this text. It can be used either as an associative memory or as


208 Simulated Anneal<strong>in</strong>ga mapp<strong>in</strong>g network, depend<strong>in</strong>g only on whether the output units overlap the<strong>in</strong>put units. These two operat<strong>in</strong>g modes encompass most of the common problemsto which ANS systems have been successfully applied. Unfortunately,the Boltzman network also has the dist<strong>in</strong>ction of be<strong>in</strong>g the slowest of all thesimulators. Nevertheless, there are several applications that can be addressedus<strong>in</strong>g the Boltzmann network; <strong>in</strong> this section, we describe one.This application uses the Boltzmann <strong>in</strong>put-output model to associate patternsfrom "symptom" space with patterns <strong>in</strong> "diagnosis" space.5.4.1 Boltzmann Symptom-Diagnosis ApplicationLet's consider a specific example of a symptom-diagnosis application. We willuse an automobile diagnostic application as the basis for our example. Specifically,we will focus on an application that will diagnose why a car will notstart. We first def<strong>in</strong>e the various symptoms to be considered:• Does noth<strong>in</strong>g: Noth<strong>in</strong>g happens when the key is turned <strong>in</strong> the ignitionswitch.• Clicks: A loud click<strong>in</strong>g noise is generated when the key is turned.• Gr<strong>in</strong>ds: A loud gr<strong>in</strong>d<strong>in</strong>g noise is generated when the key is turned.• Cranks: The eng<strong>in</strong>e cranks as though try<strong>in</strong>g to start, but the eng<strong>in</strong>e doesnot run on its own.• No spark: Remov<strong>in</strong>g one of the spark-plug wires <strong>and</strong> hold<strong>in</strong>g the term<strong>in</strong>alnear the block while crank<strong>in</strong>g the eng<strong>in</strong>e produces no spark.• Cable hot: After the eng<strong>in</strong>e has been cranked, the cable runn<strong>in</strong>g from thebattery to the starter solenoid is hot.• No gas: Remov<strong>in</strong>g the fuel l<strong>in</strong>e from the carburetor (fuel <strong>in</strong>jector) <strong>and</strong>crank<strong>in</strong>g the eng<strong>in</strong>e produces no gas flow out of the fuel l<strong>in</strong>e.Next, we consider the possible causes of the problem, based on thesymptoms:• Battery: The battery is dead.• Solenoid: The starter solenoid is defective.• Starter: The starter motor is defective.• Wires: The ignition wires are defective.• Distributor: The distributor rotor or cap is corroded.• Fuel pump: The fuel pump is defective.Although our list is not a complete representation of all possible problems,any one or a comb<strong>in</strong>ation of these problems could be <strong>in</strong>dicated by the symptoms.To complete our example, we shall construct a matrix <strong>in</strong>dicat<strong>in</strong>g the


5.4 Us<strong>in</strong>g the Boltzmann Simulator 209LikelycausesWSymptoms^X £ «CD-o'5 c"oCDc n)en -Q0)'Q.0.15Does noth<strong>in</strong>gXClicksXXXGr<strong>in</strong>dsXXXXCranksXXXNo sparkXXCable hotXXNo gasXFigure 5.10For the automobile diagnostic problem, we map symptomsto causes.mapp<strong>in</strong>g of the symptoms to the probable causes. This matrix is illustrated <strong>in</strong>Figure 5.10.An exam<strong>in</strong>ation of this matrix <strong>in</strong>dicates the variety of problems that canbe <strong>in</strong>dicated by any one symptom. The matrix also illustrates the problemwe encounter when we attempt to program a system to perform the diagnosticfunction: There rarely is a one-to-one correspondence between symptoms <strong>and</strong>causes. To be successful, our automated diagnostic system must be able tocorrelate many different symptoms, <strong>and</strong>, <strong>in</strong> the event that some symptoms maybe overlooked or absent, must be able to "fill <strong>in</strong> the blanks" of the problembased on just the <strong>in</strong>dicated symptoms.5.4.2 The Boltzmann SolutionWe will now exam<strong>in</strong>e how a Boltzmann network can be applied to the symptomdiagnosisexample we have created. The first step is to construct the networkarchitecture that will solve the problem for us. S<strong>in</strong>ce we would like to be ableto provide the network with observed symptoms, <strong>and</strong> to have it respond withprobable cause, a good c<strong>and</strong>idate architecture would be to map each symptomdirectly to an <strong>in</strong>dividual <strong>in</strong>put PE, <strong>and</strong> each probable causes to an <strong>in</strong>divid-


210 Simulated Anneal<strong>in</strong>gual output PE. S<strong>in</strong>ce our application requires outputs that are different fromthe <strong>in</strong>puts, we select the Boltzmann <strong>in</strong>put-output network as the best c<strong>and</strong>idate.Us<strong>in</strong>g the data from our example, we will need a network with seven <strong>in</strong>putunits <strong>and</strong> six output units. That leaves only the number of <strong>in</strong>ternal units undeterm<strong>in</strong>ed.In this case, there is noth<strong>in</strong>g to <strong>in</strong>dicate how many hidden units willbe required to solve the problem, <strong>and</strong> no external <strong>in</strong>terface considerations thatwill limit the number of hidden units (as there were <strong>in</strong> the data-compressionexample described <strong>in</strong> Chapter 3). We therefore arbitrarily size the Boltzmannnetwork such that it conta<strong>in</strong>s 14 <strong>in</strong>ternal units. If tra<strong>in</strong><strong>in</strong>g <strong>in</strong>dicates that we needmore units <strong>in</strong> order to converge, they can be added at a later time. If we needfewer units, the extras can be elim<strong>in</strong>ated later, although there is no overwhelm<strong>in</strong>greason to remove them <strong>in</strong> such a small network other than improv<strong>in</strong>g theperformance of the simulator.Next, we must def<strong>in</strong>e the data sets to be used to tra<strong>in</strong> the network. Referr<strong>in</strong>gaga<strong>in</strong> to our example matrix, we can consider the data <strong>in</strong> the row vectors ofthe matrix as seven-dimensional <strong>in</strong>put patterns; that is, for each probable-causeoutput that we would like the network to learn, there are seven possible symptomsthat <strong>in</strong>dicate the problem by their existence or absence. This approach willprovide six tra<strong>in</strong><strong>in</strong>g-vector pairs, each consist<strong>in</strong>g of a seven-element symptompattern <strong>and</strong> a six-element problem-<strong>in</strong>dication pattern.We let the existence of a symptom be <strong>in</strong>dicated by a 1, <strong>and</strong> the absence of asymptom be represented by a 0. For any given <strong>in</strong>put vector, the correct cause (orcauses) is <strong>in</strong>dicated by a logic 1 <strong>in</strong> the proper position of the output vector. Thetra<strong>in</strong><strong>in</strong>g-vector pairs produced by the mapp<strong>in</strong>g <strong>in</strong> the symptom-problem matrixfor this example are illustrated <strong>in</strong> Figure 5.11. If you compare Figures 5.11 <strong>and</strong>5.10, you will notice slight differences. You should conv<strong>in</strong>ce yourself that thedifferences are justified.SymptomsLikely causes<strong>in</strong>putsoutputs0100000 1000001000000 1000000 1 0 0 0 1 0 0100000 1 1 0 0 1 0 0010000 0 1 1 1 0 0001 1 0 0 10 0 0 1 1 0000001Figure 5. 11These tra<strong>in</strong><strong>in</strong>g-vector pairs are used for the automobilediagnostic problem.


Suggested Read<strong>in</strong>gs 211All that rema<strong>in</strong>s from this po<strong>in</strong>t is to tra<strong>in</strong> the network on these data pairsus<strong>in</strong>g the Boltzmann algorithms. Once tra<strong>in</strong>ed, the network will produce anoutput identify<strong>in</strong>g the probable cause <strong>in</strong>dicated by the <strong>in</strong>put symptom map.The network will do this when the <strong>in</strong>put is equivalent to one of the tra<strong>in</strong><strong>in</strong>g<strong>in</strong>puts, as expected, <strong>and</strong> it will produce an output <strong>in</strong>dicat<strong>in</strong>g the likely causeof the problem when the <strong>in</strong>put is similar to, but different from, any tra<strong>in</strong><strong>in</strong>g<strong>in</strong>put. This application illustrates the "best-guess" capability of the network <strong>and</strong>highlights the network's ability to deal with noisy or <strong>in</strong>complete data <strong>in</strong>puts.Programm<strong>in</strong>g Exercises5.1. Develop the pseudocode design for the set-<strong>in</strong>puts, apply.<strong>in</strong>puts,<strong>and</strong>get_outputs rout<strong>in</strong>es.5.2. Develop the pseudocode design for the pm<strong>in</strong>us rout<strong>in</strong>e.5.3. The pplus <strong>and</strong> pm<strong>in</strong>us as described are largely redundant <strong>and</strong> can becomb<strong>in</strong>ed <strong>in</strong>to a s<strong>in</strong>gle rout<strong>in</strong>e. Develop the pseudocode design for such arout<strong>in</strong>e.5.4. Implement the Boltzmann simulator <strong>and</strong> test it with the automotive diagnosticdata described <strong>in</strong> Section 5.4. Compare your results with ours, <strong>and</strong>discuss reasons for any differences.5.5. Implement the Boltzmann simulator <strong>and</strong> test it on an application of yourown choos<strong>in</strong>g. Describe the application <strong>and</strong> your choice of tra<strong>in</strong><strong>in</strong>g data,<strong>and</strong> discuss reasons why the test did or did not succeed.5.6. Modify the simulator to conta<strong>in</strong> two additional variable parameters,epsilon (c) <strong>and</strong> cycles, as part of the network record structure.Epsilon will be used to calculate the connection-weight change, <strong>in</strong>steadof the hard-coded 0.3 constant described <strong>in</strong> the text, <strong>and</strong> cyclesshould be used to specify the number of iterations performed dur<strong>in</strong>g thesum-cooccurrence rout<strong>in</strong>e (<strong>in</strong>stead of the five we specified). Retra<strong>in</strong>the network us<strong>in</strong>g the automotive diagnostic data with a different valuefor epsilon, then change cycles, <strong>and</strong> then change both parameters.Describe any performance variations that you observed.Suggested Read<strong>in</strong>gsThe orig<strong>in</strong> of modern <strong>in</strong>formation theory is described <strong>in</strong> a paper by Shannon,which is itself repr<strong>in</strong>ted <strong>in</strong> a collection of papers on the mathematical theoryof communications [7]. A good textbook on statistical mechanics is the one byReif [6]. A detailed development of the learn<strong>in</strong>g algorithm for the Boltzmannmach<strong>in</strong>e is given <strong>in</strong> the paper by H<strong>in</strong>ton <strong>and</strong> Sejnowski <strong>in</strong> the PDF series [3].Another worthwhile paper is the one by Ackley, H<strong>in</strong>ton, <strong>and</strong> Sejnowski [1].An early account of us<strong>in</strong>g simulated anneal<strong>in</strong>g to solve optimization problems


212 Bibliographyis given <strong>in</strong> a paper by Kirkpatrick, Gelatt, <strong>and</strong> Vecchi [5]. The concept ofus<strong>in</strong>g the Cauchy distribution to speed the anneal<strong>in</strong>g process is discussed <strong>in</strong> apaper by Szu [8]. A Byte magaz<strong>in</strong>e article by H<strong>in</strong>ton conta<strong>in</strong>s an algorithm forthe Boltzmann mach<strong>in</strong>e that is slightly different from the one presented <strong>in</strong> thischapter [4].Bibliography[1] David H. Ackley, Geoffrey E. H<strong>in</strong>ton, <strong>and</strong> Terrence J. Sejnowski. A learn<strong>in</strong>galgorithm for Boltzmann mach<strong>in</strong>es. In James A. Anderson <strong>and</strong> EdwardRosenfeld, editors, Neurocomput<strong>in</strong>g. MIT Press, Cambridge, MA, pages638-650, 1988. Repr<strong>in</strong>ted from Cognitive Science 9:147-169, 1985.[2] Stuart Geman <strong>and</strong> Donald Geman. Stochastic relaxation, Gibbs distributions,<strong>and</strong> the Bayesian restoration of images. In James A. Anderson<strong>and</strong> Edward Rosenfeld, editors, Neurocomput<strong>in</strong>g. MIT Press, Cambridge,MA, pages 614-634, 1988. Repr<strong>in</strong>ted from IEEE Transactions of PatternAnalysis <strong>and</strong> Mach<strong>in</strong>e Intelligence PAMI-6: 721-741, 1984.[3] G. E. H<strong>in</strong>ton <strong>and</strong> T. J. Sejnowski. Learn<strong>in</strong>g <strong>and</strong> relearn<strong>in</strong>g <strong>in</strong> Boltzmannmach<strong>in</strong>es. In David E. Rumelhart <strong>and</strong> James L. McClell<strong>and</strong>, editors,Parallel Distributed Process<strong>in</strong>g: Explorations <strong>in</strong> the Microstructure ofCognition. MIT Press, Cambridge, MA, pages 282-317, 1986.[4] Geoffrey E. H<strong>in</strong>ton. Learn<strong>in</strong>g <strong>in</strong> parallel networks. Byte, 10(4):265-273,April 1985.[5] S. Kirkpatrick, Jr., C. D. Gelatt, <strong>and</strong> M. P. Vecchi. Optimization by simulatedanneal<strong>in</strong>g. In James A. Anderson <strong>and</strong> Edward Rosenfeld, editors,Neurocomput<strong>in</strong>g. MIT Press, Cambridge, MA, pages 554-568, 1988.Repr<strong>in</strong>ted from Science 220: 671-680, 1983.[6] F. Reif. Fundamentals of Statistical <strong>and</strong> Thermal Physics. McGraw-Hillseries <strong>in</strong> fundamental physics. McGraw-Hill, New York, 1965.[7] C. E. Shannon. The mathematical theory of communication. In C. E. Shannon<strong>and</strong> W. Weaver, editors, The Mathematical Theory of Communication.University of Ill<strong>in</strong>ois Press, Urbana, IL, pages 29-125, 1963.[8] Harold Szu. Fast simulated anneal<strong>in</strong>g. In John S. Denker, editor, <strong>Neural</strong><strong>Networks</strong> for Comput<strong>in</strong>g. American Institute of Physics, New York, pages420-425, 1986.


C H A P T E RThe CounterpropagationNetworkThe Counterpropagation network (CPN) is the most recently developed of themodels that we have discussed so far <strong>in</strong> this text. The CPN is not so much anew discovery as it is a novel comb<strong>in</strong>ation of previously exist<strong>in</strong>g network types.Hecht-Nielsen synthesized the architecture from a comb<strong>in</strong>ation of a structureknown as a competitive network <strong>and</strong> Grossberg's outstar structure [5, 6]. Althoughthe network architecture, as it appears <strong>in</strong> its orig<strong>in</strong>ally published form<strong>in</strong> Figure 6.1, seems rather formidable, we shall see that the operation of thenetwork is quite straightforward.Given a set of vector pairs, (xi,yi), (^2^2), • • •, (xz,,yt), the CPN can learnto associate an x vector on the <strong>in</strong>put layer with a y vector at the output layer.If the relationship between x <strong>and</strong> y can be described by a cont<strong>in</strong>uous function,, such that y = ^(x), the CPN will learn to approximate this mapp<strong>in</strong>g for anyvalue of x <strong>in</strong> the range specified by the set of tra<strong>in</strong><strong>in</strong>g vectors. Furthermore, ifthe <strong>in</strong>verse of $ exists, such that x is a function of y, then the CPN will also learnthe <strong>in</strong>verse mapp<strong>in</strong>g, x = 3>~'(y).' For a great many cases of practical <strong>in</strong>terest,the <strong>in</strong>verse function does not exist. In these situations, we can simplify thediscussion of the CPN by consider<strong>in</strong>g only the forward-mapp<strong>in</strong>g case, y = $(x).In Figure 6.2, we have reorganized the CPN diagram <strong>and</strong> have restrictedour consideration to the forward-mapp<strong>in</strong>g case. The network now appears as'We are us<strong>in</strong>g the term function <strong>in</strong> its strict mathematical sense. If y is a function of x. then everyvalue of x corresponds to one <strong>and</strong> only one value of y. Conversely, if x is a function of y, thenevery value of y corresponds to one <strong>and</strong> only one value of x. An example of a function whose<strong>in</strong>verse is not a function is, y = x 2 , — oo < x < oo. A somewhat more abstract, but perhapsmore <strong>in</strong>terest<strong>in</strong>g, situation is a function that maps images of animals to the name of the animal. Forexample, "CAT" = 3>("picture of cat"). Each picture represents only one animal, but each animalcorresponds to many different pictures.213


214 The Counterpropagation Network" O-t,LayersFigure 6.1This spiderlike diagram of the CRN architecture has five layers:two <strong>in</strong>put layers (1 <strong>and</strong> 5), one hidden layer (3), <strong>and</strong> twooutput layers (2 <strong>and</strong> 4). The CRN gets its name from the factthat the <strong>in</strong>put vectors on layers 1 <strong>and</strong> 2 appear to propagatethrough the network <strong>in</strong> opposite directions. Source: Repr<strong>in</strong>tedwith permission from Robert Hecht-Nielsen, ' 'Counterpropagationnetworks." In Proceed<strong>in</strong>gs of the IEEE First InternationalConference on <strong>Neural</strong> <strong>Networks</strong>. San Diego, CA, June 1987.©1987 IEEE.a three-layer architecture, similar to, but not exactly like, the backpropagationnetwork discussed <strong>in</strong> Chapter 3. An <strong>in</strong>put vector is applied to the units onlayer 1. Each unit on layer 2 calculates its net-<strong>in</strong>put value, <strong>and</strong> a competitionis held to determ<strong>in</strong>e which unit has the largest net-<strong>in</strong>put value. That unit isthe only unit that sends a value to the units <strong>in</strong> the output layer. We shallpostpone a detailed discussion of the process<strong>in</strong>g until we have exam<strong>in</strong>ed thevarious components of the network.CPNs are <strong>in</strong>terest<strong>in</strong>g for a number of reasons. By comb<strong>in</strong><strong>in</strong>g exist<strong>in</strong>g networktypes <strong>in</strong>to a new architecture, the CPN h<strong>in</strong>ts at the possibility of form<strong>in</strong>gother, useful networks from bits <strong>and</strong> pieces of exist<strong>in</strong>g structures. Moreover,<strong>in</strong>stead of employ<strong>in</strong>g a s<strong>in</strong>gle learn<strong>in</strong>g algorithm throughout the network, theCPN uses a different learn<strong>in</strong>g procedure on each layer. The learn<strong>in</strong>g algorithmsallow the CPN to tra<strong>in</strong> quite rapidly with respect to other networks that we havestudied so far. The tradeoff is that the CPN may not always yield sufficient


6.1 CPN Build<strong>in</strong>g Blocks 215y' Output vectorLayer 3Layer 2Layer 1x Input vectory Input vectorFigure 6.2 The forward-mapp<strong>in</strong>g CPN is shown. Vector pairs fromthe tra<strong>in</strong><strong>in</strong>g set are applied to layer 1. After tra<strong>in</strong><strong>in</strong>g iscomplete, apply<strong>in</strong>g the vectors (x, 0) to layer 1 will result <strong>in</strong> anapproximation, y', to the correspond<strong>in</strong>g y vector, at the outputlayer, layer 3. See Section 6.2 for more details of the tra<strong>in</strong><strong>in</strong>gprocedure.accuracy for some applications. Nevertheless, the CPN rema<strong>in</strong>s a good choicefor some classes of problems, <strong>and</strong> it provides an excellent vehicle for rapid prototyp<strong>in</strong>gof other applications. In the next section, we shall exam<strong>in</strong>e the variousbuild<strong>in</strong>g blocks from which the CPN is constructed.6.1 CPN BUILDING BLOCKSThe PEs <strong>and</strong> network structures that we shall study <strong>in</strong> this section play animportant role <strong>in</strong> many of the subsequent chapters <strong>in</strong> this text. For that reason,we present this <strong>in</strong>troductory material <strong>in</strong> some detail. There are four majorcomponents: an <strong>in</strong>put layer that performs some process<strong>in</strong>g on the <strong>in</strong>put data, aprocess<strong>in</strong>g element called an <strong>in</strong>star, a layer of <strong>in</strong>stars known as a competitivenetwork, <strong>and</strong> a structure known as an outstar. In Section 6.2, we shall returnto the discussion of the CPN.


216 The Counterpropagation Network6.1.1 The Input LayerDiscussions of neural networks often ignore the <strong>in</strong>put-layer process<strong>in</strong>g elements,or consider them simply as pass-through units, responsible only for distribut<strong>in</strong>g<strong>in</strong>put data to other process<strong>in</strong>g elements. Computer simulations of networksusually arrange for all <strong>in</strong>put data to be scaled or normalized to accommodatecalculations by the computer's CPU. For example, <strong>in</strong>put-value magnitudes mayhave to be scaled to prevent overflow error dur<strong>in</strong>g the sum-of-product calculationsthat dom<strong>in</strong>ate most network simulations. Biological systems do not havethe benefits of such preprocess<strong>in</strong>g; they must rely on <strong>in</strong>ternal mechanisms toprevent saturation of neural activities by large <strong>in</strong>put signals. In this section,we shall exam<strong>in</strong>e a mechanism of <strong>in</strong>teraction among process<strong>in</strong>g elements thatovercomes this noise-saturation dilemma [2]. Although the mechanism hassome neurophysiological plausibility, we shall not exam<strong>in</strong>e any of the biologicalimplications of the model.Exam<strong>in</strong>e the layer of process<strong>in</strong>g elements shown <strong>in</strong> Figure 6.3. There isone <strong>in</strong>put value, /;, for each of the n units on the layer. The total <strong>in</strong>put pattern<strong>in</strong>tensity is given by 7 = 5^/j. Correspond<strong>in</strong>g to each Ij, we shall def<strong>in</strong>e aquantity(6.1)Figure 6.3 This layer of <strong>in</strong>put units has n process<strong>in</strong>g elements,{v\, V2,..., v n }. Each <strong>in</strong>put value, I u is connected withan excitatory connection (arrow with a plus sign) to itscorrespond<strong>in</strong>g process<strong>in</strong>g element, v- t . Each Ii is connectedalso to every other process<strong>in</strong>g element, Vk, k ^ i, withan <strong>in</strong>hibitory connection (arrow with a m<strong>in</strong>us sign). Thisarrangement is known as on-center, off-surround. The outputof the layer is proportional to the normalized reflectancepattern.


6.1 CRN Build<strong>in</strong>g Blocks 217The vector, (0i, 0 2 , . . . , ©„)*, is called a reflectance pattern. Notice that thispattern is normalized <strong>in</strong> the sense that J^ 0; = 1 .The reflectance pattern is <strong>in</strong>dependent of the total <strong>in</strong>tensity of the correspond<strong>in</strong>g<strong>in</strong>put pattern. For example, the reflectance pattern correspond<strong>in</strong>g tothe image of a person's face would be <strong>in</strong>dependent of whether the person werebe<strong>in</strong>g viewed <strong>in</strong> bright sunlight or <strong>in</strong> shade. We can usually recognize a familiarperson <strong>in</strong> a wide variety of light<strong>in</strong>g conditions, even if we have not seen herpreviously <strong>in</strong> the identical situation. This experience suggests that our memorystores <strong>and</strong> recalls reflectance patterns.The outputs of the process<strong>in</strong>g elements <strong>in</strong> Figure 6.3 are governed by theset of differential equations,(6.2)where 0 < x,;(0) < B, <strong>and</strong> A,B > 0. Each process<strong>in</strong>g element receives anet excitation (on-center) of (B - x z )/,; from its correspond<strong>in</strong>g <strong>in</strong>put value, l % .The addition of <strong>in</strong>hibitory connections (off-surround), -xJk, from other unitsis responsible for prevent<strong>in</strong>g the activity of the process<strong>in</strong>g element from ris<strong>in</strong>g<strong>in</strong> proportion to the absolute pattern <strong>in</strong>tensity, /j.Once an <strong>in</strong>put pattern is applied, the process<strong>in</strong>g elements quickly reach anequilibrium state (x/ = 0) withwhere we have used the def<strong>in</strong>ition of 0, <strong>in</strong> Eq. (6.1). The output pattern isnormalized, s<strong>in</strong>ceBI(6.3)which is always less than B. Thus, the pattern of activity that develops acrossthe <strong>in</strong>put units is proportional to the reflectance pattern, rather than to the orig<strong>in</strong>al<strong>in</strong>put pattern.After the <strong>in</strong>put pattern is removed, the activities of the units do not rema<strong>in</strong>at their equilibrium values, nor do they return immediately to zero. The activitypattern persists for some time while the term —Axi reduces the activitiesgradually back to a value of zero.An <strong>in</strong>put layer of the type discussed <strong>in</strong> this section is used for both the x-vector <strong>and</strong> y-vector portions of the CPN <strong>in</strong>put layer shown <strong>in</strong> Figure 6. 1 . Whenperform<strong>in</strong>g a digital simulation of the CPN, we can simplify the program bynormaliz<strong>in</strong>g the <strong>in</strong>put vectors <strong>in</strong> software. Whether the <strong>in</strong>put-pattern normalizationis accomplished us<strong>in</strong>g Eq. (6.2), or by some preprocess<strong>in</strong>g <strong>in</strong> software,depends on the particular implementation of the network.


218 The Counterpropagation NetworkExercise 6.1:1. Solve Eq. (6.2) to f<strong>in</strong>d Xi(t) explicitly, assum<strong>in</strong>g x;(0) = 0 <strong>and</strong> a constant<strong>in</strong>put pattern, I.2. Assume that the <strong>in</strong>put pattern is removed at t = t', <strong>and</strong> f<strong>in</strong>d x,(i) for t > t'.3. Draw the graph of x t (i) from t = 0 to some £ » £'. What determ<strong>in</strong>es howquickly x t (t) (a) reaches its equilibrium value, <strong>and</strong> (b) decays back to zeroafter the <strong>in</strong>put pattern is removed?Exercise 6.2: Investigate the equationsx, = -Axi + (B - Xi)Iias a possible alternative to Eq. (6.2) for the <strong>in</strong>put-layer process<strong>in</strong>g elements. Fora constant reflectance pattern, what happens to the activation of each process<strong>in</strong>gelement as the total pattern <strong>in</strong>tensity, /, <strong>in</strong>creases?Exercise 6.3: Consider the equations±i = -Ax, +(B- Xi)Ii - (x, + C)^2 I kk^,.which differ from Eq. (6.2) by an additional <strong>in</strong>hibition term, C^fc _^ Ik'.1. Suppose /, = 0, but X!fr/i Ik > 0- Show that x/ can assume a negativevalue. Does this result make sense? (Consider what it means for a realneuron to have zero activation <strong>in</strong> terms of the neuron's rest<strong>in</strong>g potential.)2. Show that the system suppresses noise by requir<strong>in</strong>g that the reflectancevalues, 9,;, be greater than some m<strong>in</strong>imum value before they will excite apositive activity <strong>in</strong> the process<strong>in</strong>g element.6.1.2 The InstarThe hidden layer of the CPN comprises a set of process<strong>in</strong>g elements known as<strong>in</strong>stars. In this section, we shall discuss the <strong>in</strong>star <strong>in</strong>dividually. In the follow<strong>in</strong>gsection, we shall exam<strong>in</strong>e the set of <strong>in</strong>stars that operate together to form theCPN hidden layer.The <strong>in</strong>star is a s<strong>in</strong>gle process<strong>in</strong>g element that shares its general structure<strong>and</strong> process<strong>in</strong>g functions with many other process<strong>in</strong>g elements described <strong>in</strong> thistext. We dist<strong>in</strong>guish it by the specific way <strong>in</strong> which it is tra<strong>in</strong>ed to respond to<strong>in</strong>put data.Let's beg<strong>in</strong> with a general process<strong>in</strong>g element, such as the one <strong>in</strong> Figure6.4(a). If the arrow represent<strong>in</strong>g the output is ignored, the process<strong>in</strong>gelement can be redrawn <strong>in</strong> the starlike configuration of Figure 6.4(b). The<strong>in</strong>ward-po<strong>in</strong>t<strong>in</strong>g arrows identify the <strong>in</strong>star structure, but we restrict the use ofthe term <strong>in</strong>star to those units whose process<strong>in</strong>g <strong>and</strong> learn<strong>in</strong>g are governed by the


6.1 CPN Build<strong>in</strong>g Blocks 219(a)(b)Figure 6.4This figure shows (a) the general form of the process<strong>in</strong>g elementwith <strong>in</strong>put vector I, weight vector w, <strong>and</strong> output value y; <strong>and</strong>(b) the <strong>in</strong>star form of the process<strong>in</strong>g element <strong>in</strong> (a). Notice thatthe arrow represent<strong>in</strong>g the output is miss<strong>in</strong>g, although it is stillpresumed to exist.equations <strong>in</strong> this section. The net-<strong>in</strong>put value is calculated, as usual, by the dotproduct of the <strong>in</strong>put <strong>and</strong> weight vectors, net = I • w. We shall assume that the<strong>in</strong>put vector, I, <strong>and</strong> the weight vector, w, have been normalized to a length of 1.The output of the <strong>in</strong>star is governed by the equationy = —ay + b net (6.4)where a, b > 0. The dynamic behavior of y is illustrated <strong>in</strong> Figure 6.5.We can solve Eq. (6.4) to get the output as a function of time. Assum<strong>in</strong>gthe <strong>in</strong>itial output is zero, <strong>and</strong> that a nonzero <strong>in</strong>put vector is present from timet = 0 until time t,y(t)=-net(l~e~" t } (6.5)a \ /The equilibrium value of y(t) isif* = -net (6.6)If the <strong>in</strong>put vector is removed at time t 1 , after equilibrium has been reached,theny(t) = y e< >e- a(t - t " > (6.7)for t > t'.


220 The Counterpropagation Networkf(net)aQ.OTimeInput vector appliedInput vector removedFigure 6.5This graph illustrates the output response of an <strong>in</strong>star. Whenthe <strong>in</strong>put vector is nonzero, the output rises to an equilibriumvalue of (b/a)net. If the <strong>in</strong>put vector is removed, the outputfalls exponentially with a time constant of I/a.Notice that, for a given a <strong>and</strong> b, the output at equilibrium will be largerwhen the net-<strong>in</strong>put value is larger. Figure 6.6 shows a diagram of the weightvector, w, of an <strong>in</strong>star, <strong>and</strong> an <strong>in</strong>put vector, I. The net-<strong>in</strong>put value determ<strong>in</strong>eshow close to each other the two vectors are as measured by the angle betweenthem, 6. The largest equilibrium output will occur when the <strong>in</strong>put <strong>and</strong> weightvectors are perfectly aligned (8 = 0).If we want the <strong>in</strong>star to respond maximally to a particular <strong>in</strong>put vector,we must arrange that the weight vector be made identical to the desired <strong>in</strong>putvector. The <strong>in</strong>star can learn to respond maximally to a particular <strong>in</strong>put vectorif the <strong>in</strong>itial weight vector is allowed to change accord<strong>in</strong>g to the equation\V — — CW •dly (6.8)where y is the output of the <strong>in</strong>star, <strong>and</strong> c, d > 0. Notice the relationship betweenEq. (6.8) <strong>and</strong> the Hebbian learn<strong>in</strong>g rule discussed <strong>in</strong> Chapter 1. The second termon the right side of Eq. (6.8) conta<strong>in</strong>s the product of the <strong>in</strong>put <strong>and</strong> the outputof a process<strong>in</strong>g element. Thus, when both are large, the weight on the <strong>in</strong>putconnection is re<strong>in</strong>forced, as predicted by Hebb's theory.Equation 6.8 is difficult to <strong>in</strong>tegrate because y is a complex function oftime through Eq. (6.5). We can try to simplify the problem by assum<strong>in</strong>g that y


6.1 CPN Build<strong>in</strong>g Blocks 221Initial w(a)(b)Figure 6.6 This figure shows an example of an <strong>in</strong>put vector <strong>and</strong> a weightvector on an <strong>in</strong>star. (a) This figure illustrates the relationshipbetween the <strong>in</strong>put <strong>and</strong> weight vectors of an <strong>in</strong>star. S<strong>in</strong>ce thevectors are normalized, net = I • w = ||I||||w||cosd — cosd. (b)The <strong>in</strong>star learns an <strong>in</strong>put vector by rotat<strong>in</strong>g its weight vectortoward the <strong>in</strong>put vector until both vectors are aligned.reaches its equilibrium value much faster than changes <strong>in</strong> w can occur. Then,y = y"> — (a/6)net. Because net = w • I, Eq. (6.8) becomesw == —cw dl(w-I) (6.9)where we have absorbed the factor a/6 <strong>in</strong>to the constant d. Although Eq. (6.9)is still not directly solvable, the assumption that changes to weights occur moreslowly than do changes to other parameters is important. We shall see more ofthe utility of such an assumption <strong>in</strong> Chapter 8. Figure 6.7 illustrates the solutionto Eq. (6.9) for a simple two-dimensional case.An alternative approach to Eq. (6.8) beg<strong>in</strong>s with the observation that, <strong>in</strong> theabsence of an <strong>in</strong>put vector, I, the weight vector will cont<strong>in</strong>uously decay awaytoward zero (w = —cw). This effect can be considered as a forgetfulness onthe part of the process<strong>in</strong>g element. To avoid this forgetfulness, we can modifyEq. (6.8) such that any change to the weight vector depends on whether there isan <strong>in</strong>put vector there to be learned. If an <strong>in</strong>put vector is present, then net = w • Iwill be nonzero. Instead of Eq. (6.8), we can use as the learn<strong>in</strong>g law,w = (-cw + dl)[/(net) (6.10)where[/(net) =1 net > 00 net = 0


222 The Counterpropagation Network0.866' Initial w =/ (0.5,0.866)/ 0.5Figure 6.7Time step, tGiven an <strong>in</strong>put vector I = (0,1) <strong>and</strong> an <strong>in</strong>itial weight vector,w(0) = (0.5,0.866), the components, w\ <strong>and</strong> wi, of the weightvector evolve <strong>in</strong> time accord<strong>in</strong>g to Eq. (6.9), as shown <strong>in</strong> thegraph. The weight vector eventually aligns itself to the <strong>in</strong>putvector such that w = (0.1) = I. For this example, c = d — 1.Equation (6.10) can be <strong>in</strong>tegrated directly for t/(net) = 1. Notice thatw f ' q = (d/c)l, mak<strong>in</strong>g c = d a condition that must be satisfied for w to evolvetoward an exact match with I. Us<strong>in</strong>g this fact, we can rewrite Eq. (6.10) <strong>in</strong> aform more suitable for later digital simulations:Aw = a(I — w)(6.H)In Eq. (6.11), we have used the approximation dvi/dt « Aw/At, <strong>and</strong> have leta = cAt. An approximate solution to Eq. (6.10) would bevr(t+ l) = (6.12)for a < 1; see Figure 6.8.A s<strong>in</strong>gle <strong>in</strong>star learn<strong>in</strong>g a s<strong>in</strong>gle <strong>in</strong>put vector does not provide an <strong>in</strong>terest<strong>in</strong>gcase. Let's consider the situation where we have a number of <strong>in</strong>put vectors,all relatively close together <strong>in</strong> what we shall call a cluster. A cluster mightrepresent items of a s<strong>in</strong>gle class. We would like the <strong>in</strong>star to learn some formof representative vector of the class: the average, for example. Figure 6.9illustrates the idea.Learn<strong>in</strong>g takes place <strong>in</strong> an iterative fashion:1. Select an <strong>in</strong>put vector, /,, at r<strong>and</strong>om, accord<strong>in</strong>g to the probability distributionof the cluster. (If the cluster is not uniformly distributed, you shouldselect <strong>in</strong>put vectors more frequently from regions where there is a higherdensity of vectors.)2. Calculate a(I — w), <strong>and</strong> update the weight vector accord<strong>in</strong>g to Eq. (6.12).


6.1 CPN Build<strong>in</strong>g Blocks 223Aw = cc(I - w)Figure 6.8 The quantity (I - w) is a vector that po<strong>in</strong>ts from w toward I.In Eq. (6.12), w moves <strong>in</strong> discrete timesteps toward I. Noticethat w does not rema<strong>in</strong> normalized.3. Repeat steps 1 <strong>and</strong> 2 for a number of times equal to the number of <strong>in</strong>putvectors <strong>in</strong> the cluster.4. Repeat step 3 several times.The last item <strong>in</strong> this list is admittedly vague. There is a way to calculate anaverage error as learn<strong>in</strong>g proceeds, which can be used as a criterion for halt<strong>in</strong>gthe learn<strong>in</strong>g process (see Exercise 6.4). In practice, error values are rarelyused, s<strong>in</strong>ce the <strong>in</strong>star is never used as a st<strong>and</strong>-alone unit, <strong>and</strong> other criteria willdeterm<strong>in</strong>e when to stop the tra<strong>in</strong><strong>in</strong>g.It is also a good idea to reduce the value of a as tra<strong>in</strong><strong>in</strong>g proceeds. Oncethe weight vector has reached the middle of the cluster, outly<strong>in</strong>g <strong>in</strong>put vectorsmight pull w out of the area if a is very large.When the weight vector has reached an average position with<strong>in</strong> the cluster,it should stay generally with<strong>in</strong> a small region around that average position. Inother words, the average change <strong>in</strong> the weight vector, (Aw), should becomevery small. Movements of w <strong>in</strong> one direction should generally be offset bymovements <strong>in</strong> the opposite direction. If we assume (Aw) = 0, then Eq. (6.11)shows thatwhich is what we wanted.(w) = (I)


224 The Counterpropagation NetworkFigure 6.9(a)This figure illustrates how an <strong>in</strong>star learns to respond to acluster of <strong>in</strong>put vectors, (a) To learn a cluster of <strong>in</strong>put vectors,we select the <strong>in</strong>itial weight vector to be equal to some memberof the cluster. This <strong>in</strong>itialization ensures that the weight vectoris <strong>in</strong> the right region of space, (b) As learn<strong>in</strong>g proceeds, theweight vector will eventually settle down to some small regionthat represents an average, (w), of all the <strong>in</strong>put vectors.(b)Now that we have seen how an <strong>in</strong>star can learn the average of a cluster of<strong>in</strong>put vectors, we can talk about layers of <strong>in</strong>stars. Instars can be grouped together<strong>in</strong>to what is known as a competitive network. The competitive network formsthe middle layer of the CPN <strong>and</strong> is the subject of the next section.Exercise 6.4: For a given <strong>in</strong>put vector, we can def<strong>in</strong>e the <strong>in</strong>star error as themagnitude of the difference between the <strong>in</strong>put vector <strong>and</strong> the weight vector:£ 2 = ||I, — w||. Show that the mean squared error can be written as(e 2 } =2(1- (cos9,))where 9, is the angle between I; <strong>and</strong> w.6.1.3 Competitive <strong>Networks</strong>In the previous section, we saw how an <strong>in</strong>dividual <strong>in</strong>star could learn to respondto a certa<strong>in</strong> group of <strong>in</strong>put vectors clustered together <strong>in</strong> a region of space.Suppose we have several <strong>in</strong>stars grouped <strong>in</strong> a layer, as shown <strong>in</strong> Figure 6.10,each of which responds maximally to a group of <strong>in</strong>put vectors <strong>in</strong> a differentregion of space. We can say that this layer of <strong>in</strong>stars classifies any <strong>in</strong>put vector,because the <strong>in</strong>star with the strongest response for any given <strong>in</strong>put identifies theregion of space <strong>in</strong> which the <strong>in</strong>put vector lies.


6.1 CPN Build<strong>in</strong>g Blocks 225'! '2 ' nInput vector, IFigure 6.10A layer of <strong>in</strong>stars arranged <strong>in</strong> a competitive network. Eachunit receives the <strong>in</strong>put vector I = (/i,/2,...,/ n )' <strong>and</strong> theith unit has associated with it the weight vector, w, =(u>i\,Wi2, •.., w<strong>in</strong>) 1 . Net-<strong>in</strong>put values are calculated <strong>in</strong> theusual way: net^ = I-w,. The w<strong>in</strong>ner of the competition is theunit with the largest net <strong>in</strong>put.Rather than our exam<strong>in</strong><strong>in</strong>g the response of each <strong>in</strong>star to determ<strong>in</strong>e whichis the largest, our task would be simpler if the <strong>in</strong>star with the largest responsewere the only unit to have a nonzero output. This effect can be accomplishedif the <strong>in</strong>stars compete with one another for the privilege of turn<strong>in</strong>g on. S<strong>in</strong>cethere is no external judge to decide which <strong>in</strong>star has the largest net <strong>in</strong>put, theunits must decide among themselves who is the w<strong>in</strong>ner. This decision processrequires communication among all the units on the layer; it also complicatesthe analysis, s<strong>in</strong>ce there are more <strong>in</strong>puts to each unit than just the <strong>in</strong>put fromthe previous layer. In the follow<strong>in</strong>g discussion, we shall be focus<strong>in</strong>g on unitactivities, rather than on unit output values.Figure 6.11 illustrates the <strong>in</strong>terconnections that implement competitionamong the <strong>in</strong>stars. The unit activations are determ<strong>in</strong>ed by differential equations.There is a variety of forms that these differential equations can take; oneof the simplest is±i = -Ax, + (B - Xi)[f(x t ) + net,-] - x, : (6.13)where A, B > 0 <strong>and</strong> /(£/) is an output function that we shall specify shortly [2].This equation should be compared to Eq. (6.2). We can convert Eq. (6.2) to


226 The Counterpropagation NetworkFigure 6.11 An on-center off-surround system for implement<strong>in</strong>gcompetition among a group of <strong>in</strong>stars. Each unit receives apositive feedback signal to itself <strong>and</strong> sends an <strong>in</strong>hibitory signalto all other units on the layer. The unit whose weight vectormost closely matches the <strong>in</strong>put vector sends the strongest<strong>in</strong>hibitory signals to the other units <strong>and</strong> receives the largestpositive feedback from itself.Eq. (6.13) by replac<strong>in</strong>g every occurrence of Ij <strong>in</strong> Eq. (6.2) with f(xj) + net,,for all j. The relationship between the constants A <strong>and</strong> B, <strong>and</strong> the form of thefunction, f(xj), determ<strong>in</strong>e how the solutions to Eq. (6.13) evolve <strong>in</strong> time. Weshall now look at specific cases.Equation (6.13) is somewhat easier to analyze if we convert it to a pairof equations: one that describes the reflectance variables, X- t — Xi/^, k xk,<strong>and</strong> one that describes the total pattern <strong>in</strong>tensity, x = ^^Xk- First, rearrangeEq. (6.13) as follows:net;] - x %Next, sum over i to getx = -Ax + (B - x) (6.14)Now substitute xXi <strong>in</strong>to Eq. (6.13) <strong>and</strong> use Eq. (6.14) to simplify the result. Ifwe make the def<strong>in</strong>ition g(w) = w~ ] f(w), then we getxXi = BxX i^2x lc [ 3 (xXi)-g(xXk)]+B(l-X i ynst i -BX i J2nestk (6.15)We can now evaluate the asymptotic behavior of Eqs. (6.14) <strong>and</strong> (6.15).Let's beg<strong>in</strong> with the simple, l<strong>in</strong>ear case of f(w) = w. S<strong>in</strong>ce g(w) = w~ l f(w),g(w) = I <strong>and</strong> the first term on the right of Eq. (6.15) is zero. The reflectancevariables stabilize atvnet


6.1 CRN Build<strong>in</strong>g Blocks 227where x'' q comes from sett<strong>in</strong>g Eq. (6.14) equal to zero. This equation showsthat the units will accurately register any pattern presented as an <strong>in</strong>put. Nowlet's look at what happens after the <strong>in</strong>put pattern is removed.Let net, = 0 for all i. Then, X{ is identically zero for all time <strong>and</strong> thereflectance variables rema<strong>in</strong> constant. The unit activations now depend only onx, s<strong>in</strong>ce x, — xX t . Equation (6.14) reduces tox = (B - A - x)xIf B < A, then x < 0 <strong>and</strong> x —> 0. However, if B > A, then x -* B - A <strong>and</strong>the activity pattern becomes stored permanently on the units. This behavior isunlike the behavior of the <strong>in</strong>put units described by Eq. (6.2), where removalof the <strong>in</strong>put pattern always resulted <strong>in</strong> a decay of the activities. We shall callthis storage effect short-term memory (STM). Even though the effect appearsto be permanent, we can assume the existence of a reset mechanism that willremove the current pattern so that another can be stored. Figure 6.12 illustratesa simple example for the l<strong>in</strong>ear output function.For our second example, we assume a faster-than-l<strong>in</strong>ear output function,/((/;) = w 2 . In this case g(w) = w. Notice that the first term on the rightof Eq. (6.15) conta<strong>in</strong>s the factor [g(xXj) — g(xX\,)\. For the quadratic outputfunction, this expression reduces to x[X; — Xk]. If X, > X^ for all values ofA- / ?', then the first term <strong>in</strong> Eq. (6.14) is an excitatory term. If Xi < X^ forA- ^ ?', then the first term <strong>in</strong> Eq. (6.14) becomes an <strong>in</strong>hibitory term. Thus, thisnetwork tends to enhance the activity of the unit with the largest value of Xi.This effect is illustrated <strong>in</strong> Figure 6.13. After the <strong>in</strong>put pattern is removed, X fwill be greater than zero for only the unit with the largest value of Xi.Exercise 6.5: Use Eq. (6.14) to show that, after the <strong>in</strong>put pattern has beenremoved, the total activity of the network, x, is bounded by the value of B.We now have an output function that can be used to implement a w<strong>in</strong>nertake-allcompetitive network. The quadratic output function (or, for that matter,any function f(w) = w", where n > 1) can be used <strong>in</strong> off-surround, <strong>in</strong>hibitoryconnections to suppress all <strong>in</strong>puts except the largest. This effect is the ultimate<strong>in</strong> noise suppression: The network assumes that everyth<strong>in</strong>g except the largestsignal is noise.It is possible to comb<strong>in</strong>e the qualities of noise suppression with the abilityto store an accurate representation of an <strong>in</strong>put vector: Simply arrange for theoutput function to be faster than l<strong>in</strong>ear for small activities, <strong>and</strong> l<strong>in</strong>ear for largeractivities. If we add the additional constra<strong>in</strong>t that the unit output must bebounded at all times, we must have the output function <strong>in</strong>crease at a less-thanl<strong>in</strong>earrate for large values of activity. This comb<strong>in</strong>ation results <strong>in</strong> a sigmoidoutput function, as illustrated <strong>in</strong> Figure 6.14.The mathematical analysis of Eqs. (6.14) <strong>and</strong> (6.15) with a sigmoid outputfunction is considerably more complicated than it was for the other twocases.All the cases considered here, as well as many others, are treated <strong>in</strong> depth by


228 The Counterpropagation Networkf(w)Equilibriumactivity patternw(a)Equilibriumactivity patternInput vector(b)I.I.ooooInput vector = 0(c)Figure 6.12This series of figures shows the result of apply<strong>in</strong>g a certa<strong>in</strong><strong>in</strong>put vector to units hav<strong>in</strong>g a l<strong>in</strong>ear output function, (a) Thegraph of the output function, f(w) = w. (b) This figure showsthe result of the susta<strong>in</strong>ed application of the <strong>in</strong>put vector. Theunits reach equilibrium activities as shown, (c) After removalof the <strong>in</strong>put vector, the units reach an equilibrium such thatthe pattern is stored <strong>in</strong> STM.Grossberg [4]. Use of the sigmoid results <strong>in</strong> the existence of a quench<strong>in</strong>gthreshold (QT). Units whose net <strong>in</strong>puts are above the QT will have their activitiesenhanced. The effect is one of contrast enhancement. An extremeexample is illustrated <strong>in</strong> Figure 6.14.Reference back to Figure 6.1 or 6.2 will reveal that there are no obvious<strong>in</strong>terconnections among the units on the competitive middle layer. In a digitalsimulation of a competitive network, the actual <strong>in</strong>terconnections are unnecessary.The CPU can act as an external judge to determ<strong>in</strong>e which unit has the largest net-


6.1 CPN Build<strong>in</strong>g Blocks 229f(w)w(a)Equilibriumactivity patternEquilibriumactivity patternooooInput vector = 0(c)Input vector(b)Figure 6.13This series of figures is analogous to those <strong>in</strong> Figure 6.12, bulwith units hav<strong>in</strong>g a quadratic output function, (a) The graphof the quadratic output function, (b) While the <strong>in</strong>put vector ispresent, the network tends to enhance the activity of the unitwith the largest activity. For the given <strong>in</strong>put pattern, the unitactivities reach the equilibrium values shown, (c) After the<strong>in</strong>put pattern is removed, all activities but the largest decayto zero.<strong>in</strong>put value. The w<strong>in</strong>n<strong>in</strong>g unit would then be assigned an output value of 1. Thesituation is similar for the <strong>in</strong>put layer. In a software simulation, we do not needon-center off-surround <strong>in</strong>teractions to normalize the <strong>in</strong>put vector; that can also bedone easily by the CPU. These considerations aside, attention to the underly<strong>in</strong>gtheory is essential to underst<strong>and</strong><strong>in</strong>g. When digital simulations give way toneural-network <strong>in</strong>tegrated circuitry, such an underst<strong>and</strong><strong>in</strong>g will be required.


230 The Counterpropagation Networkf(w)Equilibriumactivity patternInput vector(b)(a)Equilibriumactivity patternI IooooInput vector = 0(c)Figure 6.14This figure is analogous to Figures 6.12 <strong>and</strong> 6.13, but withunits hav<strong>in</strong>g a sigmoid output function, (a) The sigmoidoutput function comb<strong>in</strong>es noise suppression at low activities,l<strong>in</strong>ear pattern storage at <strong>in</strong>termediate values, <strong>and</strong> a boundedoutput at large activity values, (b) When the <strong>in</strong>put vectoris present, the unit activities reach an equilibrium value, asshown, (c) After removal of the <strong>in</strong>put vector, the activitiesabove a certa<strong>in</strong> threshold are enhanced, whereas those belowthe threshold are suppressed.6.1.4 The OutstarThe f<strong>in</strong>al leg of our journey through CPN components br<strong>in</strong>gs us to Grossberg'soutstar structure. As Figure 6.15 shows, an outstar is composed of all of theunits <strong>in</strong> the CPN outer layer <strong>and</strong> a s<strong>in</strong>gle hidden-layer unit. Thus, the outer-layerunits participate <strong>in</strong> several different outstars: one for each unit <strong>in</strong> the middlelayer.


6.1 CPN Build<strong>in</strong>g Blocks 231y' Output vectorLayerLayerLayer 1x <strong>in</strong>put vector i y Input vector(a)y' Output vector(b)(c)Figure 6.15This figure illustrates the outstar <strong>and</strong> its relationship to theCPN architecture, (a) The dotted l<strong>in</strong>e encompasses one ofthe outstar structures <strong>in</strong> the CPN. The l<strong>in</strong>e is <strong>in</strong>tentionallydrawn through the middle-layer unit to <strong>in</strong>dicate the dualfunctionality of that unit. Each middle-layer unit comb<strong>in</strong>eswith the outer layer to form an <strong>in</strong>dividual outstar. (b) A s<strong>in</strong>gleoutstar unit is shown. The output units of the outstar have two<strong>in</strong>puts: z, from the connect<strong>in</strong>g unit of the previous layer, <strong>and</strong>y it which is the tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put. The tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put is presentdur<strong>in</strong>g only the learn<strong>in</strong>g period. The output of the outstar isthe vector y' = (y(, y' 2 ,..., y' n ) f . (c) The outstar is redrawn <strong>in</strong> asuggestive configuration. Note that the arrows po<strong>in</strong>t outward<strong>in</strong> a configuration that is complementary to the <strong>in</strong>star.


232 The Counterpropagation NetworkIn Chapter 1, we gave a brief description of Pavlovian condition<strong>in</strong>g <strong>in</strong> termsof Hebbian learn<strong>in</strong>g. Grossberg argues that the outstar is the m<strong>in</strong>imal neuralarchitecture capable of classical condition<strong>in</strong>g [3]. Consider the outstar shown<strong>in</strong> Figure 6.16. Initially, the conditioned stimulus (CS) (e.g., a r<strong>in</strong>g<strong>in</strong>g bell)is assumed to be unable to elicit a response from any of the units to which itis connected. An unconditioned stimulus (UCS) (the sight of food) can causean unconditioned response (UCR) (salivation). If the CS is present while theUCS is caus<strong>in</strong>g the UCR, then the strength of the connection from the CS unitto the UCR unit will also be <strong>in</strong>creased, <strong>in</strong> keep<strong>in</strong>g with Hebb's theory (seeChapter 1). Later, the CS will be able to cause a conditioned response (CR)(the same as the UCR), even if the UCS is absent.The behavior of the outstars <strong>in</strong> the CPN resembles this classical condition<strong>in</strong>g.Dur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g period, the w<strong>in</strong>ner of the competition on the hiddenlayer turns on, provid<strong>in</strong>g a s<strong>in</strong>gle CS to the output units. The UCS is suppliedby the y-vector portion of the <strong>in</strong>put layer. Because we want the network tolearn the actual y vector, the UCR will be the same as the y vector, with<strong>in</strong> aconstant multiplicative factor. After tra<strong>in</strong><strong>in</strong>g is complete, the appearance of theCS will cause the CR value (the y' vector) to appear at the output units, eventhough the UCS values will be all zero.In the CPN, the hidden layer participates <strong>in</strong> both the <strong>in</strong>star <strong>and</strong> outstarstructures of the network. The function of the competitive <strong>in</strong>stars is to recognizean <strong>in</strong>put pattern through a w<strong>in</strong>ner-take-all competition. Once a w<strong>in</strong>ner has beendeclared, that unit becomes the CS for an outstar. The outstar associates somevalue or identity with the <strong>in</strong>put pattern. The <strong>in</strong>star <strong>and</strong> outstar complement eachother <strong>in</strong> this fashion: The <strong>in</strong>star recognizes an <strong>in</strong>put pattern <strong>and</strong> classifies it; theoutstar identifies or names the selected class. This behavior led one researcherto note that the <strong>in</strong>star is dumb, whereas the outstar is bl<strong>in</strong>d [9].The equations that govern the process<strong>in</strong>g <strong>and</strong> learn<strong>in</strong>g of the outstar aresimilar <strong>in</strong> form to those for the <strong>in</strong>star. Dur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g process, the outputvalues of the outstar can be calculated fromy- = -ay\ + by t +cnet, :which is similar to Eq. (6.4) for the <strong>in</strong>star except for the additional term due tothe tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put, j/, : . The parameters a, b, <strong>and</strong> c, are all assumed to be positive.The value of net, is calculated <strong>in</strong> the usual way as the sum of products of weights<strong>and</strong> <strong>in</strong>put values from the connect<strong>in</strong>g units. For the outstar <strong>and</strong> the CPN, onlya s<strong>in</strong>gle connect<strong>in</strong>g unit has a nonzero output at any given time. Even thougheach output unit of the outstar has an entire weight vector associated with it, thenet <strong>in</strong>put reduces to a s<strong>in</strong>gle term, WijZ, where z is the output of the connect<strong>in</strong>gunit. In the case of the CPN, 2=1. Therefore, we can writey\ = -ay' t + byi + cw vj (6.16)where the jth unit on the hidden layer is the connect<strong>in</strong>g unit. In its most generalform, the parameters a, 6, <strong>and</strong> c <strong>in</strong> Eq. (6.16) are functions of time. Here, we


6.1 CRN Build<strong>in</strong>g Blocks 233UCRUCS\(a)\ucs = o(b)Figure 6.16This figure shows an implied classical-condition<strong>in</strong>g scenario.(a) Dur<strong>in</strong>g the condition<strong>in</strong>g period, the CS <strong>and</strong> the UCS exciteone of the output units simultaneously, (b) After condition<strong>in</strong>g,the presence of the CS alone can excite the CR withoutexcit<strong>in</strong>g any of the other output units.


234 The Counterpropagation <strong>Networks</strong>hall consider them to be constants for simplicity. For the rema<strong>in</strong>der of thisdiscussion, we shall drop the j subscript from the weight.After tra<strong>in</strong><strong>in</strong>g is complete, no further changes <strong>in</strong> w t take place <strong>and</strong> thetra<strong>in</strong><strong>in</strong>g <strong>in</strong>puts, y t , are absent. Then, the output of the outstar isy' t = -ay 1 , + cw? (6.17)where w^ is the fixed weight value found dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g.The weights evolve accord<strong>in</strong>g to an equation almost identical to Eq. (6.10)for the <strong>in</strong>star:Wi = (-dwi + eyiz)U(z) (6.18)Notice that the second term <strong>in</strong> Eq. (6.18) conta<strong>in</strong>s the tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put, y it not theunit output, y(. The U(z) function ensures that no unlearn<strong>in</strong>g takes place whenthere is no tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put present, or when the connect<strong>in</strong>g unit is off (z — 0).S<strong>in</strong>ce both z <strong>and</strong> U(z) are 1 for the middle-layer unit that w<strong>in</strong>s the competition,Eq. (6.18) becomesWi = -dwi + ey t (6.19)for the connections from the w<strong>in</strong>n<strong>in</strong>g, middle-layer unit. Connections from othermiddle-layer units do not participate <strong>in</strong> the learn<strong>in</strong>g.Recall that a given <strong>in</strong>star can learn to recognize a cluster of <strong>in</strong>put vectors.If the desired CPN outputs (the y ?: s) correspond<strong>in</strong>g to each vector <strong>in</strong> the clusterare all identical, then the weights eventually reach the equilibrium values:w *eqcc>= d viIf, on the other h<strong>and</strong>, each <strong>in</strong>put vector <strong>in</strong> the cluster has a slightly differentoutput value associated with it, then the outstar will learn the average of all ofthe associated output values:Us<strong>in</strong>g the latter value for the equilibrium weight, Eq. (6.17) shows that theoutput after tra<strong>in</strong><strong>in</strong>g reaches a steady state value ofS<strong>in</strong>ce we presumably want y'^9 = (j/,), we can require that a = c <strong>in</strong> Eq. (6.17)<strong>and</strong> that d = e <strong>in</strong> Eq. (6.18). Then,(6.20)For the purpose of digital simulation, we can approximate the solution toEq. (6.19) bywM + 1) = Wi(t) + 0(yi - Wi(t)) (6.21)follow<strong>in</strong>g the same procedure that led to Eq. (6.12) for the <strong>in</strong>star.


6.2 CPN Data Process<strong>in</strong>g 2356.2 CPN DATA PROCESSINGWe are now <strong>in</strong> a position to comb<strong>in</strong>e the component structures from the previoussection <strong>in</strong>to the complete CPN. We shall still consider only the forward-mapp<strong>in</strong>gCPN for the moment. Moreover, we shall assume that we are perform<strong>in</strong>g a digitalsimulation, so it will not be necessary to model explicitly the <strong>in</strong>terconnectsfor the <strong>in</strong>put layer or the competitive layer.6.2.1 Forward Mapp<strong>in</strong>gAssume that all tra<strong>in</strong><strong>in</strong>g has occurred <strong>and</strong> that the network is now <strong>in</strong> a productionmode. We have an <strong>in</strong>put vector, I, <strong>and</strong> we would like to f<strong>in</strong>d the correspond<strong>in</strong>gy vector. The process<strong>in</strong>g is depicted <strong>in</strong> Figure 6.17 <strong>and</strong> proceeds accord<strong>in</strong>g tothe follow<strong>in</strong>g algorithm:1. Normalize the <strong>in</strong>put vector, x- t =2. Apply the <strong>in</strong>put vector to the x-vector portion of layer 1. Apply a zerovector to the y-vector portion of layer 1.3. S<strong>in</strong>ce the <strong>in</strong>put vector is already normalized, the <strong>in</strong>put layer only distributesit to the units on layer 2.4. Layer 2 is a w<strong>in</strong>ner-take-all competitive layer. The unit whose weightvector most closely matches the <strong>in</strong>put vector w<strong>in</strong>s <strong>and</strong> has an output valueof 1. All other units have outputs of 0. The output of each unit can becalculated accord<strong>in</strong>g to1 ||net;|| > ||netj|| for all j ^ i0 otherwise5. The s<strong>in</strong>gle w<strong>in</strong>ner on layer 2 excites an outstar.Each unit <strong>in</strong> the outstar quickly reaches an equilibrium value equal to thevalue of the weight on the connection from the w<strong>in</strong>n<strong>in</strong>g layer 2 unit [seeEq. (6.20)]. If the zth unit w<strong>in</strong>s on the middle layer, then the output layerproduces an output vector y' = (wu,W2i,...,w n ,i) t , where m represents thenumber of units on the output layer. A simple way to view this process<strong>in</strong>g isto realize that the equilibrium output of the outstar is equal to the outstar's net<strong>in</strong>put,~ '*,•*,• (6.23)S<strong>in</strong>ce Zj = 0 unless j = i, then y'£ q = WkiZi = Wki, which is consistent withthe results obta<strong>in</strong>ed <strong>in</strong> Section 6.1.This simple algorithm uses equilibrium, or asymptotic, values of node activities<strong>and</strong> outputs. We thus avoid the need to solve numerically all the correspond<strong>in</strong>gdifferential equations.


236 The Counterpropagation Networky' Output vectorLayer 3Layer 2Layer 1x Input vector0y Input vectorFigure 6.17 This figure shows a summary of the process<strong>in</strong>g done on an<strong>in</strong>put vector by the CRN. The <strong>in</strong>put vector, (xi,X2,...,x n ) tis distributed to all units on the competitive layer. The ithunit w<strong>in</strong>s the competition <strong>and</strong> has an output of 1; all othercompetitive units have an output of 0. This competitioneffectively selects the proper output vector by excit<strong>in</strong>g a s<strong>in</strong>gleconnection to each of the outstar units on the output layer.6.2.2 Tra<strong>in</strong><strong>in</strong>g the CRNHere aga<strong>in</strong>, we assume that we are perform<strong>in</strong>g a digital simulation of theCPN. Although this assumption does not elim<strong>in</strong>ate the need to f<strong>in</strong>d numericalsolutions to the differential equations, we can still take advantage of prenormalized<strong>in</strong>put vectors <strong>and</strong> an external judge to determ<strong>in</strong>e w<strong>in</strong>ners on the competitivelayer. We shall also assume that a set of tra<strong>in</strong><strong>in</strong>g vectors has beendef<strong>in</strong>ed adequately. We shall have more to say on that subject <strong>in</strong> a later section.Because there are two different learn<strong>in</strong>g algorithms <strong>in</strong> use <strong>in</strong> the CPN, weshall look at each one <strong>in</strong>dependently. In fact, it is a good idea to tra<strong>in</strong> thecompetitive layer completely before beg<strong>in</strong>n<strong>in</strong>g to tra<strong>in</strong> the output layer.


6.2 CPN Data Process<strong>in</strong>g 237The competitive-layer units tra<strong>in</strong> accord<strong>in</strong>g to the <strong>in</strong>star learn<strong>in</strong>g algorithmdescribed <strong>in</strong> Section 6.1. S<strong>in</strong>ce there will typically be many <strong>in</strong>stars on the competitivelayer, the iterative tra<strong>in</strong><strong>in</strong>g process described earlier must be amendedslightly. Here, as <strong>in</strong> Section 6.1, we assume that a cluster of <strong>in</strong>put vectors formsa s<strong>in</strong>gle class. Now, however, we have the situation where we may have severalclusters of vectors, each cluster represent<strong>in</strong>g a different class. Our learn<strong>in</strong>gprocedure must be such that each <strong>in</strong>star learns (w<strong>in</strong>s the competition) for allthe vectors <strong>in</strong> a s<strong>in</strong>gle cluster. To accomplish the correct classification for eachclass of <strong>in</strong>put vectors, we must proceed as follows:1. Select an <strong>in</strong>put vector from among all the <strong>in</strong>put vectors to be used fortra<strong>in</strong><strong>in</strong>g. The selection should be r<strong>and</strong>om accord<strong>in</strong>g to the probability distributionof the vectors.2. Normalize the <strong>in</strong>put vector <strong>and</strong> apply it to the CPN competitive layer.3. Determ<strong>in</strong>e which unit w<strong>in</strong>s the competition by calculat<strong>in</strong>g the net-<strong>in</strong>putvalue for each unit <strong>and</strong> select<strong>in</strong>g the unit with the largest (the unit whoseweight vector is closest to the <strong>in</strong>put vector <strong>in</strong> an <strong>in</strong>ner-product sense).4. Calculate a(x — w) for the w<strong>in</strong>n<strong>in</strong>g unit only, <strong>and</strong> update that unit's weightvector accord<strong>in</strong>g to Eq. (6.12):w(t + \) = vf(t) + a(x - w)5. Repeat steps 1 through 4 until all <strong>in</strong>put vectors have been processed once.6. Repeat step 5 until all <strong>in</strong>put vectors have been classified properly. Whenthis situation exists, one <strong>in</strong>star unit will w<strong>in</strong> the competition for all <strong>in</strong>putvectors <strong>in</strong> a certa<strong>in</strong> cluster. Note that there might be more that one clustercorrespond<strong>in</strong>g to a s<strong>in</strong>gle class of <strong>in</strong>put vectors.7. Test the effectiveness of the tra<strong>in</strong><strong>in</strong>g by apply<strong>in</strong>g <strong>in</strong>put vectors from thevarious classes that were not used dur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g process itself. If anymisclassifications occur, additional tra<strong>in</strong><strong>in</strong>g passes through step 6 may berequired, even though all the tra<strong>in</strong><strong>in</strong>g vectors are be<strong>in</strong>g classified correctly.If tra<strong>in</strong><strong>in</strong>g ends too abruptly, the w<strong>in</strong> region of a particular unit may beoffset too much from the center of the cluster, <strong>and</strong> outly<strong>in</strong>g vectors may bemisclassified. We def<strong>in</strong>e an <strong>in</strong>star's w<strong>in</strong> region as the region of vector spaceconta<strong>in</strong><strong>in</strong>g vectors for which that particular <strong>in</strong>star will w<strong>in</strong> the competition.(See Figure 6.18.)An issue that we have overlooked <strong>in</strong> our discussion is the question of<strong>in</strong>itialization of the weight vectors. For all but the simplest problems, r<strong>and</strong>om<strong>in</strong>itial weight vectors will not be adequate. We already h<strong>in</strong>ted at an <strong>in</strong>itializationmethod earlier: Set each weight vector equal to a representative of one of theclusters. We shall have more to say on this issue <strong>in</strong> the next section.Once satisfactory results have been obta<strong>in</strong>ed on the competitive layer, tra<strong>in</strong><strong>in</strong>gof the outstar layer can occur. There are several ways to proceed based onthe nature of the problem.


238 The Counterpropagation NetworkFigure 6.18In this draw<strong>in</strong>g, three clusters of vectors represent threedist<strong>in</strong>ct classes: A, B, <strong>and</strong> C. Normalized, these vectors endon the unit hypersphere. After tra<strong>in</strong><strong>in</strong>g, the weight vectors onthe competitive layer have settled near the centroid of eachcluster. Each weight vector has a w<strong>in</strong> region represented,although not accurately, by the circles drawn on the surface ofthe sphere around each cluster. Note that one of the B vectorsencroaches <strong>in</strong>to C's w<strong>in</strong> region <strong>in</strong>dicat<strong>in</strong>g that erroneousclassification is possible <strong>in</strong> some cases.Suppose that each cluster of <strong>in</strong>put vectors represents a class, <strong>and</strong> all of thevectors <strong>in</strong> a cluster map to the identical output vector. In this case, no iterativetra<strong>in</strong><strong>in</strong>g algorithm is necessary. We need only to determ<strong>in</strong>e which hidden unitw<strong>in</strong>s for a particular class. Then, we simply assign the weight vector on theappropriate connections to the output layer to be equal to the desired outputvector. That is, if the ith hidden unit w<strong>in</strong>s for all <strong>in</strong>put vectors of the classfor which A is the desired output vector, then we set WM = Ak, where Wkiis the weight on the connection from the ith hidden unit to the kth outputunit.If each <strong>in</strong>put vector <strong>in</strong> a cluster maps to a different output vector, then theoutstar learn<strong>in</strong>g procedure will enable the outstar to reproduce the average of


6.2 CPN Data Process<strong>in</strong>g 239those output vectors when any member of the class is presented to the <strong>in</strong>putsof the CPN. If the average output vector for each class is known or can becalculated <strong>in</strong> advance, then a simple assignment can be made as <strong>in</strong> the previousparagraph: Let w ki = (A k ).If the average of the output vectors is not known, then an iterative procedurecan be used based on Eq. (6.21).1. Apply a normalized <strong>in</strong>put vector, x^, <strong>and</strong> its correspond<strong>in</strong>g output vector,yt, to the x <strong>and</strong> y <strong>in</strong>puts of the CPN, respectively.2. Determ<strong>in</strong>e the w<strong>in</strong>n<strong>in</strong>g competitive-layer unit.3. Update the weights on the connections from the w<strong>in</strong>n<strong>in</strong>g competitive unitto the output units accord<strong>in</strong>g to Eq. (6.21):Wi(t+l) = Wi(t) + 0(y ki - Wi(t))4. Repeat steps 1 through 3 until all vectors of all classes map to satisfactoryoutputs.6.2.3 Practical ConsiderationsIn this section, we shall exam<strong>in</strong>e several aspects of CPN design <strong>and</strong> operationthat will <strong>in</strong>fluence the results obta<strong>in</strong>ed us<strong>in</strong>g this network. The CPN is deceptivelysimple <strong>in</strong> its operation <strong>and</strong> there are several pitfalls. Most of these pitfallscan be avoided through a careful analysis of the problem be<strong>in</strong>g solved before anattempt is made to model the problem with the CPN. We cannot cover all eventualities<strong>in</strong> this section. Instead, we shall attempt to illustrate the possibilities<strong>in</strong> order to raise your awareness of the need for careful analysis.The first consideration is actually a comb<strong>in</strong>ation of two: the number of hiddenunits required, <strong>and</strong> the number of exemplars, or tra<strong>in</strong><strong>in</strong>g vectors, needed foreach class. It st<strong>and</strong>s to reason that there must be at least as many hidden nodesas there are classes to be learned. We have been assum<strong>in</strong>g that each class of<strong>in</strong>put vectors can be identified with a cluster of vectors. It is possible, however,that two completely disjo<strong>in</strong>t regions of space conta<strong>in</strong> vectors of the same class.In such a situation, more than one competitive node would be required to identifythe <strong>in</strong>put vectors of a s<strong>in</strong>gle class. Unfortunately, for problems with largedimensions, it may not always be possible to determ<strong>in</strong>e that such is the case <strong>in</strong>advance. This possibility is one reason why more than one representative foreach class should be used dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, <strong>and</strong> also why the tra<strong>in</strong><strong>in</strong>g should beverified with other representative <strong>in</strong>put vectors.Suppose that a misclassification of a test vector does occur after allof the tra<strong>in</strong><strong>in</strong>g vectors are classified correctly. There are several possiblereasons for this error. One possibility is that the set of exemplars did notadequately represent the class, so the hidden-layer weight vector did not f<strong>in</strong>dthe true centroid. Equivalently, tra<strong>in</strong><strong>in</strong>g may not have cont<strong>in</strong>ued for a sufficienttime to center the weight vector properly; this situation is illustrated <strong>in</strong>Figure 6.19.


240 The Counterpropagation NetworkRegion ofclass 1<strong>in</strong>put vectorsRegion ofclass 2<strong>in</strong>put vectorsFigure 6.19In this example, weight vector wi learns class 1 <strong>and</strong> W2 learnsclass 2. The <strong>in</strong>put vectors of each class extend over theregions shown. S<strong>in</strong>ce w 2 has not learned the true centroidof class 2, an outly<strong>in</strong>g vector, x 2 , is actually closer to wi <strong>and</strong>is classified erroneously as a member of class 1.One solution to these situations is to add more units on the competitive layer.Caution must be used, however, s<strong>in</strong>ce the problem may be exacerbated. A unitadded whose weight vector appears at the <strong>in</strong>tersection between two classes maycause misclassification of many <strong>in</strong>put vectors of the orig<strong>in</strong>al two classes. If athreshold condition is added to the competitive units, a greater amount of controlexists over the partition<strong>in</strong>g of the space <strong>in</strong>to classes. A threshold prevents a unitfrom w<strong>in</strong>n<strong>in</strong>g if the <strong>in</strong>put vector is not with<strong>in</strong> a certa<strong>in</strong> m<strong>in</strong>imum angle, whichmay be different for each unit. Such a condition has the effect of limit<strong>in</strong>g thesize of the w<strong>in</strong> region of each unit.There are also problems that can occur dur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g period itself. Forexample, if the distribution of the vectors of each class changes with time, thencompetitive units that were coded orig<strong>in</strong>ally for one class may get receded torepresent another. Moreover, after tra<strong>in</strong><strong>in</strong>g, mov<strong>in</strong>g distributions will result <strong>in</strong>serious classification errors. Another situation is illustrated <strong>in</strong> Figure 6.20. Theproblem there manifests itself <strong>in</strong> the form of a stuck vector; that is, one unitthat never seems to w<strong>in</strong> the competition for any <strong>in</strong>put vector.The stuck-vector problem leads us to an issue that we touched on earlier: the<strong>in</strong>itialization of the competitive-unit weight vectors. We stated <strong>in</strong> the previoussection that a good strategy for <strong>in</strong>itialization is to assign each weight vector to beidentical to one of the prototype vectors for each class. The primary motivationfor us<strong>in</strong>g this strategy is to avoid the stuck-vector problem.The extreme case of the stuck-vector problem can occur if the weight vectorsare <strong>in</strong>itialized to r<strong>and</strong>om values. Tra<strong>in</strong><strong>in</strong>g with weight vectors <strong>in</strong>itialized <strong>in</strong> thismanner could result <strong>in</strong> all but one of the weight vectors becom<strong>in</strong>g stuck. A


6.2 CRN Data Process<strong>in</strong>g 241Region ofclass 1<strong>in</strong>put vectorsRegion ofclass 2<strong>in</strong>put vectorsw 2 (, =0)(a)Region ofclass 1<strong>in</strong>put vectorsRegion ofclass 2<strong>in</strong>put vectors(b)Figure 6.20 This figure illustrates the stuck-vector problem, (a) In thisexample, we would like W) to learn the class represented by\i, <strong>and</strong> w 2 to learn x 2 . (b) Initial tra<strong>in</strong><strong>in</strong>g with \i has broughtWi closer to x 2 than w 2 is. Thus, W| will w<strong>in</strong> for either X] orx 2 , <strong>and</strong> w 2 will never w<strong>in</strong>.s<strong>in</strong>gle weight vector would w<strong>in</strong> for every <strong>in</strong>put vector, <strong>and</strong> the network wouldnot learn to dist<strong>in</strong>guish between any of the classes on <strong>in</strong>put vectors.This rather peculiar occurrence arises due to a comb<strong>in</strong>ation of two factors:(1) <strong>in</strong> a high-dimensional space, r<strong>and</strong>om vectors are all nearly orthogonal to oneanother (their dot products are near 0), <strong>and</strong> (2) it is not unlikely that all <strong>in</strong>putvectors for a particular problem are clustered with<strong>in</strong> a s<strong>in</strong>gle region of space. Ifthese conditions prevail, then it is possible that only one of the r<strong>and</strong>om weightvectors lies with<strong>in</strong> the same region as the <strong>in</strong>put vectors. Any <strong>in</strong>put vector wouldhave a large dot product with that one weight vector only, s<strong>in</strong>ce all other weightvectors would be <strong>in</strong> orthogonal regions.


242 The Counterpropagation NetworkAnother approach to deal<strong>in</strong>g with a stuck vector is to endow the competitiveunits with a conscience. Suppose that the probability that a particular unit w<strong>in</strong>sthe competition was <strong>in</strong>versely proportional to the number of times that unit won<strong>in</strong> the past. If a unit w<strong>in</strong>s too often, it simply shuts down, allow<strong>in</strong>g others tow<strong>in</strong> for a change. Incorporat<strong>in</strong>g this feature can unstick a stuck vector result<strong>in</strong>gfrom a situation such as the one shown <strong>in</strong> Figure 6.20.In contrast to the competitive layer, the layer of outstars on the output layerhas few potential problems. Weight vectors can be r<strong>and</strong>omized <strong>in</strong>itially, or setequal to 0 or to some other convenient value. In fact, the only real concern isthe value of the parameter, (3, <strong>in</strong> the learn<strong>in</strong>g law, Eq. (6.21). S<strong>in</strong>ce Eq. (6.21) isa numerical approximation to the solution of a differential equation, 0 should bekept suitably small, (0 < (3


6.2 CRN Data Process<strong>in</strong>g 243x' Output vectory' Output vectorLayer 3Layer 2x Input vectorLayer 1y Input vectorFigure 6.21 The full CRN architecture is redrawn from Figure 6.1. Bothx <strong>and</strong> y <strong>in</strong>put vectors are fully connected to the competitivelayer. The x <strong>in</strong>puts are connected to the x' output units, <strong>and</strong>the y <strong>in</strong>puts are connected to the y' outputs.Both x <strong>and</strong> y <strong>in</strong>put vectors must be normalized for the full CPN. As <strong>in</strong>the forward-mapp<strong>in</strong>g CPN, both x <strong>and</strong> y are applied to the <strong>in</strong>put units dur<strong>in</strong>gthe tra<strong>in</strong><strong>in</strong>g process. After tra<strong>in</strong><strong>in</strong>g, <strong>in</strong>puts of (x, 0) will result <strong>in</strong> an output ofy' = $(x), <strong>and</strong> an <strong>in</strong>put of (0,y) will result <strong>in</strong> an output of x'.Because both x <strong>and</strong> y vectors are connected to the hidden layer, there aretwo weight vectors associated with each unit. One weight vector, r, is on theconnections from the x <strong>in</strong>puts; another weight vector, s, is on the connectionsfrom the y <strong>in</strong>puts.Each unit on the competitive layer calculates its net <strong>in</strong>put accord<strong>in</strong>g tonet, = r • x + s • yThe output of the competitive layer units isDur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g process_ f 1 net, = maxjneU}0 otherwiser, = a x (\ - Ts, = a y (y - Si)


244 The Counterpropagation NetworkAs with the forward-mapp<strong>in</strong>g network, only the w<strong>in</strong>n<strong>in</strong>g unit is allowed to learnfor a given <strong>in</strong>put vector.Like the <strong>in</strong>put layer, the output layer is split <strong>in</strong>to two dist<strong>in</strong>ct parts. They' units have weight vectors w,, <strong>and</strong> the x' units have weight vectors v,. Thelearn<strong>in</strong>g laws are<strong>and</strong>Once aga<strong>in</strong>, only weights for which Zj ^ 0 are allowed to learn.Exercise 6.6: What will be the result, after tra<strong>in</strong><strong>in</strong>g, of an <strong>in</strong>put of (x 0 ,y;,),where x,, = ^'(yj <strong>and</strong> y ft =6.3 AN IMAGE-CLASSIFICATION EXAMPLEIn this section, we shall look at an example of how the CPN can be usedto classify images <strong>in</strong>to categories. In addition, we shall see how a simplemodification of the CPN will allow the network to perform some <strong>in</strong>terpolationat the output layer.The problem is to determ<strong>in</strong>e the angle of rotation of the pr<strong>in</strong>cipal axis of anobject <strong>in</strong> two dimensions, directly from the raw video image of the object [1].In this case, the object is a model of the Space Shuttle that can be rotated 360degrees about a s<strong>in</strong>gle axis of rotation. Numerical algorithms as well as patternmatch<strong>in</strong>gtechniques exist that will solve this problem. The neural-networksolution possesses some <strong>in</strong>terest<strong>in</strong>g advantages, however, that may recommendit over these traditional approaches.Figure 6.22 shows a diagram of the system architecture for the spacecraftorientation system. The video camera, television monitor, <strong>and</strong> robot all <strong>in</strong>terfaceto a desktop computer that simulates the neural network <strong>and</strong> houses a videoframe-grabber board. The architecture is an example of how a neural networkcan be embedded as a part of an overall system.The system uses a CPN hav<strong>in</strong>g 1026 <strong>in</strong>put units (1024 for the image <strong>and</strong>2 for the tra<strong>in</strong><strong>in</strong>g <strong>in</strong>puts), 12 hidden units, <strong>and</strong> 2 output units. The units onthe middle layer learn to divide the <strong>in</strong>put vectors <strong>in</strong>to different classes. Thereare 12 units <strong>in</strong> this layer, <strong>and</strong> 12 different <strong>in</strong>put vectors are used to tra<strong>in</strong> thenetwork. These 12 vectors represent images of the shuttle at 30-degree <strong>in</strong>crements(0°, 30°,..., 330°). S<strong>in</strong>ce there are 12 categories <strong>and</strong> 12 tra<strong>in</strong><strong>in</strong>g vectors,tra<strong>in</strong><strong>in</strong>g of the competitive layer consists of sett<strong>in</strong>g each unit's weight equal toone of the (normalized) <strong>in</strong>put vectors. The output layer units learn to associatethe correct s<strong>in</strong>e <strong>and</strong> cos<strong>in</strong>e values with each of the classes represented on themiddle layer.


,6.3 An Image-Classification Example 245Figure 6.22The system architecture for the spacecraft orientation systemis shown. The video camera <strong>and</strong> frame-grabber capture a256-by-256-pixel image of the model. That image is reducedto 32-by-32 pixels by a pixel-averag<strong>in</strong>g technique, <strong>and</strong> isthen thresholded to produce a b<strong>in</strong>ary image. The result<strong>in</strong>g1024-component vector is used as the <strong>in</strong>put to the neuralnetwork, which responds by giv<strong>in</strong>g the s<strong>in</strong>e <strong>and</strong> cos<strong>in</strong>e ofthe rotation angle of the pr<strong>in</strong>cipal axis of the model. Theseoutput values are converted to an angle that is sent as partof a comm<strong>and</strong> str<strong>in</strong>g to a mechanical robot assembly. Thecomm<strong>and</strong> sequence causes the robot to reach out <strong>and</strong> pickup the model. The angle is used to roll the robot's wristto the proper orientation, so that the robot can grasp themodel perpendicular to the long axis. Source: Repr<strong>in</strong>tedwith permission from James A. Freeman, "<strong>Neural</strong> networksfor mach<strong>in</strong>e vision: the spacecraft orientation demonstration."e x ponent: Ford Aerospace Technical Journal, Fall 1988.


246 The Counterpropagation NetworkIt would seem that this network is limited to classify<strong>in</strong>g all <strong>in</strong>put patterns<strong>in</strong>to only one of 12 categories. An <strong>in</strong>put pattern represent<strong>in</strong>g a rotation of 32degrees, for example, probably would be classified as a 30-degree pattern bythis network. One way to remedy this deficiency would be to add more unitson the middle layer, allow<strong>in</strong>g for a f<strong>in</strong>er categorization of the <strong>in</strong>put images.An alternative approach is to allow the output units to perform an <strong>in</strong>terpolationfor patterns that do not match one of the tra<strong>in</strong><strong>in</strong>g patterns to with<strong>in</strong> acerta<strong>in</strong> tolerance. For this <strong>in</strong>terpolative scheme to be accomplished, more thanone unit on the competitive layer must share <strong>in</strong> w<strong>in</strong>n<strong>in</strong>g for each <strong>in</strong>put vector.Recall that the output-layer units calculate their output values accord<strong>in</strong>g toEq. (6.23): y' k = ^2-w k jZj. In the normal case, where the zth hidden unitw<strong>in</strong>s, y' k = Wki, s<strong>in</strong>ce Zj = 1 for j — i <strong>and</strong> Zj = 0 otherwise. Suppose twocompetitive units shared <strong>in</strong> w<strong>in</strong>n<strong>in</strong>g—the ones with the two closest match<strong>in</strong>gpatterns. Further, let the output of those units be proportional to how close the<strong>in</strong>put pattern is; that is, Zj oc cos Oj for the two w<strong>in</strong>n<strong>in</strong>g units. If we restrictthe total outputfrom the middle layer to unity, then the output values from theoutput layer would bey' k = w k ,Zi + w kj zjwhere the zth <strong>and</strong> jth units on the middle layer were the w<strong>in</strong>ners, <strong>and</strong>The network output is a l<strong>in</strong>ear <strong>in</strong>terpolation of the outputs that would be obta<strong>in</strong>edfrom the two patterns that exactly matched the two hidden units that shared thevictory.Us<strong>in</strong>g this technique, the network will classify successfully <strong>in</strong>put patternsrepresent<strong>in</strong>g rotation angles it had never seen dur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g period. Inour experiments, the average error was approximately ±3°. However, s<strong>in</strong>ce asimple l<strong>in</strong>ear <strong>in</strong>terpolation scheme is used, the error varied from almost 0 to asmuch as 10 degrees. Other <strong>in</strong>terpolation schemes could result <strong>in</strong> considerablyhigher accuracy over the entire range of <strong>in</strong>put patterns.One of the benefits of us<strong>in</strong>g the neural-network approach to pattern match<strong>in</strong>gis robustness <strong>in</strong> the presence of noise or of contradictory data. An example isshown <strong>in</strong> Figure 6.23, where the network was able to respond correctly, eventhough a substantial portion of the image was obscured.It is unlikely that someone would use a neural network for a simple orientationdeterm<strong>in</strong>ation. The methodology can be extended to more realistic cases,however, where the object can be rotated <strong>in</strong> three dimensions. In such cases, thetime required to construct <strong>and</strong> tra<strong>in</strong> a neural network may be significantly lessthan the time required for development of algorithms that perform the identicaltasks.


6.4 The CPN Simulator 247(a)(b)Figure 6.23These figures show 32-by-32-pixel arrays of two different<strong>in</strong>put vectors for the spacecraft-orientation system, (a) Thisis a bit-mapped image of the space-shuttle model at anangle of 150° as measured clockwise from the vertical.(b) The obscured image was used as an <strong>in</strong>put vector tothe spacecraft-orientation system. The CPN responded withan angle of 149°. Source: Repr<strong>in</strong>ted with permission fromJames A. Freeman, "<strong>Neural</strong> networks for mach<strong>in</strong>e vision:the spacecraft orientation demonstration." e"ponent: FordAerospace Technical Journal, Fall 1988.6.4 THE CPN SIMULATOREven though it utilizes two different learn<strong>in</strong>g rules, the CPN is perhaps the leastcomplex of the layered networks we will simulate, primarily because of theaspect of competition implemented on the s<strong>in</strong>gle hidden layer. Furthermore, ifwe assume that the host computer system ensures that all <strong>in</strong>put pattern vectors arenormalized prior to presentation to the network, it is only the hidden layer thatconta<strong>in</strong>s any special process<strong>in</strong>g considerations: the <strong>in</strong>put layer is simply a fanoutlayer, <strong>and</strong> each unit on the output layer merely performs a l<strong>in</strong>ear summationof its active <strong>in</strong>puts. The only complication <strong>in</strong> the simulation is the determ<strong>in</strong>ationof the w<strong>in</strong>n<strong>in</strong>g unit(s), <strong>and</strong> the generation of the appropriate output for each ofthe hidden-layer units. In the rema<strong>in</strong>der of this section, we will describe thealgorithms necessary to construct the restricted CPN simulator. Then, we shalldescribe the extensions that must be made to implement the complete CPN. Weconclude the chapter with thoughts on alternative methods of <strong>in</strong>itializ<strong>in</strong>g <strong>and</strong>tra<strong>in</strong><strong>in</strong>g the network.


248 The Counterpropagation Network6.4.1 The CRN Data StructuresDue to the similarity of the CPN simulator to the BPN discussed <strong>in</strong> Chapter 3,we will use those data structures as the basis for the CPN simulator. The onlymodification we will require is to the top-level network record specification.The reason for this modification should be obvious by now; s<strong>in</strong>ce we haveconsistently used the network record as the repository for all network specificparameters, we must <strong>in</strong>clude the CPN-specific data <strong>in</strong> the CPN's top level declaration.Thus, the CPN can be def<strong>in</strong>ed by the follow<strong>in</strong>g record structure:record CPN =INPUTS : "layer; {po<strong>in</strong>ter to <strong>in</strong>put layer record}HIDDENS : "layer; {po<strong>in</strong>ter to hidden layer record}OUTPUTS : "layer; {po<strong>in</strong>ter to output layer record}ALPHA : float; {Kohonen learn<strong>in</strong>g parameter}BETA : float; {Grossberg learn<strong>in</strong>g parameters}N : <strong>in</strong>teger; {number of w<strong>in</strong>n<strong>in</strong>g units allowed}end record;where the layer record <strong>and</strong> all lower-level structures are identical to those def<strong>in</strong>ed<strong>in</strong> Chapter 3. A diagram illustrat<strong>in</strong>g the complete structure def<strong>in</strong>ed for thisnetwork is shown <strong>in</strong> Figure 6.24.6.4.2 CPN <strong>Algorithms</strong>S<strong>in</strong>ce forward signal propagation through the CPN is easiest to describe, weshall beg<strong>in</strong> with that aspect of our simulator. Throughout this discussion, wewill assume that• The network simulator has been <strong>in</strong>itialized so that the <strong>in</strong>ternal data structureshave been allocated <strong>and</strong> conta<strong>in</strong> valid <strong>in</strong>formation• The user has set the outputs of the network <strong>in</strong>put units to a normalizedvector to be propagated through the network• Once the network generates its output, the user application reads the outputvector from the appropriate array <strong>and</strong> uses that output accord<strong>in</strong>glyRecall from our discussion <strong>in</strong> Section 6.2.1 that process<strong>in</strong>g <strong>in</strong> the CPN essentiallystarts <strong>in</strong> the hidden layer. S<strong>in</strong>ce we have assumed that the <strong>in</strong>put vectoris both normalized <strong>and</strong> available <strong>in</strong> the network data structures, signal propagationbeg<strong>in</strong>s by hav<strong>in</strong>g the computer calculate the total <strong>in</strong>put stimulation receivedby each unit on the hidden layer. The unit (or units, <strong>in</strong> the case where N > 1)with the largest aggregate <strong>in</strong>put is declared the w<strong>in</strong>ner, <strong>and</strong> the output from thatunit is set to 1. The outputs from all los<strong>in</strong>g units are simultaneously set to 0.Once process<strong>in</strong>g on the hidden layer is complete, the network output iscalculated by performance of another sum-of-products at each unit on the outputlayer. In this case, the dot product between the connection weight vector to theunit <strong>in</strong> question <strong>and</strong> the output vector formed by all the hidden-layer units is


6.4 The CPN Simulator 249outsweightsFigure 6.24 The complete data structure for the CPN is shown. Thesestructures are representative of all the layered networks thatwe simulate <strong>in</strong> this text.computed <strong>and</strong> used directly as the output for that unit. S<strong>in</strong>ce the hidden layer <strong>in</strong>the CPN is a competitive layer, the <strong>in</strong>put computation at the output layer takeson a significance not usually found <strong>in</strong> an ANS; rather than comb<strong>in</strong><strong>in</strong>g feature<strong>in</strong>dications from many units, which may be either excitatory or <strong>in</strong>hibitory (as <strong>in</strong>the BPN), the output units <strong>in</strong> the CPN are merely recall<strong>in</strong>g features as stored <strong>in</strong>the connections between the w<strong>in</strong>n<strong>in</strong>g hidden unit(s) <strong>and</strong> themselves. This aspectof memory recall is further illustrated <strong>in</strong> Figure 6.25.Armed with this knowledge of network operation, there are a number ofth<strong>in</strong>gs we can do to make our simulation more efficient. For example, s<strong>in</strong>cewe know that only a limited number of units (normally only one) <strong>in</strong> the hiddenlayer will be allowed to w<strong>in</strong> the competition, there is really no po<strong>in</strong>t <strong>in</strong> forc<strong>in</strong>gthe computer to calculate the total <strong>in</strong>put to every unit <strong>in</strong> the output layer.A much more efficient approach would be simply to allow the computer toremember which hidden layer unit(s) won the competition, <strong>and</strong> to restrict the


250 The Counterpropagation NetworkFigure 6.25This figure shows the process of <strong>in</strong>formation recall <strong>in</strong> theoutput layer of the CRN. Each unit on the output layer receivesan active <strong>in</strong>put only from the w<strong>in</strong>n<strong>in</strong>g unit(s) on the hiddenlayer. S<strong>in</strong>ce the connections between the w<strong>in</strong>n<strong>in</strong>g hiddenunit <strong>and</strong> each unit on the output layer conta<strong>in</strong> the outputvalue that was associated with the <strong>in</strong>put pattern that wonthe competition dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, the process of comput<strong>in</strong>g the<strong>in</strong>put at each unit on the output layer is noth<strong>in</strong>g more thana selection of the appropriate output pattern from the set ofavailable patterns stored <strong>in</strong> the <strong>in</strong>put connections.<strong>in</strong>put calculation at each output unit to that unit's connections to the w<strong>in</strong>n<strong>in</strong>gunit(s).Also, we can consider the process of determ<strong>in</strong><strong>in</strong>g the w<strong>in</strong>n<strong>in</strong>g hidden unit(s).In the case where only one unit is allowed to w<strong>in</strong> (TV = 1), determ<strong>in</strong><strong>in</strong>g thew<strong>in</strong>ner can be done easily as part of calculat<strong>in</strong>g the <strong>in</strong>put to each hidden-layerunit; we simply need to compare the <strong>in</strong>put just calculated to the value savedas the previously largest <strong>in</strong>put. If the current <strong>in</strong>put exceeds the older value,the current <strong>in</strong>put replaces the older value, <strong>and</strong> process<strong>in</strong>g cont<strong>in</strong>ues with thenext unit. After we have completed the <strong>in</strong>put calculation for all hidden-layerunits, the unit whose output matches the largest value saved can be declared thew<strong>in</strong>ner. 2On the other h<strong>and</strong>, if we allow more than one unit to w<strong>in</strong> the competition(N > 1), the problem of determ<strong>in</strong><strong>in</strong>g the w<strong>in</strong>n<strong>in</strong>g hidden units is more complicated.One problem we will encounter is the determ<strong>in</strong>ation of how many unitswill be allowed to w<strong>in</strong> simultaneously. Obviously, we will never have to allowall hidden units to w<strong>in</strong>, but for how many possible w<strong>in</strong>ners must we account <strong>in</strong>2 This approach ignores the case where ties between hidden-layer units confuse the determ<strong>in</strong>ation ofthe w<strong>in</strong>ner. In such an event, other criteria must be used to select the w<strong>in</strong>ner.


6.4 The CPN Simulator 251our simulator design? Also, we must address the issue of rank<strong>in</strong>g the hiddenlayerunits so that we may determ<strong>in</strong>e which unit(s) had a greater response to the<strong>in</strong>put; specifically, should we simply process all the hidden-layer units first, <strong>and</strong>sort them afterward, or should we attempt to rank the units as we process them?The answer to these questions is truly application dependent; for our purposes,however, we will assume that we must account for no more than threew<strong>in</strong>n<strong>in</strong>g units (0 < TV < 4) <strong>in</strong> our simulator design. This be<strong>in</strong>g the case, wecan also assume that it is more efficient to keep track of up to three w<strong>in</strong>n<strong>in</strong>gunits as we go, rather than try<strong>in</strong>g to sort through all hidden units afterward.CPN Production <strong>Algorithms</strong>. Us<strong>in</strong>g the assumptions described, we are nowready to construct the algorithms for perform<strong>in</strong>g the forward signal propagation<strong>in</strong> the CPN. S<strong>in</strong>ce the process<strong>in</strong>g on each of the two active layers is different(recall that the <strong>in</strong>put layer is fan-out only), we will develop two different signalpropagationalgorithms: prop-to-hidden <strong>and</strong> prop-to-output.procedure prop_to_hidden(NET:CPN; FIRST,SECOND,THIRD:INTEGER){propagate to hidden layer, return<strong>in</strong>g <strong>in</strong>dices to3 w<strong>in</strong>ners}var units : ~float[];{po<strong>in</strong>ter to unit outputs}<strong>in</strong>vec : "floatf];{po<strong>in</strong>ter to <strong>in</strong>put units}connects : "float[]; {po<strong>in</strong>ter to connection array}best : float;{the current best match}i, j : <strong>in</strong>teger; {iteration counters}beg<strong>in</strong>best = -100;units = NET.HIDDENS".CUTS;{<strong>in</strong>itialize best choice}{locate output array}for i = 1 to length (units)do{for all hidden units}units [i] = 0;{<strong>in</strong>itialize accumulator}<strong>in</strong>vec = NET.INPUTS".OUTS; {locate <strong>in</strong>put array}connects = NET.HIDDENS".WEIGHTS[i] ~;{locate <strong>in</strong>puts}for j = 1 to length (connects) dounits [i] = units [i] + connects[j] * <strong>in</strong>vec[j];end do;rank (units[i], FIRST, SECOND, THIRD);end do;compete (NET.HIDDENS".CUTS, FIRST, SECOND, THIRD);end procedure;


252 The Counterpropagation NetworkThis procedure makes calls to two as-yet-undef<strong>in</strong>ed rout<strong>in</strong>es, rank <strong>and</strong>compete. The purpose of these rout<strong>in</strong>es is to sort the current <strong>in</strong>put with thecurrent best three choices, <strong>and</strong> to generate the appropriate output for all units<strong>in</strong> the specified layer, respectively. Because the design of the rank procedureis fairly straightforward, it is left to you as an exercise. On the other h<strong>and</strong>,the compete process must do the right th<strong>in</strong>g no matter how many w<strong>in</strong>nersare allowed, mak<strong>in</strong>g it somewhat <strong>in</strong>volved. We therefore present the design forcompete <strong>in</strong> its entirety.procedure compete(UNITS:"float[]; FIRST,SECOND,THIRD:INTEGER){generate outputs for all UNITS us<strong>in</strong>g competitivefunction}var outputs : "float[]; {step through output array}sum : float;{local accumulator}w<strong>in</strong>, place, show : float;{store outputs}i : <strong>in</strong>teger;{iteration counter}beg<strong>in</strong>outputs = UNITS;sum = outputs[FIRST]w<strong>in</strong> = outputs[FIRST]{locate output array}{<strong>in</strong>itialize accumulator}{save w<strong>in</strong>n<strong>in</strong>g value}if (SECOND != 0){if a second w<strong>in</strong>ner}then{add its contribution}sum = sum + outputs[SECOND];place = outputs[SECOND]; {save second place value}if (THIRD != 0){if a third place w<strong>in</strong>ner}then{add its contribution}sum = sum + outputs[THIRD];show = outputs[THIRD]; {save third place value}end if;end if;for i = 1 to length (units)dooutputs[i] = 0;end do;outputs[FIRST] = w<strong>in</strong> / sum;[for all hidden units}{set outputs to zero}{set w<strong>in</strong>ners output}if (SECOND != 0){now update second w<strong>in</strong>ner}thenoutputs[SECOND] = place / sum;if (THIRD != 0)thenoutputs[THIRD] = show / sum;end if;end if;end procedure;{<strong>and</strong> third place}


6.4 The CRN Simulator 253Before we move on to the prop_to_output rout<strong>in</strong>e, you should notethat the compete procedure relies on the fact that the values of SECOND <strong>and</strong>THIRD are nonzero if <strong>and</strong> only if more than one unit w<strong>in</strong>s the competition.S<strong>in</strong>ce it is assumed that these values are set as part of the rank procedure, youshould take care to ensure that these variables are manipulated accord<strong>in</strong>g to thenumber of w<strong>in</strong>n<strong>in</strong>g units <strong>in</strong>dicated by the value <strong>in</strong> the NET.N variable.Let us now consider the process of propagat<strong>in</strong>g <strong>in</strong>formation to the outputlayer <strong>in</strong> the CPN. Once we have completed the signal propagation to the hiddenlayer, the outputs on the hidden layer will be nonzero only from the w<strong>in</strong>n<strong>in</strong>gunits. As we have discussed before, we could now proceed to perform a complete<strong>in</strong>put summation at every unit on the output layer, but that would prove tobe needlessly time-consum<strong>in</strong>g. S<strong>in</strong>ce we have designed the prop_to.hiddenprocedure to return the <strong>in</strong>dex of the w<strong>in</strong>n<strong>in</strong>g unit(s), we can assume that thetop-level rout<strong>in</strong>e to propagate <strong>in</strong>formation through the network completely willhave access to that <strong>in</strong>formation prior to call<strong>in</strong>g the procedure to propagate <strong>in</strong>formationto the output layer. We can therefore code the prop-to-outputprocedure so that only those connections between the w<strong>in</strong>n<strong>in</strong>g units <strong>and</strong> the outputunits are processed. Also, notice that the successful use of this procedurerelies on the values of the SECOND <strong>and</strong> THIRD variables be<strong>in</strong>g nonzero only ifmore than one w<strong>in</strong>ner was allowed.procedure prop_to_output(NET:CPN; FIRST,SECOND,THIRD:INTEGER){generate outputs for units on the output layer}var units : "float[];hidvec : "float[];connects : "float[];i : <strong>in</strong>teger;beg<strong>in</strong>units = NET.OUTPUTS".OUTS;hidvec = NET.HIDDENS".OUTS;{locate output units){locate hidden units}{locate connections}{iteration counter}{start of output array}{start of hidden array}for i = 1 to length (units) {for all output units}doconnects = NET.OUTPUTS".WEIGHTS[i]";units [i] = hidvec[FIRST] * connects[FIRST];if (SECOND !=0){if there is a second w<strong>in</strong>n<strong>in</strong>g unit}thenunits[i] = units[i] + hidvec[SECOND]* connects[SECOND];if (THIRD != 0){if there is a third w<strong>in</strong>n<strong>in</strong>g unit}


254 The Counterpropagation Networkthenunits [i] = units[i] + hidvec[THIRD]* connects[THIRD];end if;end if;end do;end procedure;You may now be ask<strong>in</strong>g yourself how we can be assured that the hiddenlayerunits will be us<strong>in</strong>g the appropriate output values for any number of w<strong>in</strong>n<strong>in</strong>gunits. To answer this question, we recall that we specified that at least one unitis guaranteed to w<strong>in</strong>, <strong>and</strong> that at most, three will share <strong>in</strong> the victory. Inspectionof the compete <strong>and</strong> prop-to-output rout<strong>in</strong>es shows that, with only onew<strong>in</strong>n<strong>in</strong>g unit, the output of all non-w<strong>in</strong>n<strong>in</strong>g units will be 0, whereas the w<strong>in</strong>n<strong>in</strong>gunit will generate a 1. As we <strong>in</strong>crease the number of units that we allow tow<strong>in</strong>, the strength of the output from each of the w<strong>in</strong>n<strong>in</strong>g units is proportionallydecreased, so that the relative contribution from all w<strong>in</strong>n<strong>in</strong>g units will l<strong>in</strong>early<strong>in</strong>terpolate between output patterns the network was tra<strong>in</strong>ed to produce.Now we are prepared to def<strong>in</strong>e the top-level algorithm for forward signalpropagation <strong>in</strong> the CPN. As before, we assume that the <strong>in</strong>put vector has beenset previously by an application-specific <strong>in</strong>put rout<strong>in</strong>e.procedure propagate (NET:CPN){perform a forward signal propagation <strong>in</strong> the CPN}var first,secondthird : <strong>in</strong>teger;{<strong>in</strong>dices for w<strong>in</strong>n<strong>in</strong>g units}beg<strong>in</strong>prop_to_hidden (NET, first,prop_to_output (NET, first,end procedure;second,second,third);third);CPN Learn<strong>in</strong>g <strong>Algorithms</strong>. There are two significant differences betweenforward signal propagation <strong>and</strong> learn<strong>in</strong>g <strong>in</strong> the CPN: dur<strong>in</strong>g learn<strong>in</strong>g, only oneunit on the hidden layer can w<strong>in</strong> the competition, <strong>and</strong>, quite obviously, thenetwork connection weights are updated. Yet, even though they are different,much of the activity that must be performed dur<strong>in</strong>g learn<strong>in</strong>g is identical to theforward signal propagation. As you will see, we will be able to reuse theproduction-mode algorithms to a large extent as we develop our learn<strong>in</strong>g-modeprocedures.We shall beg<strong>in</strong> by tra<strong>in</strong><strong>in</strong>g the hidden-layer units to recognize our <strong>in</strong>putpatterns. Hav<strong>in</strong>g completed that activity, we will proceed to tra<strong>in</strong> the outputlayer to reproduce the target outputs from the specified <strong>in</strong>puts. Let us firstconsider the process of tra<strong>in</strong><strong>in</strong>g the hidden layer <strong>in</strong> the CPN. Assum<strong>in</strong>g the<strong>in</strong>put layer units have been <strong>in</strong>itialized to conta<strong>in</strong> a normalized vector to be


6.4 The CRN Simulator 255learned, we can def<strong>in</strong>e the learn<strong>in</strong>g algorithm for the hidden-layer units <strong>in</strong> thefollow<strong>in</strong>g manner.procedure learn_<strong>in</strong>put (NET:CPN){update the connections to the w<strong>in</strong>n<strong>in</strong>g hidden unit}var w<strong>in</strong>ner : <strong>in</strong>teger; {used to locate w<strong>in</strong>n<strong>in</strong>g unit}dummy1, dummy2 : <strong>in</strong>teger;{dummy variables for second <strong>and</strong> third}connects : "floatf]; {locate connection array}units : 'float[];{locate <strong>in</strong>put array}i : <strong>in</strong>teger;{iteration counter}beg<strong>in</strong>dummyl =0;{no need for second or)dummy2 =0;{third w<strong>in</strong>n<strong>in</strong>g unit}prop_to_hidden (NET, w<strong>in</strong>ner, dummyl, dummy2);units = NET.INPUTS".OUTS; {locate <strong>in</strong>put array}connects = NET.HIDDENS".WEIGHTS[w<strong>in</strong>ner]";{locate connections}for i = 1 to length (connects){for all connections to w<strong>in</strong>ner}doconnects[i] = connects[i] +NET.ALPHA * (units[i] - connects[i]);end do;end procedure;Notice that this algorithm has no access to <strong>in</strong>formation that would <strong>in</strong>dicatewhen the competitive layer has been tra<strong>in</strong>ed sufficiently. Unlike <strong>in</strong> many ofthe other networks we have studied, there is no error measure to <strong>in</strong>dicate convergence.For that reason, we have chosen to design this tra<strong>in</strong><strong>in</strong>g algorithm sothat it performs only one tra<strong>in</strong><strong>in</strong>g pass <strong>in</strong> the competitive layer. A higher-levelrout<strong>in</strong>e to tra<strong>in</strong> the entire network must therefore be coded such that it canreasonably determ<strong>in</strong>e when the competitive layer has completed its tra<strong>in</strong><strong>in</strong>g.Mov<strong>in</strong>g on to the output layer, we will construct our tra<strong>in</strong><strong>in</strong>g algorithmfor this part of the network such that only those connections from the w<strong>in</strong>n<strong>in</strong>ghidden unit to the output layer are adapted. This approach will allow us tocomplete the tra<strong>in</strong><strong>in</strong>g of the competitive layer before start<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g onthe accretive layer. Hopefully, the use of this approach will enable the CPN toclassify <strong>in</strong>puts correctly as it is tra<strong>in</strong><strong>in</strong>g outputs, to avoid confus<strong>in</strong>g the network.procedure learn_output (NET:CPN; TARGET:"float[]){tra<strong>in</strong> the output layer to reproduce the specifiedvector)var w<strong>in</strong>ner : <strong>in</strong>teger; {used to locate w<strong>in</strong>n<strong>in</strong>g unit}dummyl, dummy2 : <strong>in</strong>teger;{dummy variables for second & third)


256 The Counterpropagation Networkconnects : "float[];units : "float!];i : <strong>in</strong>teger;beg<strong>in</strong>dummyl =0;dummy2 = 0;{locate connection array}{locate output array}{iteration counter}{no need for second or}{third w<strong>in</strong>n<strong>in</strong>g unit}prop_to_hidden (NET, w<strong>in</strong>ner, dummyl, dummy2);units = NET.OUTPUTS".CUTS; {locate output array}for i = 1 to length (units) {for all output units}doconnects = NET.OUTPUTS".WEIGHTS"[i];{locate connections}connects[w<strong>in</strong>ner] = connects[w<strong>in</strong>ner] +NET.BETA * (TARGET[i] -connects[w<strong>in</strong>ner]) ;end do;end procedure;As with learn_<strong>in</strong>put, learruoutput performs only one tra<strong>in</strong><strong>in</strong>g pass<strong>and</strong> makes no assessment of error. When the CPN is used, it is the applicationthat makes the determ<strong>in</strong>ation as to when tra<strong>in</strong><strong>in</strong>g is complete. As an example,consider the spacecraft-orientation system described <strong>in</strong> Section 6.3. This networkwas constructed to learn the image of the space shuttle at 12 differentorientations, produc<strong>in</strong>g the scaled s<strong>in</strong>e <strong>and</strong> cos<strong>in</strong>e of the angle between the referenceposition <strong>and</strong> the image position as output. Us<strong>in</strong>g our CPN algorithms,the tra<strong>in</strong><strong>in</strong>g process for this application might have taken the follow<strong>in</strong>g form:procedure learn (NET:CPN; IMAGEFILE:disk file){teach the NET us<strong>in</strong>g data <strong>in</strong> the IMAGEFILE}var iopairs : array [1. . 12,1. . 1026] of float;target : array [1..2] of float;status : array [1..12] of boolean;done : boolean;i, j : <strong>in</strong>teger;beg<strong>in</strong>NET.N = 1;{force only one w<strong>in</strong>ner}READ_INPUT_FILE (IMAGEFILE, iopairs[1,1]);{<strong>in</strong>it array}done = false;SET_FALSE (status);while (not done)dofor i = 1 to 12{tra<strong>in</strong> at least once}{<strong>in</strong>itialize status array}{until tra<strong>in</strong><strong>in</strong>g complete}{for each tra<strong>in</strong><strong>in</strong>g pair}


6.4 The CPN Simulator 257doj = r<strong>and</strong>om (12);{select a pattern at r<strong>and</strong>om}SET_INPUT (CPN, iopairstj, 1] ;learn_<strong>in</strong>put (CPN); {tra<strong>in</strong> competitive layer}end do;TEST_IF_INPUT_LEARNED (status, done);end do;done = false;SET_FALSE (status);{tra<strong>in</strong> output at least once}{<strong>in</strong>itialize status array}while (not done){until tra<strong>in</strong><strong>in</strong>g complete}dofor i = 1 to 12{for each tra<strong>in</strong><strong>in</strong>g pair}do{tra<strong>in</strong> accretive layer}SET_INPUT (CPN, iopairsfi, 1];GET_TARGET (target, iopairsfi, 1025]);learn_output (CPN, target);end do;TEST_IF_OUTPUT_LEARNED (Status, done);end do;end procedure;where the application provided rout<strong>in</strong>es test-if_<strong>in</strong>put_learned <strong>and</strong>test-if-output_learned perform the function of decid<strong>in</strong>g when the CPNhas been tra<strong>in</strong>ed sufficiently. In the case of test<strong>in</strong>g the competitive layer,this determ<strong>in</strong>ation was accomplished by verify<strong>in</strong>g that all 12 <strong>in</strong>put patternscaused different hidden-layer units to w<strong>in</strong> the competition. Similarly, the outputtest <strong>in</strong>dicated success when no output unit generated an actual output differentby more than 0.001 from the desired output for each of the 12 patterns.The other rout<strong>in</strong>es used <strong>in</strong> the application, READ-INPUT_JFILE, SET _FALSE,SET-INPUT, <strong>and</strong> GET-TARGET, merely perform housekeep<strong>in</strong>g functions forthe system. The po<strong>in</strong>t is, however, that there is no general heuristic that wecan use to determ<strong>in</strong>e when to stop tra<strong>in</strong><strong>in</strong>g the CPN simulator. If we had hadmany more tra<strong>in</strong><strong>in</strong>g pairs than hidden-layer units, the functions performed byTEST_IF_INPUT_LEARNED <strong>and</strong> TEST_IF_OUTPUT_LEARNED might havebeen altogether different.6.4.3 Simulat<strong>in</strong>g the Complete CPNNow that we have completed our discussion of the restricted CPN, let us turnour attention for a moment to the full CPN. In terms of simulat<strong>in</strong>g the completenetwork, there are only two differences between the restricted <strong>and</strong> completenetwork implementations:


258 The Counterpropagation Network1. The size of the network, <strong>in</strong> terms of number of units <strong>and</strong> connections2. The use of the network from the applications perspectiveQuite obviously, the number of units <strong>in</strong> the network has grown from N +H + M, where N <strong>and</strong> M specify the number of units <strong>in</strong> the <strong>in</strong>put <strong>and</strong> outputlayers, respectively, to 2(N+M) + H. Similarly, the number of connections thatmust be ma<strong>in</strong>ta<strong>in</strong>ed has doubled, exp<strong>and</strong><strong>in</strong>g from H(N + M) to 2H(N + M).Therefore, the extension from the restricted to the complete CPN has slightly lessthan doubled the amount of computer memory needed to simulate the network.In addition, the extra units <strong>and</strong> connections place an enormous overhead on theamount of computer time needed to perform the simulation, <strong>in</strong> that there arenow N + M extra connections to be processed at every hidden unit.As illustrated <strong>in</strong> Figure 6.26, the complete CPN requires no modificationto the algorithms we have just developed, other than to present both the <strong>in</strong>put<strong>and</strong> output patterns as target vectors for the output layer. This assertion holdstrue assum<strong>in</strong>g the user abides by the observation that, when go<strong>in</strong>g from <strong>in</strong>putto output, the extra M units on the <strong>in</strong>put layer are zeroed prior to perform<strong>in</strong>gthe signal propagation. This be<strong>in</strong>g the case, the <strong>in</strong>puts from the extra unitscontribute noth<strong>in</strong>g to the dot-product calculation at each hidden unit, effectivelyelim<strong>in</strong>at<strong>in</strong>g them from consideration <strong>in</strong> determ<strong>in</strong><strong>in</strong>g the w<strong>in</strong>n<strong>in</strong>g unit. By thesame token, the orig<strong>in</strong>al N units must be zeroed prior to performance of theCounterpropagation from the M new units to recover the orig<strong>in</strong>al <strong>in</strong>put.6.4.4 Practical Considerations for the CPN SimulatorWe earlier promised a discussion of the practical considerations of which wemight take advantage when simulat<strong>in</strong>g the CPN. We shall now live up to thatpromise by offer<strong>in</strong>g <strong>in</strong>sights <strong>in</strong>to improv<strong>in</strong>g the performance of the CPN simulator.Many times, a CPN application will require the network to function as anassociative memory; that is, we expect the network to recall a specific outputwhen presented with an <strong>in</strong>put that is similar to a tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put. Such an <strong>in</strong>putcould be the orig<strong>in</strong>al <strong>in</strong>put with noise added, or with part of the <strong>in</strong>put miss<strong>in</strong>g.When construct<strong>in</strong>g a CPN to act <strong>in</strong> this capacity, we usually create it with asmany hidden units as there are items to store. In do<strong>in</strong>g so, <strong>and</strong> <strong>in</strong> allow<strong>in</strong>g onlyone unit to w<strong>in</strong> the competition, we ensure that the network will always generatethe exact output that was associated with the tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put that most closelyresembles the current <strong>in</strong>put.Hav<strong>in</strong>g made this observation, we can now see how it is possible to reducethe amount of time needed to tra<strong>in</strong> the CPN by elim<strong>in</strong>at<strong>in</strong>g the need to tra<strong>in</strong>the competitive layer. We can do this reduction by <strong>in</strong>itializ<strong>in</strong>g the connectionsto each hidden unit such that each <strong>in</strong>put tra<strong>in</strong><strong>in</strong>g pattern is mapped onto theconnections of only one hidden unit. In essence, we will have tra<strong>in</strong>ed'thecompetitive layer by <strong>in</strong>itializ<strong>in</strong>g the connections, <strong>in</strong> a manner similar to theprocess of <strong>in</strong>itializ<strong>in</strong>g the BAM, described <strong>in</strong> Chapter 4. All that rema<strong>in</strong>s from


6.4 The CRN Simulator 259VFigure 6.26The complete CRN process<strong>in</strong>g model is shown. Us<strong>in</strong>g thealgorithms developed for the limited CPN, process<strong>in</strong>g beg<strong>in</strong>swith the application of an <strong>in</strong>put pattern on the X n units <strong>and</strong>by zero<strong>in</strong>g of the extra Y m units, which we use to representthe output. In this case, the output units (Y,' n ) will produce theoutput pattern associated with the given <strong>in</strong>put pattern dur<strong>in</strong>gtra<strong>in</strong><strong>in</strong>g. Now, to produce the counterpropagation effect, theopposite situation occurs. The given output pattern is appliedto the <strong>in</strong>put units previously zeroed (Y m ), while the X n unitsare zeroed. The output of the network, on the X' n units,represent the <strong>in</strong>put pattern associated with the given outputpattern dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g.this po<strong>in</strong>t is the tra<strong>in</strong><strong>in</strong>g of the output layer, which ought to conclude <strong>in</strong> fairlyshort order. An example of this type of tra<strong>in</strong><strong>in</strong>g is shown <strong>in</strong> Figure 6.27.Another observation about the operation of the CPN can provide us with<strong>in</strong>sight <strong>in</strong>to improv<strong>in</strong>g the ability of the network to discrim<strong>in</strong>ate between similarvectors. As we have discussed, the competitive layer acts to select betweenone of the many <strong>in</strong>put patterns the network was tra<strong>in</strong>ed to recognize. It doesthis selection by comput<strong>in</strong>g the dot product between the <strong>in</strong>put vector, I, <strong>and</strong>the connection weight vector, w. S<strong>in</strong>ce these two vectors are normalized, theresult<strong>in</strong>g value represents the cos<strong>in</strong>e of the angle between them <strong>in</strong> n-space.However, this approach can lead to problems if we allow the use of the nullvector, 0, as a valid <strong>in</strong>put. S<strong>in</strong>ce 0 cannot be normalized, another method of


260 Programm<strong>in</strong>g ExercisesFigure 6.27A restricted CPN <strong>in</strong>itialized to elim<strong>in</strong>ate the tra<strong>in</strong><strong>in</strong>g of thecompetitive layer is shown. Note that the number of hiddenunits is exactly equal to the number of tra<strong>in</strong><strong>in</strong>g patterns to belearned.propagat<strong>in</strong>g signals to the hidden layer must be found. One alternative methodthat has shown promis<strong>in</strong>g results is to use the magnitude of the difference vectorbetween the unnormalized I <strong>and</strong> w vectors as the <strong>in</strong>put computation for eachhidden unit. Specifically, letml,, =be the <strong>in</strong>put activation calculation performed at each hidden unit, rather thanthe traditional sum-of-products.Use of this method prevents the CPN from creat<strong>in</strong>g duplicate <strong>in</strong>ternal mapp<strong>in</strong>gsfor similar, but different, <strong>in</strong>put vectors. It also allows use of the null vectoras a valid tra<strong>in</strong><strong>in</strong>g <strong>in</strong>put, someth<strong>in</strong>g that we have found to be quite useful. F<strong>in</strong>ally,at least one other researcher has <strong>in</strong>dicated that this alternative method canimprove the ability of the network to learn by reduc<strong>in</strong>g the number of tra<strong>in</strong><strong>in</strong>gpresentations needed to have the network classify <strong>in</strong>put patterns properly [11].Programm<strong>in</strong>g Exercises6.1. Implement the CPN simulator described <strong>in</strong> the text <strong>and</strong> test it by tra<strong>in</strong><strong>in</strong>g itto recognize the 36 uppercase ASCII characters from a pixel matrix represent<strong>in</strong>gthe image of the character. For example, the 6x5 matrix illustratednext represents the pixel image of the character A. The equivalent ASCII


Programm<strong>in</strong>g Exercises 261code for that character is 41 16 . Thus, the ordered pair (Q8A8FE3\ | ft ,41i 6 )represent one tra<strong>in</strong><strong>in</strong>g pair for the network. Complete this example by tra<strong>in</strong><strong>in</strong>gthe network to recognize the other 35 alphanumeric characters from theirpixel image. How many hidden-layer units are needed to precisely recallall 36 characters'?. .x. .. x.x.X. . .Xxxxxx = 00100 01010 10001 11111 10001 10001X. . .XX. . .X6.2. Without resiz<strong>in</strong>g the network, retra<strong>in</strong> the CPN described <strong>in</strong> Programm<strong>in</strong>gExercise 6.1 to recognize both the upper- <strong>and</strong> lowercase ASCII alphabeticcharacters. Describe how accurately the CPN identifies all characters aftertra<strong>in</strong><strong>in</strong>g. Include <strong>in</strong> your discussion any reasons you might have to expla<strong>in</strong>why the CPN misclassifies some characters.6.3. Repeat Programm<strong>in</strong>g Exercise 6.1, but allow two hidden units to w<strong>in</strong> thecompetition dur<strong>in</strong>g production. Expla<strong>in</strong> the network's behavior under thesecircumstances.6.4. Recreate the spacecraft-orientation example <strong>in</strong> Section 6.3 us<strong>in</strong>g your CPNsimulator. You may simplify the problem by us<strong>in</strong>g a smaller matrix torepresent the video image. For example, the 5x5 matrix shown next mightbe used to represent the shuttle image <strong>in</strong> the vertical position. Tra<strong>in</strong> thenetwork to recognize your image at 45-degree rotational <strong>in</strong>crements aroundthe circle. Let two units w<strong>in</strong> the competition, as <strong>in</strong> the example, <strong>and</strong> letthe two output units produce the scaled s<strong>in</strong>e <strong>and</strong> cos<strong>in</strong>e of the image angle.Test your network by obscur<strong>in</strong>g the image (enter a vector with Os <strong>in</strong> placeof Is), <strong>and</strong> describe the results.. .x. .. . x. .. .x. ..xxx. = 00100 00100 00100 OHIO 11111xxxxx6.5. Describe what would happen <strong>in</strong> Programm<strong>in</strong>g Exercise 6.4 if you onlyallowed one unit to w<strong>in</strong> the competition. Describe what would happen ifthree units were allowed to w<strong>in</strong>.6.6. Implement the complete CPN simulator us<strong>in</strong>g the guidel<strong>in</strong>es provided <strong>in</strong>the text. Tra<strong>in</strong> the network us<strong>in</strong>g the spacecraft-orientation data, <strong>and</strong> exercisethe simulator <strong>in</strong> both the forward-propagation <strong>and</strong> counterpropagationmodes. How well does the simulator produce the desired <strong>in</strong>put pattern whengiven s<strong>in</strong>e <strong>and</strong> cos<strong>in</strong>e values that are not on a 45 degree angle?


262 BibliographySuggested Read<strong>in</strong>gsThe papers <strong>and</strong> book by Hecht-Nielsen are good companions to the material <strong>in</strong>this chapter on the CPN [5, 6, 8]. Hecht-Nielsen also has a paper that discussessome applications areas appropriate to the CPN [7].The <strong>in</strong>star, outstar, <strong>and</strong> avalanche networks are discussed <strong>in</strong> detail <strong>in</strong> thepapers by Grossberg <strong>in</strong> the collection Studies of M<strong>in</strong>d <strong>and</strong> Bra<strong>in</strong> [4]. Individualpapers from this collection are listed <strong>in</strong> the bibliography.Bibliography[I] James A. Freeman. <strong>Neural</strong> networks for mach<strong>in</strong>e vision applications: Thespacecraft orientation demonstration, exponent: Ford Aerospace TechnicalJournal, pp. 16-20, Fall 1988.[2] Stephen Grossberg. How does a bra<strong>in</strong> build a cognitive code. In StephenGrossberg, editor, Studies of M<strong>in</strong>d <strong>and</strong> Bra<strong>in</strong>. D. Reidel Publish<strong>in</strong>g,Boston, pp. 1-52, 1982.[3] Stephen Grossberg. Learn<strong>in</strong>g by neural networks. In Stephen Grossberg,editor, Studies of M<strong>in</strong>d <strong>and</strong> Bra<strong>in</strong>. D. Reidel Publish<strong>in</strong>g, Boston, pp. 65-156, 1982.[4] Stephen Grossberg, editor. Studies of M<strong>in</strong>d <strong>and</strong> Bra<strong>in</strong>, volume 70 of BostonStudies <strong>in</strong> the Philosophy of Science. D. Reidel Publish<strong>in</strong>g Company,Boston, 1982.[5] Robert Hecht-Nielsen. Counterpropagation networks. Applied Optics,26(23):4979-4984, December 1987.[6] Robert Hecht-Nielsen. Counterpropagation networks. In Maureen Caudill<strong>and</strong> Charles Butler, editors, Proceed<strong>in</strong>gs of the IEEE First InternationalConference on <strong>Neural</strong> <strong>Networks</strong>, Piscataway, NJ, pages II-19-II-32, June1987. IEEE.[7] Robert Hecht-Nielsen. <strong>Applications</strong> of Counterpropagation networks. <strong>Neural</strong><strong>Networks</strong>, 1(2):131-139, 1988.[8] Robert Hecht-Nielsen. Neurocomput<strong>in</strong>g. Addison-Wesley, Read<strong>in</strong>g, MA,1990.[9] David Hestenes. How the bra<strong>in</strong> works: The next great scientific revolution.Presented at the Third Workshop on Maximum Entropy <strong>and</strong> BayesianMethods <strong>in</strong> Applied Statistics, University of Wyom<strong>in</strong>g, August 1983.[10]D. E. Rumelhart, G. E. H<strong>in</strong>ton, <strong>and</strong> R. J. Williams. Learn<strong>in</strong>g <strong>in</strong>ternalrepresentations by error propagation. In David E. Rumelhart <strong>and</strong> James L.McClell<strong>and</strong>, editors, Parallel Distributed Process<strong>in</strong>g, Chapter 8. MITPress, Cambridge, MA, pp. 318-362, 1986.[II] Donald Woods. Back <strong>and</strong> counter propagation aberrations. In Proceed<strong>in</strong>gsof the IEEE First International Conference on <strong>Neural</strong> <strong>Networks</strong>, pp. I-473-1^79, IEEE, San Diego, CA, June 1987.


C H A P T E RSelf-Organiz<strong>in</strong>g MapsThe cerebral cortex is arguably the most fasc<strong>in</strong>at<strong>in</strong>g structure <strong>in</strong> all of humanphysiology. Although vastly complex on a microscopic level, the cortex revealsa consistently uniform structure on a macroscopic scale, from one bra<strong>in</strong> toanother. Centers for such diverse activities as thought, speech, vision, hear<strong>in</strong>g,<strong>and</strong> motor functions lie <strong>in</strong> specific areas of the cortex, <strong>and</strong> these areas arelocated consistently relative to one another. Moreover, <strong>in</strong>dividual areas exhibita logical order<strong>in</strong>g of their functionality. An example is the so-called tonotopicmap of the auditory regions, where neighbor<strong>in</strong>g neurons respond to similarsound frequencies <strong>in</strong> an orderly sequence from high pitch to low pitch. Anotherexample is the somatotopic map of motor nerves, represented artistically by thehomunculus illustrated <strong>in</strong> Figure 7.1. Regions such as the tonotopic map <strong>and</strong>the somatotopic map can be referred to as ordered feature maps. The purposeof this chapter is to <strong>in</strong>vestigate a mechanism by which these ordered featuremaps might develop naturally.It appears likely that our genetic makeup predest<strong>in</strong>es our neural developmentto a large extent. Whether the mechanisms that we shall describe here play amajor or a m<strong>in</strong>or role <strong>in</strong> the organization of neural tissue is not an issue for us.It was, however, an <strong>in</strong>terest <strong>in</strong> discover<strong>in</strong>g how such an organization might belearned that led Kohonen to develop many of the ideas that we present <strong>in</strong> thischapter [2].The cortex is essentially a large (approximately 1-meter-square, <strong>in</strong> adulthumans) th<strong>in</strong> (2-to-4-millimeter thick) sheet consist<strong>in</strong>g of six layers of neuronsof vary<strong>in</strong>g type <strong>and</strong> density. It is folded <strong>in</strong>to its familiar shape to maximizepack<strong>in</strong>g density <strong>in</strong> the cranium. S<strong>in</strong>ce we are not so much concerned with thedetails of anatomy here, we shall consider an adequate model of the cortex tobe a two-dimensional sheet of process<strong>in</strong>g elements.We saw <strong>in</strong> the previous chapter how on-center off-surround <strong>in</strong>teractionsamong competitive process<strong>in</strong>g elements could be used to construct a networkthat could classify clusters of <strong>in</strong>put vectors. If you have not done so, please


264 Self-Organiz<strong>in</strong>g MapsFigure 7.1The homunculus depicts the relationship between regions ofthe somatotopic area of the cortex <strong>and</strong> the parts of the body thatthey control. Although somewhat distorted, the basic structureof the body is reflected <strong>in</strong> the organization of the cortex <strong>in</strong> thisregion. Source: Repr<strong>in</strong>ted with permission of McGraw-Hill, Inc.from Charles R. Noback <strong>and</strong> Robert J. Demarest, The HumanNervous System: Basic Pr<strong>in</strong>ciples of Neurobiology, ©1981 byMcGraw-Hill, Inc.read Sections 6.1.1 through 6.1.3, as they are prerequisites to the material <strong>in</strong>this chapter.In the simple competitive layer of the counterpropagation network, unitslearn by a process of self-organization. Learn<strong>in</strong>g was accomplished by theapplication of <strong>in</strong>put data alone; no expected-output data was used as a teacherto signal the network that it had made an error. We call this type of learn<strong>in</strong>gunsupervised learn<strong>in</strong>g, <strong>and</strong> the <strong>in</strong>put data are called unlabeled data. In theCPN, there was also no <strong>in</strong>dication that the physical position of units <strong>in</strong> thecompetitive layer reflected any special relationship among the classes of databe<strong>in</strong>g learned. Thus, we say that there was a r<strong>and</strong>om map of <strong>in</strong>put classes tocompetitive units.In contrast to our discussion of r<strong>and</strong>om mapp<strong>in</strong>g, here we shall see howa simple extension of the competitive algorithms from Chapter 6 result <strong>in</strong> a


7.1 SOM Data Process<strong>in</strong>g 265topology-preserv<strong>in</strong>g map of the <strong>in</strong>put data to the competitive units.' Becauseof Kohonen's work <strong>in</strong> the development of the theory of competition, competitiveprocess<strong>in</strong>g elements are often referred to as Kohonen units.As a simplified def<strong>in</strong>ition, we can say that, <strong>in</strong> a topology-preserv<strong>in</strong>g map,units located physically next to each other will respond to classes of <strong>in</strong>put vectorsthat are likewise next to each other. Although it is easy to visualize unitsnext to each other <strong>in</strong> a two-dimensional array, it is not so easy to determ<strong>in</strong>ewhich classes of vectors are next to each other <strong>in</strong> a high-dimensional space.Large-dimensional <strong>in</strong>put vectors are, <strong>in</strong> a sense, projected down on the twodimensionalmap <strong>in</strong> a way that ma<strong>in</strong>ta<strong>in</strong>s the natural order of the <strong>in</strong>put vectors.This dimensional reduction could allow us to visualize easily important relationshipsamong the data that otherwise might go unnoticed.In the next section, we shall formalize some of the def<strong>in</strong>itions presented <strong>in</strong>this section, <strong>and</strong> shall look at the mathematics of the topology-preserv<strong>in</strong>g map.Henceforth, we shall refer to the topology-preserv<strong>in</strong>g map as a self-organiz<strong>in</strong>gmap (SOM).7.1 SOM DATA PROCESSINGLateral <strong>in</strong>teractions among process<strong>in</strong>g elements <strong>in</strong> the on-center off-surroundscheme were modeled <strong>in</strong> Chapter 6 as a s<strong>in</strong>gle positive-feedback connection toa central unit, <strong>and</strong> negative-feedback connections to all other units <strong>in</strong> the layer.In this chapter, we shall modify that model such that, dur<strong>in</strong>g the learn<strong>in</strong>g process,the positive feedback will extend from the central (w<strong>in</strong>n<strong>in</strong>g) unit to other units<strong>in</strong> some f<strong>in</strong>ite neighborhood around the central unit. In the competitive layerof the CPN, only the w<strong>in</strong>n<strong>in</strong>g unit (the one whose weight vector most closelymatched the <strong>in</strong>put vector) was allowed to learn; <strong>in</strong> the SOM, all the units <strong>in</strong> theneighborhood that receive positive feedback from the w<strong>in</strong>n<strong>in</strong>g unit participate<strong>in</strong> the learn<strong>in</strong>g process. Even if a neighbor<strong>in</strong>g unit's weight is orthogonal to the<strong>in</strong>put vector, its weight vector will still change <strong>in</strong> response to the <strong>in</strong>put vector.This simple addition to the competitive process is sufficient to account for theordered mapp<strong>in</strong>g discussed <strong>in</strong> the previous section.e7.1.1 Unit ActivationsThe activations of the process<strong>in</strong>g elements are def<strong>in</strong>ed by the set of equationsy, = ~r,(y,) + net, + ^ 2, 7


266 Self-Organiz<strong>in</strong>g MapsThe term net, is the net <strong>in</strong>put to unit i calculated <strong>in</strong> the usual manner as the dotproduct of the <strong>in</strong>put vector <strong>and</strong> the weight vector of the unit. The f<strong>in</strong>al term <strong>in</strong>Eq. (7.1) models the lateral <strong>in</strong>teractions between units. The sum extends overall units <strong>in</strong> the system. If z^ takes the form of the Mexican-hat function, asshown <strong>in</strong> Figure 7.2(a), then the network will exhibit a bubble of activity aroundthe unit with the largest value of net <strong>in</strong>put. Although the unit with the largestnet-<strong>in</strong>put value is technically the w<strong>in</strong>ner <strong>in</strong> the competitive system, neighbor<strong>in</strong>gunits share <strong>in</strong> that victory.7.1.2 The SOM Learn<strong>in</strong>g AlgorithmDur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g period, each unit with a positive activity with<strong>in</strong> the neighborhoodof the w<strong>in</strong>n<strong>in</strong>g unit participates <strong>in</strong> the learn<strong>in</strong>g process <strong>in</strong> a manneridentical to the <strong>in</strong>star discussed <strong>in</strong> Chapter 6. We can describe the learn<strong>in</strong>gprocess by the equationsw z = a(t)(x - w,)[%,) (7.2)where w, is the weight vector of the ith unit <strong>and</strong> x is the <strong>in</strong>put vector. Thefunction U(yi) is zero unless y, > 0 <strong>in</strong> which case t/(y,) = 1, ensur<strong>in</strong>g thatonly those units with positive activity participate <strong>in</strong> the learn<strong>in</strong>g process. Thefactor a(t) is written as a function of time to anticipate our desire to change itas learn<strong>in</strong>g progresses (see the discussion <strong>in</strong> Section 6.2.3).In the rema<strong>in</strong>der of the discussion, we shall not explicitly show the <strong>in</strong>hibitoryconnections between units, <strong>and</strong> we shall ignore the far-reach<strong>in</strong>g excitatory<strong>in</strong>teractions as a first approximation. The resultant <strong>in</strong>teraction function isshown <strong>in</strong> Figure 7.2(b).To demonstrate the formation of an ordered feature map, we shall usean example <strong>in</strong> which units are tra<strong>in</strong>ed to recognize their relative positions <strong>in</strong>two-dimensional space. The scenario is illustrated <strong>in</strong> the sequence <strong>in</strong> Figure7.3. Each process<strong>in</strong>g element is identified by its coord<strong>in</strong>ates, (u,v), <strong>in</strong>two-dimensional space. Weight vectors for this example are also two dimensional<strong>and</strong> are <strong>in</strong>itially assigned to the process<strong>in</strong>g elements r<strong>and</strong>omly.As with other competitive structures, a w<strong>in</strong>n<strong>in</strong>g process<strong>in</strong>g element is determ<strong>in</strong>edfor each <strong>in</strong>put vector based on the similarity between the <strong>in</strong>put vector<strong>and</strong> the weight vector. For an <strong>in</strong>put vector x, the w<strong>in</strong>n<strong>in</strong>g unit can be determ<strong>in</strong>edby||x-w r ||=m<strong>in</strong>{||x-Wj||} (7.3)?where the <strong>in</strong>dex c refers to the w<strong>in</strong>n<strong>in</strong>g unit. To keep subscripts to a m<strong>in</strong>imum,we identify each unit <strong>in</strong> the two-dimensional array by a s<strong>in</strong>gle subscript, as <strong>in</strong>Eq. (7.3).Instead of updat<strong>in</strong>g the weights of the w<strong>in</strong>n<strong>in</strong>g unit only, we def<strong>in</strong>e a physicalneighborhood around the unit, <strong>and</strong> all units with<strong>in</strong> this neighborhood participate<strong>in</strong> the weight-update process. As learn<strong>in</strong>g proceeds, the size of theneighborhood is dim<strong>in</strong>ished until it encompasses only a s<strong>in</strong>gle unit. If c is the


7.1 SOM Data Process<strong>in</strong>g 267(a)(b)Figure 7.2These graphics illustrate two different models of lateral<strong>in</strong>teractions among network units, (a) This curve, characteristicof the lateral <strong>in</strong>teractions between certa<strong>in</strong> neurons <strong>in</strong> thecortex, is referred to as the Mexican-hat function. A centrallyexcited process<strong>in</strong>g element excites a small neighborhoodaround it with positive feedback connections. As the lateraldistance from the central node <strong>in</strong>creases, the degree ofexcitation falls until it becomes an <strong>in</strong>hibition. This <strong>in</strong>hibitioncont<strong>in</strong>ues for a significantly longer distance. F<strong>in</strong>ally, a weakpositive feedback extends a considerable distance away fromthe central node, (b) We shall use this simple function as afirst approximation to the Mexican-hat function. The distance,a, def<strong>in</strong>es a neighborhood of units around the central unit thatparticipate <strong>in</strong> learn<strong>in</strong>g along with the central unit.


268 Self-Organiz<strong>in</strong>g Maps( o p ( °6 )p o )p o )p o )pC'cro "o"o o o ( cf o(3,3) n **- ...O O O © P Oo o o o o( o p o )p o )p o )p o 1( 1 -°)no p o o oo oO Oo o IN,Figure 7.3(a)\= "yo oo o o o o(2,2) wO Oo o oo o oo o(b) "«ooo oOThis series of figures shows how the SOM develops as tra<strong>in</strong><strong>in</strong>gtakes place, (a) Process<strong>in</strong>g elements on the SOM are arranged<strong>in</strong> a two-dimensional array at discrete po<strong>in</strong>ts, (u, v) p with<strong>in</strong> thew<strong>in</strong>n<strong>in</strong>g unit, <strong>and</strong> N,. is the list of unit <strong>in</strong>dices that make up the neighborhood,then the weight-update equations are+ 1) =0a(t)(\ -i e N,.otherwise(7.4)Each weight vector participat<strong>in</strong>g <strong>in</strong> the update process rotates slightly towardthe <strong>in</strong>put vector, x. Once tra<strong>in</strong><strong>in</strong>g has progressed sufficiently, the weight vectoron each unit will converge to a value that is representative of the coord<strong>in</strong>atesof the po<strong>in</strong>ts near the physical location of the unit.Exercise 7.1: In Eq. (7.3), the w<strong>in</strong>n<strong>in</strong>g unit is determ<strong>in</strong>ed by the m<strong>in</strong>imum ofthe quantity ||x — w,||. In Chapter 6, the w<strong>in</strong>n<strong>in</strong>g unit was determ<strong>in</strong>ed by themaximum of the quantity net, = x • W;. Under what circumstances are thesetwo conditions equivalent?


7.1 SOM Data Process<strong>in</strong>g 269(0,0) (0,1) (0,2) (0,3) (0,4)o o p o o o p leoe -0+o o o cf p Go o o ( § )p eo o o o ofw,(1,0)KTO O O O(2,3)*,-Q-Q- 0-0 O(3,3) w-OQ €>KD OO O O O O(c)Figure 7.3 (cont<strong>in</strong>ued) cont<strong>in</strong>uous space. A two-dimensional weightvector (w,,,w l .), l , is assigned to each unit. The weight valuescorrespond to physical locations with<strong>in</strong> the space occupied bythe process<strong>in</strong>g elements, but they are assigned at r<strong>and</strong>om to theunits. Thus, neighbor<strong>in</strong>g units <strong>in</strong> physical space may occupyunrelated locations <strong>in</strong> weight space, (b) Dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g, apo<strong>in</strong>t (u, v) is selected at r<strong>and</strong>om to be the <strong>in</strong>put vector. Aw<strong>in</strong>n<strong>in</strong>g unit is determ<strong>in</strong>ed by a simple Euclidean distancemeasure between the selected po<strong>in</strong>t <strong>and</strong> the unit's weightvector, <strong>and</strong> a neighborhood, N c , is def<strong>in</strong>ed around the w<strong>in</strong>n<strong>in</strong>gunit <strong>in</strong> physical space. The weight vectors <strong>in</strong> all units with<strong>in</strong>the neighborhood change slightly toward the value of the<strong>in</strong>put, (u,v). As tra<strong>in</strong><strong>in</strong>g cont<strong>in</strong>ues with different <strong>in</strong>put po<strong>in</strong>ts,the size of the neighborhood is decreased gradually until itencompasses only a s<strong>in</strong>gle unit, (c) At the completion oftra<strong>in</strong><strong>in</strong>g, the weight vector for each unit will be approximatelyequal to the physical coord<strong>in</strong>ates of the unit.Exercise 7.2: Is it acceptable to <strong>in</strong>itialize the weight vectors on an SOM unitto r<strong>and</strong>om, unnormalized values? Expla<strong>in</strong> your answer.t>Kohonen has developed a clever way to illustrate the dynamics of the learn<strong>in</strong>gprocess for examples of the type described <strong>in</strong> Figure 7.3. Instead of plott<strong>in</strong>gthe position of the process<strong>in</strong>g elements accord<strong>in</strong>g to their physical location, wecan plot them accord<strong>in</strong>g to their location <strong>in</strong> weight space. We then draw connect<strong>in</strong>gl<strong>in</strong>es between units that are neighbors <strong>in</strong> physical space; Figure 7.4 illustratesthis idea. As tra<strong>in</strong><strong>in</strong>g progresses, the map evolves, as shown <strong>in</strong> Figure 7.5.We have assumed throughout this discussion that the <strong>in</strong>put po<strong>in</strong>ts are selectedat r<strong>and</strong>om from a uniformly distributed set with<strong>in</strong> the rectangular areaoccupied by the process<strong>in</strong>g elements. Suppose that the <strong>in</strong>put po<strong>in</strong>ts were selectedfrom a different distribution, as illustrated <strong>in</strong> Figure 7.6. In this example,<strong>in</strong>put po<strong>in</strong>ts are selected from a uniform, triangular distribution. The process-


270 Self-Organiz<strong>in</strong>g Maps(0,1)(0,0)(1.0)Figure 7.4 Each unit is identified by two sets of coord<strong>in</strong>ates: fixed,physical-space coord<strong>in</strong>ates <strong>in</strong> (u,v) p space <strong>and</strong> modifiableweight-space coord<strong>in</strong>ates <strong>in</strong> (w,,,w,,) w space. A plot is madeof po<strong>in</strong>ts accord<strong>in</strong>g to their position <strong>in</strong> weight space. Unitsare jo<strong>in</strong>ed by l<strong>in</strong>es accord<strong>in</strong>g to their position <strong>in</strong> physicalspace. Neighbor<strong>in</strong>g units (0,0) p ,(0,1) ;) , <strong>and</strong> (l,0) p would beconnected as shown with the thick l<strong>in</strong>es.<strong>in</strong>g elements themselves are still <strong>in</strong> a two-dimensional physical array, but theweights form a map of the triangular region. Remember that the units do notchange their physical location dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g.In the <strong>in</strong>troduction to this chapter, we spoke of us<strong>in</strong>g the SOM to performa dimensional reduction on the <strong>in</strong>put data by mapp<strong>in</strong>g high-dimensional <strong>in</strong>putvectors onto the two-dimensional map. We can illustrate this idea graphicallyby mapp<strong>in</strong>g a two-dimensional region of po<strong>in</strong>ts onto a one-dimensional array ofprocess<strong>in</strong>g elements. Figure 7.7 illustrates the case for two different distributionsof <strong>in</strong>put po<strong>in</strong>ts. Remember that the <strong>in</strong>put <strong>and</strong> weight vectors are both twodimensional, whereas the physical location of each process<strong>in</strong>g element is on aone-dimensional array.All our examples to this po<strong>in</strong>t have used <strong>in</strong>put po<strong>in</strong>ts that are uniformlydistributed with<strong>in</strong> some region. There is noth<strong>in</strong>g sacred about the uniform distribution.Input po<strong>in</strong>ts can be distributed accord<strong>in</strong>g to any distribution function.Once the SOM has been tra<strong>in</strong>ed, the weight vectors will be organized <strong>in</strong>to anapproximation of the distribution function of the <strong>in</strong>put vectors. Stated <strong>in</strong> moreformal terms,The po<strong>in</strong>t density function of the weight vectors tends to approximate the probabilitydensity function p(x) of the <strong>in</strong>put vectors, x, <strong>and</strong> the weight vectors tend to beordered accord<strong>in</strong>g to their mutual similarity. [2]We shall not attempt to prove that the prescription described here will result<strong>in</strong> a topology-preserv<strong>in</strong>g map. Even for the simple one-dimensional case, such a


7.1 SOM Data Process<strong>in</strong>g 271(a)(b)Figure 7.5(c)In this series of figures, a two-dimensional SOM develops froma space of <strong>in</strong>put po<strong>in</strong>ts selected r<strong>and</strong>omly from a uniformrectangulardistribution, (a) In the <strong>in</strong>itial map, weight vectorshave r<strong>and</strong>om values near the center of the map coord<strong>in</strong>ates.(b) As the map beg<strong>in</strong>s to evolve, weights spread out fromthe center, (c) Eventually, the f<strong>in</strong>al structure of the mapbeg<strong>in</strong>s to emerge, (d) In the end, the relationship between theweight vectors mimics the relationship between the physicalcoord<strong>in</strong>ates of the process<strong>in</strong>g elements.(d)proof is extremely long <strong>and</strong> complex. You should consult Kohonen's excellentbook on SOMs for details [2].7.1.3 The Feature Map ClassifierAn advantage of the SOM is that large numbers of unlabeled data can be organizedquickly <strong>in</strong>to a configuration that may illum<strong>in</strong>ate underly<strong>in</strong>g structurewith<strong>in</strong> the data. Follow<strong>in</strong>g the self-organization process, it may be desirable


272 Self-Organiz<strong>in</strong>g MapsFigure 7.6This sequence of diagrams illustrates the evolution of the mapfor a uniform, triangular distribution of <strong>in</strong>put po<strong>in</strong>ts. Noticethat, <strong>in</strong> the f<strong>in</strong>al map, neighbor<strong>in</strong>g po<strong>in</strong>ts <strong>in</strong> physical spaceend up with neighbor<strong>in</strong>g weights <strong>in</strong> weight space. This featureis the essence of the topology-preserv<strong>in</strong>g map./> \\Figure 7.7These figures illustrate one-dimensional maps for two different<strong>in</strong>put probability distributions. (a) This sequence showsthe evolution of the one-dimensional SOM for a uniformdistribution of po<strong>in</strong>ts <strong>in</strong> a triangular region. Notice howprocess<strong>in</strong>g elements that are physical neighbors (<strong>in</strong>dicated byconnect<strong>in</strong>g l<strong>in</strong>es) form weight vectors that are distributed <strong>in</strong>an orderly fashion throughout the <strong>in</strong>put space, (b) For thissequence, <strong>in</strong>put po<strong>in</strong>ts are uniformly distributed with<strong>in</strong> the K-shaped region. Because of the concavities, weight vectors canhave values that are outside of the actual space of <strong>in</strong>put po<strong>in</strong>ts.


7.1 SOM Data Process<strong>in</strong>g 273to associate certa<strong>in</strong> <strong>in</strong>puts with certa<strong>in</strong> output values, such as is done with thebackpropagation <strong>and</strong> counterpropagation networks.The architecture of the SOM can be extended by the addition of an associationlayer, as shown <strong>in</strong> Figure 7.8. We refer to this structure as a featuremap classifier (FMC). Units on the output layer can be tra<strong>in</strong>ed by anyof several methods <strong>in</strong>clud<strong>in</strong>g the delta rule, described <strong>in</strong> Chapter 2, or by theoutstar procedure described <strong>in</strong> Chapter 6. An alternate tra<strong>in</strong><strong>in</strong>g method, calledmaximum-likelihood tra<strong>in</strong><strong>in</strong>g is described by Huang <strong>and</strong> Lippmann [1].Output layerSelf-organiz<strong>in</strong>g map layerFigure 7.8In this representation of the feature map classifier, a layerof output units has been added to a two-dimensional SOM.The output units associate desired output values with certa<strong>in</strong><strong>in</strong>put vectors. The SOM acts as a competitive network thatclassifies <strong>in</strong>put vectors. The process<strong>in</strong>g done by the FMC isvirtually identical to that done by the forward-mapp<strong>in</strong>g CRN(see Chapter 6). Unlike <strong>in</strong> the CRN, it is not necessary to havethe units on the competitive layer fully connected to the unitson the output layer.


274 Self-Organiz<strong>in</strong>g Maps7.2 APPLICATIONS OF SELF-ORGANIZING MAPSIn this section, we discuss two applications that employ the SOM network.Each application illustrates a significantly different way <strong>in</strong> which the SOM canbe used effectively.7.2.1 The <strong>Neural</strong> Phonetic TypewriterIt has long been a goal of computer scientists to endow a mach<strong>in</strong>e with the abilityto recognize <strong>and</strong> underst<strong>and</strong> human speech. The possibilities for such a mach<strong>in</strong>eseem endless. Some progress has been made, yet the goal has been elusive.Despite many years of research, currently available commercial products arelimited by their small vocabularies, by dependence on extensive tra<strong>in</strong><strong>in</strong>g by aparticular speaker, or by both.The neural phonetic typewriter demonstrates the potential for neural networksto aid <strong>in</strong> the endeavor to build speaker-<strong>in</strong>dependent speech recognition<strong>in</strong>to a computer. Moreover, it shows how neural-network technology can bemerged with traditional signal process<strong>in</strong>g <strong>and</strong> st<strong>and</strong>ard techniques <strong>in</strong> artificial<strong>in</strong>telligence to solve a particular problem. This device can transcribe speech <strong>in</strong>towritten text from an unlimited vocabulary, <strong>in</strong> real time, <strong>and</strong> with an accuracyof 92 to 97 percent [3]. It is therefore correctly called a typewriter; it does notpurport to underst<strong>and</strong> the mean<strong>in</strong>g of the speech. Nevertheless, the neural phonetictypewriter should have a significant effect <strong>in</strong> the modern office. Tra<strong>in</strong><strong>in</strong>gfor an <strong>in</strong>dividual speaker requires the dictation of only about 100 words, <strong>and</strong>requires about 10 m<strong>in</strong>utes of time on a personal computer. 2For this device, a two-dimensional array of nodes is tra<strong>in</strong>ed us<strong>in</strong>g, as <strong>in</strong>puts,15-component spectral analyses of spoken words sampled every 9.83 milliseconds.These <strong>in</strong>put vectors are produced from a series of preprocess<strong>in</strong>g stepsperformed on the audible sound. This preprocess<strong>in</strong>g <strong>in</strong>cludes the use of a noisecancel<strong>in</strong>gmicrophone, a 12-bit analog-to-digital conversion, a 256-po<strong>in</strong>t fastFourier transform performed every 9.83 milliseconds, group<strong>in</strong>g of the spectralchannels <strong>in</strong>to 15 groups, <strong>and</strong> additional filter<strong>in</strong>g <strong>and</strong> normalization of the resultant15-component <strong>in</strong>put vector.Us<strong>in</strong>g Kohonen's cluster<strong>in</strong>g algorithm, nodes <strong>in</strong> a two-dimensional arraywere allowed to organize themselves <strong>in</strong> response to the <strong>in</strong>put vectors. Aftertra<strong>in</strong><strong>in</strong>g, the result<strong>in</strong>g map was calibrated by us<strong>in</strong>g the spectra of phonemesas <strong>in</strong>put vectors. Even though phonemes were not used explicitly to tra<strong>in</strong> thenetwork, most nodes responded to a s<strong>in</strong>gle phoneme, as shown <strong>in</strong> Figure 7.9.This response to phonemes is all the more strik<strong>in</strong>g s<strong>in</strong>ce the phonemes lastfrom 40 milliseconds to 400 milliseconds, <strong>in</strong> contrast to the 9.83-millisecondsampl<strong>in</strong>g frequency used to tra<strong>in</strong> the network.-Unfortunately for those of us whose native language is English, the only languages supported atpresent are F<strong>in</strong>nish <strong>and</strong> Japanese.


7.2 <strong>Applications</strong> of Self-Organiz<strong>in</strong>g Maps 2750®©@©©©©©®©©©©®©0©©0©0©000©©000®®0©0©0000000000000©0©g)©00®©©0©0©®®0©0®®©OO©®0©®©®0®®OO000000©©©©Figure 7.9This figure is a phonotopic map with the neurons, shownas circles, <strong>and</strong> the phonemes to which they learned torespond. A double label means that the node responds tomore than one phoneme. Some phonemes—such as theplosives represented by k, p, <strong>and</strong> f—are difficult for thenetwork to dist<strong>in</strong>guish <strong>and</strong> are most prone to misclassificationby the network. Source: Repr<strong>in</strong>ted with permission from TeuvoKohonen, ''The neural phonetic typewriter." IEEE Computer,March 1988. ©1988 IEEE.As a word is spoken, it is sampled, analyzed, <strong>and</strong> submitted to the networkas a sequence of <strong>in</strong>put vectors. As the nodes <strong>in</strong> the network respond,a path is traced on the map that corresponds to the sequence of <strong>in</strong>put patterns.See Figure 7.10 for an example; the arrows <strong>in</strong> the figure correspond tothe sampl<strong>in</strong>g time of 9.83 milliseconds. This path results <strong>in</strong> a phonetic transcriptionof the word, which can then be compared with known words or usedas <strong>in</strong>put to a rule-based system. The analysis can be carried out quickly <strong>and</strong>efficiently us<strong>in</strong>g a variety of techniques, <strong>in</strong>clud<strong>in</strong>g neural-network associativememories.As words are spoken <strong>in</strong>to the microphone, their transcription appears on thecomputer screen. We eagerly await the English-language version.7.2.2 Learn<strong>in</strong>g Ballistic Arm MovementsIn our second example, we describe a method developed by Ritter <strong>and</strong> Schultenthat uses an SOM that learns how to perform ballistic movements of a simplerobot arm [4, 5, 6]. Ballistic movements are <strong>in</strong>itiated by short-duration torquesapplied at the jo<strong>in</strong>ts of the arm. Once the torques have been applied, motion ofthe arm proceeds freely (hence the term ballistic movement). S<strong>in</strong>ce the torquesare of such short duration, there is no control of the motion by way of a feedback


276 Self-Organiz<strong>in</strong>g MapsFigure 7.10PPThis illustration shows the sequence of responses from thephonotopic map result<strong>in</strong>g from the spoken F<strong>in</strong>nish wordhumppila. (Do not bother to look up the mean<strong>in</strong>g of this word<strong>in</strong> your F<strong>in</strong>nish-English dictionary: humppila is the name of aplace.) Source: Repr<strong>in</strong>ted with permission from Teuvo Kohonen,' 'The neural phonetic typewriter." IEEE Computer, March 1988.©1988 IEEE.mechanism; thus, the torques that cause a particular motion must be known <strong>in</strong>advance. Figure 7.11 illustrates the simple, two-dimensional robot-arm modelused <strong>in</strong> this example.For a particular start<strong>in</strong>g position, x, <strong>and</strong> a particular, desired end-effectorvelocity, Udesired, the required torques can be found fromT = A(x)Udes,r,•ed (7.5)where T is the vector ("n^)'. 3 The tensor quantity, A (here, simply a twodimensionalmatrix) is determ<strong>in</strong>ed by the details of the arm <strong>and</strong> its configuration.Ritter <strong>and</strong> Schulten use Kohonen's SOM algorithm to learn the A(x) quantities.A mechanism for learn<strong>in</strong>g the A tensors would be useful <strong>in</strong> a real environmentwhere ag<strong>in</strong>g effects <strong>and</strong> wear might alter the dynamics of the arm over time.The first part of the method is virtually identical to the two-dimensionalmapp<strong>in</strong>g example discussed <strong>in</strong> Section 7.1. Recall that, <strong>in</strong> that example, units- Torque itself is a vector quantity, def<strong>in</strong>ed as the time-rate of change of the angular momentumvector. Our vector T is a composite of the magnitudes of two torque vectors, T\ <strong>and</strong> r*. Thedirections of r\ <strong>and</strong> r^ can be accounted for by their signs: r > 0 implies a counterclockwiserotation of the jo<strong>in</strong>t, <strong>and</strong> r < 0 implies a clockwise rotation.


7.2 <strong>Applications</strong> of Self-Organiz<strong>in</strong>g Maps 277Figure 7.11This figure shows a schematic of a simple robot arm <strong>and</strong>its space of permitted movement. The arm consists of twomassless segments of length 1.0 <strong>and</strong> 0.9, with unit, po<strong>in</strong>tmasses at its distal jo<strong>in</strong>t, d, <strong>and</strong> its end effector, e. The endeffector beg<strong>in</strong>s at some r<strong>and</strong>omly selected location, x, with<strong>in</strong>the region R. The jo<strong>in</strong>t angles are Q\ <strong>and</strong> 6*2- The desiredmovement of the arm is to have the end effector move atsome r<strong>and</strong>omly selected velocity, u des ired- For this movementto be accomplished, torques r\ <strong>and</strong> r^ must be applied at thejo<strong>in</strong>ts.learned to organize themselves such that their two-dimensional weight vectorscorresponded to the physical coord<strong>in</strong>ates of the associated unit. An <strong>in</strong>put vectorof (x\.X2) would then cause the largest response from the unit whose physicalcoord<strong>in</strong>ates were closest to (x\,X2).Ritter <strong>and</strong> Schulten beg<strong>in</strong> with a two-dimensional array of units identifiedby their <strong>in</strong>teger coord<strong>in</strong>ates, (i,j), with<strong>in</strong> the region R of Figure 7.11.Instead of us<strong>in</strong>g the coord<strong>in</strong>ates of a selected po<strong>in</strong>t as <strong>in</strong>puts, they use thecorrespond<strong>in</strong>g values of the jo<strong>in</strong>t angles. Given suitable restrictions on thevalues of 9\ <strong>and</strong> #2> there will be a one-to-one correspondence between thejo<strong>in</strong>t-angle vector, 6 — (&i,02)', <strong>and</strong> the coord<strong>in</strong>ate vector, x = (xj,.^)'.Other than this change of variables, <strong>and</strong> the use of a different model for theMexican-hat function, the development of the map proceeds as described <strong>in</strong>Section 7.1:1. Select a po<strong>in</strong>t x with<strong>in</strong> R accord<strong>in</strong>g to a uniform r<strong>and</strong>om distribution.2. Determ<strong>in</strong>e the correspond<strong>in</strong>g 0* =


278 Self-Organiz<strong>in</strong>g Maps3. Select the w<strong>in</strong>n<strong>in</strong>g unit, y*, such that||0(y*)-01=m<strong>in</strong>.||0(y)-01y4. Update the theta vector for all units accord<strong>in</strong>g to0(y, t+l) = 0(y,«) + /i,(y - y*, t)(0* - 0(y,*))The function h\(y — y*,i) def<strong>in</strong>es the model of the Mexican-hat function:It is a Gaussian function centered on the w<strong>in</strong>n<strong>in</strong>g unit. Therefore, theneighborhood around the w<strong>in</strong>n<strong>in</strong>g unit that gets to share <strong>in</strong> the victoryencompasses all of the units. Unlike <strong>in</strong> the example <strong>in</strong> Section 7.1, however,the magnitude of the weight updates for the neighbor<strong>in</strong>g units decreases as afunction of distance from the w<strong>in</strong>n<strong>in</strong>g unit. Also, the width of the Gaussianis decreased as learn<strong>in</strong>g proceeds.So far, we have not said anyth<strong>in</strong>g about how the A(x) matrices arelearned. That task is facilitated by association of one A tensor with eachunit of the SOM network. Then, as w<strong>in</strong>n<strong>in</strong>g units are selected accord<strong>in</strong>g tothe procedure given, A matrices are updated right along with the 0 vectors.We can determ<strong>in</strong>e how to adjust the A matrices by us<strong>in</strong>g the differencebetween the desired motion, laired* <strong>and</strong> the actual motion, v, to determ<strong>in</strong>esuccessive approximations to A. In pr<strong>in</strong>ciple, we do not need a SOM toaccomplish this adjustment. We could pick a start<strong>in</strong>g location, then <strong>in</strong>vestigateall possible velocities start<strong>in</strong>g from that po<strong>in</strong>t, <strong>and</strong> iterate A until itconverges to give the expected movements. Then, we would select anotherstart<strong>in</strong>g po<strong>in</strong>t <strong>and</strong> repeat the exercise. We could cont<strong>in</strong>ue this process untilall start<strong>in</strong>g locations have been visited <strong>and</strong> all As have been determ<strong>in</strong>ed.The advantage of us<strong>in</strong>g a SOM is that all A matrices are updatedsimultaneously, based on the corrections determ<strong>in</strong>ed for only one start<strong>in</strong>glocation. Moreover, the magnitude of the corrections for neighbor<strong>in</strong>gunits ensures that their A matrices are brought close to their correct valuesquickly, perhaps even before their associated units have been selected viathe 6 competition. So, to pick up the algorithm where we left off,5. Select a desired velocity, u, with r<strong>and</strong>om direction <strong>and</strong> unit magnitude,||u|| = 1. Execute an arm movement with torques computed from T =A(x)u, <strong>and</strong> observe the actual end-effector velocity, v.6. Calculate an improved estimate of the A tensor for the w<strong>in</strong>n<strong>in</strong>g unit:A(y*, t + 1) = A(y*, t) + eA(y*, t)(u - v)v'where e is a positive constant less than 1.7. F<strong>in</strong>ally, update the A tensor for all units accord<strong>in</strong>g toA(y, t+l) = A(y, t) + My - y*, *)(A(y*, t + 1) - A(y, t))where hi is a Gaussian function whose width decreases with time.


7.3 Simulat<strong>in</strong>g the SOM 279The result of us<strong>in</strong>g a SOM <strong>in</strong> this manner is a significant decrease <strong>in</strong> theconvergence time for the A tensors. Moreover, the <strong>in</strong>vestigators reported thatthe system was more robust <strong>in</strong> the sense of be<strong>in</strong>g less sensitive to the <strong>in</strong>itialvalues of the A tensors.7.3 SIMULATING THE SOMAs we have seen, the SOM is a relatively uncomplicated network <strong>in</strong> that it hasonly two layers of units. Therefore, the simulation of this network will nottax the capacity of the general network data structures with which we have, bynow, become familiar. The SOM, however, adds at least one <strong>in</strong>terest<strong>in</strong>g twistto the notion of the layer structure used by most other networks; this is the firsttime we have dealt with a layer of units that is organized as a two-dimensionalmatrix, rather than as a simple one-dimensional vector. To accommodate thisnew dimension, we will decompose the matrix conceptually <strong>in</strong>to a s<strong>in</strong>gle vectorconta<strong>in</strong><strong>in</strong>g all the row vectors from the orig<strong>in</strong>al matrix. As you will see <strong>in</strong>the follow<strong>in</strong>g discussion, this matrix decomposition allows the SOM simulatorto be implemented with m<strong>in</strong>imal modifications to the general data structuresdescribed <strong>in</strong> Chapter 1.7.3.1 The SOM Data StructuresFrom our theoretical discussion earlier <strong>in</strong> this chapter, we know that the SOMis structured as a two-layer network, with a s<strong>in</strong>gle vector of <strong>in</strong>put units provid<strong>in</strong>gstimulation to a rectangular array of output units. Furthermore, units<strong>in</strong> the output layer are <strong>in</strong>terconnected to allow lateral <strong>in</strong>hibition <strong>and</strong> excitation,as illustrated <strong>in</strong> Figure 7.12(a). This network structure will be rather cumbersometo simulate if we attempt to model the network precisely as illustrated,because we will have to iterate on the row <strong>and</strong> column offsets of the outputunits. S<strong>in</strong>ce we have chosen to organize our network connection structures asdiscrete, s<strong>in</strong>gle-dimension arrays accessed through an <strong>in</strong>termediate array, thereis no straightforward means of def<strong>in</strong><strong>in</strong>g a matrix of connection arrays withoutmodify<strong>in</strong>g most of the general network structures. We can, however, reducethe complexity of the simulation task by conceptually unpack<strong>in</strong>g the matrix ofunits <strong>in</strong> the output layer, reform<strong>in</strong>g them as a s<strong>in</strong>gle layer of units organized asa long vector composed of the concatenation of the orig<strong>in</strong>al row vectors.In so do<strong>in</strong>g, we will have essentially restructured the network such that itresembles the more familiar two-layer structure, as shown <strong>in</strong> Figure 7.12(b). Aswe shall see, the benefit of restructur<strong>in</strong>g the network <strong>in</strong> this manner is that itwill enable us to efficiently locate, <strong>and</strong> update, the neighborhood surround<strong>in</strong>gthe w<strong>in</strong>n<strong>in</strong>g unit <strong>in</strong> the competition.If we also observe that the connections between the units <strong>in</strong> the output layercan be simulated on the host computer system as an algorithmic determ<strong>in</strong>ation ofthe w<strong>in</strong>n<strong>in</strong>g unit (<strong>and</strong> its associated neighborhood), we can reduce the process<strong>in</strong>g


280 Self-Organiz<strong>in</strong>g MapsOutput unitsRow 3Row 2Row 1Input units(a)Row 1 units Row 2 units Row 3 unitsInput units(b)Figure 7.12 The conceptual model of the SOM is shown, (a) as describedby the theoretical model, <strong>and</strong> (b) restructured to ease thesimulation task.


7.3 Simulat<strong>in</strong>g the SOM 281model of the SOM network to a simple two-layer, feedforward structure. Thisreduction allows us to simulate the SOM by us<strong>in</strong>g exactly the data structuresdescribed <strong>in</strong> Chapter I. The only network-specific structure needed to implementthe simulator is then the top-level network-specification record. For the SOM,such a record takes the follow<strong>in</strong>g form:record SOM =ROWS : <strong>in</strong>teger; {number of rows <strong>in</strong> output layer)COLS : <strong>in</strong>teger;{ditto for columns}INPUTS : "layer; {po<strong>in</strong>ter to <strong>in</strong>put layer structure}OUTPUTS : "layer; {po<strong>in</strong>ter to output layer structure}WINNER : <strong>in</strong>teger;{<strong>in</strong>dex to w<strong>in</strong>n<strong>in</strong>g unit}deltaR : <strong>in</strong>teger;{neighborhood row offset}deltaC : <strong>in</strong>teger; {neighborhood column offset}TIME : <strong>in</strong>teger;{discrete timestep}end record;7.3.2 SOM <strong>Algorithms</strong>Let us now turn our attention to the process of implement<strong>in</strong>g the SOM simulator.As <strong>in</strong> previous chapters, we shall beg<strong>in</strong> by describ<strong>in</strong>g the algorithms needed topropagate <strong>in</strong>formation through the network, <strong>and</strong> shall conclude this section bydescrib<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g algorithms. Throughout the rema<strong>in</strong>der of this section,we presume that you are by now familiar with the data structures we use tosimulate a layered network. Anyone not comfortable with these structures isreferred to Section 1.4.SOM Signal Propagation. In Section 6.4.4, we described a modification tothe counterpropagation network that used the magnitude of the difference vectorbetween the unnormalized <strong>in</strong>put <strong>and</strong> weight vectors as the basis for determ<strong>in</strong><strong>in</strong>gthe activation of a unit on the competitive layer. We shall now see that thisapproach is a viable means of implement<strong>in</strong>g competition, s<strong>in</strong>ce it is the basicmethod of stimulat<strong>in</strong>g output units <strong>in</strong> the SOM.In the SOM, the <strong>in</strong>put layer is provided only to store the <strong>in</strong>put vector.For that reason, we can consider the process of forward signal propagationto be a matter of allow<strong>in</strong>g the computer to visit all units <strong>in</strong> the output layersequentially. At each output-layer unit, the computer calculates the magnitude ofthe difference vector between the output of the <strong>in</strong>put layer <strong>and</strong> the weight vectorformed by the connections between the <strong>in</strong>put layer <strong>and</strong> the current unit. Aftercompletion of this calculation, the magnitude will be stored, <strong>and</strong> the computerwill move on to the next unit on the layer. Once all the output-layer units havebeen processed, the forward signal propagation is f<strong>in</strong>ished, <strong>and</strong> the output of thenetwork will be the matrix conta<strong>in</strong><strong>in</strong>g the magnitude of the difference vector foreach unit <strong>in</strong> the output layer.If we also consider the tra<strong>in</strong><strong>in</strong>g process, we can allow the computer tostore locally an <strong>in</strong>dex (or po<strong>in</strong>ter) to locate the output unit that had the smallest


282 Self-Organiz<strong>in</strong>g Mapsdifference-vector magnitude dur<strong>in</strong>g the <strong>in</strong>itial pass. That <strong>in</strong>dex can then beused to identify the w<strong>in</strong>ner of the competition. By adopt<strong>in</strong>g this approach, wecan also use the rout<strong>in</strong>e used to forward propagate signals <strong>in</strong> the SOM dur<strong>in</strong>gtra<strong>in</strong><strong>in</strong>g with no modifications.Based on this strategy, we shall def<strong>in</strong>e the forward signal-propagation algorithmto be the comb<strong>in</strong>ation of two rout<strong>in</strong>es: one to compute the differencevectormagnitude for a specified unit on the output layer, <strong>and</strong> one to call the firstrout<strong>in</strong>e for every unit on the output layer. We shall call these rout<strong>in</strong>es prop<strong>and</strong> propagate, respectively. We beg<strong>in</strong> with the def<strong>in</strong>ition of prop.function prop (NET:SOM; UNIT:<strong>in</strong>teger) return float{compute the magnitude of the difference vector for UNIT}var <strong>in</strong>vec, connectssum, mag : float;i : <strong>in</strong>teger;'float[];{locate arrays}{temporary variables}{iteration counter}beg<strong>in</strong><strong>in</strong>vec = NET.INPUTS".CUTS"; {locate <strong>in</strong>put vector}connects = NET.OUTPUTS".WEIGHTS"[UNIT]; {connections}sum = 0;{<strong>in</strong>itialize sum}for i = 1 to length(<strong>in</strong>vec){for all <strong>in</strong>puts}do . {square of difference}sum = sum + sqr(<strong>in</strong>vec[i] - connect[i]);end do;mag = sqrt(sum)return (mag);end function;{compute magnitude}{return magnitude}Now that we can compute the output value for any unit on the outputlayer, let us consider the rout<strong>in</strong>e to generate the output for the entire network.S<strong>in</strong>ce we have def<strong>in</strong>ed our SOM network as a st<strong>and</strong>ard, two-layer network, thepseudocode def<strong>in</strong>ition for propagate is straightforward.function propagate (NET:SOM) return <strong>in</strong>teger{propagate forward through the SOM, return the <strong>in</strong>dex tow<strong>in</strong>ner}var outvec : "float [];w<strong>in</strong>ner : <strong>in</strong>teger;smallest, mag : floati. : <strong>in</strong>teger;beg<strong>in</strong>outvec = NET.OUTPUTS" .CUTS'w<strong>in</strong>ner = 0;smallest = 10000;for i = 1 to length(outvec){locate output array}{w<strong>in</strong>n<strong>in</strong>g unit <strong>in</strong>dex}{temporary storage}{iteration counter}{locate output array}{<strong>in</strong>itialize w<strong>in</strong>ner}{arbitrarily high}{for all outputs}


7.3 Simulat<strong>in</strong>g the SOM 283domag = prop(NET, i) ;outvecfi] = mag;if (mag < smallest)thenw<strong>in</strong>ner = i;smallest = mag;end if;end do;NET. WINNER = w<strong>in</strong>ner;return (w<strong>in</strong>ner) ;end function;{activate unit}{save output}{if new w<strong>in</strong>ner is found}{mark new w<strong>in</strong>ner}{save w<strong>in</strong>n<strong>in</strong>g value){store w<strong>in</strong>n<strong>in</strong>g unit id}{identify w<strong>in</strong>ner}SOM Learn<strong>in</strong>g <strong>Algorithms</strong>. Now that we have developed a means for perform<strong>in</strong>gthe forward signal propagation <strong>in</strong> the SOM, we have also solved thelargest part of the problem of tra<strong>in</strong><strong>in</strong>g the network. As described by Eq. (7.4),learn<strong>in</strong>g <strong>in</strong> the SOM takes place by updat<strong>in</strong>g of the connections to the set of outputunits that fall with<strong>in</strong> the neighborhood of the w<strong>in</strong>n<strong>in</strong>g unit. We have alreadyprovided the means for determ<strong>in</strong><strong>in</strong>g the w<strong>in</strong>ner as part of the forward signalpropagation; all that rema<strong>in</strong>s to be done to tra<strong>in</strong> the network is to develop theprocesses that def<strong>in</strong>e the neighborhood (7V C ) <strong>and</strong> update the connection weights.Unfortunately, the process of determ<strong>in</strong><strong>in</strong>g the neighborhood surround<strong>in</strong>g thew<strong>in</strong>n<strong>in</strong>g unit is likely to be application dependent. For example, consider thetwo applications described earlier, the neural phonetic typewriter <strong>and</strong> the ballisticarm movement systems. Each implemented an SOM as the basic mechanismfor solv<strong>in</strong>g their respective problems, but each also utilized a neighborhoodselectionmechanism that was best suited to the application be<strong>in</strong>g addressed.It is likely that other problems would also require alternative methods bettersuited to determ<strong>in</strong><strong>in</strong>g the size of the neighborhood needed for each application.Therefore, we will not presuppose that we can def<strong>in</strong>e a universally acceptablefunction for N c .We will, however, develop the code necessary to describe a typicalneighborhood-selection function, trust<strong>in</strong>g that you will learn enough from theexample to construct a function suitable for your applications. For simplicity,we will design the process as two functions: the first will return a true-falseflag to <strong>in</strong>dicate whether a certa<strong>in</strong> unit is with<strong>in</strong> the neighborhood of the w<strong>in</strong>n<strong>in</strong>gunit at the current timestep, <strong>and</strong> the second will update the connection values atan output unit, if the unit falls with<strong>in</strong> the neighborhood of the w<strong>in</strong>n<strong>in</strong>g unit.The first of these rout<strong>in</strong>es, which we call neighbor, will return a trueflag if the row <strong>and</strong> column coord<strong>in</strong>ates of the unit given as <strong>in</strong>put fall with<strong>in</strong> therange of units to be updated. This process proves to be relatively easy, <strong>in</strong> thatthe rout<strong>in</strong>e needs to perform only the follow<strong>in</strong>g two tests:(R w - AR)


284 Self-Organiz<strong>in</strong>g Maps-«———— AC = 1 —————+~Figure 7.13A simple scheme is shown for dynamically alter<strong>in</strong>g the sizeof the neighborhood surround<strong>in</strong>g the w<strong>in</strong>n<strong>in</strong>g unit. In thisdiagram, W denotes the w<strong>in</strong>n<strong>in</strong>g unit for a given <strong>in</strong>put vector.The neighborhood surround<strong>in</strong>g the w<strong>in</strong>n<strong>in</strong>g unit is then givenby the values conta<strong>in</strong>ed <strong>in</strong> the variables deltaR <strong>and</strong>-deltaCconta<strong>in</strong>ed <strong>in</strong> the SOM record. As the values <strong>in</strong> deltaR <strong>and</strong>deltaC approach zero, the neighborhood surround<strong>in</strong>g thew<strong>in</strong>n<strong>in</strong>g unit shr<strong>in</strong>ks, until the neighborhood is precisely thew<strong>in</strong>n<strong>in</strong>g unit.where (R w , C w ) are the row <strong>and</strong> column coord<strong>in</strong>ates of the w<strong>in</strong>n<strong>in</strong>g unit, (A-R,AC) are the row <strong>and</strong> column offsets from the w<strong>in</strong>n<strong>in</strong>g unit that def<strong>in</strong>e theneighborhood, <strong>and</strong> (R, C) the row <strong>and</strong> column coord<strong>in</strong>ates of the unit be<strong>in</strong>gtested.For example, consider the situation illustrated <strong>in</strong> Figure 7.13. Notice thatthe boundary surround<strong>in</strong>g the w<strong>in</strong>ner's neighborhood shr<strong>in</strong>ks with successivelysmaller values for (A/?, AC), until the neighborhood is limited to the w<strong>in</strong>nerwhen (A.R, AC) = (0,0). Thus, we need only to alter the values for (A/?, AC)<strong>in</strong> order to change the size of the w<strong>in</strong>ner's neighborhood.So that we can implement this mechanism of neighborhood determ<strong>in</strong>ation,we have <strong>in</strong>corporated two variables <strong>in</strong> the SOM record, which we have nameddeltaR <strong>and</strong> deltaC, which allow the network record to keep the currentvalues for the &.R <strong>and</strong> AC terms. Hav<strong>in</strong>g made this observation, we can nowdef<strong>in</strong>e the algorithm needed to implement the neighbor function.function neighbor (NET:SOM; R,C,W:<strong>in</strong>teger) return boolean{return true if (R,C) is <strong>in</strong> the neighborhood of W}var row, col,{coord<strong>in</strong>ates of w<strong>in</strong>ner}dRl, dCl,{coord<strong>in</strong>ates of lower boundary}dR2, dC2 : <strong>in</strong>teger; {coord<strong>in</strong>ates of upper boundary}beg<strong>in</strong>


7.3 Simulat<strong>in</strong>g the SOM 285row = (W-l) / NET.COLS;{convert <strong>in</strong>dex of w<strong>in</strong>ner to row}col = (W-l) % NET.COLS;{modulus f<strong>in</strong>ds column of w<strong>in</strong>ner}dRl = max(l, (row - NET.deltaR));dR2 = m<strong>in</strong>(NET.ROWS, (row + NET.deltaR));dCl = max(l, (col - NET.deltaC));dC2 = m<strong>in</strong>(NET.COLS, (col + NET.deltaC));return (((dRl


286 Self-Organiz<strong>in</strong>g Mapsfor i = 1 to NET.ROWSdofor j = 1 to NET.COLS{for all rows}{<strong>and</strong> all columns}doif (neighbor (NET,i,j,w<strong>in</strong>ner))then{first locate the appropriate connection array}connect = NET.OUTPUTS".WEIGHTS"[unit];{then locate the <strong>in</strong>put layer output array}end do;end do;return (upd);end function;7.3.3 Tra<strong>in</strong><strong>in</strong>g the SOM<strong>in</strong>vec = NET.INPUTS".OUTS";upd = upd + 1; {count another update}for k = 1 to length(connect){for all connections}doconnect[k] = connect[k]+ (A*(<strong>in</strong>vec[k]-connect[k]));end do;end if;unit = unit + 1;{access next unit}{return update count}Like most other networks, the SOM will be constructed so that it <strong>in</strong>itially conta<strong>in</strong>sr<strong>and</strong>om <strong>in</strong>formation. The network will then be allowed to self-adapt bybe<strong>in</strong>g shown example <strong>in</strong>puts that are representative of the desired topology. Ourcomputer simulation ought to mimic the desired network behavior if we simplyfollow these same guidel<strong>in</strong>es when construct<strong>in</strong>g <strong>and</strong> tra<strong>in</strong><strong>in</strong>g the simulator.There are two aspects of the tra<strong>in</strong><strong>in</strong>g process that are relatively simple toimplement, <strong>and</strong> we assume that you will provide them as part of the implementationof the simulator. These functions are the ones needed to <strong>in</strong>itialize theSOM (<strong>in</strong>itialize) <strong>and</strong> to apply an <strong>in</strong>put vector (set-<strong>in</strong>put) to the <strong>in</strong>putlayer of the network.Most of the work to be done <strong>in</strong> the simulator will be accomplished by thepreviously def<strong>in</strong>ed rout<strong>in</strong>es, <strong>and</strong> we need to concern ourselves now with onlythe notion of decid<strong>in</strong>g how <strong>and</strong> when to collapse the w<strong>in</strong>n<strong>in</strong>g neighborhood aswe tra<strong>in</strong> the network. Here aga<strong>in</strong>, this aspect of the design probably will be<strong>in</strong>fluenced by the specific application, so, for <strong>in</strong>structional purposes, we willrestrict ourselves to a fairly easy application that allows each of the output layerunits to be uniquely associated with a specific <strong>in</strong>put vector.For this example, let us assume that the SOM to be simulated has four rowsof five columns of units <strong>in</strong> the output layer, <strong>and</strong> two units provid<strong>in</strong>g <strong>in</strong>put. Such


7.3 Simulat<strong>in</strong>g the SOM 287Figure 7.14 This SOM network can be used to capture organization ofa two-dimensional image, such as a triangle, circle, or anyregular polygon. In the programm<strong>in</strong>g exercises, we ask youto simulate this network structure to test the operation of yoursimulator program.a network structure is depicted <strong>in</strong> Figure 7.14. We will code the SOM simulatorso that the entire output layer is <strong>in</strong>itially conta<strong>in</strong>ed <strong>in</strong> the neighborhood, <strong>and</strong>we shall shr<strong>in</strong>k the neighborhood by two rows <strong>and</strong> two columns after every 10tra<strong>in</strong><strong>in</strong>g patterns.For the SOM, there are two dist<strong>in</strong>ct tra<strong>in</strong><strong>in</strong>g sessions that must occur. In thefirst, we will tra<strong>in</strong> the network until the neighborhood has shrunk to the po<strong>in</strong>tthat only one unit w<strong>in</strong>s the competition. Dur<strong>in</strong>g the second phase, which occursafter all the tra<strong>in</strong><strong>in</strong>g patterns have been allocated to a w<strong>in</strong>n<strong>in</strong>g unit (althoughnot necessarily different units), we will simply cont<strong>in</strong>ue to run the tra<strong>in</strong><strong>in</strong>galgorithm for an arbitrarily large number of additional cycles. We do this to tryto ensure that the network has stabilized, although there is no absolute guaranteethat it has. With this strategy <strong>in</strong> m<strong>in</strong>d, we can now complete the simulator byconstruct<strong>in</strong>g the rout<strong>in</strong>e to <strong>in</strong>itialize <strong>and</strong> tra<strong>in</strong> the network.procedure tra<strong>in</strong> (NET:SOM; NP:<strong>in</strong>teger){tra<strong>in</strong> the network for each of NP patterns}beg<strong>in</strong><strong>in</strong>itialize(NET);for i = 1 to NPdo{reader-provided rout<strong>in</strong>e}{for all tra<strong>in</strong><strong>in</strong>g patterns}


288 Programm<strong>in</strong>g ExercisesNET.deltaR = NET.ROWS / 2;{<strong>in</strong>itialize the row offset}NET.deltaC = NET.COLS 7 2; {ditto for columns}NET.TIME = 0;{reset time counter}set_<strong>in</strong>puts(NET,i);{get tra<strong>in</strong><strong>in</strong>g pattern}while (update(NET) > 1) {loop until one w<strong>in</strong>ner}doNET.TIME = NET.TIME + 1;{advance tra<strong>in</strong><strong>in</strong>g counter}if (NET.TIME % 10 = 0) {if shr<strong>in</strong>k time}then{shr<strong>in</strong>k the neighborhood, with a floorof (0,0)}NET.deltaR = max (0, NET.deltaR - 1);NET.deltaC = max (0, NET.deltaC - 1);end if;end do;end do;{now that all patterns have one w<strong>in</strong>ner, tra<strong>in</strong>some more}for i = 1 to 1000{for arbitrary passes}dofor j = 1 to NP{for all patterns}doset_<strong>in</strong>puts(NET, j); {set tra<strong>in</strong><strong>in</strong>g pattern}dummy = update(NET); {tra<strong>in</strong> network}end do;end do;end procedure;Programm<strong>in</strong>g Exercises7.1. Implement the SOM simulator. Test it by construct<strong>in</strong>g a network similarto the one depicted <strong>in</strong> Figure 7.14. Tra<strong>in</strong> the network us<strong>in</strong>g the Cartesiancoord<strong>in</strong>ates of each unit <strong>in</strong> the output layer as tra<strong>in</strong><strong>in</strong>g data. Experiment withdifferent time periods to determ<strong>in</strong>e how many tra<strong>in</strong><strong>in</strong>g passes are optimalbefore reduc<strong>in</strong>g the size of the neighborhood.7.2. Repeat Programm<strong>in</strong>g Exercise 7.1, but this time extend the simulator toplot the network dynamics us<strong>in</strong>g Kohonen's method, as described <strong>in</strong> Section7.1. If you do not have access to a graphics term<strong>in</strong>al, simply list outthe connection-weight values to each unit <strong>in</strong> the output layer as a set ofordered pairs at various timesteps.7.3. The converge algorithm given <strong>in</strong> the text is not very general, <strong>in</strong> that itwill work only if the number of output units <strong>in</strong> the SOM is exactly equal


Bibliography 289to the number of tra<strong>in</strong><strong>in</strong>g patterns to be encoded. Redesign the rout<strong>in</strong>e toh<strong>and</strong>le the case where the number of tra<strong>in</strong><strong>in</strong>g patterns greatly outnumbersthe number of output layer units. Test the algorithm by repeat<strong>in</strong>g Programm<strong>in</strong>gExercise 7.1 <strong>and</strong> reduc<strong>in</strong>g the number of output units used to threerows of four units.7.4. Repeat Programm<strong>in</strong>g Exercise 7.2, this time us<strong>in</strong>g three <strong>in</strong>put units. Configurethe output layer appropriately, <strong>and</strong> tra<strong>in</strong> the network to learn to mapa three-dimensional cube <strong>in</strong> the first quadrant (all vertices should conta<strong>in</strong>positive coord<strong>in</strong>ates). Do not put the vertices of the cube at <strong>in</strong>teger coord<strong>in</strong>ates.Does the network do as well as the network <strong>in</strong> Programm<strong>in</strong>gExercise 7.2?Suggested Read<strong>in</strong>gsThe best supplement to the material <strong>in</strong> this chapter is Kohonen's text on selforganization[2]. Now <strong>in</strong> its second edition, that text also conta<strong>in</strong>s generalbackground material regard<strong>in</strong>g various learn<strong>in</strong>g methods for neural networks, aswell as a review of the necessary mathematics.Bibliography[1] Willian Y. Huang <strong>and</strong> Richard P. Lippmann. <strong>Neural</strong> net <strong>and</strong> traditionalclassifiers. In Proceed<strong>in</strong>gs of the Conference on <strong>Neural</strong> Information Process<strong>in</strong>gSystems, Denver, CO, November 1987.[2] Teuvo Kohonen. Self-Organization <strong>and</strong> Associative Memory, volume 8 ofSpr<strong>in</strong>ger Series <strong>in</strong> Information Sciences. Spr<strong>in</strong>ger-Verlag, New York,1984.[3] Teuvo Kohonen. The "neural" phonetic typewriter. Computer, 21(3): 11-22,March 1988.[4] H. Ritter <strong>and</strong> K. Schulten. Topology conserv<strong>in</strong>g mapp<strong>in</strong>gs for learn<strong>in</strong>gmotor tasks. In John S. Denker, editor, <strong>Neural</strong> <strong>Networks</strong> for Comput<strong>in</strong>g.American Institute of Physics, pp. 376-380, New York, 1986.[5] H. Ritter <strong>and</strong> K. Schulten. Extend<strong>in</strong>g Kohonen's self-organiz<strong>in</strong>g mapp<strong>in</strong>galgorithm to learn ballistic movements. In Rolf Eckmiller <strong>and</strong> Christophv.d. Malsburg, editors. <strong>Neural</strong> Computers. Spr<strong>in</strong>ger-Verlag, Heidelberg,pp. 393-406, 1987.[6] Helge J. Ritter, Thomas M. Mart<strong>in</strong>etz, <strong>and</strong> Klaus J. Schulten. Topologyconserv<strong>in</strong>gmaps for learn<strong>in</strong>g visuo-motor-coord<strong>in</strong>ation. <strong>Neural</strong> <strong>Networks</strong>,2(3): 159-168, 1989.


i 1


H A P T E RAdaptiveResonance TheoryOne of the nice features of human memory is its ability to learn many newth<strong>in</strong>gs without necessarily forgett<strong>in</strong>g th<strong>in</strong>gs learned <strong>in</strong> the past. A frequentlycited example is the ability to recognize your parents even if you have not seenthem for some time <strong>and</strong> have learned many new faces <strong>in</strong> the <strong>in</strong>terim. It wouldbe highly desirable if we could impart this same capability to an ANS. Mostnetworks that we have discussed <strong>in</strong> previous chapters will tend to forget old<strong>in</strong>formation if we attempt to add new <strong>in</strong>formation <strong>in</strong>crementally.When develop<strong>in</strong>g an ANS to perform a particular pattern-classification operation,we typically proceed by gather<strong>in</strong>g a set of exemplars, or tra<strong>in</strong><strong>in</strong>g patterns,then us<strong>in</strong>g these exemplars to tra<strong>in</strong> the system. Dur<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g, <strong>in</strong>formationis encoded <strong>in</strong> the system by the adjustment of weight values. Once the tra<strong>in</strong><strong>in</strong>gis deemed to be adequate, the system is ready to be put <strong>in</strong>to production, <strong>and</strong> noadditional weight modification is permitted.This operational scenario is acceptable provided the problem doma<strong>in</strong> haswell-def<strong>in</strong>ed boundaries <strong>and</strong> is stable. Under such conditions, it is usuallypossible to def<strong>in</strong>e an adequate set of tra<strong>in</strong><strong>in</strong>g <strong>in</strong>puts for whatever problem isbe<strong>in</strong>g solved. Unfortunately, <strong>in</strong> many realistic situations, the environment isneither bounded nor stable.Consider a simple example. Suppose you <strong>in</strong>tend to tra<strong>in</strong> a BPN to recognizethe silhouettes of a certa<strong>in</strong> class of aircraft. The appropriate images can becollected <strong>and</strong> used to tra<strong>in</strong> the network, which is potentially a time-consum<strong>in</strong>gtask depend<strong>in</strong>g on the size of the network required. After the network haslearned successfully to recognize all of the aircraft, the tra<strong>in</strong><strong>in</strong>g period is ended<strong>and</strong> no further modification of the weights is allowed.If, at some future time, another aircraft <strong>in</strong> the same class becomes operational,you may wish to add its silhouette to the store of knowledge <strong>in</strong> your


292 Adaptive Resonance Theorynetwork. To do this, you would have to retra<strong>in</strong> the network with the new patternplus all of the previous patterns. Tra<strong>in</strong><strong>in</strong>g on only the new silhouette could result<strong>in</strong> the network learn<strong>in</strong>g that pattern quite well, but forgett<strong>in</strong>g previously learnedpatterns. Although retra<strong>in</strong><strong>in</strong>g may not take as long as the <strong>in</strong>itial tra<strong>in</strong><strong>in</strong>g, it stillcould require a significant <strong>in</strong>vestment.Moreover, if an ANS is presented with a previously unseen <strong>in</strong>put pattern,there is generally no built-<strong>in</strong> mechanism for the network to be able to recognizethe novelty of the <strong>in</strong>put. The ANS doesn't know that it doesn't know the <strong>in</strong>putpattern.We have been describ<strong>in</strong>g what Stephen Grossberg calls the stabilityplasticitydilemma [5]. This dilemma can be stated as a series of questions [6]:How can a learn<strong>in</strong>g system rema<strong>in</strong> adaptive (plastic) <strong>in</strong> response to significant<strong>in</strong>put, yet rema<strong>in</strong> stable <strong>in</strong> response to irrelevant <strong>in</strong>put? How does the systemknow to switch between its plastic <strong>and</strong> its stable modes? How can the systemreta<strong>in</strong> previously learned <strong>in</strong>formation while cont<strong>in</strong>u<strong>in</strong>g to learn new th<strong>in</strong>gs?In response to such questions, Grossberg, Carpenter, <strong>and</strong> numerous colleaguesdeveloped adaptive resonance theory (ART), which seeks to provideanswers. ART is an extension of the competitive-learn<strong>in</strong>g schemes that havebeen discussed <strong>in</strong> Chapters 6 <strong>and</strong> 7. The material <strong>in</strong> Section 6.1 especially,should be considered a prerequisite to the current chapter. We will draw heavilyfrom those results, so you should review the material, if necessary, beforeproceed<strong>in</strong>g.In the competitive systems discussed <strong>in</strong> Chapter 6, nodes compete withone another, based on some specified criteria, <strong>and</strong> the w<strong>in</strong>ner is said to classifythe <strong>in</strong>put pattern. Certa<strong>in</strong> <strong>in</strong>stabilities can arise <strong>in</strong> these networks such thatdifferent nodes might respond to the same <strong>in</strong>put pattern on different occasions.Moreover, later learn<strong>in</strong>g can wash away earlier learn<strong>in</strong>g if the environment isnot statistically stationary or if novel <strong>in</strong>puts arise [9].A key to solv<strong>in</strong>g the stability-plasticity dilemma is to add a feedback mechanismbetween the competitive layer <strong>and</strong> the <strong>in</strong>put layer of a network. This feedbackmechanism facilitates the learn<strong>in</strong>g of new <strong>in</strong>formation without destroy<strong>in</strong>gold <strong>in</strong>formation, automatic switch<strong>in</strong>g between stable <strong>and</strong> plastic modes, <strong>and</strong> stabilizationof the encod<strong>in</strong>g of the classes done by the nodes. The results fromthis approach are two neural-network architectures that are particularly suited forpattern-classification problems <strong>in</strong> realistic environments. These network architecturesare referred to as ART1 <strong>and</strong> ART2. ART1 <strong>and</strong> ART2 differ <strong>in</strong> the natureof their <strong>in</strong>put patterns. ART1 networks require that the <strong>in</strong>put vectors be b<strong>in</strong>ary.ART2 networks are suitable for process<strong>in</strong>g analog, or gray-scale, patterns.ART gets its name from the particular way <strong>in</strong> which learn<strong>in</strong>g <strong>and</strong> recall<strong>in</strong>terplay <strong>in</strong> the network. In physics, resonance occurs when a small-amplitudevibration of the proper frequency causes a large-amplitude vibration <strong>in</strong> an electricalor mechanical system. In an ART network, <strong>in</strong>formation <strong>in</strong> the form ofprocess<strong>in</strong>g-element outputs reverberates back <strong>and</strong> forth between layers. If theproper patterns develop, a stable oscillation ensues, which is the neural-network


8.1 ART Network Description 293equivalent of resonance. Dur<strong>in</strong>g this resonant period, learn<strong>in</strong>g, or adaptation,can occur. Before the network has achieved a resonant state, no learn<strong>in</strong>g takesplace, because the time required for changes <strong>in</strong> the process<strong>in</strong>g-element weightsis much longer than the time that it takes the network to achieve resonance.A resonant state can be atta<strong>in</strong>ed <strong>in</strong> one of two ways. If the network haslearned previously to recognize an <strong>in</strong>put vector, then a resonant state will beachieved quickly when that <strong>in</strong>put vector is presented. Dur<strong>in</strong>g resonance, theadaptation process will re<strong>in</strong>force the memory of the stored pattern. If the <strong>in</strong>putvector is not immediately recognized, the network will rapidly search throughits stored patterns look<strong>in</strong>g for a match. If no match is found, the network willenter a resonant state whereupon the new pattern will be stored for the first time.Thus, the network responds quickly to previously learned data, yet rema<strong>in</strong>s ableto learn when novel data are presented.Much of Grossberg's work has been concerned with model<strong>in</strong>g actual macroscopicprocesses that occur with<strong>in</strong> the bra<strong>in</strong> <strong>in</strong> terms of the average propertiesof collections of the microscopic components of the bra<strong>in</strong> (neurons). Thus, aGrossberg process<strong>in</strong>g element may represent one or more actual neurons. Inkeep<strong>in</strong>g with our practice, we shall not dwell on the neurological implicationsof the theory. There exists a vast body of literature available concern<strong>in</strong>g thiswork. Work with these theories has led to predictions about neurophysiologicalprocesses, even down to the chemical-ion level, which have subsequently beenproven true through research by neurophysiologists [6]. Numerous referencesare listed at the end of this chapter.The equations that govern the operation of the ART networks are quitecomplicated. It is easy to lose sight of the forest while exam<strong>in</strong><strong>in</strong>g the treesclosely. For that reason, we first present a qualitative description of the process<strong>in</strong>g<strong>in</strong> ART networks. Once that foundation is laid, we shall return to adetailed discussion of the equations.8.1 ART NETWORK DESCRIPTIONThe basic features of the ART architecture are shown <strong>in</strong> Figure 8.1. Patterns ofactivity that develop over the nodes <strong>in</strong> the two layers of the attentional subsystemare called short-term memory (STM) traces because they exist only <strong>in</strong> associationwith a s<strong>in</strong>gle application of an <strong>in</strong>put vector. The weights associated with thebottom-up <strong>and</strong> top-down connections between F\ <strong>and</strong> F2 are called long-termmemory (LTM) traces because they encode <strong>in</strong>formation that rema<strong>in</strong>s a part ofthe network for an extended period.8.1.1 Pattern Match<strong>in</strong>g <strong>in</strong> ARTTo illustrate the process<strong>in</strong>g that takes place, we shall describe a hypotheticalsequence of events that might occur <strong>in</strong> an ART network. The scenario is a simplepattern-match<strong>in</strong>g operation dur<strong>in</strong>g which an ART network tries to determ<strong>in</strong>e


294 Adaptive Resonance TheoryGa<strong>in</strong> Attentional subsystem ' Orient<strong>in</strong>g "^control F 2 Layer I subsystemResetsignalInput vectorFigure 8.1The ART system is diagrammed. The two major subsystems arethe attentional subsystem <strong>and</strong> the orient<strong>in</strong>g subsystem. F\ <strong>and</strong>F 2 represent two layers of nodes <strong>in</strong> the attentional subsystem.Nodes on each layer are fully <strong>in</strong>terconnected to the nodes onthe other layer. Not shown are <strong>in</strong>terconnects among the nodeson each layer. Other connections between components are<strong>in</strong>dicated by the arrows. A plus sign <strong>in</strong>dicates an excitatoryconnection; a m<strong>in</strong>us sign <strong>in</strong>dicates an <strong>in</strong>hibitory connection.The function of the ga<strong>in</strong> control <strong>and</strong> orient<strong>in</strong>g subsystem isdiscussed <strong>in</strong> the text.whether an <strong>in</strong>put pattern is among the patterns previously stored <strong>in</strong> the network.Figure 8.2 illustrates the operation.In Figure 8.2(a), an <strong>in</strong>put pattern, I, is presented to the units on F\ <strong>in</strong> thesame manner as <strong>in</strong> other networks: one vector component goes to each node.A pattern of activation, X, is produced across F\. The process<strong>in</strong>g done by theunits on this layer is a somewhat more complicated form of that done by the<strong>in</strong>put layer of the CPN (see Section 6.1). The same <strong>in</strong>put pattern excites both theorient<strong>in</strong>g subsystem, A, <strong>and</strong> the ga<strong>in</strong> control, G (the connections to G are notshown on the draw<strong>in</strong>gs). The output pattern, S, results <strong>in</strong> an <strong>in</strong>hibitory signalthat is also sent to A. The network is structured such that this <strong>in</strong>hibitory signalexactly cancels the excitatory effect of the signal from I, so that A rema<strong>in</strong>s<strong>in</strong>active. Notice that G supplies an excitatory signal to F\. The same signalis applied to each node on the layer <strong>and</strong> is therefore known as a nonspecificsignal. The need for this signal will be made clear later.The appearance of X on F\ results <strong>in</strong> an output pattern, S, which is sentthrough connections to F 2 . Each FI unit receives the entire output vector, S,


8.1 ART Network Description 2951 0 1 0 =1 1 0 1 0 =10 =!Figure 8.2A pattern-match<strong>in</strong>g cycle <strong>in</strong> an ART network is shown. Theprocess evolves from the <strong>in</strong>itial presentation of the <strong>in</strong>put pattern<strong>in</strong> (a) to a pattern-match<strong>in</strong>g attempt <strong>in</strong> (b), to reset <strong>in</strong> (c), to thef<strong>in</strong>al recognition <strong>in</strong> (d). Details of the cycle are discussed <strong>in</strong>the text.from F]. F2 units calculate their net-<strong>in</strong>put values <strong>in</strong> the usual manner by summ<strong>in</strong>gthe products of the <strong>in</strong>put values <strong>and</strong> the connection weights. In responseto <strong>in</strong>puts from F\, a pattern of activity, Y, develops across the nodes of F 2 . F 2is a competitive layer that performs a contrast enhancement on the <strong>in</strong>put signallike the competitive layer described <strong>in</strong> Section 6.1. The ga<strong>in</strong> control signals toF 2 are omitted here for simplicity.In Figure 8.2(b), the pattern of activity, Y, results <strong>in</strong> an output pattern, U,from F 2 . This output pattern is sent as an <strong>in</strong>hibitory signal to the ga<strong>in</strong> controlsystem. The ga<strong>in</strong> control is configured such that if it receives any <strong>in</strong>hibitorysignal from F 2 , it ceases activity. U also becomes a second <strong>in</strong>put pattern for theF\ units. U is transformed by LTM traces on the top-down connections fromF 2 to F\. We shall call this transformed pattern V.Notice that there are three possible sources of <strong>in</strong>put to F\, but that onlytwo appear to be used at any one time. The units on F\ (<strong>and</strong> F 2 as well)are constructed so that they can become active only if two out of the possible


296 Adaptive Resonance Theorythree sources of <strong>in</strong>put are active. This feature is called the 2/3 rule <strong>and</strong> itplays an important role <strong>in</strong> ART, which we shall discuss more fully later <strong>in</strong> thissection.Because of the 2/3 rule, only those F\ nodes receiv<strong>in</strong>g signals from both I<strong>and</strong> V will rema<strong>in</strong> active. The pattern that rema<strong>in</strong>s on F\ is InV, the <strong>in</strong>tersectionof I <strong>and</strong> V. In Figure 8.2(b), the patterns mismatch <strong>and</strong> a new activity pattern,X*, develops on FI. S<strong>in</strong>ce the new output pattern, S*, is different from theorig<strong>in</strong>al pattern, S, the <strong>in</strong>hibitory signal to A no longer cancels the excitationcom<strong>in</strong>g from I.In Figure 8.2(c), A has become active <strong>in</strong> response to the mismatch ofpatterns on FI. A sends a nonspecific reset signal to all of the nodes on F 2 .These nodes respond accord<strong>in</strong>g to their present state. If they are <strong>in</strong>active, theydo not respond. If they are active, they become <strong>in</strong>active <strong>and</strong> they stay that wayfor an extended period of time. This susta<strong>in</strong>ed <strong>in</strong>hibition is necessary to preventthe same node from w<strong>in</strong>n<strong>in</strong>g the competition dur<strong>in</strong>g the next match<strong>in</strong>g cycle.S<strong>in</strong>ce Y no longer appears, the top-down output <strong>and</strong> the <strong>in</strong>hibitory signal to thega<strong>in</strong> control also disappear.In Figure 8.2(d), the orig<strong>in</strong>al pattern, X, is re<strong>in</strong>stated on FI, <strong>and</strong> a newcycle of pattern match<strong>in</strong>g beg<strong>in</strong>s. This time a new pattern, Y*, appears on F 2 .The nodes participat<strong>in</strong>g <strong>in</strong> the orig<strong>in</strong>al pattern, Y, rema<strong>in</strong> <strong>in</strong>active due to thelong term effects of the reset signal from A.This cycle of pattern match<strong>in</strong>g will cont<strong>in</strong>ue until a match is found, or untilF 2 runs out of previously stored patterns. If no match is found, the networkwill assign some uncommitted node or nodes on F 2 <strong>and</strong> will beg<strong>in</strong> to learn thenew pattern.' Learn<strong>in</strong>g takes place through the modification of the weights, orthe LTM traces. It is important to underst<strong>and</strong> that this learn<strong>in</strong>g process does notstart or stop, but rather cont<strong>in</strong>ues even while the pattern match<strong>in</strong>g process takesplace. Anytime signals are sent over connections, the weights associated withthose connections are subject to modification. Why then do the mismatchesnot result <strong>in</strong> loss of knowledge or the learn<strong>in</strong>g of <strong>in</strong>correct associations? Thereason is that the time required for significant changes to occur <strong>in</strong> the weights isvery long with respect to the time required for a complete match<strong>in</strong>g cycle. Theconnections participat<strong>in</strong>g <strong>in</strong> mismatches are not active long enough to affect theassociated weights seriously.When a match does occur, there is no reset signal <strong>and</strong> the network settlesdown <strong>in</strong>to a resonant state as described earlier. Dur<strong>in</strong>g this stable state,connections rema<strong>in</strong> active for a sufficiently long time so that the weights arestrengthened. This resonant state can arise only when a pattern match occurs, ordur<strong>in</strong>g the enlistment of new units on F 2 <strong>in</strong> order to store a previously unknownpattern.1 In an actual ART network, the pattern-match<strong>in</strong>g cycle may not visit all previously stored patternsbefore an uncommitted F-> node is chosen.


8.1 ART Network Description 2978.1.2 Ga<strong>in</strong> Control <strong>in</strong> ARTBefore cont<strong>in</strong>u<strong>in</strong>g with a look at the dynamics of ART networks, we want toexam<strong>in</strong>e more closely the need for the ga<strong>in</strong>-control mechanism. In the simpleexample discussed <strong>in</strong> the previous section, the existence of a ga<strong>in</strong> control <strong>and</strong> the2/3 rule appear to be superfluous. They are, however, quite important featuresof the system, as the follow<strong>in</strong>g example illustrates.Suppose the ART network of the previous section was only one <strong>in</strong> a hierarchyof networks <strong>in</strong> a much larger system. The ¥2 layer might be receiv<strong>in</strong>g<strong>in</strong>puts from a layer above it as well as from the F\ layer below. This hierarchicalstructure is thought to be a common one <strong>in</strong> biological neural systems. Ifthe F 2 layer were stimulated by an upper layer, it could produce a top-downoutput <strong>and</strong> send signals back to the Fj layer. It is possible that this top-downsignal would arrive at F\ before an <strong>in</strong>put signal, I, arrived at F\ from below.A premature signal from F 2 could be the result of an expectation aris<strong>in</strong>g froma higher level <strong>in</strong> the hierarchy. In other words, F 2 is <strong>in</strong>dicat<strong>in</strong>g what it expectsthe next <strong>in</strong>put pattern to be, before the pattern actually arrives at F\. Biologicalexamples of this expectation phenomenon abound. For example, how oftenhave you anticipated the next word that a person was go<strong>in</strong>g to say dur<strong>in</strong>g aconversation?The appearance of an early top-down signal from F 2 presents us with asmall dilemma. Suppose FI produced an output <strong>in</strong> response to any s<strong>in</strong>gle <strong>in</strong>putvector, no matter what the source. Then, the expectation signal arriv<strong>in</strong>g fromF 2 would elicit an output from FI <strong>and</strong> the pattern-match<strong>in</strong>g cycle would ensuewithout ever hav<strong>in</strong>g an <strong>in</strong>put vector to FI from below. Now let's add <strong>in</strong> thega<strong>in</strong> control <strong>and</strong> the 2/3 rule.Accord<strong>in</strong>g to the discussion <strong>in</strong> the previous section, if G exists, any signalcom<strong>in</strong>g from F 2 results <strong>in</strong> an <strong>in</strong>hibition of G. Recall that G nonspecificallyarouses every FI unit. With the 2/3 rule <strong>in</strong> effect, <strong>in</strong>hibition of G meansthat a top-down signal from F 2 cannot, by itself, elicit an output from F\.Instead, the Fj units become preconditioned, or sensitized, by the top-downpattern. In the biological language of Grossberg, the FI units have received asublim<strong>in</strong>al stimulation from F 2 . If now the expected <strong>in</strong>put pattern is receivedon FI from below, this precondition<strong>in</strong>g results <strong>in</strong> an immediate resonance <strong>in</strong> thenetwork. Even if the <strong>in</strong>put pattern is not the expected one, F\ will still providean output, s<strong>in</strong>ce it is receiv<strong>in</strong>g <strong>in</strong>puts from two out of the three possible sources,I, G, <strong>and</strong> F 2 .If there is nc expectation signal from F 2 , then F\ rema<strong>in</strong>s completely quiescentuntil it receives an <strong>in</strong>put vector from below. Then, s<strong>in</strong>ce G is not <strong>in</strong>hibited,FI units are aga<strong>in</strong> receiv<strong>in</strong>g <strong>in</strong>puts from two sources <strong>and</strong> FI will send an outputup to F 2 to beg<strong>in</strong> the match<strong>in</strong>g cycle.G <strong>and</strong> the 2/3 rule comb<strong>in</strong>e to permit the FI layer to dist<strong>in</strong>guish betweenan expectation signal from above, <strong>and</strong> an <strong>in</strong>put signal from below. In the formercase, FI 's response is sublim<strong>in</strong>al; <strong>in</strong> the latter case, it is supralim<strong>in</strong>al—that is,it generates a nonzero output.


298 Adaptive Resonance TheoryIn the next section, we shall exam<strong>in</strong>e the equations that govern the operationof the ARTl network. We shall see explicitly how the ga<strong>in</strong> control <strong>and</strong> 2/3 rule<strong>in</strong>fluence the process<strong>in</strong>g-element activities. In Section 8.3, we shall extend theresult to encompass the ART2 architecture.Exercise 8.1: Based on the discussion <strong>in</strong> this section, describe how the ga<strong>in</strong>control signal on the FI layer would function.8.2 ART1The ARTl architecture shares the same overall structure as that shown <strong>in</strong> Figure8.1. Recall that all <strong>in</strong>puts to ARTl must be b<strong>in</strong>ary vectors; that is, they musthave components that are elements of the set {0,1}. This restriction may appearto limit the utility of the network, but there are many problems hav<strong>in</strong>g data thatcan be cast <strong>in</strong>to b<strong>in</strong>ary format. The pr<strong>in</strong>ciples of operation of ARTl are similarto those of ART2, where analog <strong>in</strong>puts are allowed. Moreover, the restrictions<strong>and</strong> assumptions that we make for ARTl will simplify the mathematics a bit.We shall exam<strong>in</strong>e the attentional subsystem first, <strong>in</strong>clud<strong>in</strong>g the STM layers F\<strong>and</strong> F 2 , <strong>and</strong> the ga<strong>in</strong>-control mechanism, G.8.2.1 The Attentional SubsystemThe dynamic equations for the activities of the process<strong>in</strong>g elements on layersFI <strong>and</strong> F> both have the formJ A +ex k = -x k + (1 - Ax k )J+ ~(B + Cx k )J^ (8.1)is an excitatory <strong>in</strong>put to the fcth unit <strong>and</strong> J^ is an <strong>in</strong>hibitory <strong>in</strong>put. The precisedef<strong>in</strong>itions of these terms, as well as those of the parameters A, B, <strong>and</strong> C,depend on which layer is be<strong>in</strong>g discussed, but all terms are assumed to be greaterthan zero. Henceforth we shall use x\j to refer to activities on the FI layer,<strong>and</strong> X2j for activities on the F 2 layer. Similarly, we shall add numbers to theparameter names to identify the layer to which they refer: for example, B\, A.2-For convenience, we shall label the nodes on FI with the symbol v\ <strong>and</strong> thoseon FI with Vj. The subscripts i <strong>and</strong> j will be used exclusively to refer to thelayers FI <strong>and</strong> FI, respectively.The factor e requires some explanation. Recall from the previous sectionthat pattern-match<strong>in</strong>g activities between layers FI <strong>and</strong> F 2 must occur much fasterthan the time required for the connection weights to change significantly. Thee factor <strong>in</strong> Eq. (8.1) embodies that requirement. If we <strong>in</strong>sist that 0 < e


8.2 ART1 299Exercise 8.2: Show that the activities of the process<strong>in</strong>g elements described byEq. (8.1) have their values bounded with<strong>in</strong> the <strong>in</strong>terval [-BC.A], no matterhow large the excitatory or <strong>in</strong>hibitory <strong>in</strong>puts may become.Process<strong>in</strong>g on FI. Figure 8.3 illustrates an F, process<strong>in</strong>g element with itsvarious <strong>in</strong>puts <strong>and</strong> weight vectors. The units calculate a net-<strong>in</strong>put value com<strong>in</strong>gfrom p2 <strong>in</strong> the usual way: 2 Vi = ^UjZij (8.2)jWe assume the unit output function quickly rises to 1 for nonzero activities.Thus, we can approximate the unit output, s,, with a b<strong>in</strong>ary step function:The total excitatory <strong>in</strong>put, J ( + , is given byJ+ = /,. + DiVi+BiG (8.4)where D\ <strong>and</strong> B\ are constants. 3 The <strong>in</strong>hibitory term, J ( ~, we shall set identicallyequal to 1 . With these def<strong>in</strong>itions, the equations for the FI process<strong>in</strong>gelements are±\i = -x u + d-A { xii)di + DiVt + B { G) - (B, + Ci xu) (8.5)The output, G, of the ga<strong>in</strong>-control system depends on the activities on otherparts of the network. We can describe G succ<strong>in</strong>ctly with the equationJl ifI^O<strong>and</strong>U = 0G = < » , . (8.6)(_ 0 otherwiseIn other words, if there is an <strong>in</strong>put vector, <strong>and</strong> F2 is not actively produc<strong>in</strong>gan output vector, then G = 1 . Any other comb<strong>in</strong>ation of activity on I <strong>and</strong> FTeffectively <strong>in</strong>hibits the ga<strong>in</strong> control from produc<strong>in</strong>g its nonspecific excitation tothe units on F\*The convention on weight <strong>in</strong>dices that we have utilized consistently throughout this text is oppositeto that used by Carpenter <strong>and</strong> Grossberg. In our notation, ztj refers to the weight on the connectionfrom the jth unit to the ith unit. In Carpenter <strong>and</strong> Grossberg's notation, ztj would refer to theweight on the connection from the ith node to the jth.'Carpenter <strong>and</strong> Grossberg <strong>in</strong>clude the parameter, D\, <strong>in</strong> their calculation of V-,. In their notation:V, = D\ y~] - UjZij. You should bear this difference <strong>in</strong> m<strong>in</strong>d when read<strong>in</strong>g their papers.4 In the orig<strong>in</strong>al paper by Carpenter <strong>and</strong> Grossberg, the authors described four different ways ofimplement<strong>in</strong>g the ga<strong>in</strong> control system. The method that we have shown here was chosen to beconsistent with the general description of ART given <strong>in</strong> the previous sections. [5]


300 Adaptive Resonance TheoryFrom F.ToF,,To AFigure 8.3This diagram shows a process<strong>in</strong>g element, v it on the FI layerof an ART1 network. The activity of the unit is x\ l . It receivesa b<strong>in</strong>ary <strong>in</strong>put value, /;, from below, <strong>and</strong> an excitatory signal,G, from the ga<strong>in</strong> control. In addition, the top-down signals,Uj, from F2 are gated (multiplied by) weights, Zjj. Outputs,Si, from the process<strong>in</strong>g element go up to F2 <strong>and</strong> across to theorient<strong>in</strong>g subsystem, A.Let's exam<strong>in</strong>e Eq. (8.5) for the four possible comb<strong>in</strong>ations of activity onI <strong>and</strong> p2. First, consider the case where there is no <strong>in</strong>put vector <strong>and</strong> F^ is<strong>in</strong>active. Equation (8.5) reduces toC\In equilibrium x\i — 0,x\; — (8.7)Thus, units with no <strong>in</strong>puts are held <strong>in</strong> a negative activity state.Now apply an <strong>in</strong>put vector, I, but keep F^ <strong>in</strong>active for the moment. In thiscase, both F\ <strong>and</strong> the ga<strong>in</strong> control receive <strong>in</strong>put signals from below. S<strong>in</strong>ce Ft


8.2 ART1 301is <strong>in</strong>active, G is not <strong>in</strong>hibited. The equations for the F\ units becomex\i = -XH + (\ - A { 1,,-X/i + 5, G) - (Bi + C\ x }i )When the units reach their equilibrium activities,(8 ' 8)where we have used G = 1 . In this case, units that receive a nonzero <strong>in</strong>put valuefrom below also generate an activity value greater than zero <strong>and</strong> have a nonzerooutput value accord<strong>in</strong>g to Eq. (8.3). Units that do not receive a nonzero <strong>in</strong>putnevertheless have their activities raised to the zero level through the nonspecificexcitation signal from G.For the third scenario, we exam<strong>in</strong>e the case where an <strong>in</strong>put pattern, I, <strong>and</strong>a top-down pattern, V, from F 2 are co<strong>in</strong>cident on the F\ layer. The equationsfor the unit activities arex\i = -x u + (1 - AI x lt )(I, + D\ Vi) - (£, + Ci xu)Here, the equilibrium value_Xli ~li + DiVj-B,requires a little effort to <strong>in</strong>terpret. Whether x\i is greater than, equal to, or lessthan zero depends on the relative values of the quantities <strong>in</strong> the numerator <strong>in</strong>Eq. (8.9). We can dist<strong>in</strong>guish three cases of <strong>in</strong>terest, which are determ<strong>in</strong>ed byapplication of the 2/3 rule.If a unit has a positive <strong>in</strong>put value, /,, <strong>and</strong> a large positive net <strong>in</strong>put fromabove, V t , then the 2/3 rule says that the unit activity should be greater than zero.For this to be the case, it must be true that 7, +D\Vi-B\ > 0 <strong>in</strong> Eq. (8.9). Toanalyze this relation further, we shall anticipate some results from the discussionof process<strong>in</strong>g on the F 2 layer. Specifically, we shall assume that only a s<strong>in</strong>glenode on F 2 has a nonzero output at any given time, the maximum output of anF 2 node is 1, <strong>and</strong> the maximum weight on any top-down connection is also 1.S<strong>in</strong>ce Vj = ^ uj z ij ar >d oniv one u j IS nonzero, then V, < 1. Then, <strong>in</strong> themost extreme case with V, • = 1, <strong>and</strong> /,; = 1, we must have 1 + D\ - B\ > 0, orB } < DI + 1 (8.10)Any unit that does not receive a top-down signal from F 2 must have anegative activity, even if it receives an <strong>in</strong>put from below. In this case, we musthave, /, - B t < 0, orB\ > 1 (8.11)Suppose F 2 is produc<strong>in</strong>g a top-down output (perhaps as a result of <strong>in</strong>putsfrom higher levels), but there is not yet an <strong>in</strong>put vector I. G is still <strong>in</strong>hibited <strong>in</strong>('


302 Adaptive Resonance Theorythis case. The equilibrium state isIf a unit receives no net <strong>in</strong>put from FI (VJ = 0), then it rema<strong>in</strong>s at its mostnegative activity level, as <strong>in</strong> Eq. (8.7). If Vj > 0, then the unit's activity risesto some value above that of Eq. (8.7), but it must rema<strong>in</strong> negative because wedo not want the unit to have a nonzero output based on top-down <strong>in</strong>puts alone.Then, from the numerator of Eq. (8.12), we must have D\ - B\ < 0, orB l > D l (8.13)Equations (8.10), (8.11), <strong>and</strong> (8.13) comb<strong>in</strong>e to give the overall condition,max{D,,l} < B\ < D } + 1 (8.14)The ART1 parameters must satisfy the constra<strong>in</strong>t of Eq. (8.14) to implement the2/3 rule successfully <strong>and</strong> to dist<strong>in</strong>guish between top-down <strong>and</strong> bottom-up <strong>in</strong>putpatterns.Satisfy<strong>in</strong>g the constra<strong>in</strong>ts of Eq. (8.14) does not guarantee that a unit thatreceives both an <strong>in</strong>put from below, J^, <strong>and</strong> one from above, V t , will have apositive value of activation. We must consider the case where V f is less thanits maximum value of 1 (even though Uj = 1, Zjj may be less than 1, result<strong>in</strong>g<strong>in</strong> V, < 1). In such a case, the condition that Eq. (8.9) gives a positive value isIi + £>, Vi - BI > 0S<strong>in</strong>ce 7j = 1, this relation def<strong>in</strong>es a condition on V^.V, t > *LILl (8.15)D\Equation (8.15) tells us that the <strong>in</strong>put to unit Vi due to a top-down signal fromFI must meet a certa<strong>in</strong> threshold condition to ensure a positive activation of v^,even if v t receives a strong <strong>in</strong>put, I t , from below. We shall return to this resultwhen we discuss the weights, or LTM traces, from FI to F\.Process<strong>in</strong>g on FI. Figure 8.4 shows a typical process<strong>in</strong>g element on the FT.layer. The ga<strong>in</strong>-control <strong>in</strong>put <strong>and</strong> the connection from the orient<strong>in</strong>g subsystemare shown but we shall not <strong>in</strong>clude them explicitly <strong>in</strong> the follow<strong>in</strong>g analysis.The unit activations are described by an equation of the form of Eq. (8.1).We can specify the various terms as follows.The net <strong>in</strong>put received from the F\ layer is calculated as usual:Tj = net., - SiZji (8.16)


8.2 ART1 303To all F 2 unitsFrom F,Figure 8.4A process<strong>in</strong>g element, vj, is shown on the F 2 layer of an ART1network. The activity of the unit is x 2l . The unit Vj receives<strong>in</strong>puts from the FI layer, the ga<strong>in</strong>-control system, G, <strong>and</strong> theorient<strong>in</strong>g subsystem, A. Bottom-up signals, s u from FI aregated by the weights, z ]t . Outputs, u ]r are sent back down toFI. In addition, each unit receives a positive-feedback termfrom itself, g(x 2j ) <strong>and</strong> sends an identical signal through an<strong>in</strong>hibitory connection to all other units on the layer.The total excitatory <strong>in</strong>put to Vj is,/+ = D 2 Tj+g(x 2 j)(8.17)The <strong>in</strong>hibitory <strong>in</strong>put to each unit is(8.18)Substitut<strong>in</strong>g these values <strong>in</strong>to Eq. (8.1) yields± 2 j = -x 2j + (1 - A 2 x 2J )(D 2 Tj + g(x 2j )) - (B 2 (8.19)


304 Adaptive Resonance TheoryNote the similarity between Eq. (8.19) <strong>and</strong> Eq. (6.13) from Section 6.1.3.Both equations describe a competitive layer with on-center off-surround <strong>in</strong>teractions.In Section 6.1.3, we showed how the choice of the functional form ofg(x) <strong>in</strong>fluenced the evolution of the activities of the units on the layer. 5 Withoutrepeat<strong>in</strong>g the analysis, we assume that the values of the various parameters <strong>in</strong>Eq. (8.19), <strong>and</strong> the functional form of g(x) have been chosen to enhance theactivity of the s<strong>in</strong>gle F 2 node with the largest net-<strong>in</strong>put value from F\ accord<strong>in</strong>gto Eq. (8.16). The activities of all other nodes are suppressed to zero. Theoutput of this w<strong>in</strong>n<strong>in</strong>g node is given a value of one. We can therefore expressthe output values of F 2 nodes asf 1 T, =max{T A ,}VA:uj = f(x 2] ) = { *• \ (8.20)[ 0 otherwiseWe need to clarify one f<strong>in</strong>al po<strong>in</strong>t concern<strong>in</strong>g Figure 8.4. The process<strong>in</strong>gelement <strong>in</strong> that figure appears to violate our st<strong>and</strong>ard of a s<strong>in</strong>gle output per node:The node sends an output of g(x 2 j) to the F 2 units, <strong>and</strong> an output of f(x2j)to the FI units. We can reconcile this discrepancy by allow<strong>in</strong>g Figure 8.4to represent a composite structure. We can arrange for unit Vj to have thes<strong>in</strong>gle output value x 2 j. This output can then be sent to two other process<strong>in</strong>gelements; one that gives an output of g(x2j), <strong>and</strong> one that gives an output off(x 2j ). By assum<strong>in</strong>g the existence of these <strong>in</strong>termediate nodes, or <strong>in</strong>terneurons,we can avoid violat<strong>in</strong>g the s<strong>in</strong>gle-output st<strong>and</strong>ard. The node <strong>in</strong> Figure 8.4 thenrepresents a composite of the Vj nodes <strong>and</strong> the two <strong>in</strong>termediate nodes.Top-Down LTM Traces. The equations that describe the top-down LTM traces(weights on connections from F 2 units to F\ units) should be somewhat familiarfrom the study of Chapter 6:^j = (-ztj + h(xu))f(x 2 j) (8.21)S<strong>in</strong>ce f(x 2 j} is nonzero for only one value of j (one F 2 node, Vj), Eq. (8.21) isnonzero only for connections lead<strong>in</strong>g down from that w<strong>in</strong>n<strong>in</strong>g unit. If the jthF 2 node is active <strong>and</strong> the zth FI node is also active, then Zj, = — z,j + 1 <strong>and</strong> £,;_/asymptotically approaches one. If the jth F 2 node is active <strong>and</strong> the zth F, nodeis not active, then Zjj = —z.-,j <strong>and</strong> z t j decays toward zero. We can summarizethe behavior of z,, as follows:{—Zij + 1 Vj active <strong>and</strong> v, active—Zij Vj active <strong>and</strong> ?;,- <strong>in</strong>active (8.22)0 Vj <strong>and</strong> v\ both <strong>in</strong>activeRecall from Eq. (8.15) that, if F> is active, then v, can be active only ifit is receiv<strong>in</strong>g an <strong>in</strong>put, /,, from below <strong>and</strong> a sufficiently large net <strong>in</strong>put, Vj,5 Caulion! The function t;(.r) <strong>in</strong> this section is the analog of f(j-) <strong>in</strong> Section 6.1.3. In Section 6.1,3,g(.r) was used to mean j-~'/(.r). In this section,


8.2 ART1 305from the F 2 layer. S<strong>in</strong>ce only one ^2 unit is active at a time, Vj — UjZ- l} — Zjj.Equation (8.15) now becomes a condition on the weight on the connection fromVj to Vj:Zij > ^^- (8.23)i->\Unless z^ has a m<strong>in</strong>imum value given by Eq. (8.23), it will decay toward zeroeven if Vj is active <strong>and</strong> v, is receiv<strong>in</strong>g an <strong>in</strong>put, /,, from below.From a practical st<strong>and</strong>po<strong>in</strong>t, all top-down connection weights must be <strong>in</strong>itializedto a value greater than the m<strong>in</strong>imum given by Eq. (8.23) <strong>in</strong> order forany learn<strong>in</strong>g to take place on the network. Otherwise, any time a Vj is active onF 2 , all connections from it to any unit on F\ will decay toward zero; eventually,all the connection weights will be zero <strong>and</strong> the system will be useless.If a resonant condition is allowed to cont<strong>in</strong>ue for an extended period, topdownweights will approach their asymptotic values of 1 or 0, accord<strong>in</strong>g toEq. (8.22). For the purpose of our digital simulations, we shall assume that<strong>in</strong>put patterns are ma<strong>in</strong>ta<strong>in</strong>ed on F\ for a sufficient time to permit top-downweights to equilibrate. Thus, as soon as a resonant state is detected, we canimmediately set the appropriate top-down weights to their asymptotic values. Ifvj is the w<strong>in</strong>n<strong>in</strong>g F 2 node, then'1 v, active0 otherwiseWeights on connections from other nodes, i' ; . j ^ ,/, are not changed. We referto this model as fast learn<strong>in</strong>g. Weights on bottom-up connections also have afast-learn<strong>in</strong>g model, which we shall discuss next.Bottom-Up LTM Traces. The equations that describe the bottom-up LTMtraces are slightly more complicated than were those for the top-down LTMtraces. The weight on the connection from i>, on F\ to Vj on F 2 is determ<strong>in</strong>ed byzji = Kf(x 2j )((\ - Zji)Lh(x u ) - zji h(xi k )] (8.25)where K <strong>and</strong> L are constants, f(x 2i ) is the output of Vj, <strong>and</strong> h(x\,} is theoutput of Vj. As was the case with the top-down LTM traces, the factor f(x 2j )ensures that only weights on the w<strong>in</strong>n<strong>in</strong>g node are allowed to change. Otherthan that, Eq. (8.25) is an equation for a competitive system with on-centeroff-surround <strong>in</strong>teractions. In this case, however, it is the <strong>in</strong>dividual weights thatare compet<strong>in</strong>g among one another, rather than <strong>in</strong>dividual units.Let's assume Vj is the w<strong>in</strong>n<strong>in</strong>g unit on F 2 . There are two cases to consider.If Vj is active on F\, then h(x\j) = 1; otherwise, h(xu) = 0. Weights onconnections to other F 2 units do not change s<strong>in</strong>ce, for them, f(x. 2 k) = 0, k ^ j.Before go<strong>in</strong>g further with Eq. (8.25), we wish to <strong>in</strong>troduce a new notation.If the <strong>in</strong>put pattern is I, then we shall def<strong>in</strong>e the magnitude of I as |I| = £V /,.


306 Adaptive Resonance TheoryS<strong>in</strong>ce / z is either 0 or 1, the magnitude of I is equal to the number of nonzero<strong>in</strong>puts. The output of F\ is the pattern S: Its magnitude is |S| = £^ /I(XH),which is also just the number of nonzero outputs on F\. The actual outputpattern, S, depends on conditions <strong>in</strong> the network:/ I F 2 is <strong>in</strong>activeS - \ I n V J F 2 is active(8 ' 26)where the superscript on V J means that vj was the w<strong>in</strong>n<strong>in</strong>g node on F 2 . We can<strong>in</strong>terpret V J as mean<strong>in</strong>g the b<strong>in</strong>ary pattern with a 1 at those positions where the<strong>in</strong>put, V, , from above is large enough to support the activation of Vi whenevervt receives an <strong>in</strong>put Ii from below.S<strong>in</strong>ce |S| = £V h(x u ), then £) fc _^ h( x \k) = E/t h(x ]k ) - h(x u ) which willbe equal to either |S| - 1 or |S|, depend<strong>in</strong>g on whether v t is active. We can nowsummarize the three cases of Eq. (8.25) as follows:{K[(\ - Zji)L — Zji(\S\ - 1)] if Vi <strong>and</strong> Vj are active— AT[zjj|S|] if v t is <strong>in</strong>active <strong>and</strong> v } is active0 if Vj is <strong>in</strong>active (8.27)Remember, Vi can rema<strong>in</strong> active only if it receives an <strong>in</strong>put /; from below <strong>and</strong>a sufficiently large <strong>in</strong>put Vi from above. In the fast-learn<strong>in</strong>g case, weights onthe w<strong>in</strong>n<strong>in</strong>g F 2 node, vj, take on the asymptotic values given by0 if Vi is <strong>in</strong>activewhere we have L > 1 <strong>in</strong> order to keep L — 1 > 0.By def<strong>in</strong><strong>in</strong>g the bottom-up weights as we have <strong>in</strong> this section, we impartto the ART1 network an important characteristic that we can elucidate withthe follow<strong>in</strong>g scenario. Suppose one F 2 node, vj\, w<strong>in</strong>s <strong>and</strong> learns dur<strong>in</strong>g thepresentation of an <strong>in</strong>put pattern, I], <strong>and</strong> another node, v J2 , w<strong>in</strong>s <strong>and</strong> learns <strong>in</strong>putpattern I 2 . Further, let <strong>in</strong>put pattern I| be a subset of pattern I 2 ; that is, Ii C I 2 .When next we present either I, or I 2 , we would like the appropriate node onF 2 to w<strong>in</strong>. Equation (8.28) ensures that the proper node will w<strong>in</strong> by keep<strong>in</strong>gthe weights on the node that learned the subset pattern sufficiently larger thanthe weights on the node that learned the superset pattern.For the subset pattern, I 1; weights on connections from active F\ nodes alltake on the valueLZJIA = L-\+where we have used the result |S| = |I t n V J | = Ii . We have not yet shownthat this result holds, but we will see shortly that it is true because of the waythat we <strong>in</strong>itialize the top-down weights on the F\ layer.For the superset pattern,L


8.2 ART1 307on connections from those active F\ nodes. Now let's calculate the net <strong>in</strong>putto the two nodes for each of the <strong>in</strong>put patterns to verify that node vj\ w<strong>in</strong>s forpattern I\ <strong>and</strong> node i>,/ 2 w<strong>in</strong>s for pattern I 2 .The net <strong>in</strong>puts to vj\ <strong>and</strong> w./ 2 for <strong>in</strong>put pattern |I] are<strong>and</strong>T — V -, i ./i.i — / -zji.i= __i|IiT,, l= X>.lL\liL-\+\I 2 \respectively, where the extra subscript on T <strong>and</strong> h(x\i) refers to the number ofthe pattern be<strong>in</strong>g presented. S<strong>in</strong>ce Ii C \2, \l\ < h , <strong>and</strong> T./ 2 ,i < Tj\,\, sov,j\ w<strong>in</strong>s as desired. When I 2 is presented, the net <strong>in</strong>puts areTj\,2 = / _,<strong>and</strong>L - 1 + I,T V^ -,1,12.2 — / Zj;L- 1+ I 2 |Notice that \\\ appears <strong>in</strong> the numerator <strong>in</strong> the expression for Tj\_2 <strong>in</strong>stead of|I 2 |. Recall that v.j\ has learned only the subset pattern. Therefore, bottom-upweights on that node are nonzero only for F\ units that represent the subset.This time, s<strong>in</strong>ce Ij| < |I 2 |, we have T/ 2 . 2 > T/i, 2 so v J2 w<strong>in</strong>s.Exercise 8.3: The expression for the net-<strong>in</strong>put values to the w<strong>in</strong>n<strong>in</strong>g F 2 unitshas the form:where a <strong>and</strong> b are both positive. Show that this function, called a Weberfunction, is an <strong>in</strong>creas<strong>in</strong>g function of |I| for |I| > 1.On a network with uncommitted nodes <strong>in</strong> F 2 (i.e., nodes that have not yetparticipated <strong>in</strong> any learn<strong>in</strong>g), we must also ensure that their weights due to the<strong>in</strong>itialization scheme are not so great that they accidentally w<strong>in</strong> over a node thathas learned the pattern. Therefore, we must keep all <strong>in</strong>itial weights below a


308 Adaptive Resonance Theorycerta<strong>in</strong> value. S<strong>in</strong>ce all patterns are a subset of the pattern conta<strong>in</strong><strong>in</strong>g all Is, weshould keep all <strong>in</strong>itial weight values with<strong>in</strong> the range0 < 2j,-(0) < -——*~— (8.29)LJ 1 -\- -iV7where A/ is the number of nodes on F\, <strong>and</strong> hence is the number of bottom-upconnections to each p2 node. This condition ensures that some uncommittednode does not accidentally w<strong>in</strong> over a node that has learned a particular <strong>in</strong>putpattern.Exercise 8.4: Let F^ node vj learn an <strong>in</strong>put pattern I accord<strong>in</strong>g to Eq. (8.28).Assume that all other nodes have their weights <strong>in</strong>itialized accord<strong>in</strong>g to Eq. (8.29).Prove that presentation of I will activate v.j, rather than any other uncommittednode, Vj. j ^ J.8.2.2 The Orient<strong>in</strong>g SubsystemThe orient<strong>in</strong>g subsystem <strong>in</strong> an ART network is responsible for sens<strong>in</strong>g mismatchesbetween bottom-up <strong>and</strong> top-down patterns on the F\ layer. Its operationcan be modeled by the addition of terms to the dynamic equations that describethe activities of the F 2 process<strong>in</strong>g element. S<strong>in</strong>ce our discussion has evolvedfrom the dynamic equations to their asymptotic solutions <strong>and</strong> the fast-learn<strong>in</strong>gcase, we shall not return to the dynamic equations at this po<strong>in</strong>t.There are many ways to model the dynamics of the orient<strong>in</strong>g subsystem.One example is the development by Ryan <strong>and</strong> W<strong>in</strong>ter, <strong>and</strong> by Ryan, W<strong>in</strong>ter, <strong>and</strong>Turner, listed <strong>in</strong> the Suggested Read<strong>in</strong>gs section at the end of this chapter. Ourapproach here will be to describe the details of the match<strong>in</strong>g <strong>and</strong> reset process<strong>and</strong> the effects on the F2 units.We can model the orient<strong>in</strong>g subsystem as a s<strong>in</strong>gle process<strong>in</strong>g element, A,with an output to each unit on the F> layer. The <strong>in</strong>puts to A are the outputs ofthe FI units, S, <strong>and</strong> the <strong>in</strong>put vector, I. The weights on the connections fromthe <strong>in</strong>put vector are all equal to a value P; those on the connections from FIare all equal to a value —Q. The net <strong>in</strong>put to A is then P\l\ - Q\S\. The outputof A switches on if the net <strong>in</strong>put becomes nonzero:orP|I|-Q|S|>0P\I\ > Q\s\P > £iQ iThe quantity P/Q is given the name vigilance parameter <strong>and</strong> is usually identifiedby the symbol, p. Thus, activation of the orient<strong>in</strong>g subsystem is preventedif-Ir > P(«-3ftl


8.2 ART1 309Recall that |S| = |I| when p2 is <strong>in</strong>active. The orient<strong>in</strong>g subsystem must notsend a reset signal to FI at that time. From Eq. (8.30), we get a condition onthe vigilance parameter:P< 1We also obta<strong>in</strong> a subsequent condition on P <strong>and</strong> Q:P


310 Adaptive Resonance TheoryHav<strong>in</strong>g a value of p that is less than one also permits the possibility that thetop-down pattern that is coded by FI to represent a particular class may changeas new <strong>in</strong>put vectors are presented to the network. We can see this by referenceback to Figure 8.5(a). We implicitly assumed that the top-down pattern on theright (with the extra dot) had been previously encoded (learned) by one of theF2 nodes. The appearance of the bottom-up pattern on the left (without the extradot) does not cause a reset, so a resonance is established between the F\ layer<strong>and</strong> the w<strong>in</strong>n<strong>in</strong>g node on F2 that produced the top-down pattern. Dur<strong>in</strong>g thisresonance, weights can be changed. The FI node correspond<strong>in</strong>g to the feature ofthe center dot is not active, s<strong>in</strong>ce that feature is miss<strong>in</strong>g from the <strong>in</strong>put vector.Accord<strong>in</strong>g to Eq. (8.22), the top-down weight on that connection will decayaway; <strong>in</strong> the fast-learn<strong>in</strong>g mode, we simply set it equal to zero. Similarly, thebottom-up weight on the F 2 node will decay accord<strong>in</strong>g to Eq. (8.27).The reced<strong>in</strong>g of top-down template patterns described <strong>in</strong> the previous paragraphcould lead to <strong>in</strong>stabilities <strong>in</strong> the learn<strong>in</strong>g process. These <strong>in</strong>stabilities couldmanifest themselves as a cont<strong>in</strong>ual change <strong>in</strong> the class of <strong>in</strong>put vectors recognized,or encoded, by each F-^ unit. Fortunately, ART was developed to combatthis <strong>in</strong>stability. We shall not prove it here, but it can be shown that categorylearn<strong>in</strong>g will stabilize <strong>in</strong> an ART1 network, after at most a few recod<strong>in</strong>gs. Weshall demonstrate this result <strong>in</strong> the next section, where we look at a numericalexample of ART1 process<strong>in</strong>g. The stability of the learn<strong>in</strong>g is a direct result ofthe use of the 2/3 rule described earlier <strong>in</strong> this chapter.To complete the model of the orient<strong>in</strong>g subsystem, we must consider itseffects on the ¥2 units. When a pattern mismatch occurs, the orient<strong>in</strong>g subsystemshould <strong>in</strong>hibit the F^ unit that resulted <strong>in</strong> the nonmatch<strong>in</strong>g pattern, <strong>and</strong> shouldma<strong>in</strong>ta<strong>in</strong> that <strong>in</strong>hibition throughout the rema<strong>in</strong>der of the match<strong>in</strong>g cycle.We have now concluded our presentation of the ART1 network. Beforeproceed<strong>in</strong>g on to ART2, we shall summarize the entire ART1 model for theasymptotic <strong>and</strong> fast-learn<strong>in</strong>g case.Exercise 8.5: Assume that an <strong>in</strong>put pattern has resulted <strong>in</strong> an unsuccessfulsearch for a match through all previously encoded FI templates. On the nextmatch<strong>in</strong>g cycle, one of the previously uncommitted FI nodes will w<strong>in</strong> by default.Show that this situation will not result <strong>in</strong> a reset signal from A. This proofdemonstrates that ART1 will automatically enlist a new FI node if the <strong>in</strong>putpattern cannot be categorized with previously learned patterns.8.2.3 ART1 Process<strong>in</strong>g SummaryFor this summary we shall employ the asymptotic solutions to the dynamicequations <strong>and</strong> the fast-learn<strong>in</strong>g case for the weights. We shall also presenta step-by-step calculation show<strong>in</strong>g how the ART1 network learns <strong>and</strong> recallspatterns.To beg<strong>in</strong> with, we must determ<strong>in</strong>e the size of the FI <strong>and</strong> F> layers, <strong>and</strong> thevalues of the various parameters <strong>in</strong> the system. Let M be the number of units


8.2 ART1 311on FI <strong>and</strong> N be the number of units on FI. Other parameters must be chosenaccord<strong>in</strong>g to the follow<strong>in</strong>g constra<strong>in</strong>ts:A\ > 0C, > 0D\ > 0max{£>,,l} < BI 10< p < 1Top-down weightsv{) are <strong>in</strong>itialized accord<strong>in</strong>g to<strong>and</strong> bottom-up weights0 0Si = h(x\' ; 1 0 xn < 03. Propagate S forward to F^ <strong>and</strong> calculate the activities accord<strong>in</strong>g toMTj = >SiZji


312 Adaptive Resonance Theory4. Only the w<strong>in</strong>n<strong>in</strong>g F 2 node has a nonzero output:f 1 Tj = max{T k }VkUj = < k[ 0 otherwisee shall assume that the w<strong>in</strong>n<strong>in</strong>g node is vj.5. Propagate the output from F 2 back to F\ . Calculate the net <strong>in</strong>puts from F 2to the FI units:6. Calculate the new activities accord<strong>in</strong>g toI i + D l V i -B l7. Determ<strong>in</strong>e the new output values, Sj, as <strong>in</strong> step 2.8. Determ<strong>in</strong>e the degree of match between the <strong>in</strong>put pattern <strong>and</strong> the top-downtemplate:9. If |S|/|I| < p mark vj as <strong>in</strong>active, zero the outputs of F 2 , <strong>and</strong> return tostep 1 us<strong>in</strong>g the orig<strong>in</strong>al <strong>in</strong>put pattern. If |S|/|I| > p, cont<strong>in</strong>ue.10. Update bottom-up weights on vj onlyif Vi is activeif Vi is <strong>in</strong>active11. Update the top-down weights com<strong>in</strong>g from vj only to all FI units:_ J 1 if Vi is active' J [0 if Vi is <strong>in</strong>active12. Remove the <strong>in</strong>put pattern. Restore all <strong>in</strong>active F 2 units. Return to step 1with a new <strong>in</strong>put pattern.To see this algorithm <strong>in</strong> operation, let's perform a step-by-step calculationfor a small example problem. We will use this example to see how subset <strong>and</strong>superset patterns are h<strong>and</strong>led by the network.We shall choose the dimensions of FI <strong>and</strong> F 2 as M = 5 <strong>and</strong> N = 6,respectively. Other parameters <strong>in</strong> the system are A\ — \,B\ — \.5,C\ =5,Di = 0.9, L = 3, ,3 = 0.9.We <strong>in</strong>itialize weights on FI units by add<strong>in</strong>g a small, positive value (0.2, <strong>in</strong>this case) to (B\ — \)/D\. Each unit on FI then has the weight vectorz, = (0.756,0.756,0.756,0.756,0.756,0.756)'


8.2 ART1 313S<strong>in</strong>ce L = 3 <strong>and</strong> M = 5, weights on F 2 units are all <strong>in</strong>itialized to slightly lessthan L/(L - 1 + M). Each weight is given the value 0.429 -0.1= 0.329. Eachweight vector is thenZj = (0.329,0.329,0.329,0.329,0.329)'All F 2 units are <strong>in</strong>itialized to zero activity. FI activities are <strong>in</strong>itialized to-5,7(1 +Ci) = -0.25:X(0) = (-0.25, -0.25, -0.25, -0.25, -0.25)'We can now beg<strong>in</strong> actual process<strong>in</strong>g. We shall start with the simple <strong>in</strong>putvector,I, =(0,0,0,1,0)*Now, we follow the sequence of the algorithm:1. After the <strong>in</strong>put vector is applied, the Fj activities becomeX, =(0,0,0,0.118,0)*2. The output vector is S = (0,0,0,1,0)'.3. Propagat<strong>in</strong>g this output vector to F 2 , the net <strong>in</strong>puts to all F 2 units will beidentical:Then, T = (0.329,0.329,0.329,0.329,0.329,0.329)'.4. S<strong>in</strong>ce all unit activities are equal, simply take the first unit as our w<strong>in</strong>ner.Then, the output from F 2 is U = (1,0,0,0,0,0)*.5. Propagate back to FI :Vi = z z - UThen, V = (0.756,0.756,0.756,0.756,0.756)'.6. Calculate the new activity values on F\ accord<strong>in</strong>g to Eq. (8.9):X = (-0.123, -0.123, -0.123,0.023, -0.123)'7. Only unit 4 has a positive activity, so the new outputs are S = (0,0,0, 1,0)*.8. |S|/|I| =\>p.9. There is no reset: Resonance has been reached.10. Update bottom-up weights on F 2 unit v\ accord<strong>in</strong>g to the fast-learn<strong>in</strong>g rule.S<strong>in</strong>ce L_[ L +| S i = 1, z, on F 2 becomes (0,0,0, 1,0)*. Thus, unit v\ encodesthe <strong>in</strong>put vector exactly. It is actually <strong>in</strong> the pattern of nonzero weights that


314 Adaptive Resonance Theorywe are <strong>in</strong>terested here. The fact that the s<strong>in</strong>gle nonzero weight matches the<strong>in</strong>put value is purely co<strong>in</strong>cidental, as we shall see shortly.11. Update weights on top-down connections. Only the first weight on eachFI unit changes. If each row describes the weights on one unit, then theweight matrix for F\ units is0 0.756 0.756 0.756 0.756 0.7560 0.756 0.756 0.756 0.756 0.7560 0.756 0.756 0.756 0.756 0.7561 0.756 0.756 0.756 0.756 0.7560 0.756 0.756 0.756 0.756 0.756That completes the cycle for the first <strong>in</strong>put pattern. Now let's apply asecond pattern that is orthogonal to the first—namely, \2 = (0,0,1,0,1)'. Weshall abbreviate the calculation somewhat.When the <strong>in</strong>put pattern is propagated to F 2 , the resultant activities areT = (0.000,0.657,0.657,0.657,0.657,0.657)*. Unit 1 def<strong>in</strong>itely loses. Weselect unit 2 as the w<strong>in</strong>ner. The output of F 2 is U = (0,1,0,0,0,0)'. Propagat<strong>in</strong>gback to Fi we get V = (0.756,0.756,0.756,0.756,0.756)' <strong>and</strong> X =(-0.123,-0.123,0.0234,-0.123,0.0234)'. The result<strong>in</strong>g output matches the<strong>in</strong>put vector, (0,0, 1,0, 1)* so there is no reset.Weights on the w<strong>in</strong>n<strong>in</strong>g F 2 unit are set to (0,0,0.75,0,0.75)'. The secondweight on each FI unit is adjusted such that the new weight matrix on FI is0 0 0.756 0.756 0.756 0.7560 0 0.756 0.756 0.756 0.7560 1 0.756 0.756 0.756 0.7561 0 0.756 0.756 0.756 0.7560 1 0.756 0.756 0.756 0.756For reference, the weight matrix on FI looks like0 0 0 1 00 0 0.75 0 0.750.329 0.329 0.329 0.329 0.3290.329 0.329 0.329 0.329 0.3290.329 0.329 0.329 0.329 0.3290.329 0.329 0.329 0.329 0.329Now let's see what happens when we apply a vector that is a subset of I?—namely, I 3 = (0,0,0,0, l) f . When this pattern is propagated forward, the activitieson F 2 become (0,0.75,0.329,0.329,0.329,0.329)'. Notice that the secondunit w<strong>in</strong>s the competition <strong>in</strong> this case so F 2 output is (0, 1,0,0,0,0)'. Go<strong>in</strong>gback to FI, the net <strong>in</strong>puts from the top-down pattern are V = (0,0, 1,0, })'.


8.2 ART1 315In this case, the equilibrium activities are X = (-0.25, -0.25, -0.087. -0.25,0.051)' with only one positive activity. The new output pattern is (0,0,0,0, 1)',which exactly matches the <strong>in</strong>put pattern, so no reset occurs.Even though unit 2 on F 2 had previously encoded an <strong>in</strong>put pattern, it getsreceded now to match the new <strong>in</strong>put pattern that is a subset of the orig<strong>in</strong>alpattern. The new weight matrices appear as follows. For F\,0 0 0.756 0.756 0.756 0.7560 0 0.756 0.756 0.756 0.7560 0 0.756 0.756 0.756 0.7561 0 0.756 0.756 0.756 0.7560 1 0.756 0.756 0.756 0.756For F 2 ,0 0 0 1 00 0 0 0 10.329 0.329 0.329 0.329 0.3290.329 0.329 0.329 0.329 0.3290.329 0.329 0.329 0.329 0.3290.329 0.329 0.329 0.329 0.329If we return to the superset vector, (0,0,1,0,1)', the <strong>in</strong>itial forward propagationto F 2 yields activities of (0.000, 1.000,0.657,0.657,0.657,0.657)*, sounit 2 w<strong>in</strong>s aga<strong>in</strong>. Go<strong>in</strong>g back to F\, V = (0,0,0,0,1)*, <strong>and</strong> the equilibriumactivities are (-0.25, -0.25, -0.071, -0.25,0.051)'. The new outputs are(0,0,0,0,1)'. This time, we get a reset signal, s<strong>in</strong>ce |S|/|I 2 = 0.5 < p. Thus,unit 2 on F 2 is removed from competition, <strong>and</strong> the match<strong>in</strong>g cycle is repeatedwith the orig<strong>in</strong>al <strong>in</strong>put vector restored on F\.Propagat<strong>in</strong>g forward a second time results <strong>in</strong> activities on F 2 of (0.000,0.000,0.657, 0.657,0.657,0.657)*, where we have forced unit 2's activity tozero as a result of susta<strong>in</strong>ed <strong>in</strong>hibition from the orient<strong>in</strong>g subsystem. We chooseunit 3 as the w<strong>in</strong>ner this time, <strong>and</strong> it codes the <strong>in</strong>put vector. On subsequentpresentations of subset <strong>and</strong> superset vectors, each will access the appropriate F 2unit directly without the need of a search. This result can be verified by directcalculation with the example presented <strong>in</strong> this section.Exercise 8.6: Verify the statement made <strong>in</strong> the previous paragraph that thepresentation of (0,0,1,0, 1)' <strong>and</strong> (0,0,0,0,1)' result <strong>in</strong> immediate access tothe correspond<strong>in</strong>g nodes on F 2 without reset by perform<strong>in</strong>g the appropriatecalculations.Exercise 8.7: What does the weight matrix on F 2 look like after unit 3 encodesthe superset vector given <strong>in</strong> the example <strong>in</strong> this section?Exercise 8.8: What do you expect will happen if we apply (0, 1, 1,0, 1)' to theexample network? Note that the pattern is a superset to one already encoded.Verify your hypothesis by direct calculation.


316 Adaptive Resonance Theory8.3 ART2On the surface, ART2 differs from ART1 only <strong>in</strong> the nature of the <strong>in</strong>put patterns:ART2 accepts analog (or gray-scale) vector components as well as b<strong>in</strong>arycomponents. This capability represents a significant enhancement to the system.Beyond the surface difference between ART1 <strong>and</strong> ART2 lie architecturaldifferences that give ART2 its ability to deal with analog patterns. These differencesare sometimes more complex, <strong>and</strong> sometimes less complex, than thecorrespond<strong>in</strong>g ART1 structures.Aside from the obvious fact that b<strong>in</strong>ary <strong>and</strong> analog patterns differ <strong>in</strong> thenature of their respective components, ART2 must deal with additional complications.For example, ART2 must be able to recognize the underly<strong>in</strong>g similarityof identical patterns superimposed on constant backgrounds hav<strong>in</strong>g differentlevels. Compared <strong>in</strong> an absolute sense, two such patterns may appear entirelydifferent when, <strong>in</strong> fact, they should be classified as the same pattern.The price for this additional capability is primarily an <strong>in</strong>crease <strong>in</strong> complexityon the F\ process<strong>in</strong>g level. The ART2 F\ level consists of several sublevels<strong>and</strong> several ga<strong>in</strong>-control systems. Process<strong>in</strong>g on p2 is the same for ART2 asit was for ART1. As partial compensation for the added complexity on the F\layer, the LTM equations are a bit simpler for ART2 than they were for ART1.The developers of the architecture, Carpenter <strong>and</strong> Grossberg, have experimentedwith several variations of the architecture for ART2. At the time of thiswrit<strong>in</strong>g, that work is cont<strong>in</strong>u<strong>in</strong>g. The architecture we shall describe here is oneof several variations reported by Carpenter <strong>and</strong> Grossberg [2].8.3.1 ART2 ArchitectureAs we mentioned <strong>in</strong> the <strong>in</strong>troduction to this section, ART2 bears a superficialresemblance to ART1. Both have an attentional subsystem <strong>and</strong> an orient<strong>in</strong>g subsystem.The attentional subsystem of each architecture consists of two layersof process<strong>in</strong>g elements, F\ <strong>and</strong> F2, <strong>and</strong> a ga<strong>in</strong>-control system. The orient<strong>in</strong>gsubsystem of each network performs the identical function. Moreover, the basicdifferential equations that govern the activities of the <strong>in</strong>dividual process<strong>in</strong>gelements are the same <strong>in</strong> both cases. To deal successfully with analog patterns<strong>in</strong> ART2, Carpenter <strong>and</strong> Grossberg have had to split the F\ layer <strong>in</strong>to a numberof sublayers conta<strong>in</strong><strong>in</strong>g both feedforward <strong>and</strong> feedback connections. Figure 8.6shows the result<strong>in</strong>g structure.8.3.2 Process<strong>in</strong>g on FjThe activity of each unit on each sublayer of F\ is governed by an equation ofthe formex k = -Ax k + (1 - Bx k )J+ -(C + Dx k )J- (8.31)where A,B,C, <strong>and</strong> D are constants. Equation (8.31) is almost identical toEq. (8.1) from the ART1 discussion <strong>in</strong> Section 8.2.1. The only difference is


8.3 ART2 317Orient<strong>in</strong>g , Attentional subsystemsubsystemp 2 Layer•ic xx xx XXI yFi LayerInput vectorFigure 8.6The overall structure of the ART2 network is the same as thatof ART1. The FI layer has been divided <strong>in</strong>to six sublayers,w, x, u,v,p, <strong>and</strong> q. Each node labeled G is a ga<strong>in</strong>-control unitthat sends a nonspecific <strong>in</strong>hibitory signal to each unit on thelayer it feeds. All sublayers on F\, as well as the r layer of theorient<strong>in</strong>g subsystem, have the same number of units. Individualsublayers on FI are connected unit to unit; that is, the layersare not fully <strong>in</strong>terconnected, with the exception of the bottomupconnections to FI <strong>and</strong> the top-down connections from F 2 .the appearance of the multiplicative factor <strong>in</strong> the first term on the right-h<strong>and</strong>side <strong>in</strong> Eq. (8.31). For the ART2 model presented here, we shall set B <strong>and</strong> Cidentically equal to zero. As with ART1, j£ <strong>and</strong> J^ represent net excitatory<strong>and</strong> <strong>in</strong>hibitory factors, respectively. Likewise, we shall be <strong>in</strong>terested <strong>in</strong> only theasymptotic solution, soIA +(8.32)


318 Adaptive Resonance TheoryThe values of the <strong>in</strong>dividual quantities <strong>in</strong> Eq. (8.32) vary accord<strong>in</strong>g to the sublayerbe<strong>in</strong>g considered. For convenience, we have assembled Table 8.1, whichshows all of the appropriate quantities for each F\ sublayer, as well as the rlayer of the orient<strong>in</strong>g subsystem. Based on the table, the activities on each ofthe six sublayers on F\ can be summarized by the follow<strong>in</strong>g equations:w, = Ii+ aui (8.33)Xl= e +L\\(8 ' 34)v t = /(I*) + bf(q t ) (8.35)(8.36)(yi)zn (8.37)q t = —p-77 (8.38)e + IIPlIWe shall discuss the orient<strong>in</strong>g subsystem r layer shortly. The parameter eis typically set to a positive number considerably less than 1. It has the effectQuantityLayer A D 7+ /rw 1 1 Ii + au,x e 1 w,u e 1 v tv1 1 f(xi) + bf(qi)3q e i PIr c 1 u-' -f- CD '0HIIMI00IPTable 8.1Factors <strong>in</strong> Eq. (8.32) for each FI sublayer <strong>and</strong> the r layer. /; is theith component of the <strong>in</strong>put vector. The parameters a, b, c, <strong>and</strong> eare constants whose values will be discussed <strong>in</strong> the text, yj isthe activity of the jth unit on the F 2 layer <strong>and</strong> g(y) is the outputfunction on F 2 . The function f(x) is described <strong>in</strong> the text.


8.3 ART2 319of keep<strong>in</strong>g the activations f<strong>in</strong>ite when no <strong>in</strong>put is present <strong>in</strong> the system. We donot require the presence of e for this discussion so we shall set e — 0 for therema<strong>in</strong>der of the chapter.The three ga<strong>in</strong> control units <strong>in</strong> FI nonspecifically <strong>in</strong>hibit the x, u, <strong>and</strong> qsublayers. The <strong>in</strong>hibitory signal is equal to the magnitude of the <strong>in</strong>put vector tothose layers. The effect is that the activities of these three layers are normalizedto unity by the ga<strong>in</strong> control signals. This method is an alternative to the oncenteroff-surround <strong>in</strong>teraction scheme presented <strong>in</strong> Chapter 6 for normaliz<strong>in</strong>gactivities.The form of the function, f(x), determ<strong>in</strong>es the nature of the contrast enhancementthat takes place on FI (see Chapter 6). A sigmoid might be thelogical choice for this function, but we shall stay with Carpenter's choice of°>S S 'where 6 is a positive constant less than one. We shall use 9 = 0.2 <strong>in</strong> oursubsequent examples.It will be easier to see what happens on FI dur<strong>in</strong>g the process<strong>in</strong>g of an<strong>in</strong>put vector if we actually carry through a couple of examples, as we did withART1. We shall set up a five-unit F\ layer. The constants are chosen as follows:a= 10; 6 = 10; c = 0.1. The first <strong>in</strong>put vector isI, =(0.2,0.7,0.1,0.5,0.4)'We propagate this vector through the sublayers <strong>in</strong> the order of the equationsgiven.As there is currently no feedback from u, w becomes a copy of the <strong>in</strong>putvector:w = (0.2,0.7,0.1,0.5,0.4)'x is a normalized version of the same vector:x = (0.205,0.718,0.103,0.513,0.410)'In the absence of feedback from q, v is equal to /(x):v = (0.205,0.718.0, 0.513,0.410)'Note that the third component is now zero, s<strong>in</strong>ce its value fell below the threshold,0. Because F 2 is currently <strong>in</strong>active, there is no top-down signal to FI . Inthat case, all the rema<strong>in</strong><strong>in</strong>g three sublayers on F, become copies of v:u = (0.205,0.718,0,0.513,0.410)*p = (0.205,0.718,0,0.513,0.410)'q = (0.205,0.718,0,0.513,0.410)'


320 Adaptive Resonance TheoryWe cannot stop here, however, as both u, <strong>and</strong> q are now nonzero. Beg<strong>in</strong>n<strong>in</strong>gaga<strong>in</strong> at w, we f<strong>in</strong>d:w = (2.263,7.920,0.100,5.657,4.526)'x = (0.206,0.722,0.009,0.516,0.413)*v = (2.269,7.942,0.000,5.673,4.538)'where v now has contributions from the current x vector <strong>and</strong> the u vector fromthe previous time step. As before, the rema<strong>in</strong><strong>in</strong>g three layers will be identical:u = (0.206,0.723,0.000,0.516,0.413)'p = (0.206,0.723,0.000,0.516,0.413)'q = (0.206,0.723,0.000,0.516,0.413)'Now we can stop because further iterations through the sublayers will notchange the results. Two iterations are generally adequate to stabilize the outputsof the units on the sublayers.Dur<strong>in</strong>g the first iteration through F\, we assumed that there was no topdownsignal from F 2 that would contribute to the activation on the p sublayer ofF\. This assumption may not hold for the second iteration. We shall see laterfrom our study of the orient<strong>in</strong>g subsystem that, by <strong>in</strong>itializ<strong>in</strong>g the top-downweights to zero, Zjj(O) = 0, we prevent reset dur<strong>in</strong>g the <strong>in</strong>itial encod<strong>in</strong>g bya new F 2 unit. We shall assume that we are consider<strong>in</strong>g such a case <strong>in</strong> thisexample, so that the net <strong>in</strong>put from any top-down connections sum to zero.As a second example, we shall look at an <strong>in</strong>put pattern that is a simplemultiple of the first <strong>in</strong>put pattern—namely,I 2 = (0.8,2.8,0.4,2.0,1.6)'which is each element of Ii times four. Calculat<strong>in</strong>g through the F\ sublayersresults <strong>in</strong>w = (0.800,2.800,0.400,2.000,1.600)'The second time through givesx = (0.205,0.718,0.103,0.513,0.410)'v = (0.205,0.718,0.000,0.513,0.410)'u = (0.206,0.722,0.000,0.516,0.413)'p = (0.206,0.722,0.000,0.516,0.413)'q = (0.206,0.722,0.000,0.516,0.413)'w = (2.863,10.020,0.400,7.160,5.726)'x = (0.206,0.722,0.0288,0.515,0.412)*v = (2.269,7.942,0.000,5.672,4.538)'u = (0.206,0.722,0.000,0.516,0.413)'p = (0.206,0.722,0.000,0.516,0.413)'q = (0.206,0.722,0.000,0.516,0.413)'


8.3 ART2 321Notice that, after the v layer, the results are identical to the first example.Thus, it appears that ART2 treats patterns that are simple multiples of eachother as belong<strong>in</strong>g to the same class. For analog patterns, this would appear tobe a useful feature. Patterns that differ only <strong>in</strong> amplitude probably should beclassified together.We can conclude from our analysis that FI performs a straightforward normalization<strong>and</strong> contrast-enhancement function before pattern match<strong>in</strong>g is attempted.To see what happens dur<strong>in</strong>g the match<strong>in</strong>g process itself, we mustconsider the details of the rema<strong>in</strong>der of the system.8.3.3 Process<strong>in</strong>g on F 2Process<strong>in</strong>g on FI of ART2 is identical to that performed on ART1. Bottom-up<strong>in</strong>puts are calculated as <strong>in</strong> ART1:~~ iZji (8.40)Competition on Fa results <strong>in</strong> contrast enhancement where a s<strong>in</strong>gle w<strong>in</strong>n<strong>in</strong>g nodeis chosen, aga<strong>in</strong> <strong>in</strong> keep<strong>in</strong>g with ART1.The output function of Fa is given by( d T } = max{T k }Vk9(Vj) = < n , . k (8-41)I 0 otherwiseThis equation presumes that the set {T k } <strong>in</strong>cludes only those nodes that havenot been reset recently by the orient<strong>in</strong>g subsystem.We can now rewrite the equation for process<strong>in</strong>g on the p sublayer of FI as(see Eq. 8.37)_ / Ui if Fa is <strong>in</strong>active „ ..I Ui + dzij if the Jth node on Fa is active8.3.4 LTM EquationsThe LTM equations on ART2 are significantly less complex than are those onART1. Both bottom-up <strong>and</strong> top-down equations have the same form:Zji = 9(y,} (Pi ~ zn} (8.43)for the bottom-up weights from Vi on FI to Vj on Fa, <strong>and</strong>zn = g(y } ) (Pi - zij) (8.44)for top-down weights from Vj on Fa to vt on F,. If vj is the w<strong>in</strong>n<strong>in</strong>g Fa node,then we can use Eq. (8.42) <strong>in</strong> Eqs. (8.43) <strong>and</strong> (8.44) to show thatzji = d(Ui + dzij — zji)


322 Adaptive Resonance Theory<strong>and</strong> similarlyz u = d(ui + dzij - Zij)with all other Zij = Zj t = 0 for j ^ J. We shall be <strong>in</strong>terested <strong>in</strong> the fast-learn<strong>in</strong>gcase, so we can solve for the equilibrium values of the weights:U 'zji = z u = - — '— (8.45)1 - awhere we assume that 0 < d < 1.We shall postpone the discussion of <strong>in</strong>itial values for the weights until afterthe discussion of the orient<strong>in</strong>g subsystem.8.3.5 ART2 Orient<strong>in</strong>g SubsystemFrom Table 8.1 <strong>and</strong> Eq. (8.32), we can construct the equation for the activitiesof the nodes on the r layer of the orient<strong>in</strong>g subsystem:r, = ± (8.46)where we once aga<strong>in</strong> have assumed that e — 0. The condition for reset isH =• ' * 47)where p is the vigilance parameter as <strong>in</strong> ART1.Notice that two F\ sublayers, p, <strong>and</strong> u, participate <strong>in</strong> the match<strong>in</strong>g process.As top-down weights change on the p layer dur<strong>in</strong>g learn<strong>in</strong>g, the activity ofthe units on the p layer also changes. The u layer rema<strong>in</strong>s stable dur<strong>in</strong>g thisprocess, so <strong>in</strong>clud<strong>in</strong>g it <strong>in</strong> the match<strong>in</strong>g process prevents reset from occurr<strong>in</strong>gwhile learn<strong>in</strong>g of a new pattern is tak<strong>in</strong>g place.We can rewrite Eq. (8.46) <strong>in</strong> vector form asu + cpThen, from ||r|| — (r • r) 1 / 2 , we can write,, „ [l+2|| CP ||cos(u, P )+|| C p|| 2 ] l/2I* —— ——————————————————————————————————————————— ^O."TO^where cos(u, p) is the cos<strong>in</strong>e of the angle between u <strong>and</strong> p. First, note that, ifu <strong>and</strong> p are parallel, then Eq. (8.48) reduces to ||r|| — 1, <strong>and</strong> there will be noreset. As long as there is no output from F 2 , Eq. (8.37) shows that u = p, <strong>and</strong>there will be no reset <strong>in</strong> this case.Suppose now that F 2 does have an output from some w<strong>in</strong>n<strong>in</strong>g unit, <strong>and</strong> thatthe <strong>in</strong>put pattern needs to be learned, or encoded, by the F 2 unit. We also do


8.3 ART2 323not want a reset <strong>in</strong> this case. From Eq. (8.37), we see that p = u + dz./, wherethe Jth unit on /•? is the w<strong>in</strong>ner <strong>and</strong> z,/ = (z\j. Z2j, .... ZA/J)'. If we <strong>in</strong>itializeall the top-down weights, z,-_,-, to zero, then the <strong>in</strong>itial output from FI will haveno effect on the value of p; that is, p will rema<strong>in</strong> equal to u.Dur<strong>in</strong>g the learn<strong>in</strong>g process itself, z./ becomes parallel to u accord<strong>in</strong>g toEq. (8.45). Thus, p also becomes parallel to u, <strong>and</strong> aga<strong>in</strong> ||r|| = 1 <strong>and</strong> there isno reset.As with ART1, a sufficient mismatch between the bottom-up <strong>in</strong>put vector<strong>and</strong> the top-down template results <strong>in</strong> a reset. In ART2, the bottom-up pattern istaken at the u sublevel of F\ <strong>and</strong> the top-down template is taken at p.Before return<strong>in</strong>g to our numerical example, we must f<strong>in</strong>ish the discussionof weight <strong>in</strong>itialization. We have already seen that top-down weights must be<strong>in</strong>itialized to zero. Bottom-up weight <strong>in</strong>itialization is the subject of the nextsection.8.3.6 Bottom-Up LTM InitializationWe have been discuss<strong>in</strong>g the modification of LTM traces, or weights, <strong>in</strong> thecase of fast-learn<strong>in</strong>g. Let's exam<strong>in</strong>e the dynamic behavior of the bottom-upweights dur<strong>in</strong>g a learn<strong>in</strong>g trial. Assume that a particular FI node has previouslyencoded an <strong>in</strong>put vector such that ZJ- L = uj(\ — d), <strong>and</strong>, therefore, ||zj|| =j|u||/(l - d) — 1/0 - d), where zj is the vector of bottom-up weights on theJth, F 2 node. Suppose the same node w<strong>in</strong>s for a slightly different <strong>in</strong>put pattern,one for which the degree of mismatch is not sufficient to cause a reset. Then,the bottom-up weights will be receded to match the new <strong>in</strong>put vector. Dur<strong>in</strong>gthis dynamic reced<strong>in</strong>g process, ||zj|| can decrease before return<strong>in</strong>g to the value1/0 - d). Dur<strong>in</strong>g this decreas<strong>in</strong>g period, ||r|| will also be decreas<strong>in</strong>g. If othernodes have had their weight values <strong>in</strong>itialized such that ||zj(0)|| > I/O - d),then the network might switch w<strong>in</strong>ners <strong>in</strong> the middle of the learn<strong>in</strong>g trial.We must, therefore, <strong>in</strong>itialize the bottom-up weight vectors such thatl|z||1 -dWe can accomplish such an <strong>in</strong>itialization by sett<strong>in</strong>g the weights to small r<strong>and</strong>omnumbers. Alternatively, we could use the <strong>in</strong>itializationZji(0) < —————j= (8.49)(1 - d)VMThis latter scheme has the appeal of a uniform <strong>in</strong>itialization. Moreover, if we usethe equality, then the <strong>in</strong>itial values are as large as possible. Mak<strong>in</strong>g the <strong>in</strong>itialvalues as large as possible biases the network toward uncommitted nodes. Evenif the vigilance parameter is too low to cause a reset otherwise, the network willchoose an uncommitted node over a badly mismatched node. This mechanismhelps stabilize the network aga<strong>in</strong>st constant reced<strong>in</strong>g.


324 Adaptive Resonance TheorySimilar arguments lead to a constra<strong>in</strong>t on the parameters c <strong>and</strong> d; namely,C 00 < d < 1cd1 -d< 10< 0 < 10 < p < 1e


8.3 ART2 3254. Propagate forward to the v sublayer.Vi = f(Xi) + bf( qi )Note that the second term is zero on the first pass through, as q is zero atthat time.5. Propagate to the u sublayer.6. Propagate forward to the p sublayer.Ui =e + \\v\\Pi = Ui + dz uwhere the Jth node on F 2 is the w<strong>in</strong>ner of the competition on that layer.If FI is <strong>in</strong>active, p- t = u,j. Similarly, if the network is still <strong>in</strong> its <strong>in</strong>itialconfiguration, pi = Ui because -z,-j(0) = 0.7. Propagate to the q sublayer.8. Repeat steps 2 through 7 as necessary to stabilize the values on F\.9. Calculate the output of the r layer.Uj + C.p,e + \\u\\ + \\cp\\10. Determ<strong>in</strong>e whether a reset condition is <strong>in</strong>dicated. If p/(e + \\r\\) > 1,then send a reset signal to F 2 . Mark any active F 2 node as <strong>in</strong>eligible forcompetition, reset the cycle counter to one, <strong>and</strong> return to step 2. If thereis no reset, <strong>and</strong> the cycle counter is one, <strong>in</strong>crement the cycle counter <strong>and</strong>cont<strong>in</strong>ue with step 11. If there is no reset, <strong>and</strong> the cycle counter is greaterthan one, then skip to step 14, as resonance has been established.11. Propagate the output of the p sublayer to the F 2 layer. Calculate the net<strong>in</strong>puts to FI.M^r—>12. Only the w<strong>in</strong>n<strong>in</strong>g F 2 node has nonzero output.0 otherwiseAny nodes marked as <strong>in</strong>eligible by previous reset signals do not participate<strong>in</strong> the competition.


326 Adaptive Resonance Theory13. Repeat steps 6 through 10.14. Modify bottom-up weights on the w<strong>in</strong>n<strong>in</strong>g F 2 unit.z.i, = ,1 - d15. Modify top-down weights com<strong>in</strong>g from the w<strong>in</strong>n<strong>in</strong>g F 2 unit.u,Z,J = 1 -d16. Remove the <strong>in</strong>put vector. Restore all <strong>in</strong>active F 2 units. Return to step 1with a new <strong>in</strong>put pattern.8.3.8 ART2 Process<strong>in</strong>g ExampleWe shall be us<strong>in</strong>g the same parameters <strong>and</strong> <strong>in</strong>put vector for this example thatwe used <strong>in</strong> Section 8.3.2. For that reason, we shall beg<strong>in</strong> with the propagationof the p vector up to FI. Before show<strong>in</strong>g the results of that calculation, weshall summarize the network parameters <strong>and</strong> show the <strong>in</strong>itialized weights.We established the follow<strong>in</strong>g parameters earlier: a = 10; b = 10; c =0.1,0 = 0.2. To that list we add the additional parameter, d = 0.9. We shalluse N = 6 units on the F 2 layer.The top-down weights are all <strong>in</strong>itialized to zero, so Zjj(0) = 0 as discussed<strong>in</strong> Section 8.3.5. The bottom-up weights are <strong>in</strong>itialized accord<strong>in</strong>g to Eq. (8.49):Zji = 0.5/(1 - d)\/M = 2.236, s<strong>in</strong>ce M = 5.Us<strong>in</strong>g I = (0.2,0.7,0.1,0.5,0.4)* as the <strong>in</strong>put vector, before propagationto F 2 we have p = (0.206,0.722,0,0.516,0.413)'. Propagat<strong>in</strong>g this vectorforward to F 2 yields a vector of activities across the F 2 units ofT = (4.151,4.151,4.151,4.151,4.151,4.151)'Because all of the activities are the same, the first unit becomes the w<strong>in</strong>ner <strong>and</strong>the activity vector becomesT = (4.151,0,0,0,0,0)'<strong>and</strong> the output of the F 2 layer is the vector, (0.9,0,0,0,0,0)'.We now propagate this output vector back to FI <strong>and</strong> cycle through thelayers aga<strong>in</strong>. S<strong>in</strong>ce the top-down weights are all <strong>in</strong>itialized to zero, there is nochange on the sublayers of FI . We showed earlier that this condition will notresult <strong>in</strong> a reset from the orient<strong>in</strong>g subsystem; <strong>in</strong> other words, we have reached aresonant state. The weight vectors will now update accord<strong>in</strong>g to the appropriateequations given previously. We f<strong>in</strong>d that the bottom-up weight matrix is/ 2.063 7.220 0.000 5.157 4.126 \2.236 2.236 2.236 2.236 2.2362.236 2.236 2.236 2.236 2.2362.236 2.236 2.236 2.236 2.2362.236 2.236 2.236 2.236 2.236\ 2.236 2.236 2.236 2.236 2.236 /


8.4 The ART1 Simulator 327<strong>and</strong> the top-down matrix is/ 2.06284 0 0 0 0 0 \7.21995000000.00000 000005.15711 000004.12568 0 0 0 0 O/Notice the expected similarity between the first row of the bottom-up matrix<strong>and</strong> the first column of the top-down matrix.We shall not cont<strong>in</strong>ue this example further. You are encouraged to build anART2 simulator <strong>and</strong> experiment on your own.8.4 THE ART1 SIMULATORIn this section, we shall present the design for the ART network simulator.For clarity, we will focus on only the ART1 network <strong>in</strong> our discussion. Thedevelopment of the ART2 simulator is left to you as an exercise. However, dueto the similarities between the two networks, much of the material presented <strong>in</strong>this section will be applicable to the ART2 simulator. As <strong>in</strong> previous chapters,we beg<strong>in</strong> this section with the development of the data structures needed toimplement the simulator, <strong>and</strong> proceed to describe the pert<strong>in</strong>ent algorithms. Weconclude this section with a discussion of how the simulator might be adaptedto implement the ART2 network.8.4.1 ART1 Data StructuresThe ART1 network is very much like the BAM network described <strong>in</strong> Chapter 4of this text. Both networks process only b<strong>in</strong>ary <strong>in</strong>put vectors. Both networksuse connections that are <strong>in</strong>itialized by performance of a calculation based onparameters unique to the network, rather than a r<strong>and</strong>om distribution of values.Also, both networks have two layers of process<strong>in</strong>g elements that are completely<strong>in</strong>terconnected between layers (the ART network augments the layers with thega<strong>in</strong> control <strong>and</strong> reset units).However, unlike <strong>in</strong> the BAM, the connections between layers <strong>in</strong> the ARTnetwork are not bidirectional. Rather, the network units here are <strong>in</strong>terconnectedby means of two sets of (/w'directional connections. As shown <strong>in</strong> Figure 8.7,one set ties all the outputs of the elements on layer F t to all the <strong>in</strong>puts on p2,<strong>and</strong> the other set connects all F-^ unit outputs to <strong>in</strong>puts on layer F\. Thus, forreasons completely different from those used to justify the BAM data structures,it turns out that the <strong>in</strong>terconnection scheme used to model the BAM is identicalto the scheme needed to model the ART1 network.As we saw <strong>in</strong> the case of the BAM, the data structures needed to implementthis view of network process<strong>in</strong>g fit nicely with the process<strong>in</strong>g model provided bythe generic simulator described <strong>in</strong> Chapter 1. To underst<strong>and</strong> why this is so, recallthe discussion <strong>in</strong> Section 4.5.2 where, <strong>in</strong> the case of the BAM, we claimed it wasdesirable to split the bidirectional connections between layers <strong>in</strong>to two sets of


328 Adaptive Resonance TheoryF 2 LayerF 1 LayerFigure 8.7The diagram shows the <strong>in</strong>terconnection strategy needed tosimulate the ART1 network. Notice that only the connectionsbetween units on the F\ <strong>and</strong> F 2 layers are needed. The hostcomputer can perform the function of the ga<strong>in</strong> control <strong>and</strong>reset units directly, thus elim<strong>in</strong>at<strong>in</strong>g the need to model thesestructures <strong>in</strong> the simulator.unidirectional connections, <strong>and</strong> to process each <strong>in</strong>dividually. By organiz<strong>in</strong>g thenetwork data structures <strong>in</strong> this manner, we were able to simplify the calculationsperformed at each network unit, <strong>in</strong> that the computer had only <strong>in</strong>put values toprocess. In the case of the BAM, splitt<strong>in</strong>g the connections was done to improveperformance at the expense of additional memory consumption. We can nowsee that there was another benefit to organiz<strong>in</strong>g the BAM simulator as we did:The data structures used to model the modified BAM network can be porteddirectly to the ART1 simulator.By us<strong>in</strong>g the <strong>in</strong>terconnection data structures developed for the BAM as thebasis of the ART1 network, we elim<strong>in</strong>ate the need to develop a new set of datastructures, <strong>and</strong> now need only to def<strong>in</strong>e the top-level network structure used totie all the ART1 specific parameters together. To do this, we simply construct arecord conta<strong>in</strong><strong>in</strong>g the po<strong>in</strong>ters to the appropriate layer structures <strong>and</strong> the learn<strong>in</strong>gparameters unique to the ART1 network. A good c<strong>and</strong>idate structure is givenby the follow<strong>in</strong>g declaration:record ART1 =beg<strong>in</strong>Fl "layer;F2 "layer;Al float;Bl float;Cl float;Dl float;float;{the network declaration}{locate Fl layer structure}{locate F2 layer structure}{A parameters for layer Fl}{B parameters for layer Fl}{C parameters for layer Fl}{D parameters for layer Fl}{L parameter for network}


8.4 The ART1 Simulator 329rho : float;F2W : <strong>in</strong>teger;INK : "float[];magX : float;end record;{vigilance parameter}{<strong>in</strong>dex of w<strong>in</strong>ner on F2 layer}{F2 <strong>in</strong>hibited vector}{magnitude of vector on Fl}where A, B, C, D, <strong>and</strong> L are network parameters as described <strong>in</strong> Section 8.2. Youshould also note that we have <strong>in</strong>corporated three items <strong>in</strong> the network structurethat will be used to simplify the simulation process. These values—F2W, INK,<strong>and</strong> magX—are used to provide immediate access to the w<strong>in</strong>n<strong>in</strong>g unit on FZ,to implement the <strong>in</strong>hibition mechanism from the attentional subsystem (A), <strong>and</strong>to store the computed magnitude of the template on layer F\, respectively.Furthermore, we have not specified the dimension of the INK array directly, soyou should be aware that we assume that this array conta<strong>in</strong>s as many values asthere are units on layer F^. We will use the INK array to selectively elim<strong>in</strong>atethe <strong>in</strong>put stimulation to each FI layer unit, thus perform<strong>in</strong>g the reset function.We will elaborate on the use of this array <strong>in</strong> the follow<strong>in</strong>g section.As illustrated <strong>in</strong> Figure 8.8, this structure for the ART1 network provides uswith access to all the network-specific data that we will require to complete oursimulator, we shall now proceed to the development of the algorithms necessaryto simulate the ART1 network.8.4.2 ART1 <strong>Algorithms</strong>As discussed <strong>in</strong> Section 8.2, it is desirable to simplify (as much as possible)the calculation of the unit activity with<strong>in</strong> the network dur<strong>in</strong>g digital simulation.For that reason, we will restrict our discussion of the ART1 algorithms to theasymptotic solution for the dynamic equations, <strong>and</strong> will implement the fastlearn<strong>in</strong>gcase for the network weights.Further, to clarify the implementation of the simulator, we will focus onthe process<strong>in</strong>g described <strong>in</strong> Section 8.2.3, <strong>and</strong> will use the data provided <strong>in</strong> thatexample as the basis for the algorithm design provided here. If you have notdone so already, please review Section 8.2.3.We beg<strong>in</strong> by presum<strong>in</strong>g that the network simulator has been constructed<strong>in</strong> memory <strong>and</strong> <strong>in</strong>itialized accord<strong>in</strong>g to the example data. We can def<strong>in</strong>e thealgorithm necessary to perform the process<strong>in</strong>g of the <strong>in</strong>put vector on layer F\as follows:procedure prop_to_Fl (net:ARTl; <strong>in</strong>vec:"float[]);{compute outputs for layer Fl for a given <strong>in</strong>put vector}var i : <strong>in</strong>teger;unit : "float[];beg<strong>in</strong>unit = net.Fl".OUTS;for i = 1 to length(unit){iteration counter}{po<strong>in</strong>ter to unit outputs}{locate unit outputs){for all Fl units}


330 Adaptive Resonance Theorydounit[i] = <strong>in</strong>vec[i] /(1 + net.Al * (<strong>in</strong>vec[i] + net.Bl) + net.Cl);if (unit[i] > 0)then unit[i] =1else unit[i] = 0;end if;end do;end procedure;{convert activation to output}outputsweightsART1Figure 8.8 The complete data structure for the ART1 simulator is shown.Notice that we have added an additional array to conta<strong>in</strong>the INHibit data that will be used to suppress <strong>in</strong>valid patternmatches on the F^ layer. Compare this diagram with thedeclaration <strong>in</strong> the text for the ART1 record, <strong>and</strong> be sure youunderst<strong>and</strong> how this model implements the <strong>in</strong>terconnectionscheme for the ART1 network.


8.4 The ART1 Simulator 331Notice that the computation for the output of each unit on FI requires nomodulat<strong>in</strong>g connection weights. This calculation is consistent with the process<strong>in</strong>gmodel for the ART1 network, but it also is of benefit s<strong>in</strong>ce we must usethe <strong>in</strong>put connection arrays to each unit on F\ to hold the values associated withthe connections from layer F 2 . This makes the simulation process efficient, <strong>in</strong>that we can model two different k<strong>in</strong>ds of connections (the <strong>in</strong>puts from the externalworld, <strong>and</strong> the top-down connections from FT) <strong>in</strong> the memory space requiredfor one set of connections (the st<strong>and</strong>ard <strong>in</strong>put connections for a unit on a layer).The next step <strong>in</strong> the simulation process is to propagate the signals fromthe FI layer to the F 2 layer. This signal propagation is the familiar sumof-productsoperation, <strong>and</strong> each unit <strong>in</strong> the F 2 layer will generate a nonzerooutput only if it had the highest activation level on the layer. For the ART1simulation, however, we must also consider the effect of the <strong>in</strong>hibit signal toeach unit on F 2 from the attentional subsystem. We assume this <strong>in</strong>hibition statusis represented by the values <strong>in</strong> the INH array, as <strong>in</strong>itialized by a reader-providedrout<strong>in</strong>e to be discussed later, <strong>and</strong> further modified by network operation. Wewill use the values {0, 1} to represent the <strong>in</strong>hibition status for the network,with a zero <strong>in</strong>dicat<strong>in</strong>g the F2 unit is <strong>in</strong>hibited, <strong>and</strong> a one <strong>in</strong>dicat<strong>in</strong>g the unit isactively participat<strong>in</strong>g <strong>in</strong> the competition. Furthermore, as <strong>in</strong> the discussion of thecounterpropagation network simulator, we will f<strong>in</strong>d it desirable to know, afterthe signal propagation to the competitive layer has completed, which unit wonthe competition so that it may be quickly accessed aga<strong>in</strong> dur<strong>in</strong>g later process<strong>in</strong>g.To accomplish all of these operations, we can def<strong>in</strong>e the algorithm for thesignal propagation to all units on layer F 2 as follows:procedure prop_to_F2 (net:ARTl);{propagate signals from layer Fl to F2}var i,j : <strong>in</strong>teger;unit : ~float[];<strong>in</strong>puts : "float[];connects : "float[];largest : float;w<strong>in</strong>ner : <strong>in</strong>teger;sum : float;beg<strong>in</strong>unit = net.F2".OUTS;<strong>in</strong>puts = net.Fl".OUTS;largest = -100;{iteration counters}{po<strong>in</strong>ter to F2 unit outputs}{po<strong>in</strong>ter to Fl unit outputs}{po<strong>in</strong>ter to unit connections}{largest activation}{<strong>in</strong>dex to w<strong>in</strong>ner}{accumulator}{locate F2 output array}{locate Fl output array}{<strong>in</strong>itial largest activation}for i = 1 to length(unit)dounit[i] = 0;end do;for i = 1 to length(unit)do{for all F2 units}{deactivate unit output}{for all F2 units)


332 Adaptive Resonance Theorysum = 0;{reset accumulator}connects = net. F2~.WEIGHTS[i];{locate connection array}for j = 1 to length(<strong>in</strong>puts){for all <strong>in</strong>puts to unit}do{compute activation}sum = sum + <strong>in</strong>puts[j] * connects[j];end do;sum = sum * net.INH[i];if (sum > largest)thenw<strong>in</strong>ner = i;largest = sum;end if;end do;unit[w<strong>in</strong>ner] = 1;net.F2W = w<strong>in</strong>ner;end procedure;{<strong>in</strong>hibit if necessary}{if current w<strong>in</strong>ner}{remember this unit}{mark largest activation}{mark w<strong>in</strong>ner}{remember w<strong>in</strong>ner}Now we have to propagate from the w<strong>in</strong>n<strong>in</strong>g unit on F 2 back to all the unitson FI . In theory, we perform this step by comput<strong>in</strong>g the <strong>in</strong>ner product betweenthe connection weight vector <strong>and</strong> the vector formed by the outputs from all theunits on FT. For our digital simulation, however, we can reduce the amountof time needed to perform this propagation by limit<strong>in</strong>g the calculation to onlythose connections between the units on FI <strong>and</strong> the s<strong>in</strong>gle w<strong>in</strong>n<strong>in</strong>g unit on F2-Further, s<strong>in</strong>ce the output of the w<strong>in</strong>n<strong>in</strong>g unit on Fi was set to one, we can aga<strong>in</strong>improve performance by elim<strong>in</strong>at<strong>in</strong>g the multiplication <strong>and</strong> us<strong>in</strong>g the connectionweight directly. This new <strong>in</strong>put from F2 is then used to calculate a new outputvalue for the FI units. The sequence of operations just described is captured <strong>in</strong>the follow<strong>in</strong>g algorithm.procedure prop_back_to_Fl (net:ARTl; <strong>in</strong>vec:"float[]);{propagate signals from F2 w<strong>in</strong>ner back to FI layer}var i : <strong>in</strong>teger;w<strong>in</strong>ner : <strong>in</strong>teger;unit : ~float[];connects : "float[];X : float;Vi : float;{iteration counter}{<strong>in</strong>dex of w<strong>in</strong>n<strong>in</strong>g F2 unit}{locate FI units}{locate connections}{new <strong>in</strong>put activation}{connection weight}beg<strong>in</strong>unit = net.FI".OUTS;w<strong>in</strong>ner = net.F2W;{locate beg<strong>in</strong>n<strong>in</strong>g of FI outputs}{get <strong>in</strong>dex of w<strong>in</strong>n<strong>in</strong>g unit}


8.4 The ART1 Simulator 333for i = 1 to length (unit) {for all Fl units}doconnects = net.Fl".WEIGHTS;{locate connection arrays}Vi = connects[i]"[w<strong>in</strong>ner];{get connection weight}X = (<strong>in</strong>vecfi] + net.Dl * Vi - net.Bl) /(1 + net.Al * (<strong>in</strong>vec[i] + net.Dl * Vi) + net.Cl);if (X > 0)then unit[i] = 1else unit[i] = 0 ;end do;end procedure;{is activation sufficient}{to turn on unit output?}{if not, turn off}Now all that rema<strong>in</strong>s is to compare the output vector on F\ to the orig<strong>in</strong>al<strong>in</strong>put vector, <strong>and</strong> to update the network accord<strong>in</strong>gly. Rather than try<strong>in</strong>g toaccomplish both of these operations <strong>in</strong> one function, we shall construct twofunctions (named match <strong>and</strong> update) that will determ<strong>in</strong>e whether a matchhas occurred between bottom-up <strong>and</strong> top-down patterns, <strong>and</strong> will update thenetwork accord<strong>in</strong>gly. These rout<strong>in</strong>es will both be constructed so that they canbe called from a higher-level rout<strong>in</strong>e, which we call propagate. We firstcompute the degree to which the two vectors resemble each other. We shallaccomplish this comparison as follows:function match (net:ARTl; <strong>in</strong>vec:"float[]) return float;{compare <strong>in</strong>put vector to activation values on Fl}var i : <strong>in</strong>teger;unit : "float[];magX : float;magi : float;beg<strong>in</strong>unit = net.Fl".OUTS;magX = 0;magi = 0;{iteration counter}{locate outputs of Fl units}{the magnitude of template}{the magnitude of the <strong>in</strong>put}{access unit outputs}{<strong>in</strong>itialize magnitude}{ditto}for i = 1 to length (unit) - - - -{for all component of <strong>in</strong>put}domagX = magX + unit[i];{compute magnitude of template}magi = magi + <strong>in</strong>vec[i]; {same for <strong>in</strong>put vector}end do;net.magX = magX;return (magX / magi);end function;{save magnitude for later use}{return the match value}


334 Adaptive Resonance TheoryOnce resonance has been established (as <strong>in</strong>dicated by the degree of thematch found between the template vector <strong>and</strong> the <strong>in</strong>put vector), we must updatethe connection weights <strong>in</strong> order to re<strong>in</strong>force the memory of this pattern. Thisupdate is accomplished <strong>in</strong> the follow<strong>in</strong>g manner.procedure update (net:ARTl);{update the connection weights to remember a pattern)var i : <strong>in</strong>teger;w<strong>in</strong>ner : <strong>in</strong>teger;unit : ~float[];{iteration counter}{<strong>in</strong>dex of w<strong>in</strong>n<strong>in</strong>g F2 unit}{access to unit outputs}connects : ~float[]; {access to connection values}<strong>in</strong>puts : "float[]; {po<strong>in</strong>ter to outputs of Fl}beg<strong>in</strong>unit = net.F2~.OUTS; {update w<strong>in</strong>n<strong>in</strong>g F2 unit first}w<strong>in</strong>ner = net.F2W;{<strong>in</strong>dex to w<strong>in</strong>n<strong>in</strong>g unit}connects = net. F2".WEIGHTS[w<strong>in</strong>ner];{locate w<strong>in</strong>ners connections}<strong>in</strong>puts = net.Fl".OUTS; {locate outputs of Fl units}for i = 1 to length(connects){for all connections to F2 w<strong>in</strong>ner}do{update the connections to the unit accord<strong>in</strong>g toEq. (8.28)}connects [i] = (net.L / (net.L - 1 + net.magX))* <strong>in</strong>puts [i];end do;for i = 1 to length(unit) {now do connections to Fl}doconnects = net.Fl".WEIGHTS[i];{access connections)connects[w<strong>in</strong>ner] = <strong>in</strong>puts[i]; {update connections)end do;end procedure;You should note from <strong>in</strong>spection of the update algorithm that we havetaken advantage of some characteristics of the ART1 network to enhance simulatorperformance <strong>in</strong> two ways:• We update the connection weights to the w<strong>in</strong>ner on F 2 by multiply<strong>in</strong>g thecomputed value for each connection by the output of the F\ unit associatedwith the connection be<strong>in</strong>g updated. This operation makes use of the factthat the output from every FI unit is always b<strong>in</strong>ary. Thus, connections areupdated correctly regardless of whether they are connected to an active or<strong>in</strong>active F\ unit.


8.4 The ART1 Simulator 335• We update the top-down connections from the w<strong>in</strong>n<strong>in</strong>g FI unit to the unitson F\ to conta<strong>in</strong> the output value of the F\ unit to which they are connected.Aga<strong>in</strong>, this takes advantage of the b<strong>in</strong>ary nature of the unit outputs on F\<strong>and</strong> allows us to elim<strong>in</strong>ate a conditional test-<strong>and</strong>-branch operation <strong>in</strong> thealgorithm.With the addition of a top-level rout<strong>in</strong>e to tie them all together, the collectionof algorithms just def<strong>in</strong>ed are sufficient to implement the ART1 network. Weshall now complete the simulator design by present<strong>in</strong>g the implementation of thepropagate rout<strong>in</strong>e. So that it rema<strong>in</strong>s consistent with our example, the toplevelrout<strong>in</strong>e is designed to place an <strong>in</strong>put vector on the network, <strong>and</strong> performthe signal propagation accord<strong>in</strong>g to the algorithm described <strong>in</strong> Section 8.2.3.Note that this rout<strong>in</strong>e uses a reader-provided rout<strong>in</strong>e (remove-<strong>in</strong>hibit) toset all the values <strong>in</strong> the ART1. INK array to one. This rout<strong>in</strong>e is necessary <strong>in</strong>order to guarantee that all FI units participate <strong>in</strong> the signal-propagation activityfor every new pattern presented to the network.procedure propagate (net:ARTl; <strong>in</strong>vec:"float[]);{perform a signal propagation with learn<strong>in</strong>g <strong>in</strong> thenetwork}var done : boolean;beg<strong>in</strong>done = false;remove_<strong>in</strong>hibit (net);{true when template found}{start loop}{enable all F2 units}while (not done)doprop_to_Fl (net, <strong>in</strong>vec); {update Fl layer}prop_to_F2 (net);{determ<strong>in</strong>e F2 w<strong>in</strong>ner}prop_back_to_Fl (net, <strong>in</strong>vec);{send template back to Fl}if (match(net, <strong>in</strong>vec) < net.rho){if pattern does not match}then net.INH[net.F2W] = 0 {<strong>in</strong>hibit w<strong>in</strong>ner}else done = true;{else exit loop}end do;,--"'"update (net);end procedure;{re<strong>in</strong>force template}Note that the propagate algorithm does not take <strong>in</strong>to account the casewhere all F-^ units have been encoded <strong>and</strong> none of them match the current <strong>in</strong>putpattern. In that event, one of two th<strong>in</strong>gs should occur: Either the algorithmshould attempt to comb<strong>in</strong>e two already encoded patterns that exhibit some degreeof similarity <strong>in</strong> order to free an FI unit (difficult to implement), or the simulatorshould allow for growth <strong>in</strong> the number of network units. This second optioncan be accomplished as follows:


336 Adaptive Resonance Theory1. When the condition exists that requires an additional p2 unit, first allocatea new array of floats that conta<strong>in</strong>s enough room for all exist<strong>in</strong>g FI units,plus some number of extra units.2. Copy the current contents of the output array to the newly created array sothat the exist<strong>in</strong>g n values occupy the first n values <strong>in</strong> the new array.3. Change the po<strong>in</strong>ter <strong>in</strong> the ART1 record structure^o locate the new array asthe output array for the F2 units.4. Deallocate the old F^ output array (optional).The design <strong>and</strong> implementation of such an algorithm is left to you as anexercise.8.5 ART2 SIMULATIONAs we discussed earlier <strong>in</strong> this chapter, the ART2 model varies from the ART1network primarily <strong>in</strong> the implementation of the F\ layer. Rather than a s<strong>in</strong>glelayerstructure of units, the F\ layer conta<strong>in</strong>s a number of sublayers that serve toremove noise, to enhance contrast, <strong>and</strong> to normalize an analog <strong>in</strong>put pattern. Weshall not f<strong>in</strong>d this structure difficult to model, as the F\ layer can be reduced to asuperlayer conta<strong>in</strong><strong>in</strong>g many <strong>in</strong>termediate layer structures. In this case, we needonly to be aware of the differences <strong>in</strong> the network structure as we implementthe ART2 process<strong>in</strong>g algorithms.In addition, signals propagat<strong>in</strong>g through the ART2 network are primarilyanalog <strong>in</strong> nature, <strong>and</strong> hence must be modeled as float<strong>in</strong>g-po<strong>in</strong>t numbers <strong>in</strong> ourdigital simulation. This condition creates a situation of which you must be awarewhen attempt<strong>in</strong>g to adapt the algorithms developed for the ART1 simulator to theART2 model. Recall that, <strong>in</strong> several ART1 algorithms, we relied on the fact thatnetwork units were generat<strong>in</strong>g b<strong>in</strong>ary outputs <strong>in</strong> order to simplify process<strong>in</strong>g.For example, consider the case where the <strong>in</strong>put connection weights to layer FIare be<strong>in</strong>g modified dur<strong>in</strong>g learn<strong>in</strong>g (algorithm update). In that algorithm, wemultiplied the corrected connection weight by the output of the unit from theF\ layer. We did this multiplication to ensure that the ART1 connections wereupdated to conta<strong>in</strong> either the corrected connection value (if the F\ unit was on)or to zero (if the FI unit was off). This approach will not work <strong>in</strong> the ART2model, because FI layer units can now produce analog outputs.Other than these two m<strong>in</strong>or variations, the implementation of the ART2simulator should be straightforward. Us<strong>in</strong>g the ART1 simulator <strong>and</strong> ART2discussion as a guide, we leave it as an exercise for you to develop the algorithms<strong>and</strong> data structures needed to create an ART2 simulator.


Suggested Read<strong>in</strong>gs 337Programm<strong>in</strong>g Exercises8.1. Implement the ART1 simulator. Test it us<strong>in</strong>g the example data presented <strong>in</strong>Section 8.2.3. Does the simulator generate the same data values described<strong>in</strong> the example? Expla<strong>in</strong> your answer.8.2. Design <strong>and</strong> implement a function that can be <strong>in</strong>corporated <strong>in</strong> the propagaterout<strong>in</strong>e to account for the situation where all F2 units have been used <strong>and</strong>a new <strong>in</strong>put pattern does not match any of the encoded patterns. Use theguidel<strong>in</strong>es presented <strong>in</strong> the text for this algorithm. Show the new algorithm,<strong>and</strong> <strong>in</strong>dicate where it should be called from <strong>in</strong>side the propagate rout<strong>in</strong>e.8.3. Implement the ART2 simulator. Test it us<strong>in</strong>g the example data presented<strong>in</strong> Section 8.3.2. Does the simulator behave as expected? Describe theactivity levels at each sublayer on F\ at different periods dur<strong>in</strong>g the signalpropagationprocess.8.4. Us<strong>in</strong>g the ART2 simulator constructed <strong>in</strong> Programm<strong>in</strong>g Exercise 8.3, describewhat happens when all the <strong>in</strong>puts <strong>in</strong> a tra<strong>in</strong><strong>in</strong>g pattern are scaled bya r<strong>and</strong>om noise function <strong>and</strong> are presented to the network after tra<strong>in</strong><strong>in</strong>g.Does your ART2 network correctly classify the new <strong>in</strong>put <strong>in</strong>to the samecategory as it classifies the orig<strong>in</strong>al pattern? How can you tell whether itdoes?Suggested Read<strong>in</strong>gsThe most prolific writers of the neural-network community appear to be StephenGrossberg, Gail Carpenter, <strong>and</strong> their colleagues. Start<strong>in</strong>g with Grossberg's work<strong>in</strong> the 1970s, <strong>and</strong> cont<strong>in</strong>u<strong>in</strong>g today, a steady stream of papers has evolved fromGrossberg's early ideas. Many such papers have been collected <strong>in</strong>to books. Thetwo that we have found to be the most useful are Studies of M<strong>in</strong>d <strong>and</strong> Bra<strong>in</strong> [10]<strong>and</strong> <strong>Neural</strong> <strong>Networks</strong> <strong>and</strong> Natural Intelligence [13]. Another collection is TheAdaptive Bra<strong>in</strong>, Volumes I <strong>and</strong> II [11, 12]. This two-volume compendium conta<strong>in</strong>spapers on the application of Grossberg's theories to models of vision,speech <strong>and</strong> language recognition <strong>and</strong> recall, cognitive self-organization, condition<strong>in</strong>g,re<strong>in</strong>forcement, motivation, attention, circadian rhythms, motor control,<strong>and</strong> even certa<strong>in</strong> mental disorders such as amnesia. Many of the papers thatdeal directly with the adaptive resonance networks were coauthored by GailCarpenter [1, 5, 2, 3, 4, 6].A highly mathematical paper by Cohen <strong>and</strong> Grossberg proved a convergencetheorem regard<strong>in</strong>g networks <strong>and</strong> the latter's ability to learn patterns [8].Although important from a theoretical st<strong>and</strong>po<strong>in</strong>t, this paper is recommendedfor only the hardy mathematician.<strong>Applications</strong> us<strong>in</strong>g ART networks often comb<strong>in</strong>e the basic ART structureswith other, related structures also developed by Grossberg <strong>and</strong> colleagues. Thisfact is one reason why specific application examples are miss<strong>in</strong>g from thischapter. Examples of these applications can be found <strong>in</strong> the papers by Carpenter


338 Bibliographyet al. [7], Kolodzy [15], <strong>and</strong> Kolodzy <strong>and</strong> van Alien [14]. An alternate methodfor model<strong>in</strong>g the orient<strong>in</strong>g subsystem can be found <strong>in</strong> the papers by Ryan <strong>and</strong>W<strong>in</strong>ter [16] <strong>and</strong> by Ryan, W<strong>in</strong>ter, <strong>and</strong> Turner [17].Bibliography[1] Gail A. Carpenter <strong>and</strong> Stephen Grossberg. Associative learn<strong>in</strong>g, adaptivepattern recognition <strong>and</strong> cooperative-competetive decision mak<strong>in</strong>g by neuralnetworks. In H. Szu, editor. Hybrid <strong>and</strong> Optical Comput<strong>in</strong>g. SPIE,1986.[2] Gail A. Carpenter <strong>and</strong> Stephen Grossberg. ART 2: Self-organization of stablecategory recognition codes for analog <strong>in</strong>put patterns. Applied Optics,26(23):4919-4930, December 1987.[3] Gail A. Carpenter <strong>and</strong> Stephen Grossberg. ART2: Self-organization ofstable category recognition codes for analog <strong>in</strong>put patterns. In MaureenCaudill <strong>and</strong> Charles Butler, editors, Proceed<strong>in</strong>gs of the IEEE FirstInternational Conference on <strong>Neural</strong> <strong>Networks</strong>, San Diego, CA, pp. II-727-11-735, June 1987. IEEE.[4] Gail A. Carpenter <strong>and</strong> Stephen Grossberg. Invariant pattern recognition <strong>and</strong>recall by an attentive self-organiz<strong>in</strong>g ART architecture <strong>in</strong> a nonstationaryworld. In Maureen Caudill <strong>and</strong> Charles Butler, editors, Proceed<strong>in</strong>gs ofthe IEEE First International Conference on <strong>Neural</strong> <strong>Networks</strong>, San Diego,CA, pp. II-737-II-745, June 1987. IEEE.[5] Gail A. Carpenter <strong>and</strong> Stephen Grossberg. A massively parallel architecturefor a self-organiz<strong>in</strong>g neural pattern recognition mach<strong>in</strong>e. Computer Vision,Graphics, <strong>and</strong> Image Process<strong>in</strong>g, 37:54-115, 1987.[6] Gail A. Carpenter <strong>and</strong> Stephen Grossberg. The ART of adaptive patternrecognition by a self-organiz<strong>in</strong>g neural network. Computer, 21(3):77-88,March 1988.[7] Gail A. Carpenter, Stephen Grossberg, <strong>and</strong> Courosh Mehanian. Invariantrecognition of cluttered scenes by a self-organiz<strong>in</strong>g ART architecture:CORT-X boundary segmentation. <strong>Neural</strong> <strong>Networks</strong>, 2(3):169-181, 1989.[8] Michael A. Cohen <strong>and</strong> Stephen Grossberg. Absolute stability of globalpattern formation <strong>and</strong> parallel memory storage by competitive neural networks.IEEE Transactions on Systems, Man, <strong>and</strong> Cybernetics, SMC-13(5):815-826, September-October 1983.[9] Stephen Grossberg. Adaptive pattern classsification <strong>and</strong> universal reced<strong>in</strong>g,I: Parallel development <strong>and</strong> cod<strong>in</strong>g of neural feature detectors. InStephen Grossberg, editor. Studies of M<strong>in</strong>d <strong>and</strong> Bra<strong>in</strong>. D. Reidel Publish<strong>in</strong>g,Boston, pp. 448-497, 1982.[10] Stephen Grossberg. Studies of M<strong>in</strong>d <strong>and</strong> Bra<strong>in</strong>, volume 70 of Boston Studies<strong>in</strong> the Philosophy of Science. D. Reidel Publish<strong>in</strong>g Company, Boston,1982.


Bibliography 339[11] Stephen Grossberg, editor. The Adaptive Bra<strong>in</strong>, Vol. I: Cognition, Learn<strong>in</strong>g,Re<strong>in</strong>forcement, <strong>and</strong> Rhythm. North Holl<strong>and</strong>, Amsterdam, 1987.[12] Stephen Grossberg, editor. The Adaptive Bra<strong>in</strong>, Vol. II: Vision, Speech,Language <strong>and</strong> Motor Control. North Holl<strong>and</strong>, Amsterdam, 1987.[13] Stephen Grossberg, editor. <strong>Neural</strong> <strong>Networks</strong> <strong>and</strong> Natural Intelligence. MITPress, Cambridge, MA, 1988.[14] P. Kolodzy <strong>and</strong> E. J. van Alien. Application of a boundary contour neuralnetwork to illusions <strong>and</strong> <strong>in</strong>frared imagery. In Proceed<strong>in</strong>gs of theIEEE First International Conference on <strong>Neural</strong> <strong>Networks</strong>, San Diego, CA,pp. IV-193-IV-202, June 1987. IEEE.[15] Paul J. Kolodzy. Multidimensional mach<strong>in</strong>e vision us<strong>in</strong>g neural networks.In Proceed<strong>in</strong>gs of the IEEE First International Conference on <strong>Neural</strong> <strong>Networks</strong>,San Diego, CA, pp. II-747-II-758, June 1987. IEEE.[16]T. W. Ryan <strong>and</strong> C. L. W<strong>in</strong>ter. Variations on adaptive resonance. In MaureenCaudill <strong>and</strong> Charles Butler, editors, Proceed<strong>in</strong>gs of the IEEE FirstInternational Conference on <strong>Neural</strong> <strong>Networks</strong>, San Diego, CA, pp. II-767-11-775, June 1987. IEEE.[ 17] T. W. Ryan, C. L. W<strong>in</strong>ter, <strong>and</strong> C. J. Turner. Dynamic control of an artificialneural system: the property <strong>in</strong>heritance network. Applied Optics, 21(23):4961-4971, December 1987.


H A P T E RSpatiotemporalPattern ClassificationMany ANS architectures, such as backpropagation, adaptive resonance, <strong>and</strong> othersdiscussed <strong>in</strong> previous chapters of this text, are applicable to the recognitionof spatial <strong>in</strong>formation patterns: a two-dimensional, bit-mapped image of a h<strong>and</strong>writtencharacter, for example. Input vectors presented to such a network werenot necessarily time correlated <strong>in</strong> any way; if they were, that time correlationwas <strong>in</strong>cidental to the pattern-classification process. Individual patterns wereclassified on the basis of <strong>in</strong>formation conta<strong>in</strong>ed with<strong>in</strong> the pattern itself. Theprevious or subsequent pattern had no effect on the classification of the current<strong>in</strong>put vector.We presented an example <strong>in</strong> Chapter 7 where a sequence of spatial patternscould be encoded as a path across a two-dimensional layer of process<strong>in</strong>gelements (see Section 7.2.1, on the neural phonetic typewriter). Nevertheless,the self-organiz<strong>in</strong>g map used <strong>in</strong> that example was not conditioned torespond to any particular sequence of <strong>in</strong>put patterns; it just reported what thesequence was. /In this chapter, we shall describe ANS architectures that can deal directlywith both the spatial <strong>and</strong> the temporal aspects /of <strong>in</strong>put signals. These networksencode <strong>in</strong>formation relat<strong>in</strong>g to the time correlation of spatial patterns, as wellas the spatial pattern <strong>in</strong>formation itself. We def<strong>in</strong>e a spatiotemporal pattern(STP) as a time-correlated sequence of spatial patterns.There are several application doma<strong>in</strong>s where STP recognition is important.One that comes to m<strong>in</strong>d immediately is speech recognition, for which the STPcould be the time-vary<strong>in</strong>g power spectrum produced by a multichannel audiospectrum analyzer. A coarse example of such an analyzer is represented bythe bar-graph display of a typical graphic equalizer used <strong>in</strong> many home stereosystems. Each channel of the graphic equalizer responds to the sound <strong>in</strong>ten-


342 Spatiotemporal Pattern Classification15001000Value(x10)500-500-1000-15000.0 0.5760 1.152 1.728 2.304 2.880 3.456 4.032 4.608Time (x10~ 1 )Figure 9.1This figure is a digitized acoustic waveform of a male speakermak<strong>in</strong>g the utterance "zero" <strong>in</strong> ambient conditions. Thesampl<strong>in</strong>g rate is 10 kHz. Source: Courtesy of Kami I Grajski,Loral Western Development Laboratories.sity with<strong>in</strong> a certa<strong>in</strong> frequency <strong>in</strong>terval. Other applications, with characteristicssimilar to the speech example, <strong>in</strong>clude radar <strong>and</strong> sonar echoes.Figures 9.1 <strong>and</strong> 9.2 illustrate an example of an STP from a spoken word.The waveform shown <strong>in</strong> Figure 9.1 has been converted <strong>in</strong>to a sequence ofpower spectra <strong>in</strong> Figure 9.2. Each <strong>in</strong>dividual spectrum is a spatial pattern. Thesequence of spatial patterns, one at each time slice, makes up the STP.Much of the follow<strong>in</strong>g discussion requires an underst<strong>and</strong><strong>in</strong>g of the materialconta<strong>in</strong>ed <strong>in</strong> Chapter 6. If you have not already studied that material, you shoulddo so before attempt<strong>in</strong>g to read this chapter.9.1 THE FORMAL AVALANCHEThe foundation for the development of the network architectures <strong>in</strong> this chapteris the formal avalanche structure developed by Grossberg [2]. The network isshown <strong>in</strong> Figure 9.3. The structure of the network resembles the top two layers


9.1 The Formal Avalanche 343MagnitudeTime1000 2000 3000 4000Frequency (hz)5000Figure 9.2This figure shows the spectrogram made from the waveform<strong>in</strong> Figure 9.1. The horizontal axis represents frequency, <strong>and</strong>the vertical axis represents the power level of the signal ateach frequency. Time extends backward <strong>in</strong>to the figure.Source: Courtesy of Kamil Grajski, Loral Western DevelopmentLaboratories.of the CPN described <strong>in</strong> Chapter 6. This resemblance is not co<strong>in</strong>cidental: Bothnetworks use multiple outstars. 1The CPN <strong>and</strong> the avalanche use their outstars <strong>in</strong> a slightly different manner.In the CPN, <strong>in</strong>dividual spatial patterns were presented to the <strong>in</strong>put layer, aw<strong>in</strong>n<strong>in</strong>g hidden-layer unit was found, <strong>and</strong> the correspond<strong>in</strong>g outstar was excited.The outputs of the outstar were the previously learned mapp<strong>in</strong>g for the class of<strong>in</strong>put pattern presented.Instead of perform<strong>in</strong>g a recognition function, the avalanche demonstrateshow a complex spatiotemporal pattern might be learned <strong>and</strong> recalled. An examplewould be the sequence of muscle contractions <strong>and</strong> extensions that result<strong>in</strong> the ability to play a particular piece on the piano. The pattern can be learnedas follows (refer to Figure 9.3).1 We hasten to add that the development of the avalanche preceded that of the CPN by many years,even though we discussed the CPN first <strong>in</strong> this text.


344 Spatiotemporal Pattern ClassificationMOy,(0Figure 9.3This figure shows Grossberg's formal avalanche structure. Thenetwork consists of multiple outstars, all shar<strong>in</strong>g their outputunits. The network learns y(t) « \(t). Once it is tra<strong>in</strong>ed, theavalanche can replay y(t) <strong>in</strong> the proper sequence by excit<strong>in</strong>gthe outstars sequentially.Assume \(t) = (x\(t).x 2 (t),... ,x n (t)) f is the set of muscle comm<strong>and</strong>srequired at time t, t 0 < t < t\. Activate the node labeled t 0 <strong>and</strong> applyx(< () ) to be learned by the outstar's output units. The second set of comm<strong>and</strong>s,x(to + Ai) is applied while activat<strong>in</strong>g the second outstar, labeled to +At. Cont<strong>in</strong>ue this process by activat<strong>in</strong>g successive outstars until all of themuscle comm<strong>and</strong>s have been learned <strong>in</strong> sequence. Once learned, the entireseries of comm<strong>and</strong>s can be recalled by activation of the outstars <strong>in</strong> theproper sequence, to, to + At, to + 2At,... ,t\, while a zero vector is appliedto the x <strong>in</strong>puts. 2 Replay of the learned sequence can be <strong>in</strong>itiated by stimulat<strong>in</strong>gthe to node. The t 0 node then stimulates the t 0 + At node, <strong>and</strong>so forth.For our purposes it does not matter whether the formal avalanche architecturebears much resemblance to actual biological systems; the mere fact thatit represents a m<strong>in</strong>imal architecture capable of learn<strong>in</strong>g <strong>and</strong> reproduc<strong>in</strong>g STPsmakes this network structure <strong>in</strong>terest<strong>in</strong>g <strong>in</strong> its own right. In the next section, we-See Section 6.1 for the details on the outstar tra<strong>in</strong><strong>in</strong>g procedure.


9.2 Architectures of Spatiotemporal <strong>Networks</strong> (STNs) 345shall return to a consideration of the pattern-recognition problem. We shall showhow avalanche structures can be constructed that can identify complex STPs.Exercise 9.1: The Moonlight Sonata requires approximately 13.5 m<strong>in</strong>utes toperform. Assume that a formal avalanche structure is responsible for actuat<strong>in</strong>gthe muscles that enable someone to perform this piece on the piano. Assume thatthe outstars are stimulated at a rate of one every 25 milliseconds. Estimate howmany outstars would be required <strong>in</strong> the avalanche. Speculate on the likelihoodthat the formal avalanche is actually used <strong>in</strong> the bra<strong>in</strong> for this purpose.9.2 ARCHITECTURES OF SPATIOTEMPORALNETWORKS (STNS)We shall anchor the follow<strong>in</strong>g discussion <strong>in</strong> familiar waters by us<strong>in</strong>g the exampleof speech recognition as the source for our STPs.Figure 9.4 shows a simple experimental arrangement that generates STPsfrom spoken words. At each <strong>in</strong>stant of time, the output of the spectrum analyzerconsists of a vector whose components are the powers <strong>in</strong> the various channels.For example, at t = t\,is the power spectrum for the ith word at time t\. The STP for each wordconsists of a sequence of vectors of the form P t (tj). If P, is the STP for thezth word, thenDifferences <strong>in</strong> volume can be accounted for by normalization of the <strong>in</strong>dividualpower spectra: ||P,;(£j)|| = 1.To complete the picture, we must specify the number of vectors <strong>in</strong> the set,Pi. To represent a time-vary<strong>in</strong>g signal accurately, we must sample the powerspectrum at a frequency at least two times the b<strong>and</strong>width of the orig<strong>in</strong>al signal.Although we have restricted ourselves to speech, we can accommodate othertime-vary<strong>in</strong>g signals <strong>in</strong> a similar manner. /9.2.1 STN for Speech RecognitionWe shall first construct a network that is tuned to recognize a s<strong>in</strong>gle word.Then, we shall show how to exp<strong>and</strong> the network to accommodate other words.Figure 9.5 shows the basic structure.To tune the network for a specific word, we must first perform a spectralanalysis while someone actually speaks the word. To tune it more generally, wemight take the average of the spectra produced by several different speakers. Ifthe time required to say the word is t <strong>and</strong> we must sample at a rate of v persecond, then we need m — vt spectra to represent the word, <strong>and</strong> m PEs <strong>in</strong> our


346 Spatiotemporal Pattern ClassificationSound . \waves ) ISpectrum analyzerMicrophones/(0P n (t)P i2 (t)Figure 9.4This diagram shows the generation of power spectra fromspeech. The microphone acts as a transducer produc<strong>in</strong>g theelectrical signal, s t (t), from the sound-pressure waves. Thesubscript i identifies a s<strong>in</strong>gle word. The spectrum analyzermeasures the power conta<strong>in</strong>ed <strong>in</strong> each of n frequency b<strong>in</strong>s.Pi\(t) is the power <strong>in</strong> frequency b<strong>in</strong> 1 for the zth spoken word.network. LetPi =be the STP for the word. Assign weight vectors to the network units accord<strong>in</strong>g tozi = P,(f,),z 2 =Pi(< 2 ), ...,z m = PI (£,„), where ||Pi(t/)|| = 1. The dimensionof the weight vectors <strong>and</strong>, therefore, the number of <strong>in</strong>puts to each process<strong>in</strong>gelement, is def<strong>in</strong>ed by the number of channels <strong>in</strong> the spectrum analyzer.Now let's have someone repeat the word <strong>in</strong>to the microphone as we applythe resultant STP to the network to verify its response. We have not as yetspecified the process<strong>in</strong>g performed by each unit, so we shall do that as weproceed. Let Q, = {Qi(ii),Qi(£ 2 ), - - • ,Qi(*m)} be the STP from the outputof the spectrum analyzer. Assume that each Qi(^) is normalized to one. Tosimplify the notation, we make the def<strong>in</strong>ition Q H = Qi(t,)-Each QH is applied to the <strong>in</strong>puts of every PE <strong>and</strong> is allowed to rema<strong>in</strong>there for a time t equal to the sampl<strong>in</strong>g <strong>in</strong>terval for the spectra. After thattime, the next <strong>in</strong>put vector, Qu+i is applied, <strong>and</strong> so on. Dur<strong>in</strong>g the time that aparticular <strong>in</strong>put vector is present on the unit <strong>in</strong>puts, each PE dynamically adjustsits activation <strong>and</strong> output value accord<strong>in</strong>g to the prescription that we shall nowdescribe.


9.2 Architectures of Spatiotemporal <strong>Networks</strong> (STNs) 347Figure 9.5 The avalanche network for s<strong>in</strong>gle-word recognition is shown.The output of each unit is connected to each successive unit,from left to right, with a connection strength of d, where 0


348 Spatiotemporal Pattern ClassificationIn Eq. (10.2), a <strong>and</strong> b are both positive constants. The function [u] + is def<strong>in</strong>ed byu if u > 00 if u < 0(9.3)F takes on the role of a threshold value. For now, we will consider F to be somepredef<strong>in</strong>ed constant value. The function A(u) is called an attack function. Itis def<strong>in</strong>ed by the equationA(u) = u if u > 0cu if u < 0(9.4)where 0 < c < \. The attack function has the effect of caus<strong>in</strong>g rise <strong>and</strong> falltimes for the PE output values to differ from each other. The response of a PEto a f<strong>in</strong>ite <strong>in</strong>put value exceed<strong>in</strong>g the threshold is illustrated <strong>in</strong> Figure 9.6.The output of the last unit, x ni , is def<strong>in</strong>ed as the network output value; thatis, y = x m . The equations for the network are def<strong>in</strong>ed such that the value ofy(t) at any time t provides a measure of how closely the STP be<strong>in</strong>g presentedto the network matches the one previously stored <strong>in</strong> the network.Unlike what we assumed for typical PEs <strong>in</strong> other networks, we shall notassume here that unit outputs can be approximated by the asymptotic values. Infact, we shall rely heavily on the dynamic nature of the output values for thesuccessful operation of the network. Let's consider the case where Q\ = P\,by which we mean that Qi, = PI(£,) for all i. We shall analyze the network'sresponse as we apply the <strong>in</strong>put vectors of Qi <strong>in</strong> succession. Assume that F issufficiently large (say, 0.9) such that a fairly close match between Qi, <strong>and</strong> z, isrequired before a PE produces a nonzero output.i r =1/aT f =MacFigure 9.6This graph shows the output of an STN PE <strong>in</strong> response to anet-<strong>in</strong>put value above threshold. The total <strong>in</strong>put, /,, exceedsthe threshold value for the duration At. The time constantfor the rise is T,. = I/a, but the time constant for the fall isTf = 1/ac, which is longer than the rise time s<strong>in</strong>ce c < 1. PEsare arbitrarily made to saturate at an output value of one.


9.2 Architectures of Spatiotemporal <strong>Networks</strong> (STNs) 349The first <strong>in</strong>put vector, Qn, matches Zi exactly, so x\ beg<strong>in</strong>s to rise quickly.We shall assume that all other units rema<strong>in</strong> quiescent. If Qn rema<strong>in</strong>s longenough, x\ will saturate. In any case, as soon as Qn is removed <strong>and</strong> Q i2 isapplied, x\ will beg<strong>in</strong> to decay. At the same time, xi will beg<strong>in</strong> to rise becauseQi2 matches the weight vector, 22. Moreover, s<strong>in</strong>ce x\ will not have decayedaway completely, it will contribute to the positive <strong>in</strong>put value to unit 2. WhenQi3 is applied, both x\ <strong>and</strong> XT_ will contribute to the <strong>in</strong>put to unit 3.This cascade cont<strong>in</strong>ues while all <strong>in</strong>put vectors are applied to the network.S<strong>in</strong>ce the <strong>in</strong>put vectors have been applied <strong>in</strong> the proper sequence, unit <strong>in</strong>putvalues tend to be re<strong>in</strong>forced by the outputs of preced<strong>in</strong>g units. By the time thef<strong>in</strong>al <strong>in</strong>put vector is applied, the network output, represented by y — x,,,, mayalready have reached saturation, even though contributions from units very early<strong>in</strong> the network may have decayed away.To illustrate the effects of a pattern mismatch, let's exam<strong>in</strong>e the situationwhere the patterns of Q\ are applied <strong>in</strong> reverse order. S<strong>in</strong>ce Qi,,, matches z,,,,x m will beg<strong>in</strong> a quick rise toward saturation, although its output is not be<strong>in</strong>gre<strong>in</strong>forced by any other units <strong>in</strong> the network. When Qi.,,,_i is applied, x n ,-\turns on <strong>and</strong> sends its output value to contribute to x m . The total <strong>in</strong>put tothe mth unit is dx m -\, which, because d < 1, is unlikely to overcome thethreshold F. Therefore, x,,, will cont<strong>in</strong>ue to decay away. By the time the last<strong>in</strong>put vector is applied, x,,, may have decayed away entirely, <strong>in</strong>dicat<strong>in</strong>g thatthe network has not recognized the STP. Note that we have assumed here thatQi.m-i is orthogonal to the weight vector z ni , <strong>and</strong> thus there is no contributionto the <strong>in</strong>put to the mth unit from the dot product of these two vectors. Thisassumption is acceptable for this discussion to illustrate the concepts beh<strong>in</strong>d thenetwork operation. In practice, these vectors will not necessarily be orthogonal.Figure 9.7 shows a graphic example of unit outputs for a pattern recognitionby a simple, four-unit STN. Results from apply<strong>in</strong>g the identical <strong>in</strong>put vectors,but <strong>in</strong> a r<strong>and</strong>om order, are shown <strong>in</strong> Figure 9.8. Notice how the activity patternappears to flow smoothly from left to right <strong>in</strong> the first figure.Because of the relatively rapid rise time/ of the unit activity, followed bya longer decay time, the STN has the property of be<strong>in</strong>g somewhat <strong>in</strong>sensitiveto the speed at which the STP is presented. We know that different speakerspronounce the same word at slightly different rates; thus, tolerance of timevariation is a desirable characteristic of STNs. Figure 9.9 shows the results ofpresent<strong>in</strong>g an STP at two different rates to the same network.Recall that the network we have been describ<strong>in</strong>g is capable of recogniz<strong>in</strong>gonly a s<strong>in</strong>gle STP. To dist<strong>in</strong>guish words <strong>in</strong> some specified vocabulary, we wouldhave to replicate the network for each word that we want the system to recognize.Figure 9.10 illustrates a system that can dist<strong>in</strong>guish TV words.There are many aspects of the speech-recognition problem that we haveoverlooked <strong>in</strong> our discussion. For example, how do we account for both smallwords <strong>and</strong> large words? Moreover, some words are subsets of other words;will the system dist<strong>in</strong>guish between a subset word spoken slowly <strong>and</strong> a superset


350 Spatiotemporal Pattern Classificationtt tX 4 = xFigure 9.7This sequence (from top to bottom) shows the results from asmall STN when the <strong>in</strong>put patterns are presented <strong>in</strong> the orderthat matches the order of the weight vectors. Unit activitiesare represented by the bar graphs above each unit. Noticethat, by the time the last <strong>in</strong>put pattern is presented, the networkoutput, represented by the output of the rightmost unit, is high,<strong>in</strong>dicat<strong>in</strong>g that the STP is a close match to that pattern encodedby the sequence of weights <strong>in</strong> the network.


9.2 Architectures of Spatiotemporal <strong>Networks</strong> (STNs) 351*4 = yFigure 9.8In this sequence, the network is the same as that shown <strong>in</strong>Figure 9.7, as are the <strong>in</strong>put patterns. In this case, however, the<strong>in</strong>put patterns are presented to the network out of sequence.By the time the last <strong>in</strong>put pattern is presented, the networkoutput is low, <strong>in</strong>dicat<strong>in</strong>g that the network has not recognizedthe sequence.


352 Spatiotemporal Pattern Classification* 4 = yt t t}Figure 9.9This sequence shows the results from the presentation of anSTP at two different rates. The activities represented by theunshaded bar graphs are for the case where each spatial patternis presented <strong>and</strong> held for a certa<strong>in</strong> time, t, before the nextpattern is presented. The activities represented by the shadedbar graphs are for the case where each spatial pattern ispresented for a time, t/2. Note that, <strong>in</strong> both cases, the networkoutput is high.


9.2 Architectures of Spatiotemporal <strong>Networks</strong> (STNs) 353Figure 9.10 A layered STN for word recognition is shown. Each layer istuned to recognize a s<strong>in</strong>gle word. Each unit <strong>in</strong> each layerreceives the identical series of <strong>in</strong>put vectors as a word isspoken <strong>and</strong> analyzed by the spectrum analyzer. After eachword is presented, the largest y value <strong>in</strong>dicates the layer thatmost closely matched the STP.word spoken quickly? Different words with the same end<strong>in</strong>g could result <strong>in</strong>ambiguous results. The layered architecture that we have described representsonly a conceptual foundation for a realistic speech-recognition system.Exercise 9.2: With the system described <strong>in</strong> this section, two different STPshav<strong>in</strong>g the same f<strong>in</strong>al pattern vector could result <strong>in</strong> a large output from thesame layer (s<strong>in</strong>ce the rightmost unit is stimulated last <strong>in</strong> each case). Suggestmodifications to the network that could alleviate this error.


354 Spatiotemporal Pattern ClassificationExercise 9.3: Def<strong>in</strong>e a set of specific requirements that you th<strong>in</strong>k should bemet by any realistic speech-recognition system. Estimate the amount of computationalresources that would be required by such a system, based on thearchitecture described <strong>in</strong> this section. Provide your answer <strong>in</strong> terms of operationsper second, where an operation can be an add or a multiply. List allassumptions that you made to arrive at your estimate.If we reflect on the architecture of our proposed speech-recognition system,it may appear somewhat <strong>in</strong>efficient to assign an entire layer of PEs toa s<strong>in</strong>gle word. In the previous paragraph, we called attention to the fact thatsome words are subsets of others, so there is bound to be some redundancy<strong>in</strong> our system. Even at the level of <strong>in</strong>dividual PEs, there is likely to be a fairamount of repetition. By tak<strong>in</strong>g advantage of this fact, we might be able toreduce the computational load on our system. Figure 9.11 illustrates a simpleexample. Elim<strong>in</strong>at<strong>in</strong>g redundant units, as shown <strong>in</strong> the figure, will <strong>in</strong>crease theefficiency of our network but may complicate the <strong>in</strong>terpretation of the outputsfrom <strong>in</strong>dividual layers. For example, <strong>in</strong> the case of a subset word, the output ofone entire layer may feed <strong>in</strong>to the first unit of a layer represent<strong>in</strong>g the supersetword.Rather than cont<strong>in</strong>ue this piecemeal analysis, we shall simply conclude that,ultimately, the most efficient network may be one <strong>in</strong> which many (or most, orall) units are connected to many (or most, or all) other units. In the next section,we shall describe such an architecture, <strong>and</strong> shall show a learn<strong>in</strong>g algorithm thatallows a time sequence to be learned by the network.Figure 9.11This figure shows how to elim<strong>in</strong>ate a redundant unit <strong>in</strong> alayered STN. Unit 2 on layers 1 <strong>and</strong> 2 have the same weightvector. By reconnect<strong>in</strong>g the units as shown, we can elim<strong>in</strong>ateone unit from the network.


9.3 The Sequential Competitive Avalanche Field 3559.3 THE SEQUENTIAL COMPETITIVEAVALANCHE FIELDThe network <strong>in</strong> Figure 9.12 represents the logical extrapolation from elim<strong>in</strong>at<strong>in</strong>gredundant elements <strong>in</strong> the layered STN described <strong>in</strong> the previous section. Thestructure is called a sequential, competitive avalanche field (SCAF) [3]. It isillustrated here as a one-dimensional array of PEs, although it could be constructedas a two-dimensional array. In a two-dimensional configuration, theSCAF resembles Kohonen's SOM described <strong>in</strong> Chapter 7. Both the SCAF <strong>and</strong>the SOM are made up of compet<strong>in</strong>g units, but the mechanism for mediat<strong>in</strong>g thatcompetition differs <strong>in</strong> each case.You should note from the architecture diagram that there are no outputscorrespond<strong>in</strong>g to the y values <strong>in</strong> the layered architecture described <strong>in</strong> the previoussection. This fact complicates the <strong>in</strong>terpretation of the network responseto any particular STP. We shall first exam<strong>in</strong>e how the network responds to anSTP, then shall discuss how the weight vectors are determ<strong>in</strong>ed.Figure 9.12The sequential, competitive avalanche-field architecture. Theoutput of each unit is connected to all other units throughthe weight vectors w ; . The z, are the weight vectors onthe <strong>in</strong>put connections. The global parameter F mediates thecompetition <strong>in</strong> the network as described <strong>in</strong> the text.


356 Spatiotemporal Pattern ClassificationThe equations that govern the response of <strong>in</strong>dividual process<strong>in</strong>g elementson the SCAF are almost identical to those for the units on the layered STN. The<strong>in</strong>put to the ith node has three components, lf\ lf\ <strong>and</strong> F whereI\ l) = * t Q (9.5)/ z (2) = Wi • x (9.6)where wa = 0. F is called the system activity threshold; its value <strong>and</strong> functionshall be described shortly. The total <strong>in</strong>put to a node is calculated aswhere/i,net - Ii ~ r (9.7)I i =ll" + dl? ) (9.8)<strong>and</strong> d represents a ga<strong>in</strong> term on the contribution from other units. Typically, weshall want 0 < d < 1.The equation for the output of the node is given by the same attack functionthat we used on the s<strong>in</strong>gle-layer STN <strong>in</strong> the previous section:±i = A(-ax + b[Ii - F]+)where a <strong>and</strong> b are positive constants.Note that the output of each unit is dependent on the current <strong>in</strong>put vector,<strong>and</strong> on the recent history of the <strong>in</strong>put vectors. The parameters of the attackfunction are adjusted such that the output quickly saturates <strong>in</strong> response to amatch<strong>in</strong>g <strong>in</strong>put vector. Furthermore, the extended decay time of node outputsmeans that the network reta<strong>in</strong>s some knowledge of the <strong>in</strong>put vectors that werepresented with<strong>in</strong> some recent time <strong>in</strong>terval. This feature gives the network theability to encode the time correlation of the <strong>in</strong>put vectors. Instead of a patternof activity that marches across a layer, as <strong>in</strong> the case of the layered STN, aconstellation of activity develops over selected nodes <strong>in</strong> the SCAF. Figure 9.13illustrates the response of a two-dimensional SCAF.Once aga<strong>in</strong>, we shall po<strong>in</strong>t out the similarity between the SCAF response<strong>and</strong> the response of the SOM from Chapter 7, the primary difference be<strong>in</strong>g that,<strong>in</strong> the SOM, each unit fires <strong>and</strong> turns off before the next unit <strong>in</strong> the sequencefires, whereas, <strong>in</strong> the SCAF, units that have previously fired decay away slowly,so the entire pattern l<strong>in</strong>gers on the network for a time.The parameter F plays an important part <strong>in</strong> the network operation. F isnot a simple constant as <strong>in</strong> the case of the layered STN, but it varies directlyaccord<strong>in</strong>g to the total amount of output across all nodes of the network. F istherefore a variable threshold that mediates the competition among nodes. Ifrecent <strong>in</strong>put vectors f<strong>in</strong>d near matches, F rises quickly to keep the thresholdhigh. If recent <strong>in</strong>put vectors do not f<strong>in</strong>d matches, F decreases. In this way,F acts as an <strong>in</strong>dicator of how closely <strong>in</strong>com<strong>in</strong>g patterns match someth<strong>in</strong>g thenetwork has learned to recognize. A low value of F <strong>in</strong>dicates that recent <strong>in</strong>putpattern sequences are unknown to the network.


9.3 The Sequential Competitive Avalanche Field 357Figure 9.13The response of a SCAF to a series of <strong>in</strong>put vectors isshown. The vertical bars represent the output value of theassociated node. Nodes with no bars have zero output. Theconstellation of activity that results can be associated with theidentity of the most recent STP <strong>in</strong>put.F is calculated from the differential equationF = a(S-T) + /3S(9.9)where T is a positive constant, called the power-level target, <strong>and</strong>is the total activity <strong>in</strong> the network. In practice, the choices of the parameters T,a, <strong>and</strong> /? have a significant effect on the performance of the SCAF. You mustadjust these parameters experimentally to tune the network.9.3.1 Two-Node SCAF ExampleTo see how the behavior of the SCAF permits the network to dist<strong>in</strong>guish betweendifferent time-sequences of patterns, we consider the simple case of a two-nodesystem illustrated <strong>in</strong> Figure 9.14. We desire the system to dist<strong>in</strong>guish betweenthe two STPs, Q] = {Qn,Qi 2 } <strong>and</strong> Q 2 = {Qi2,Qn}, where the position ofthe <strong>in</strong>put vector <strong>in</strong> the list <strong>in</strong>dicates the order of presentation.Let us assume that the weights on unit 1, ^\, are an exact match to the<strong>in</strong>put pattern Qn, <strong>and</strong> that those on unit 2, 22, match Qi2. This match can bearranged through tra<strong>in</strong><strong>in</strong>g exactly like that done for the CPN, which this partof the SCAF resembles. Furthermore, we assume that Zi,z 2 ,Qn, <strong>and</strong> Q 12 arenormalized to unity, <strong>and</strong> that Z[ <strong>and</strong> Qn are orthogonal to z 2 <strong>and</strong> Q i2 . We willalso assume that the weight connect<strong>in</strong>g the output of unit 1 to unit 2 is one, as<strong>in</strong>dicated <strong>in</strong> the figure, <strong>and</strong> the weight connect<strong>in</strong>g the output of unit 2 to unit 1


358 Spatiotemporal Pattern Classificationt1 2Figure 9.14A simple two-node SCAF is shown.is zero. The mechanism for tra<strong>in</strong><strong>in</strong>g these weights will be described later. Thef<strong>in</strong>al assumption is that the <strong>in</strong>itial value for F is zero.Consider what happens when Qn is applied first. The net <strong>in</strong>put to unit 1 is-fl.net = Zi • Qn + U' 12 X 2 - T(t)= i + o - r«)where we have explicitly shown gamma as a function of time. Accord<strong>in</strong>g toEqs. (9.2) through (9.4), x\ — ~ax\ + b(l - F), so x\ beg<strong>in</strong>s to <strong>in</strong>crease, s<strong>in</strong>ceF <strong>and</strong> x\ are <strong>in</strong>itially zero.The net <strong>in</strong>put to unit 2 is/2,net = Z 2 • Qn + W 2\X\ ~= 0 + Zi - T(t)Thus, ±2 also beg<strong>in</strong>s to rise due to the positive contribution from x\. S<strong>in</strong>ceboth x\ <strong>and</strong> x 2 are <strong>in</strong>creas<strong>in</strong>g from the start, the total activity <strong>in</strong> the network,x\ + X2, <strong>in</strong>creases quickly. Under these conditions, F(t) will beg<strong>in</strong> to <strong>in</strong>crease,accord<strong>in</strong>g to Eq. (9.9).


9.3 The Sequential Competitive Avalanche Field 359After a short time, we remove Qn <strong>and</strong> present Qi2- x\ will beg<strong>in</strong> to decay,but slowly with respect to its rise time. Now, we calculate I\ nei <strong>and</strong> /2. ne t aga<strong>in</strong>:I\ .net = Z, • Q,2= 0 + 0 - T(t)h.nel = Z 2 ' Ql2 + W 2\Xi ~ T(t)Us<strong>in</strong>g Eqs. (9.2) through (9.4) aga<strong>in</strong>, x\ = —cax\ <strong>and</strong> ± 2 = 6(1 + x\ — F),so x\ cont<strong>in</strong>ues to decay, but x 2 will cont<strong>in</strong>ue to rise until 1 + x\ < r(t).Figure 9. 15(a) shows how x\ <strong>and</strong> x 2 evolve as a function of time.A similar analysis can be used to evaluate the network output for the oppositesequence of <strong>in</strong>put vectors. When Q|2 is presented first, x 2 will <strong>in</strong>crease. x\rema<strong>in</strong>s at zero s<strong>in</strong>ce /i. net = -T(t) <strong>and</strong>, thus, x\ = -cax\. The total activity<strong>in</strong> the system is not sufficient to cause T(t) to rise.When Qn is presented, the <strong>in</strong>put to unit 1 is I\ = 1. Even though x 2 isnonzero, the connection weight is zero, so x 2 does not contribute to the <strong>in</strong>putto unit 1. x\ beg<strong>in</strong>s to rise <strong>and</strong> T(t) beg<strong>in</strong>s to rise <strong>in</strong> response to the <strong>in</strong>creas<strong>in</strong>gtotal activity. In this case, F does not <strong>in</strong>crease as much as it did <strong>in</strong> the firstexample. Figure 9.15(b) shows the behavior of x\ <strong>and</strong> x 2 for this example. Thevalues of F(t) for both cases are shown <strong>in</strong> Figure 9.15(c). S<strong>in</strong>ce T(t) is themeasure of recognition, we can conclude that Qn — > Qi2 was recognized, butQi2 — * Qn was not.9.3.2 Tra<strong>in</strong><strong>in</strong>g the SCAFAs mentioned earlier, we accomplish tra<strong>in</strong><strong>in</strong>g the weights on the connectionsfrom the <strong>in</strong>puts by methods already described for other networks. These weightsencode the spatial part of the STP. We have drawn the analogy between the SOM<strong>and</strong> the spatial portion of the STN. In fact, a feood method for tra<strong>in</strong><strong>in</strong>g the spatialweights on the SCAF is with Kohonen's cluster<strong>in</strong>g algorithm (see Chapter 7).We shall not repeat the discussion of that trail<strong>in</strong>g method here. We shall <strong>in</strong>steadconcentrate on the tra<strong>in</strong><strong>in</strong>g of the temporal part of the SCAF.Encod<strong>in</strong>g the proper temporal order of the spatial patterns requires tra<strong>in</strong><strong>in</strong>gthe weights on the connections between the various nodes. This tra<strong>in</strong><strong>in</strong>g uses thedifferential Hebbian learn<strong>in</strong>g law (also referred to as the Kosko-Klopf learn<strong>in</strong>glaw):where c <strong>and</strong> d are positive constants, <strong>and</strong>Wij = (-cwij + dx i Xj)U(± i )U(-Xj) (9.10)_ (~ [1 s >00 s


360 Spatiotemporal Pattern Classification1.00.80.60.4.-•••.....fl1.00.80.60.4x 2 -•X 1 /'.-••••••....0.2-0.2•10 20 30(a)10 20 30(b)10 20(c)30Figure 9.15These figures illustrate the output response of a 2-node SCAF.(a) This graph shows the results of a numerical simulation ofthe two output values dur<strong>in</strong>g the presentation of the sequenceQn —* Qi2- The <strong>in</strong>put pattern changes at t = 17. (b) Thisgraph shows the results for the presentation of the sequenceQi2 —> Qn- (c) This figure shows how the value of F evolves<strong>in</strong> each case. FI is for the case shown <strong>in</strong> (a), <strong>and</strong> F 2 is forthe case shown <strong>in</strong> (b).Without the U factors, Eq. (9.10) resembles the Grossberg outstar law. TheU factors ensure that learn<strong>in</strong>g can occur (wjj is nonzero) only under certa<strong>in</strong>conditions. These conditions are that x, is <strong>in</strong>creas<strong>in</strong>g (x, > 0) at the same timethat Xj is decreas<strong>in</strong>g (-±j > 0). When these conditions are met, both U factorswill be equal to one. Any other comb<strong>in</strong>ation of ±j <strong>and</strong> Xj will cause one, orboth, of the Us to be zero.The effect of the differential Hebbian learn<strong>in</strong>g law is illustrated <strong>in</strong> Figure9.16, which refers back to the two-node SCAF <strong>in</strong> Figure 9.14. We wantto tra<strong>in</strong> the network to recognize that pattern Qn precedes pattern Qi2- In theexample that we did, we saw that the proper response from the network was


9.3 The Sequential Competitive Avalanche Field 361z, • Q11x. > 0'Figure 9.16This figure shows the results of a sequential presentation ofQn followed by Qi2- The net-<strong>in</strong>put values of the two units areshown, along with the activity of each unit. Notice that westill consider that x\ > 0 <strong>and</strong> ± 2 > 0 throughout the periods<strong>in</strong>dicated, even though the activity value is hard-limited to amaximum value of one. The region R <strong>in</strong>dicates the time forwhich x\ < 0 <strong>and</strong> ± 2 > 0 simultaneously. Dur<strong>in</strong>g this timeperiod, the differential Hebbian learn<strong>in</strong>g law causes w 2 \ to<strong>in</strong>crease.achieved if w\ 2 = 0 while w 2 \ = 1. Thus, our learn<strong>in</strong>g law must be able to<strong>in</strong>crease w 2 \ without <strong>in</strong>creas<strong>in</strong>g w\ 2 . Referr<strong>in</strong>g to Figure 9.16, you will see thatthe proper conditions will occur if we present the <strong>in</strong>put vectors <strong>in</strong> their propersequence dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g. If we tra<strong>in</strong> the network by present<strong>in</strong>g Qu followedby Qi2, then x 2 will be <strong>in</strong>creas<strong>in</strong>g while x\ is decreas<strong>in</strong>g, as <strong>in</strong>dicated by theregion, R, <strong>in</strong> Figure 9.16. The weight, w\ 2 , rema<strong>in</strong>s at zero s<strong>in</strong>ce the conditionsare never right for it to learn. The weight, w 2 \, does learn, result<strong>in</strong>g <strong>in</strong> theconfiguration shown <strong>in</strong> Figure 9.14.9.3.3 Time-Dilation EffectsThe output values of nodes <strong>in</strong> the SCAF network decay slowly <strong>in</strong> time withrespect to the rate at which new patterns are presented to the network. Viewedas a whole, the pattern of output activity across all of the nodes varies on a timescale somewhat longer than the one for the <strong>in</strong>put patterns. This is a time-dilationeffect, which can be put to good use.


362 Spatiotemporal Pattern ClassificationNode output valuesSCAF layerFigure 9.17This representation of a SCAF layer shows the output valuesas vertical l<strong>in</strong>es.Figure 9.17 shows a representation of a SCAF with circles as the nodes. Thevertical l<strong>in</strong>es represent hypothetical output values for the nodes. As the <strong>in</strong>putvectors change, the output of the SCAF will change: New units may saturatewhile others decay, although this decay will occur at a rate slightly slower thanthe rate at which new <strong>in</strong>put vectors are presented. For STPs that are sampledfrequently—say, every few milliseconds—the variation of the output valuesmay still be too quick to be followed by a human observer. Suppose, however,that the output values from the SCAF were themselves used as <strong>in</strong>put vectorsto another SCAF. S<strong>in</strong>ce these outputs vary at a slower rate than the orig<strong>in</strong>al<strong>in</strong>put vectors, they can be sampled at a lower frequency. The output values ofthis second SCAF would decay even more slowly than those of the previouslayer. Conceptually, this process can be cont<strong>in</strong>ued until a layer is reachedwhere the output patterns vary on a time scale that is equal to the total timenecessary to present a complete sequence of patterns to the orig<strong>in</strong>al network.The last output values would be essentially stationary. A s<strong>in</strong>gle set of outputvalues from the last slab would represent an entire series of patterns mak<strong>in</strong>g upone complete STP. Figure 9.18 shows such a system based on a hierarchy ofSCAF layers.The stationary output vector can be used as the <strong>in</strong>put vector to one of thespatial pattern-classification networks. The spatial network can learn to classifythe stationary <strong>in</strong>put vectors by the methods discussed previously. A completespatiotemporal pattern-recognition <strong>and</strong> pattern-classification system can be constructed<strong>in</strong> this manner.Exercise 9.4: No matter how fast <strong>in</strong>put vectors are presented to a SCAF, theoutputs can be made to l<strong>in</strong>ger if the parameters of the attack function are adjustedsuch that, once saturated, a node output decays very slowly. Such anarrangement would appear to elim<strong>in</strong>ate the need for the layered SCAF architectureproposed <strong>in</strong> the previous paragraphs. Analyze the response of a SCAF toan arbitrary STP <strong>in</strong> the limit<strong>in</strong>g case where saturated nodes never decay.


9.4 <strong>Applications</strong> of STNs 363AssociativememoryIOutput from SCAF 4SCAF 4Output from SCAF 3SCAF31Output from SCAF 2SCAF 2Output from SCAF 1SCAF1Orig<strong>in</strong>al <strong>in</strong>put vectorsFigure 9.18This hierarchy of SCAF fayers is used for spatiotemporalpattern classification. The outputs from each layer aresampled at a rate slower man the rate at which <strong>in</strong>puts tothat layer change. The output from the top layer, essentiallya spatial pattern, can be used as an <strong>in</strong>put to an associativenetwork that classifies the orig<strong>in</strong>al STP.9.4 APPLICATIONS OF STNSWe suggested earlier <strong>in</strong> this chapter that STNs would be useful <strong>in</strong> areas such asspeech recognition, radar analysis, <strong>and</strong> sonar-echo classification. To date, thedearth of literature <strong>in</strong>dicates that little work has been done with this promis<strong>in</strong>garchitecture.


364 Spatiotemporal Pattern ClassificationA prototype sonar-echo classification system was built by General DynamicsCorporation us<strong>in</strong>g the layered STN architecture described <strong>in</strong> Section 9.2 [8]. Inthat study, time slices of the <strong>in</strong>com<strong>in</strong>g sonar signals were converted to powerspectra, which were then presented to the network <strong>in</strong> the proper time sequence.After be<strong>in</strong>g tra<strong>in</strong>ed on seven civilian boats, the network was able to identifycorrectly each of these vessels from the latter's passive sonar signature.The developers of the SCAF architecture experimented with a 30 by 30SCAF, where outputs from <strong>in</strong>dividual units are connected r<strong>and</strong>omly to otherunits. Apparently, the network performance was encourag<strong>in</strong>g, as the developersare reportedly work<strong>in</strong>g on new applications. Details of those applications arenot available at the time of this writ<strong>in</strong>g.9.5 STN SIMULATIONIn this section, we shall describe the design of the simulator for the spatiotemporalnetwork. We shall focus on the implementation of a one-layer STN. <strong>and</strong>shall show how that STN can be extended to encompass multilayer (<strong>and</strong> mult<strong>in</strong>etwork)STN architectures. The implementation of the SCAF architecture isleft to you as an exercise.We beg<strong>in</strong> this section, as we have all previous simulation discussions, witha presentation of the data structures used to construct the STN simulator. Fromthere, we proceed with the development of the algorithms used to perform signalprocess<strong>in</strong>g with<strong>in</strong> the simulator. We close this section with a discussion of howa multiple STN structure might be created to record a temporal sequence ofrelated patterns.9.5.1 STN Data StructuresThe design of the STN simulator is rem<strong>in</strong>iscent of the design we used for theCPN <strong>in</strong> Chapter 6. We therefore recommend that you review Section 6.4 prior tocont<strong>in</strong>u<strong>in</strong>g here. The reason for the similarity between these two networks is thatboth networks fit precisely the process<strong>in</strong>g structure we def<strong>in</strong>ed for perform<strong>in</strong>gcompetitive process<strong>in</strong>g with<strong>in</strong> a layer of units. 3 The units <strong>in</strong> both the STN<strong>and</strong> the competitive layer of the CPN operate by process<strong>in</strong>g normalized <strong>in</strong>putvectors, <strong>and</strong> even though competition <strong>in</strong> the CPN suppresses the output fromall but the w<strong>in</strong>n<strong>in</strong>g unit(s), all network units generate an output signal that isdistributed to other PEs.The major difference between the competitive layer <strong>in</strong> the CPN <strong>and</strong> theSTN structure is related to the fact that the output from each unit <strong>in</strong> the STNbecomes an <strong>in</strong>put to all subsequent network units on the layer, whereas thelateral connections <strong>in</strong> the CPN simulation were h<strong>and</strong>led by the host computer'Although the STN is not competitive <strong>in</strong> the same sense that the hidden layer <strong>in</strong> the CPN is. weshall see that STN units respond actively to <strong>in</strong>puts <strong>in</strong> much the same way that CPN hidden-layerunits respond.


9.5 STN Simulation 365system, <strong>and</strong> never were actually modeled. Similarly, the <strong>in</strong>terconnections betweenunits on the layer <strong>in</strong> the STN can be accounted for by the process<strong>in</strong>galgorithms performed <strong>in</strong> the host computer, so we do not need to account forthose connections <strong>in</strong> the simulator design.Let us now consider the top-level data structure needed to model an STN.As before, we will construct the network as a record conta<strong>in</strong><strong>in</strong>g po<strong>in</strong>ters tothe appropriate lower-level structures, <strong>and</strong> conta<strong>in</strong><strong>in</strong>g any network specific dataparameters that are used globally with<strong>in</strong> the network. Therefore, we can createan STN structure through the follow<strong>in</strong>g record declaration:record STN =beg<strong>in</strong>UNITS : "layer;a, b, c, d : float;gamma : float;upper : "STN;lower : "STN;y : float;end record;{po<strong>in</strong>ter to network units}{network parameters}{constant value for gamma}{po<strong>in</strong>ter to next STN}{po<strong>in</strong>ter to previous STN}{output of last STN element}Notice that, as illustrated <strong>in</strong> Figure 9.19, this record def<strong>in</strong>ition differs fromall previous network record declarations <strong>in</strong> that we have <strong>in</strong>cluded a means forTo higher-levelSTN networksoutputsweightsTo lower-levelSTN networksFigure 9.19 The data structure of the STN simulator is shown. Notice that,<strong>in</strong> this network structure, there are po<strong>in</strong>ters to other networkrecords above <strong>and</strong> below to accommodate multiple STNs. Inthis manner, the same <strong>in</strong>put data can be propagated efficientlythrough multiple STN structures.


366 Spatiotemporal Pattern Classificationstack<strong>in</strong>g multiple networks through the use of a doubly l<strong>in</strong>ked list of networkrecord po<strong>in</strong>ters. We <strong>in</strong>clude this capability for two reasons:1. As described previously, a network that recognizes only one pattern is notof much use. We must therefore consider how to <strong>in</strong>tegrate multiple networksas part of our simulator design.2. When multiple STNs are used to time dilate temporal patterns (as <strong>in</strong> theSCAF), the activity patterns of the network units can be used as <strong>in</strong>putpatterns to another network for further classification.F<strong>in</strong>ally, <strong>in</strong>spection of the STN record structure reveals that there is noth<strong>in</strong>gabout the STN that will require further modifications or extensions to the genericsimulator structure we proposed <strong>in</strong> Chapter 1. We are therefore free to beg<strong>in</strong>develop<strong>in</strong>g STN algorithms.9.5.2 STN <strong>Algorithms</strong>Let us beg<strong>in</strong> by consider<strong>in</strong>g the sequence of operations that must be performedby the computer to simulate the STN. Us<strong>in</strong>g the speech-recognition exampleas described <strong>in</strong> Section 9.2.1 as the basis for the process<strong>in</strong>g model, we canconstruct a list of the operations that must be performed by the STN simulator.1. Construct the network, <strong>and</strong> <strong>in</strong>itialize the <strong>in</strong>put connections to the units suchthat the first unit <strong>in</strong> the layer has the first normalized <strong>in</strong>put pattern conta<strong>in</strong>ed<strong>in</strong> its connections, the second unit has the second pattern, <strong>and</strong> so on.2. Beg<strong>in</strong> process<strong>in</strong>g the test pattern by zero<strong>in</strong>g the outputs from all units <strong>in</strong>the network (as well as the STN.y value, s<strong>in</strong>ce it is a duplicate copy ofthe output value from the last network unit), <strong>and</strong> then apply<strong>in</strong>g the firstnormalized test vector to the <strong>in</strong>put of the STN.3. Calculate the <strong>in</strong>ner product between the <strong>in</strong>put test vector <strong>and</strong> the weightvector for the first unprocessed unit.4. Compute the sum of the outputs from all units on the layer from the firstto the previous units, <strong>and</strong> multiply the result by the network d term.5. Add the result from step 3 to the result from step 4 to produce the <strong>in</strong>putactivation for the unit.6. Subtract the threshold value (F) from the result of step 5. If the result isgreater than zero, multiply it by the network b term; otherwise, substitutezero for the result.7. Multiply the negative of the network a term by the previous output fromthe unit, <strong>and</strong> add the result to the value produced <strong>in</strong> step 6.8. If the result of step 7 was less than or equal to zero, multiply it by thenetwork c term to produce x. Otherwise, use the result of step 7 withoutmodification as the value for x.


9.5 STN Simulation 3679. Compute the attack value for the unit by multiply<strong>in</strong>g the x value calculated<strong>in</strong> step 8 by a small value <strong>in</strong>dicat<strong>in</strong>g the network update rate (6t) to producethe update value for the unit output. Update the unit output by add<strong>in</strong>g thecomputed attack value to the current unit output value.10. Repeat steps 3 through 9 for each unit <strong>in</strong> the network.11. Repeat steps 3 through 10 for the duration of the time step, Ai. The numberof repetitions that occur dur<strong>in</strong>g this step will be a function of the sampl<strong>in</strong>gfrequency for the specific application.12. Apply the next time-sequential test vector to the network <strong>in</strong>put, <strong>and</strong> repeatsteps 3 through 11.13. After all the time-sequential test vectors have been applied, use the outputof the last unit on the layer as the output value for the network for the givenSTP.Notice that we have assumed that the network units update at a rate muchmore rapid than the sampl<strong>in</strong>g rate of the <strong>in</strong>put (i.e., the value for fit is muchsmaller than the value of At). S<strong>in</strong>ce the actual sampl<strong>in</strong>g frequency (given by-^r) will always be application dependent, we shall assume that the networkmust update itself 100 times for each <strong>in</strong>put pattern. Thus, the ratio of 6t to A


368 Spatiotemporal Pattern Classificationfor i = 1 to (unumber - 1) {sum other units outputs}doothers = others + unit[i];end do;return (sum + net.d * others);end function;{return activation}The activation rout<strong>in</strong>e will allow us to compute the <strong>in</strong>put-activationvalue for any unit <strong>in</strong> the STN. What we now need is a rout<strong>in</strong>e that will converta given <strong>in</strong>put value to the appropriate output value for any network unit. Thisservice will be performed by the Xdot function, which we shall now def<strong>in</strong>e.Note that this rout<strong>in</strong>e performs the functions specified <strong>in</strong> steps 6 through 8 <strong>in</strong>the process<strong>in</strong>g list above for any STN unit.function Xdot (net:STN; unumber:<strong>in</strong>teger; <strong>in</strong>val:float)return float;{convert the <strong>in</strong>put value for the specified unit tooutput value}var outval : float;unit : "float[];beg<strong>in</strong>unit = net.UNITS".OUTS; {access unit output array}outval = <strong>in</strong>val - net.gamma; {threshold unit <strong>in</strong>put}if (outval > 0){if unit is on}then outval = outval * net.b {scale the unit output}else outval = 0;{else unit is off}outval = outval + unit[unumber] * -net.a;if (outval


9.5 STN Simulation 369beg<strong>in</strong>unit = net.UNITS".OUTS;how_many = length(unit);{locate the output array}{save number of units}for i = 1 to how_many {for all units <strong>in</strong> the STN}do{generate output from <strong>in</strong>put}<strong>in</strong>val = activation (net, i, <strong>in</strong>vec);dx = Xdot (net, i, <strong>in</strong>val);unitfi] = unitfi] + (dx * dt);end do;net.y = unit[how_many];end procedure;{save last unit output}The propagate procedure will perform a complete signal propagationof one <strong>in</strong>put vector through the entire STN. For a true spatiotemporal patternclassificationoperation, propagate would have to be performed many times 4for every Q; patterns that compose the spatiotemporal pattern to be processed.If the network recognized the temporal pattern sequence, the value conta<strong>in</strong>ed <strong>in</strong>the STN. y slot would be relatively high after all patterns had been propagated.9.5.3 STN Tra<strong>in</strong><strong>in</strong>gIn the previous discussion, we considered an STN that was tra<strong>in</strong>ed by <strong>in</strong>itialization.Tra<strong>in</strong><strong>in</strong>g the network <strong>in</strong> this manner is f<strong>in</strong>e if we know all the tra<strong>in</strong><strong>in</strong>g vectorsprior to build<strong>in</strong>g the network simulator. But what about those cases whereit is preferable to defer tra<strong>in</strong><strong>in</strong>g until after the network is operational? Suchoccurrences are common when the tra<strong>in</strong><strong>in</strong>g environment is rather large, or whentra<strong>in</strong><strong>in</strong>g-data acquisition is cumbersome. In such cases, is it possible to tra<strong>in</strong> anSTN to record (<strong>and</strong> eventually to replay) data patterns collected at run time?The answer to this question is a qualified "yes." The reason it is qualifiedis that the STN is not undergo<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g <strong>in</strong> the same sense that most of theother networks described <strong>in</strong> this text are tra<strong>in</strong>ed. Rather, we shall take theapproach that an STN can be constructed dynamically, thus simulat<strong>in</strong>g the effectof tra<strong>in</strong><strong>in</strong>g. As we have seen, the st<strong>and</strong>ard STN is constructed <strong>and</strong> <strong>in</strong>itialized toconta<strong>in</strong> the normalized form of the pattern to be encoded at each timestep <strong>in</strong> theconnections of the <strong>in</strong>dividual network units. To tra<strong>in</strong> an STN, we will simplycause our program to create a new STN whenever a new pattern to be learnedis available. In this manner, we construct specialized STNs that can then beexercised us<strong>in</strong>g all of the algorithms developed previously.The only special consideration is that, with multiple networks <strong>in</strong> the computersimultaneously, we must take care to ensure that the networks rema<strong>in</strong>accessible <strong>and</strong> consistent. To accomplish this feat, we shall simply l<strong>in</strong>k together4 lt would have to be performed essentially ^j- times, where At is the <strong>in</strong>verse of the sampl<strong>in</strong>gfrequency for the application, <strong>and</strong> 8t is the time that it takes the host computer to perform thepropagate rout<strong>in</strong>e one time.


370 Programm<strong>in</strong>g Exercisesthe network structures <strong>in</strong> a doubly l<strong>in</strong>ked list that a top-level rout<strong>in</strong>e can thenaccess sequentially. A side benefit to this approach is that we have now createda means of collect<strong>in</strong>g a number of related STPs, <strong>and</strong> have grouped themtogether sequentially. Thus, we can utilize this structure to encode (<strong>and</strong> recognize)a sequence of related patterns, such as the sonar signatures of differentsubmar<strong>in</strong>es, us<strong>in</strong>g the output from the most active STN as an <strong>in</strong>dication of thetype of submar<strong>in</strong>e.The disadvantage to the STN, as mentioned earlier, is that it will requiremany concurrent STN simulations to beg<strong>in</strong> to tackle problems that can be considerednontrivial. 5 There are two approaches to solv<strong>in</strong>g this dilemma, both ofwhich we leave to you as exercises. The first alternative method is to elim<strong>in</strong>ateredundant network elements whenever possible, as was illustrated <strong>in</strong> Figure 9.11<strong>and</strong> described <strong>in</strong> the previous section. The second method is to implement theSCAF network, <strong>and</strong> to comb<strong>in</strong>e many SCAF's with an associative-memory network(such as a BPN or CPN, as described <strong>in</strong> Chapters 3 <strong>and</strong> 6 respectively) todecode the output of the f<strong>in</strong>al SCAF.Programm<strong>in</strong>g Exercises9.1. Code the STN simulator <strong>and</strong> verify its operation by construct<strong>in</strong>g multipleSTNs, each of which is coded to recognize a letter sequence as a word. Forexample, consider the sequence "N E U R A L" versus the sequence "N EU R O N." Assume that two STNs are constructed <strong>and</strong> <strong>in</strong>itialized such thateach can recognize one of these two sequences. At what po<strong>in</strong>t do the STNsbeg<strong>in</strong> to fail to respond when presented with the wrong letter sequence?9.2. Create several STNs that recognize letter sequences correspond<strong>in</strong>g to differentwords. Stack them to form simple sentences, <strong>and</strong> determ<strong>in</strong>e which(if any) STNs fail to respond when presented with word sequences that aresimilar to the encoded sequences.9.3. Construct an STN simulator that removes the redundant nodes for the wordrecognitionapplication described <strong>in</strong> Programm<strong>in</strong>g Exercise 9.1. Show list<strong>in</strong>gsfor any new (or modified) data structures, as well as for code. Draw adiagram <strong>in</strong>dicat<strong>in</strong>g the structure of the network. Show how your new datastructures lend themselves to perform<strong>in</strong>g this simulation.9.4. Construct a simulator for the SCAF network. Show the data structuresrequired, <strong>and</strong> a complete list<strong>in</strong>g of code required to implement the network.Be sure to allow multiple SCAFs to feed one another, <strong>in</strong> order to stacknetworks. Also describe how the output from your SCAF simulator wouldtie <strong>in</strong>to a BPN simulator to perform the associative-memory function at theoutput.5 That is not to say that the STN should be considered a trivial network. There are many applicationswhere the STN might provide an excellent solution, such as voicepr<strong>in</strong>t classification for controll<strong>in</strong>gaccess to protected environments.


Bibliography 3719.5. Describe a method for tra<strong>in</strong><strong>in</strong>g a BPN simulator to recognize the output ofa SCAF. Remember that tra<strong>in</strong><strong>in</strong>g <strong>in</strong> a BPN is typically completed beforethat network is first applied to a problem.Suggested Read<strong>in</strong>gsThere is not a great deal of <strong>in</strong>formation available about Hecht-Nielsen's STNimplementation. Aside from the papers cited <strong>in</strong> the text, you can refer to hisbook for additional <strong>in</strong>formation [4].On the subject of STP recognition <strong>in</strong> general, <strong>and</strong> speech recognition <strong>in</strong>particular, there are a number of references to other approaches. For a generalreview of neural networks for speech recognition, see the papers by Lippmann[5, 6, 7]. For other methods see, for example, Grajski et al. [1] <strong>and</strong>Williams <strong>and</strong> Zipser [9].Bibliography[1] Kamil A. Grajski, Dan P. Witmer, <strong>and</strong> Carson Chen. A comb<strong>in</strong>ed DSP<strong>and</strong> artificial neural network (ANN) approach to the classification of timeseries data, exponent: Ford Aerospace Technical Journal, pages 20-25,W<strong>in</strong>ter, 1989/1990.[2] Stephen Grossberg. Learn<strong>in</strong>g by neural networks. In Stephen Grossberg,editor, Studies of M<strong>in</strong>d <strong>and</strong> Bra<strong>in</strong>. D. Reidel Publish<strong>in</strong>g, Boston, MA, pp.65-156, 1982.[3] Robert Hecht-Nielsen. Nearest matched filter classification of spatiotemporalpatterns. Technical report, Hecht-Nielsen Neurocomputer Corporation,San Diego CA, June 1986.[4] Robert Hecht-Nielsen. Neurocomput<strong>in</strong>g. Addison-Wesley, Read<strong>in</strong>g, MA,1990.[5] Richard P. Lippmann <strong>and</strong> Ben Gold. <strong>Neural</strong>-net classifiers useful for speechrecognition. In Proceed<strong>in</strong>gs of the IEEE, First International Conference on<strong>Neural</strong> <strong>Networks</strong>, San Diego, CA, pp. IV-417-IV-426, June 1987. IEEE.[6] Richard P. Lippmann. <strong>Neural</strong> network classifiers for speech recognition.The L<strong>in</strong>coln Laboratory Journal, 1(1): 107-124, 1988.[7] Richard P. Lippmann. Review of neural networks for speech recognition.<strong>Neural</strong> Computation, 1(1): 1-38, Spr<strong>in</strong>g 1989.[8] Robert L. North. Neurocomput<strong>in</strong>g: Its impact on the future of defensesystems. Defense Comput<strong>in</strong>g, 1(1), January-February 1988.[9] Ronald J. Williams <strong>and</strong> David Zipser. A learn<strong>in</strong>g algorithm for cont<strong>in</strong>uallyrunn<strong>in</strong>g fully recurrent neural networks. <strong>Neural</strong> Computation, 1(2):270-280, 1989.


C H A P T E RThe NeocognitronANS architectures such as backpropagation (see Chapter 3) tend to have generalapplicability. We can use a s<strong>in</strong>gle network type <strong>in</strong> many different applicationsby chang<strong>in</strong>g the network's size, parameters, <strong>and</strong> tra<strong>in</strong><strong>in</strong>g sets. In contrast, thedevelopers of the neocognitron set out to tailor an architecture for a specificapplication: recognition of h<strong>and</strong>written characters. Such a system has a greatdeal of practical application, although, judg<strong>in</strong>g from the <strong>in</strong>troductions to someof their papers, Fukushima <strong>and</strong> his coworkers appear to be more <strong>in</strong>terested <strong>in</strong>develop<strong>in</strong>g a model of the bra<strong>in</strong> [4, 3].' To that end, their design was basedon the sem<strong>in</strong>al work performed by Hubel <strong>and</strong> Weisel elucidat<strong>in</strong>g some of thefunctional architecture of the visual cortex.We could not beg<strong>in</strong> to provide a complete account<strong>in</strong>g of what is knownabout the anatomy <strong>and</strong> physiology of the mammalian visual system. Nevertheless,we shall present a brief <strong>and</strong> highly simplified description of some of thatsystem's features as an aid to underst<strong>and</strong><strong>in</strong>g thejbasis of the neocognitron design.Figure 10.1 shows the ma<strong>in</strong> pathways for neurons lead<strong>in</strong>g from the ret<strong>in</strong>aback to the area of the bra<strong>in</strong> known as the viiual, or striate, cortex. This areais also known as area 17. The optic nerve ii made up of axons from nervecells called ret<strong>in</strong>al ganglia. The ganglia receive stimulation <strong>in</strong>directly from thelight-receptive rods <strong>and</strong> cones through several <strong>in</strong>terven<strong>in</strong>g neurons.Hubel <strong>and</strong> Weisel used an amaz<strong>in</strong>g technique to discern the function of thevarious nerve cells <strong>in</strong> the visual system. They used microelectrodes to recordthe response of <strong>in</strong>dividual neurons <strong>in</strong> the cortex while stimulat<strong>in</strong>g the ret<strong>in</strong>a withlight. By apply<strong>in</strong>g a variety of patterns <strong>and</strong> shapes, they were able to determ<strong>in</strong>ethe particular stimulus to which a neuron was most sensitive.The ret<strong>in</strong>al ganglia <strong>and</strong> the cells of the lateral geniculate nucleus (LGN)appear to have circular receptive fields. They respond most strongly to circular'This statement is <strong>in</strong>tended not as a negative criticism, but rather as justification for the ensu<strong>in</strong>g,short discussion of biology.


374 The NeocognitronRod <strong>and</strong> Bipolar Ganglioncone cells cells cellsCentersurround<strong>and</strong>Center simple Complexsurround cortical corticalcells cells cellsLefteyeRet<strong>in</strong>aOptic chjasjnaLateralgeniculatenucleusVisual cortexRighteyeFigure 10.1Visual pathways from the eye to the primary visual cortexare shown. Some nerve fibers from each eye cross over <strong>in</strong>tothe opposite hemisphere of the bra<strong>in</strong>, where they meet nervefibers from the other eye at the LGN. From the LGN, neuronsproject back to area 17. From area 17, neurons project <strong>in</strong>toother cortical areas, other areas deep <strong>in</strong> the bra<strong>in</strong>, <strong>and</strong> alsoback to the LGN. Source: Repr<strong>in</strong>ted with permission ofAddison-Wesley Publish<strong>in</strong>g Co., Read<strong>in</strong>g, MA, from Mart<strong>in</strong> A. Fischler<strong>and</strong> Oscar Firsche<strong>in</strong>, Intelligence: The Eye, the Bra<strong>in</strong>, <strong>and</strong> theComputer, © 1987 by Addison-Wesley Publish<strong>in</strong>g Co.spots of light of a particular size on a particular part of the ret<strong>in</strong>a. The partof the ret<strong>in</strong>a responsible for stimulat<strong>in</strong>g a particular ganglion cell is called thereceptive field of the ganglion. Some of these receptive fields give an excitatoryresponse to a centrally located spot of light, <strong>and</strong> an <strong>in</strong>hibitory response to alarger, more diffuse spot of light. These fields have an on-center off-surroundresponse characteristic (see Chapter 6, Section 6.1). Other receptive fields havethe opposite characteristic, with an <strong>in</strong>hibitory response to the centrally locatedspot—an off-center on-surround response characteristic.


The Neocognitron 375The visual cortex itself is composed of six layers of neurons. Most of theneurons from the LGN term<strong>in</strong>ate on cells <strong>in</strong> layer IV. These cells have circularlysymmetric receptive fields like the ret<strong>in</strong>al ganglia <strong>and</strong> the cells of the LGN.Further along the pathway, the response characteristic of the cells beg<strong>in</strong>s to <strong>in</strong>crease<strong>in</strong> complexity. Cells <strong>in</strong> layer IV project to a group of cells directly abovecalled simple cells. Simple cells respond to l<strong>in</strong>e segments hav<strong>in</strong>g a particularorientation. Simple cells project to cells called complex cells. Complex cellsrespond to l<strong>in</strong>es hav<strong>in</strong>g the same orientation as their correspond<strong>in</strong>g simple cells,although complex cells appear to <strong>in</strong>tegrate their response over a wider receptivefield. In other words, complex cells are less sensitive to the position of the l<strong>in</strong>eon the ret<strong>in</strong>a than are the simple cells. Some complex cells are sensitive to l<strong>in</strong>esegments of a particular orientation that are mov<strong>in</strong>g <strong>in</strong> a particular direction.Cells <strong>in</strong> different layers of area 17 project to different locations of the bra<strong>in</strong>.For example, cells <strong>in</strong> layers II <strong>and</strong> III project to cells <strong>in</strong> areas 18 <strong>and</strong> 19. Theseareas conta<strong>in</strong> cells called hypercomplex cells. Hypercomplex cells respond tol<strong>in</strong>es that form angles or corners <strong>and</strong> that move <strong>in</strong> various directions across thereceptive field.The picture that emerges from these studies is that of a hierarchy of cellswith <strong>in</strong>creas<strong>in</strong>gly complex response characteristics. It is not difficult to extrapolatethis idea of a hierarchy <strong>in</strong>to one where further data abstraction takes placeat higher <strong>and</strong> higher levels. The neocognitron design adopts this hierarchicalstructure <strong>in</strong> a layered architecture, as illustrated schematically <strong>in</strong> Figure 10.2."C1 U S3Figure 10.2The neocognitron hierarchical structure is shown. Each boxrepresents a level <strong>in</strong> the neocognitron compris<strong>in</strong>g a simplecelllayer, u si , <strong>and</strong> a complex-cell layer, Ua, where i isthe layer number. U 0 represents signals orig<strong>in</strong>at<strong>in</strong>g on theret<strong>in</strong>a. There is also a suggested mapp<strong>in</strong>g to the hierarchicalstructure of the bra<strong>in</strong>. The network concludes with s<strong>in</strong>glecells that respond to complex visual stimuli. These f<strong>in</strong>al cellsare often called gr<strong>and</strong>mother cells after the notion that theremay be some cell <strong>in</strong> your bra<strong>in</strong> that responds to complexvisual stimuli, such as a picture of your gr<strong>and</strong>mother.


376 The NeocognitronWe rem<strong>in</strong>d you that the description of the visual system that we have presentedhere is highly simplified. There is a great deal of detail that we haveomitted. The visual system does not adhere to a strict hierarchical structureas presented here. Moreover, we do not subscribe to the notion that gr<strong>and</strong>mothercells per se exist <strong>in</strong> the bra<strong>in</strong>. We know from experience that strictadherence to biology often leads to a failed attempt to design a system to performthe same function as the biological prototype: Flight is probably the mostsignificant example. Nevertheless, we do promote the use of neurobiologicalresults if they prove to be appropriate. The neocognitron is an excellent exampleof how neurobiological results can be used to develop a new networkarchitecture.10.1 NEOCOGNITRON ARCHITECTUREThe neocognitron design evolved from an earlier model called the cognitron,<strong>and</strong> there are several versions of the neocognitron itself. The one that we shalldescribe has n<strong>in</strong>e layers of PEs, <strong>in</strong>clud<strong>in</strong>g the ret<strong>in</strong>a layer. The system wasdesigned to recognize the numerals 0 through 9, regardless of where they areplaced <strong>in</strong> the field of view of the ret<strong>in</strong>a. Moreover, the network has a high degreeof tolerance to distortion of the character <strong>and</strong> is fairly <strong>in</strong>sensitive to the size ofthe character. This first architecture conta<strong>in</strong>s only feedforward connections.In Section 10.3.2, we shall describe a network that has feedback as well asfeedforward connections.10.1.1 Functional DescriptionThe PEs of the neocognitron are organized <strong>in</strong>to modules that we shall refer toas levels. A s<strong>in</strong>gle level is shown <strong>in</strong> Figure 10.3. Each level consists of twolayers: a layer of simple cells, or S-cells, followed by a layer of complexcells, or C-cells. Each layer, <strong>in</strong> turn, is divided <strong>in</strong>to a number of planes,each of which consists of a rectangular array of PEs. On a given level, theS-layer <strong>and</strong> the C-layer may or may not have the same number of planes.All planes on a given layer will have the same number of PEs; however, thenumber of PEs on the S-planes can be different from the number of PEs onthe C-planes at the same level. Moreover, the number of PEs per plane canvary from level to level. There are also PEs called Vs-cells <strong>and</strong> V c -cells thatare not shown <strong>in</strong> the figure. These elements play an important role <strong>in</strong> theprocess<strong>in</strong>g, but we can describe the functionality of the system without referenceto them.We construct a complete network by comb<strong>in</strong><strong>in</strong>g an <strong>in</strong>put layer, which weshall call the ret<strong>in</strong>a, with a number of levels <strong>in</strong> a hierarchical fashion, as shown<strong>in</strong> Figure 10.4. That figure shows the number of planes on each layer for theparticular implementation that we shall describe here. We call attention to the


10.1 Neocognitron Architecture 377S-cell layerS-cell planeS -cellsC-cell layerC-cell planeC -cells**Figure 10.3A s<strong>in</strong>gle level of a neocognitron is shown. Each level consistsof two layers, <strong>and</strong> each layer consists of a number of planes.The planes conta<strong>in</strong> the PEs <strong>in</strong> a rectangular array. Data passfrom the S-layer to the C-layer through connections that arenot shown here. In neocognitrons hav<strong>in</strong>g feedback, there alsowill be connections from the C-layer to the S-layer.fact that there is noth<strong>in</strong>g, <strong>in</strong> pr<strong>in</strong>ciple, that dictates a limit to the size of thenetwork <strong>in</strong> terms of the number of levels.The <strong>in</strong>terconnection strategy is unlike that of networks that are fully <strong>in</strong>terconnectedbetween layers, such as the backpropagation network described<strong>in</strong> Chapter 3. Figure 10.5 shows a schematic illustration of the way units areconnected <strong>in</strong> the neocognitron. Each layer of simple cells acts as a featureextractionsystem that uses the layer preced<strong>in</strong>g it as its <strong>in</strong>put layer. On thefirst S-layer, the cells on each plane are sensitive to simple features on theret<strong>in</strong>a—<strong>in</strong> this case, l<strong>in</strong>e segments at different orientation angles. Each S-cell on a s<strong>in</strong>gle plane is sensitive to the same feature, but at different locationson the <strong>in</strong>put layer. S-cells on different planes respond to different features.As we look deeper <strong>in</strong>to the network, the S-cells respond to features at higherlevels of abstraction; for example, corners with <strong>in</strong>tersect<strong>in</strong>g l<strong>in</strong>es at various


M378 The NeocognitronU S2U C2 U 'S3A-r3-'S4-t4/19x19///////I1x10/19x19x12 11x11x38 7x7x3211x11x8 7x7x22 7x7x303x3x6Figure 10.4The figure shows the basic organization of the neocognitronfor the numeral-recognition problem. There are n<strong>in</strong>e layers,each with a vary<strong>in</strong>g number of planes. The size of each layer,<strong>in</strong> terms of the number of process<strong>in</strong>g elements, is given beloweach layer. For example, layer Uc2 has 22 planes of 7 x 7process<strong>in</strong>g elements arranged <strong>in</strong> a square matrix. The layerof C-cells on the f<strong>in</strong>al level is made up of 10 planes, eachof which has a s<strong>in</strong>gle element. Each element corresponds toone of the numerals from 0 to 9. The identification of thepattern appear<strong>in</strong>g on the ret<strong>in</strong>a is made accord<strong>in</strong>g to whichC-cell on the f<strong>in</strong>al level has the strongest response.angles <strong>and</strong> orientations. The C-cells <strong>in</strong>tegrate the responses of groups of S-cells. Because each S-cell is look<strong>in</strong>g for the same feature <strong>in</strong> a different location,the C-cells' response is less sensitive to the exact location of the feature on the<strong>in</strong>put layer. This behavior is what gives the neocognitron its ability to identifycharacters regardless of their exact position <strong>in</strong> the field of the ret<strong>in</strong>a. By thetime we have reached the f<strong>in</strong>al layer of C-cells, the effective receptive field


10.1 Neocognitron Architecture 379(a)S-layerC-layerC-layerFigure 10.5 This diagram is a schematic representation of the<strong>in</strong>terconnection strategy of the neocognitron. (a) On the firstlevel, each S unit receives <strong>in</strong>put connections from a smallregion of the ret<strong>in</strong>a. Units <strong>in</strong> correspond<strong>in</strong>g positions on allplanes receive <strong>in</strong>put connections from the same region ofthe ret<strong>in</strong>a. The region from which an S-cell receives <strong>in</strong>putconnections def<strong>in</strong>es the receptive field of the cell, (b) On<strong>in</strong>termediate levels, each unit on an s-plane receives <strong>in</strong>putconnections from correspond<strong>in</strong>g locations on all C-planes <strong>in</strong>the previous level. C-eelIs have connections from a region ofS-cells on the S level preced<strong>in</strong>g it. If the number of C-planesis the same as that of S-planes at that level, then each C-cellhas connections from S-cells on a s<strong>in</strong>gle s-plane. If thereare fewer C-planes than S-planes, some C-cells may receiveconnections from more than one S-plane.of each cell is the entire ret<strong>in</strong>a. Figure 10.6 shows the character identificationprocess schematically.Note the slight difference between the first S -layer <strong>and</strong> subsequent S -layers<strong>in</strong> Figure 10.5. Each cell <strong>in</strong> a plane on the first S-layer receives <strong>in</strong>puts froma s<strong>in</strong>gle <strong>in</strong>put layer—namely, the ret<strong>in</strong>a. On subsequent layers, each S-cellplane receives <strong>in</strong>puts from each of the C-cell planes immediately preced<strong>in</strong>git. The situation is slightly different for the C-cell planes. Typically, eachcell on a C-cell plane exam<strong>in</strong>es a small region of S-cells on a s<strong>in</strong>gle S-cellplane. For example, the first C-cell plane on layer 2 would have connectionsto only a region of S-cells on the first S-cell plane of the previous layer. Ref-


380 The NeocognitronFigure 10.6 This figure illustrates how the neocognitron performsits character-recognition function. The neocognitrondecomposes the <strong>in</strong>put pattern <strong>in</strong>to elemental parts consist<strong>in</strong>gof l<strong>in</strong>e segments at various angles of rotation. The systemthen <strong>in</strong>tegrates these elements <strong>in</strong>to higher-order structures ateach successive level <strong>in</strong> the network. Cells <strong>in</strong> each level<strong>in</strong>tegrate the responses of cells <strong>in</strong> the previous level over af<strong>in</strong>ite area. This behavior gives the neocognitron its ability toidentify characters regardless of their exact position or size<strong>in</strong> the field of view of the ret<strong>in</strong>a. Source: Repr<strong>in</strong>ted withpermission from Kunihiko Fukushima, "A neural network forvisual pattern recognition." IEEE Computer, March 1988. ©1988 IEEE.erence back to Figure 10.4 reveals that there is not necessarily a one-to-onecorrespondence between C-cell planes <strong>and</strong> S-cell planes at each layer <strong>in</strong> thesystem. This discrepancy occurs because the system designers found it advantageousto comb<strong>in</strong>e the <strong>in</strong>puts from some S-planes to a s<strong>in</strong>gle C-plane if thefeatures that the S-planes were detect<strong>in</strong>g were similar. This tun<strong>in</strong>g processis evident <strong>in</strong> several areas of the network architecture <strong>and</strong> process<strong>in</strong>g equations.The weights on connections to S-cells are determ<strong>in</strong>ed by a tra<strong>in</strong><strong>in</strong>g processthat we shall describe <strong>in</strong> Section 10.2.2. Unlike <strong>in</strong> many other network architectures(such as backpropagation), where each unit has a different weight vector,all S-cells on a s<strong>in</strong>gle plane share the same weight vector. Shar<strong>in</strong>g weights<strong>in</strong> this manner means that all S-cells on a given plane respond to the identicalfeature <strong>in</strong> their receptive fields, as we <strong>in</strong>dicated. Moreover, we need to tra<strong>in</strong>


10.2 Neocognitron Data Process<strong>in</strong>g 381only one S-cell on each plane, then to distribute the result<strong>in</strong>g weights to theother cells.The weights on connections to C-cells are not modifiable <strong>in</strong> the sense thatthey are not determ<strong>in</strong>ed by a tra<strong>in</strong><strong>in</strong>g process. All C-cell weights are usuallydeterm<strong>in</strong>ed by be<strong>in</strong>g tailored to the specific network architecture. As with S-planes, all cells on a s<strong>in</strong>gle C-plane share the same weights. Moreover, <strong>in</strong> someimplementations, all C-planes on a given layer share the same weights.10.2 NEOCOGNITRON DATA PROCESSINGIn this section we shall discuss the various process<strong>in</strong>g algorithms of the neocognitroncells. First we shall look at the S-cell data process<strong>in</strong>g <strong>in</strong>clud<strong>in</strong>g themethod used to tra<strong>in</strong> the network. Then, we shall describe process<strong>in</strong>g on theC-layer.10.2.1 5-Cell Process<strong>in</strong>gWe shall first concentrate on the cells <strong>in</strong> a s<strong>in</strong>gle plane of \Js\, as <strong>in</strong>dicated <strong>in</strong>Figure 10.7. We shall assume that the ret<strong>in</strong>a, layer Uo, is an array of 19 by 19pixels. Therefore, each Usi plane will have an array of 19 by 19 cells. Eachplane scans the entire ret<strong>in</strong>a for a particular feature. As <strong>in</strong>dicated <strong>in</strong> the figure,each cell on a plane is look<strong>in</strong>g for the identical feature but <strong>in</strong> a different locationon the ret<strong>in</strong>a. Each S-cell receives <strong>in</strong>put connections from an array of 3 by 3pixels on the ret<strong>in</strong>a. The receptive field of each S-cell corresponds to the 3 by 3array centered on the pixel that corresponds to the cell's location on the plane.When build<strong>in</strong>g or simulat<strong>in</strong>g this network, we must make allowances foredge effects. If we surround the active ret<strong>in</strong>a with <strong>in</strong>active pixels (outputs alwaysset to zero), then we can automatically account for cells whose fieldsof view are centered on edge pixels. Neighbor<strong>in</strong>g S-cells scan the ret<strong>in</strong>a arraydisplaced by one pixel from each otlier. In this manner, the entire imageis scanned from left to right <strong>and</strong> top to bottom by the cells <strong>in</strong> each S-plane.A s<strong>in</strong>gle plane of V c -cells is associated with the S-layer, as <strong>in</strong>dicated <strong>in</strong>Figure 10.7. The V^-plane conta<strong>in</strong>s the same number of cells as does each S-plane. Vc-cells have the same receptive fields as the S-cells <strong>in</strong> correspond<strong>in</strong>glocations <strong>in</strong> the plane. The output of a Vc-cell goes to a s<strong>in</strong>gle S-cell <strong>in</strong> everyplane <strong>in</strong> the layer. The S-cells that receive <strong>in</strong>puts from a particular Vc-cellare those that occupy a position <strong>in</strong> the plane correspond<strong>in</strong>g to the position ofthe Vc-cell. The output of the Vc-cell has an <strong>in</strong>hibitory effect on the S-cells.Figure 10.8 shows the details of a s<strong>in</strong>gle S-cell along with its correspond<strong>in</strong>g<strong>in</strong>hibitory cell.Up to now, we have been discuss<strong>in</strong>g the first S-layer, <strong>in</strong> which cells receive<strong>in</strong>put connections from a s<strong>in</strong>gle plane (<strong>in</strong> this case the ret<strong>in</strong>a) <strong>in</strong> the previouslayer. For what follows, we shall generalize our discussion to <strong>in</strong>clude the case


382 The NeocognitronS -layerRet<strong>in</strong>aFigure 10.7 The ret<strong>in</strong>a, layer U 0 , is a 19-by-19-pixel array, surrounded by<strong>in</strong>active pixels to account for edge effects as described <strong>in</strong> thetext. One of the S-planes is shown, along with an <strong>in</strong>dicationof the regions of the ret<strong>in</strong>a scanned by the <strong>in</strong>dividual cells.Associated with each S-layer <strong>in</strong> the system is a plane of v r -cells. These cells receive <strong>in</strong>put connections from the samereceptive field as do the S-cells <strong>in</strong> correspond<strong>in</strong>g locations <strong>in</strong>the plane. The process<strong>in</strong>g done by these Vc-cells is described<strong>in</strong> the text.of layers deeper <strong>in</strong> the network where an S-cell will receive <strong>in</strong>put connectionsfrom all the planes on the previous C-layer.Let the <strong>in</strong>dex k/ refer to the kth plane on level 1. We can label each cellon a plane with a two-dimensional vector, with n <strong>in</strong>dicat<strong>in</strong>g its position on theplane; then, we let the vector v refer to the relative position of a cell <strong>in</strong> theprevious layer ly<strong>in</strong>g <strong>in</strong> the receptive field of unit n. With these def<strong>in</strong>itions, we


10.2 Neocognitron Data Process<strong>in</strong>g 383Input connectionsfrom ret<strong>in</strong>aS-celloutputFigure 10.8 A s<strong>in</strong>gle 5-cell <strong>and</strong> correspond<strong>in</strong>g <strong>in</strong>hibitory cell on the Us\layer are shown. Each unit receives the identical n<strong>in</strong>e <strong>in</strong>putsfrom the ret<strong>in</strong>a layer. The weights, a,, on the S-ce\\ determ<strong>in</strong>ethe feature for which theicell is sensitive. Both the a, weightson connections from the ret<strong>in</strong>a, <strong>and</strong> the weight, b, from theV r -cells are modifiable <strong>and</strong> are determ<strong>in</strong>ed dur<strong>in</strong>g a tra<strong>in</strong><strong>in</strong>gprocess, as described <strong>in</strong> tfie text.can write the follow<strong>in</strong>g equation for the output of any 5-cell:v)(10.1)where the function 4> is a l<strong>in</strong>ear threshold function given by( x x>0


384 The NeocognitronLet's dissect these equations <strong>in</strong> some detail. The <strong>in</strong>ner summation ofEq. (10.1) is the usual sum-of-products calculation of <strong>in</strong>puts, t/c,_,(fc/-i. n + v),<strong>and</strong> weights, a/(fc(_i, v. A;/). The sum extends over all units <strong>in</strong> the previous C-layer that lie with<strong>in</strong> the receptive field of unit n. Those units are designatedby the vector n + v. Because we shall assume that all weights <strong>and</strong> cell outputvalues are nonnegative, the sum-of-products calculation yields a measure of howclosely the <strong>in</strong>put pattern matches the weight vector on a unit. 2 We have labeledthe receptive field AI, <strong>in</strong>dicat<strong>in</strong>g that the geometry of the receptive field is thesame for all units on a particular layer. The outer summation of Eq. (10.1)extends over all of the K\-\ planes of the previous C-layer. In the case of Us\ ,there would be no need for this outer summation.The product, bi(ki) • Vc,(n), <strong>in</strong> the denom<strong>in</strong>ator of Eq. (10.1), represents the<strong>in</strong>hibitory contribution of the Vc-cell. The parameter r/, where 0 < r; < oo,determ<strong>in</strong>es the cell's selectivity for a specific pattern. The factor r//(l + r ( )goes from zero to one as r/ goes from zero to <strong>in</strong>f<strong>in</strong>ity. Thus, for small valuesof r;, the denom<strong>in</strong>ator of Eq. (10.1) could be relatively small compared to thenumerator even if the <strong>in</strong>put pattern did not exactly match the weight vector. Thissituation could result <strong>in</strong> a positive argument to the Thus, the quantity r\ determ<strong>in</strong>es the m<strong>in</strong>imum relative strength of excitationversus <strong>in</strong>hibition that will result <strong>in</strong> a nonzero output of the unit. As r; <strong>in</strong>creases,rj/(l + r;) —> 1. Therefore, a larger r\ requires a larger excitation relative to<strong>in</strong>hibition for a nonzero output.2 Although we have not said anyth<strong>in</strong>g about normaliz<strong>in</strong>g the <strong>in</strong>put vector or the weights, we couldadd this condition to our system design. Then, we could talk about closeness <strong>in</strong> terms of the anglebetween the <strong>in</strong>put <strong>and</strong> weight vectors, as we did <strong>in</strong> Chapter 6.


10.2 Neocognitron Data Process<strong>in</strong>g 385Notice that neither of the weight expressions, ai(ki-\,\,ki), or bi(ki) dependexplicitly on the position, n, of the cell. Remember that all cells on aplane share the same weights, even the &/(&;) weights, which we did not discusspreviously.We must now specify the output from the <strong>in</strong>hibitory nodes. The Ve-cell atposition n has an output value ofKI-,^ 5^c,(v).f/^ __(*,_,,n + v) (10.3)where c/(v) is the weight on the connection from a cell at position v of theVc -cell's receptive field. These weights are not subject to tra<strong>in</strong><strong>in</strong>g. They cantake the form of any normalized function that decreases monotonically as themagnitude of v <strong>in</strong>creases. One such function is [7]c,(v) = a['«-> (10.4)where r'(v) is the normalized distance between the cell located at position v <strong>and</strong>the center of the receptive field, <strong>and</strong> a; is a constant less than 1 that determ<strong>in</strong>esthe rate of falloff with <strong>in</strong>creas<strong>in</strong>g distance. The factor C(l) is a normalizationconstant:The condition that the weights are norrnalized can be expressed asK,- t \E E < \ (v) ='A-,_| veA,(1 ° 6)which is satisfied by Eqs. (10.4) <strong>and</strong> (10.5). 3 The form of the c/(v) functionalso affects the S-cells pattern selectivity by favor<strong>in</strong>g patterns that are centrallylocated <strong>in</strong> the receptive field. We shall see <strong>in</strong> the next section that this samefunction modulates the weights, aj(fc/_i, v, fcj), dur<strong>in</strong>g learn<strong>in</strong>g. Thus, bothexcitatory <strong>in</strong>puts <strong>and</strong> <strong>in</strong>hibitory <strong>in</strong>puts will be stronger if the <strong>in</strong>put pattern iscentrally located <strong>in</strong> the cell's receptive field.The particular form of Eq. (10.3) is that of a weighted, root-mean-squareof the <strong>in</strong>puts to the Vc—cell. Look<strong>in</strong>g back at Eq. (10.1), we can see that, <strong>in</strong>the 5-cells, the net excitatory <strong>in</strong>put to the cell is be<strong>in</strong>g compared to a measureof the average <strong>in</strong>put signal. If the ratio of the net excitatory <strong>in</strong>put to the net<strong>in</strong>hibitory <strong>in</strong>put is greater than 1, the cell will have a positive output.'If we impose the condition thai the <strong>in</strong>put vectors be normalized, as suggested <strong>in</strong> footnote 2, then• we can relax the normalization condition on these weights.


386 The NeocognitronExercise 10.1: We can rewrite Eq. (10.1) for a s<strong>in</strong>gle S-cell <strong>in</strong> the succ<strong>in</strong>ctformU S = r,* [|±1 - l] (10.7)where e <strong>and</strong> h represent total excitatory <strong>and</strong> <strong>in</strong>hibitory <strong>in</strong>puts to the cell, respectively,<strong>and</strong> we have absorbed the factor r//(l + r/) <strong>in</strong>to h. Show that, when the<strong>in</strong>hibitory <strong>in</strong>put is very small, Eq. (10.7) can be written ast/ s « r,0[e - ft]Exercise 10.2: Show that, when both e <strong>and</strong> h are very large <strong>in</strong> Eq. (10.7),Us « r, \l - llL/i J10.2.2 Tra<strong>in</strong><strong>in</strong>g Weights on the S-LayersThere are several different methods for tra<strong>in</strong><strong>in</strong>g the weights on the neocognitron.The method that we shall detail here is an unsupervised-learn<strong>in</strong>g algorithmdesigned by the orig<strong>in</strong>al neocognitron designers. At the end of this section, wewill mention a few alternatives to this approach.Unsupervised Learn<strong>in</strong>g. In pr<strong>in</strong>ciple, tra<strong>in</strong><strong>in</strong>g proceeds as it does for manynetworks. First, an <strong>in</strong>put pattern is presented at the <strong>in</strong>put layer <strong>and</strong> the data arepropagated through the network. Then, weights are allowed to make <strong>in</strong>crementaladjustments accord<strong>in</strong>g to the specified algorithm. After weight updates haveoccurred, a new pattern is presented at the <strong>in</strong>put layer, <strong>and</strong> the process is repeatedwith all patterns <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g set until the network is classify<strong>in</strong>g the <strong>in</strong>putpatterns properly.In the neocognitron, shar<strong>in</strong>g of weights on a given plane means that only as<strong>in</strong>gle cell on each plane needs to participate <strong>in</strong> the learn<strong>in</strong>g process. Once itsweights have been updated, a copy of the new weight vector can be distributedto the other cells on the same plane. To underst<strong>and</strong> how this works, we canth<strong>in</strong>k of the 5-planes on a given layer as be<strong>in</strong>g stacked vertically on top of oneanother, aligned so that cells at correspond<strong>in</strong>g locations are directly on top of oneanother. We can now imag<strong>in</strong>e many overlapp<strong>in</strong>g columns runn<strong>in</strong>g perpendicularto this stack. These columns def<strong>in</strong>e groups of S-cells, where all of the members<strong>in</strong> a group have receptive fields <strong>in</strong> approximately the same location of the <strong>in</strong>putlayer.With this model <strong>in</strong> m<strong>in</strong>d, we now apply an <strong>in</strong>put pattern <strong>and</strong> exam<strong>in</strong>ethe response of the S'-cells <strong>in</strong> each column. To ensure that each 5-cell providesa dist<strong>in</strong>ct response, we can <strong>in</strong>itialize the a; weights to small, positiver<strong>and</strong>om values. The b; weights on the <strong>in</strong>hibitory connections can be <strong>in</strong>itialized


10.2 Neocognitron Data Process<strong>in</strong>g 387to zero. We first note the plane <strong>and</strong> position of the S-cell whose response isthe strongest <strong>in</strong> each column. Then we exam<strong>in</strong>e the <strong>in</strong>dividual planes so that,if one plane conta<strong>in</strong>s two or more of these S-cells, we disregard all but thecell respond<strong>in</strong>g the strongest. In this manner, we will locate the S-cell on eachplane whose response is the strongest, subject to the condition that each of thosecells is <strong>in</strong> a different column. Those S-cells become the prototypes, or representatives,of all the cells on their respective planes. Likewise, the strongestrespond<strong>in</strong>g Vf-cell is chosen as the representative for the other cells on theV c -plane.Once the representatives are chosen, weight updates are made accord<strong>in</strong>g tothe follow<strong>in</strong>g equations:,(fc/-i,n-|-v) (10.8)= q t V Cl _,(A) (10.9)where qi is the learn<strong>in</strong>g rate parameter, c/_i(v), is the monotonically decreas<strong>in</strong>gfunction as described <strong>in</strong> the previous section, <strong>and</strong> the location of the representativefor plane fc; is A.Notice that the largest <strong>in</strong>creases <strong>in</strong> the weights occur on those connectionsthat have the largest <strong>in</strong>put signal, Uc,_, (ki-\ , n + v). Because the 5-cell whoseweights are be<strong>in</strong>g modified was the one with the largest output, this learn<strong>in</strong>galgorithm implements a form of Hebbian learn<strong>in</strong>g. Notice also that weights canonly <strong>in</strong>crease, <strong>and</strong> that there is no upper bound on the weight value. The formof Eq. (10.1), for 5-cell output, guarantees that the output value will rema<strong>in</strong>f<strong>in</strong>ite, even for large weight values (see Exercise 10.2).Once the cells on a given plane beg<strong>in</strong> to respond to a certa<strong>in</strong> feature, theytend to respond less to other features. l\fter a short time, each plane will havedeveloped a strong response to a particular feature. Moreover, as we look deeper<strong>in</strong>to the network, planes will be respond<strong>in</strong>g to more complex features.Other Learn<strong>in</strong>g Methods. The designers of the orig<strong>in</strong>al neocognitron knewto what features they wanted each level, <strong>and</strong> each plane on a level, to respond.Under these circumstances, a set of tra<strong>in</strong><strong>in</strong>g vectors can be developed for eachlayer, <strong>and</strong> the layers can be tra<strong>in</strong>ed <strong>in</strong>dependently. Figure 10.9 shows thetra<strong>in</strong><strong>in</strong>g patterns that were used to tra<strong>in</strong> the 38 planes on the second layer of theneocognitron illustrated previously <strong>in</strong> Figure 10.4.It is also possible to select the representative cell for each plane <strong>in</strong> advance.Care must be taken, however, to ensure that the <strong>in</strong>put pattern is presented <strong>in</strong> theproper location with respect to the representative's receptive field. Here aga<strong>in</strong>,some foreknowledge of the desired features is required.Provided that the weight vectors <strong>and</strong> <strong>in</strong>put vectors are normalized, weightupdates to representative cells can be made accord<strong>in</strong>g to the method described<strong>in</strong> Chapter 6 for competitive layers. To implement this method, you wouldessentially rotate the exist<strong>in</strong>g weight vector a little <strong>in</strong> the direction of the <strong>in</strong>put


388 The Neocognitron20_33i. L i.* L L L23 t- t- t-LJt-r13-. -.. -. -.." ,- < .•- •'28 ..- ._ ..—29 _. .-. ,_M - -- - --]31-. - -_ "-J32- -, _. .-..33 ..' _,' _;' _;'.' ' .' .' ' .' "l "'1 '1 'lI5 J J J _••' " ';- '-- i- >-wo .j. j. •-• 1 35 y j._ 'j._ j._17 !.._ l._ I. l.._n 34 ;. /_ /i £.]Figure 10.919r r r rM -/ -?' -? ~fThis figure shows the four patterns used to tra<strong>in</strong> each ofthe 38 planes on layer Us 2 of the neocognitron designed torecognize the numerals 0 through 9. The square brackets<strong>in</strong>dicate group<strong>in</strong>gs of S-planes whose output connectionsconverge on a s<strong>in</strong>gle c-plane <strong>in</strong> the follow<strong>in</strong>g layer. Source:Repr<strong>in</strong>ted with permission from Kunihiko Fukushima, Sei Miyake,<strong>and</strong> Takayuki Ito, "Neocognitron: a neural network model for amechanism of visual pattern recognition." IEEE Transactions onSystems, Man, <strong>and</strong> Cybernetics, SMC-13(5), September/October1983. © 1983 IEEE.vector. You would need to multiply the <strong>in</strong>put vector by the monotonicallydecreas<strong>in</strong>g function <strong>and</strong> renormalize first [6].10.2.3 Process<strong>in</strong>g on the C-LayerThe functions describ<strong>in</strong>g the C-cell process<strong>in</strong>g are similar <strong>in</strong> form to those forthe S-cells. Also like the S-layer, each C-layer has associated with it a s<strong>in</strong>gleplane of <strong>in</strong>hibitory units that function <strong>in</strong> a manner similar to the V c -cells on theS-layer. We label the output of these units Vs,( n )-Generally, units on a given C-plane receive <strong>in</strong>put connections from one, orat most a small number of, S-planes on the preced<strong>in</strong>g layer. Vs-cells receive<strong>in</strong>put connections from all S-planes on the preced<strong>in</strong>g layer.


10.3 Performance of the Neocognitron 389The output of a C'-cell is given byV Sl (n)v)(10.10)where A"/ is the number of S*-planes at level /, JI(KI. k{) is one or zero depend<strong>in</strong>gon whether 5-plane /•>/ is or is not connected to C-plane A'/, rf/(v) is the weighton the connection from the S-cell at position v <strong>in</strong> the receptive field of theC'-cell, <strong>and</strong> I)/ def<strong>in</strong>es the receptive-field geometry of the C'-cell.The function 0 is def<strong>in</strong>ed by0(x) = \ 0 + x X - (10.11)[0 x


390 The NeocognitronUo U UC4J I9Figure 10.10In this figure, the numeral 2 appears on the ret<strong>in</strong>a of theneocognitron. All the planes at each level are <strong>in</strong>dicatedby <strong>in</strong>dividual boxes. Cells with<strong>in</strong> those planes that arerespond<strong>in</strong>g to the <strong>in</strong>put are <strong>in</strong>dicated <strong>in</strong> black with<strong>in</strong> theboxes. The planes on layer UCA have only one cell perplane, each cell correspond<strong>in</strong>g to one of the numbersfrom 0 to 9. The strength of the response from each uniton that layer is <strong>in</strong>dicated by the degree of darken<strong>in</strong>g <strong>in</strong>the associated box. Source: Repr<strong>in</strong>ted with permission ofAddison-Wesley Publish<strong>in</strong>g Co., Read<strong>in</strong>g, MA, from RobertHecht-Nielsen, Neurocomput<strong>in</strong>g, © 7990 by Addison-WesleyPublish<strong>in</strong>g Co.Some examples of numerals that were recognized successfully appear <strong>in</strong>Figure 10.11. F<strong>in</strong>ally, we show <strong>in</strong> Figure 10.12 an example of a pattern that results<strong>in</strong> a completely ambiguous response from the neocognitron. In the next section,we describe a method for resolv<strong>in</strong>g patterns such as the one <strong>in</strong> Figure 10.12.10.4 ADDITION OF LATERAL INHIBITION ANDFEEDBACK TO THE NEOCOGNITRONThere are two issues raised by the example of the superimposed patterns that wedescribed at the end of the previous paragraph. The first deals with resolv<strong>in</strong>gthe ambiguity so that the network makes a clear choice. The second dealswith gett<strong>in</strong>g the network to recognize <strong>and</strong> identify both patterns present on theret<strong>in</strong>a.Arrang<strong>in</strong>g for the network to make a choice between the two patterns can beaccomplished by the addition of lateral <strong>in</strong>hibition between neighbor<strong>in</strong>g cells ona layer. If each cell is <strong>in</strong>hibit<strong>in</strong>g other cells, then m<strong>in</strong>or differences <strong>in</strong> responsewill be magnified over time, <strong>and</strong> one cell <strong>in</strong> the f<strong>in</strong>al layer usually will emergeas the clear w<strong>in</strong>ner.


10.4 Addition of Lateral Inhibition <strong>and</strong> Feedback to the Neocognitron 3910DO034Dn6•nITLJFigure 10.11These are some examples of numerals that were successfullyrecognized by the neocognitron. The patterns varyaccord<strong>in</strong>g to sizev location on the ret<strong>in</strong>a, amount ofdistortion, <strong>and</strong> addition of noise. Source: Repr<strong>in</strong>ted withpermission of Addison-Wesley Publish<strong>in</strong>g Co., Read<strong>in</strong>g, MA,from Robert Hecht-Nielsen, Neurocomput<strong>in</strong>g, © 7990 byAddison-Wesley Publish<strong>in</strong>g Co.The second issue can be addressed by the addition of feedback paths <strong>in</strong> thenetwork, along with other devices, such as ga<strong>in</strong> controls on cells <strong>and</strong> variablethreshold conditions. If a pattern such as the one <strong>in</strong> Figure 10.12 is presented,feedforward connections <strong>and</strong> lateral <strong>in</strong>hibition are allowed to function to choosea s<strong>in</strong>gle responsive cell <strong>in</strong> the f<strong>in</strong>al layer of the network. This cell may representeither a 2 or a 4, <strong>and</strong> slightly different <strong>in</strong>put patterns may alter the <strong>in</strong>itialresponse.Follow<strong>in</strong>g this <strong>in</strong>itial process<strong>in</strong>g, signals are sent back toward the ret<strong>in</strong>avia other planes of cells, called u; s -cells <strong>and</strong> w c -cells. Dur<strong>in</strong>g the feedforwardprocess, only certa<strong>in</strong> (7-cells <strong>and</strong> 5-cells rema<strong>in</strong> active. These cells gate the


392 The NeocognitronFigure 10.12This pattern shows the numeral 4 superimposed on thenumeral 2. This pattern cannot be identified unambiguouslyby the neocognitron that we have been describ<strong>in</strong>g.feedback pathways so that feedback signals retrace the same pathways throughthe network back to the ret<strong>in</strong>a. In turn, the ws-cells <strong>and</strong> we-cells facilitate thestrengthen<strong>in</strong>g of responses by the active C-cells <strong>and</strong> 5-cells by affect<strong>in</strong>g changes<strong>in</strong> the thresholds <strong>and</strong> ga<strong>in</strong>s of these cells. Thus, a self-susta<strong>in</strong><strong>in</strong>g resonance effectensues, which is rem<strong>in</strong>dful of the similar effect <strong>in</strong> the ART networks ofChapter 8.To cause the network to recognize the second pattern present on the ret<strong>in</strong>a,all that is necessary is to <strong>in</strong>terrupt the feedback signals momentarily. This actioncauses the ga<strong>in</strong> of all active C-cells to be lowered, as though by fatigue. Asa result, other previously <strong>in</strong>active cells can respond, <strong>and</strong> a second resonancewill be established where the second pattern is identified on the last layer of thenetwork. Once aga<strong>in</strong>, we are rem<strong>in</strong>ded of the orient<strong>in</strong>g subsystem of ART <strong>and</strong>the susta<strong>in</strong>ed <strong>in</strong>hibitory signals used there to facilitate search.Although we have given only a cursory treatment to lateral <strong>in</strong>hibition <strong>and</strong>feedback here, we do not mean to give the impression that implement<strong>in</strong>g thesedevices is easy. The architectural details <strong>and</strong> process<strong>in</strong>g equations necessaryare similar to those for the neocognitron, but are considerably more complex.


Bibliography 393References at the end of this chapter will direct you to sources for the details,should you be <strong>in</strong>terested <strong>in</strong> read<strong>in</strong>g further.Exercise 10.3: There are actually three numerals present <strong>in</strong> the pattern <strong>in</strong> Figure10.12. What is the third numeral?Suggested Read<strong>in</strong>gsThe orig<strong>in</strong>al cognitron is described <strong>in</strong> an article by Fukushima [1]. Other articlesby him <strong>and</strong> his colleagues are listed <strong>in</strong> the bibliography [4, 2, 3]. One of theclearest descriptions of the operation <strong>and</strong> tra<strong>in</strong><strong>in</strong>g of the neocognitron appears<strong>in</strong> an article by Menon <strong>and</strong> He<strong>in</strong>emann [7]. This paper also illustrates the use ofthe neocognitron architecture <strong>in</strong> a pattern-recognition application different fromthe numeral recognition application.As we mentioned <strong>in</strong> the Preface, we chose not to <strong>in</strong>clude the pseudocode forthe neocognitron simulator. If you wish to see an example of how complicatedsuch a simulator can become, see the HNC ANZA User's Manual (Hecht-NielsenNeurocomput<strong>in</strong>g Corporation, San Diego, CA) [5].Bibliography[1] Kunihiko Fukushima. Cognitron: A self-organiz<strong>in</strong>g multilayered neuralnetwork. Biological Cybernetics, 20:121-136, 1975.[2] Kunihiko Fukushima. <strong>Neural</strong> network model for selective attention <strong>in</strong> visualpattern recognition <strong>and</strong> associative recall. Applied Optics, 26(23):4985-4992, December 1987. \[3] Kunihiko Fukushima. A neural networKfor visual pattern recognition. Computer,21(3):65-75, March 1988.[4] Kunihiko Fukushima, Sei Miyake, <strong>and</strong> Takayuki Ito. Neocognitron: Aneural network model for a mechanism of visual pattern recognition.IEEE Transactions on Systems, Man, <strong>and</strong> Cybernetics, SMC-13(5):826-834, September-October 1983.[5] Robert Hecht-Nielsen. ANZA User's Guide. Hecht-Nielsen NeurocomputerCorporation, San Diego, CA, Release 1.0, July 1987.[6] Robert Hecht-Nielsen. Counterpropagation networks. Applied Optics,26(23):4979-4984, December 1987.[7] Murali M. Menon <strong>and</strong> Karl G. He<strong>in</strong>emann. Classification of patterns us<strong>in</strong>ga self-organiz<strong>in</strong>g neural network. <strong>Neural</strong> <strong>Networks</strong>, 1(3):201-215, 1988.


395I N DAckley, David H., 189, 211action potential, 10, 144activation, 19, 26, 218, 225, 226function, 19requirement of differentiability, 98value, 19Ada, 35Adal<strong>in</strong>e, 45, 55-68, 131, 169adaptive l<strong>in</strong>ear comb<strong>in</strong>er, 55, 56adaptive l<strong>in</strong>ear element, 56adaptive l<strong>in</strong>ear neuron, 56data structures, 79simulator, 118adaptive beam form<strong>in</strong>g, 70adaptive resonance theory (ART),291-339, 341, 392attentional subsystem, 293, 294ga<strong>in</strong> control. 294, 295, 296, 297, 299susta<strong>in</strong>ed <strong>in</strong>hibition, 296top-down connection, 2932/3 rule, 296, 297ART1, 292, 298,algorithms, 329attentional subsystem, 298bottom-up connections, 293bottom-up LTM traces, 305bottom-up signals, 303data structures, 327fast learn<strong>in</strong>g, 305, 310, 313ga<strong>in</strong> control, 299, 302, 303orient<strong>in</strong>g subsystem, 294, 302, 303, 308,309, 310, 315process<strong>in</strong>g, 310on Fl, 299on F2, 302simulator, 327susta<strong>in</strong>ed <strong>in</strong>hibition, 3152/3 rule, 301, 302, 310vigilance parameter, 308, 309ART2, 292, 316analog <strong>in</strong>put vector, 316architecture, 316attentional subsystem, 316, 329, 331bottom-up LTM <strong>in</strong>itialization, 323fast learn<strong>in</strong>g, 322, 323ga<strong>in</strong> control, 316, 317, 319orient<strong>in</strong>g subsystem, 316, 318, 320, 321,392process<strong>in</strong>g, 324on Fl, 316on F2, 324simulator, 336vigilance parameter, 322, 323amplitude modulation, 45analog filters, 45Anderson, James A., 14, 41, 86anneal<strong>in</strong>g,process, 1/76, 177, 205schedule,/181, 189, 190, 197, 202simulated, 170, 177, 178, 182arcs, 4arrays :ANS data structures, 32dynamically allocated, 37<strong>in</strong>dex, 34l<strong>in</strong>early-sequential, 32structure, 37ART. See adaptive resonance theoryartificial neurons, 17ASCII, 4, 89, 91, 107, 166, 260associative memory, 127, 207, 258, 370auto, 130, 132, 136hetero, 130, 181<strong>in</strong>terpolative, 130asymmetric divergence, 173, 183automobile diagnostic application, 208avalanche networks, 262backpropagation, 89-125, 169, 214, 242, 273,341, 373, 380applications, 106data structures, 116equations, 95network (BPN), 89, 90, 92, 127, 131, 207,248,249,291, 370, 371, 377simulator, 114


396 Indexsummary, 101, 160tra<strong>in</strong><strong>in</strong>g data, conditions on, 103ballistic arm movement, 275, 283Barren, Robert J., 41bas<strong>in</strong>s of attraction, 139Beltrami, Edward J., 137bias.term, 56, 65, 104, 124weight, 65, 94bidirectional associative memory (BAM),127-168, 169, 190, 258, 327autoassociative, 140, 142data structures, 158energy function, 136, 137energy l<strong>and</strong>scape, 138noisy data, 132simulation, 156bidirectional connections, 132, 156b<strong>in</strong>ary digit, 171b<strong>in</strong>ary neurons, 12b<strong>in</strong>ary output units, 4bit, 171Boltzmann 179-212algorithms, 196completion network, 179, 180constant, 175data structures, 190distribution, 175, 178factor, 175<strong>in</strong>put-output network, 179, 181, 182, 189,208learn<strong>in</strong>g algorithms, 202mach<strong>in</strong>e, 179learn<strong>in</strong>g <strong>in</strong>, 182tra<strong>in</strong><strong>in</strong>g, 184signal propagation, 197simulator, 189, 204bra<strong>in</strong>, 263, 293, 376Butler, Charles, 41bytes, 171Carpenter, Gail A., 292, 299, 316, 319, 337,338Cauchydistribution, 189, 212mach<strong>in</strong>e, 189Caudill, Maureen, 41central process<strong>in</strong>g unit (CPU), 31characterdiscrim<strong>in</strong>ation, 4recognition, 5clamped units, 181C language, 40classical condition<strong>in</strong>g, 15, 16, 232, 233conditioned response (CR), 232conditioned stimulus (CS), 232unconditioned response (UCR), 232cluster, 222, 238, 263centroid of, 238<strong>in</strong>put vectors, 224Kohonen algorithm, 274, 359cognitron, 376Cohen, Michael A., 337competition, 226, 247between <strong>in</strong>stars, 225w<strong>in</strong>ner-take-all, 152, 227, 235competitive<strong>in</strong>stars, 232layer, 236, 237, 238, 249, 295network, 213, 215, 224units, 264connection, 4, 32excitatory, 19feedback, 132<strong>in</strong>hibitory, 19strength, 18weight, 6, 33, 36 , 37weighted, 5conscience, 242constellation of activity, 356constra<strong>in</strong>ts, 148strong, 148, 170weak, 148, 170contrast enhancement, 228, 295, 321convergence, 12proofs, 137convolution sum, 54co-occurrenceprobabilities, 187, 193, 202statistics, 204correlation matrix, 67<strong>in</strong>put, 58cortexarea 17, 18, 19, 373, 374, 375cerebral, 263visual (striate), 373layer IV, 375layer III, 375layer II, 375cost function, 148Cottrell, G. W., 124Counterpropagation network (CPN), 213-262,264, 265, 273, 286, 294, 331, 343, 370architecture of, 214as an associative memory, 258complete, 242, 243data process<strong>in</strong>g, 235forward mapp<strong>in</strong>g, 215, 235, 242hidden layer, 232outer layer, 230outstar tra<strong>in</strong><strong>in</strong>g, 239reverse mapp<strong>in</strong>g, 242simulator, 247data structures, 248learn<strong>in</strong>g algorithms, 254production algorithms, 251tra<strong>in</strong><strong>in</strong>g, 236crosstalk, 133, 135


Index 397data compression, 107, 210delta rule, 96, 101dendrites, 8Dickenson, Bradley W., 167differential equations, 217, 235differential Hebbian learn<strong>in</strong>g, 359, 361differentiated environment, 24digital signal process<strong>in</strong>g, 53digital signal processor, 45dimensional reduction, 270distribution function. See probability,179, 184divergence, 12dot product, 21dynamical system, 20dynamic memory allocation, 31echo-cancellation, 49, 69echo-suppression devices, 69edge effects, 381eigenvalue, 67electrocardiograph, 65energyfunction, 137l<strong>and</strong>scape, 177total, 184ensemble, 174canonical, 175entropy, 172, 175maximization of, 183maximum, 172, 184equilibrium, 217, 234thermal, 181error propagation, 120error surface, 169, 170Euclide<strong>and</strong>istance, 128, 269space, 128excitation, lateral, 279excitatory <strong>in</strong>puts, 9exemplars, 58, 130, 133, 291expectation value, 58, 297feature-map classifier, 271, 273feedback, 292, 390connections, 132negative, 12filtersadaptive, 50, 70, 71b<strong>and</strong>pass, 46, 50frequency response of, 48b<strong>and</strong>stop, 50frequency response of, 50highpass, 50lowpass, 50, 68transverse, 66float<strong>in</strong>g-po<strong>in</strong>thardware, 31operations, 31forgetfulness, 221formal avalanche, 342, 344, 345FORTRAN, 158Fourieranalysis, 51<strong>in</strong>tegral, 52series, 51free-runn<strong>in</strong>g system, 185Fukushima, Kunihiko, 373, 393functionactivation, 19canonical distribution, 179, 184cost, 148energy, 136Gaussian, 278l<strong>in</strong>ear threshold, 26, 27, 28, 383logistic, 98Lyapunov, 137, 147majority, 76Mexican-hat, 266, 267, 277, 278m<strong>in</strong>imization of, 169motor, 263output, 225partition, 175, 186probability density, 270quadratic output, 227, 229quench<strong>in</strong>g, 19sigmoid, 98, 99, 103, 114, 144, 145, 146,227, 230, 319ga<strong>in</strong>, 19, 392ga<strong>in</strong> parameter, 144, 145, 148Gaussi<strong>in</strong> function, 278Gelatt.lc. D., 212Gemani Donald, 188, 189Geman, Stuart, 188, 189generalization, 103, 242generalized delta rule, 93, 101, 170error m<strong>in</strong>imization by, 96global m<strong>in</strong>imum, 105, 106, 153, 169Glover, David E., 124German, R. Paul, 124graded potentials, 10gradient, 59, 62, 83, 96, 100gradient-descent method, 60, 95, 97, 99, 105,170, 182, 184Grajski, Kamil, 371gr<strong>and</strong>mother cell, 375Grossberg, Stephen, 228, 230, 232, 248, 262,292, 293, 297, 299, 316, 337, 342Hamm<strong>in</strong>gdistance, 128, 129, 133hypercube, 128, 129, 148, 169, 196space, 128, 129, 146Hamm<strong>in</strong>g, Richard W., 54heat reservoir, 174, 176, 178Hebb, Donald O., 15, 16Hebbian


398 Indexdifferential, 359, 361learn<strong>in</strong>g, 8, 15, 220, 232, 387Hecht-Nielsen, Robert, 41, 124, 213, 262, 371He<strong>in</strong>emann, Karl G., 389, 393hidden units, 5hill climb<strong>in</strong>g, 169H<strong>in</strong>ton, Geoffrey E., 177, 189, 211, 212Hirsch, Morris, 137Hoff, Marcian E., 86homunculus, 263, 264Hopfield, John J., 41, 127, 155, 157, 167, 169,190Hopfield memory (network), 127, 141-157,170, 178cont<strong>in</strong>uous, 141, 144, 146discrete, 141high-ga<strong>in</strong> limit, 150Huang, William Y., 41, 273Hubel, David H., 373hybrid circuit, 69hyperparaboloid, 59hyperplanes, 27, 28hyperspace, 28, 78hypersphere, unit, 238image classification, 244<strong>in</strong>formationga<strong>in</strong>, 173source, 171theory, 170, 171<strong>in</strong>hibition, lateral, 152, 279, 390, 391, 392<strong>in</strong>ner product, 21, 32<strong>in</strong>putlayer, 215, 216potentials, 10value, 19, 32vectors, 21<strong>in</strong>star, 215, 218, 224, 226, 232, 266learn<strong>in</strong>g algorithm, 237output, 219<strong>in</strong>teractions, lateral, 265-267<strong>in</strong>terconnect. See connection<strong>in</strong>ternal representation, 93<strong>in</strong>terneurons, 304<strong>in</strong>terpolation, 246<strong>in</strong>variancerotation, 75scal<strong>in</strong>g, 75translation, 75ionscalcium, 11chloride, 9negative organic, 9potassium, 9sodium, 9Kaplan, Wilfred, 53Kirkpatrick, S., 212Kohonen, Teuvo, 248, 263, 265, 269, 276, 288,289Kolodzy, Paul, 338Kosko, Bart, 167Kosko-Klopf learn<strong>in</strong>g, 359Kuh, Anthony, 167lateral geniculate nucleus (LGN), 373, 374, 375Lav<strong>in</strong>e, Robert A., 41layer, 20, 37hidden, 5, 28, 93, 100output, 4learn<strong>in</strong>gby example (supervised), 5<strong>in</strong> a neural network, 95unsupervised, 264, 386learn<strong>in</strong>g-rate parameter, 98, 101, 105, 116least mean square (LMS)algorithm, 63rule, 57, 95, 96technique, 99least squares approximation, 95l<strong>in</strong>ear-associator, 131l<strong>in</strong>early separable, 25l<strong>in</strong>ked list, 32, 35, 36data structures, 35null po<strong>in</strong>ters, 35po<strong>in</strong>ters, 32record, 38Lippman, Richard P., 41, 273, 371LISP, 35local-m<strong>in</strong>ima effects, 170local m<strong>in</strong>imum, 105, 169long-term memory (LTM), 293equations describ<strong>in</strong>g, 321traces, 295, 296, 304Lyapunov function, 137, 147macroscopic processesmodel<strong>in</strong>g of, 293Madal<strong>in</strong>e, 45, 72-79, 110categorization, 72rule III (MRIII), 75rule II (MRII), 73, 75mapp<strong>in</strong>g network, 94, 208matrixKarhunen-Loeve (KL), 17key-weight, 77, 78maximum-likelihood tra<strong>in</strong><strong>in</strong>g, 273McClell<strong>and</strong>, James, 41, 89McCulloch-Pitts, 22model, 8network (exercise), 15predicate calculus, 25theory, 14McCulloch, Warren S., 12McGregor, Ronald J., 41memorycontent addressable, 127human, 291


Index 399long-term, 293, 321short-term, 227, 293Menon, Murali M., 389, 393m<strong>in</strong>imum disturbance pr<strong>in</strong>ciple, 74M<strong>in</strong>sky, Marv<strong>in</strong>, 24, 28, 29modulation, 46Modula-2, 35momentum parameter, 105, 110, 115, 122motor functions, 263negative feedback, 265neighborhood, 266, 278, 279, 283, 365Neocognitron, 373-393architecture, 376C-layer process<strong>in</strong>g, 376, 377, 378, 388complex cells, 375-376C-planes, 376data process<strong>in</strong>g, 381hierarchical structure, 375hypercomplex cells, 375performance, 389receptive field, 379, 384S-cell, 376S-cell process<strong>in</strong>g, 381simple cells, 375, 376, 377S-layer, 376, 377S-plane, 376Vc-cells, 376V s -cells, 376Wc-cells, 391w s -cells, 391neural circuits, 13neural network, 3. 41siz<strong>in</strong>g, 104, 108neural phonetic typewriter, 274, 283neurobiologycellbody, 8der.drites, 8membrane, 8, 9, 144nerve, 8postsynaptic, 11,16presynaptic, 11, 16structure, 8ganglion,receptive field of, 374membrane,postsynaptic, 12presynaptic, 11neurons, 8, 218, 263axon, 8axon hillock, 9, 10, 144extracellular fluid, 9<strong>in</strong>terstitial fluid, 8<strong>in</strong>tracellular fluid, 8, 9myel<strong>in</strong> sheath, 8, 9nodes of Ranvier, 8, 9, 10synapse, 8, 10, 11, 16, 18, 144synaptic cleft, 11neurotransmitter, 11, 12, 16refractory period, 11rest<strong>in</strong>g potential, 9, 218sodium-potassium pump, 9neuron, quadratic, 100, 101neurophysiology, 8, 293nodes, 4, 17noise, 218patterns conta<strong>in</strong><strong>in</strong>g, 6suppression, 227, 230noise-saturation dilemma, 216nonspecificarousal, 19excitation, 301<strong>in</strong>hibitory signal, 317reset signal, 296signal, 294normalization, 321NP complete, 149object-oriented program, 127Oja rule, 17on-center off-surround, 216, 226, 227, 229,263, 265, 304, 319, 374Oppenheim, Alan V., 54optic nerve, 373optimization, 170problem, 148ordered feature map, 263output, 26, 32, 33, 34, 37, 217functidn, 19, 20layer, 4\learn<strong>in</strong>g procedure, 238scaled, 67 \vector, 21outstar, 213, 215, 230, 231, 232, 235, 273, 343,344, 360Page, Ed, 167pa<strong>in</strong>t-quality <strong>in</strong>spection, 110-114Pappert, Seymour, 25, 28, 29paraboloid, 59, 60parallelprocess<strong>in</strong>g, 2, 30processors, 4parity problem, 242Parker, D. B., 89, 104partition function, 175, 186patternactivation, 294classification, 291, 292classification network, 362effects of a mismatch, 349match<strong>in</strong>g, 293, 295, 296, 298recognition, 2, 75, 89, 345Pavlov, 15Pavlovian condition<strong>in</strong>g, 232perceptron, 17, 21, 28association layer, 22, 23


400 Indexassociator (A) units, 22, 23, 24convergence theorem, 24photo, 22response layer, 22response unit, 22, 23, 24sensory (S) po<strong>in</strong>ts, 22source set, 22phonotopic map, 275, 276Pitts, Walter, 12pixel, 4, 89, 107, 260placeholder approach, 35pr<strong>in</strong>cipal components, 17probability, 171, 174, 201density function, 270distribution, 222, 237symbol, 172, 183theory, 22process<strong>in</strong>gcycles, 181, 187element (PE), 4, 17, 18production mode, 6propagate-adapt cycle, 92prepositional logicconjo<strong>in</strong>ed negation, 13conjunction, 13disjunction, 13divergence, 12precession, 13pseudocode, 40quadratic output function, 227quench<strong>in</strong>g threshold, 228reflectancenormalized pattern, 216patterns, 217variables, 226Reif, F., 211resonance, 292, 296, 310, 313, 325, 326, 392,394ret<strong>in</strong>a, 22, 23, 24, 25, 76, 374, 376, 379-383ret<strong>in</strong>al ganglia, 373, 375Ritter, Helge J., 275, 276, 277rods (<strong>and</strong> cones), 373Rosenblatt, Frank, 21, 22, 24, 25Rosenfeld, Edward, 14, 86Rummelhart, David, 41, 89, 104Ryan, Thomas W., 308scale <strong>in</strong>variance, 75scal<strong>in</strong>g, 29Schafer, Ronald W., 54Schulten, Klaus J., 275, 276, 277Sejnowski, Terrence J., 124, 177, 182, 189, 211self-adaptation, 57self-organization, 264self-organiz<strong>in</strong>g map (SOM), 263-389algorithms, 281data process<strong>in</strong>g, 265data structures, 279learn<strong>in</strong>g algorithms, 266, 283signal propagation, 281simulation, 279tra<strong>in</strong><strong>in</strong>g, 286self-scal<strong>in</strong>g, 309separability, 22, 23sequential competitive avalanchefield (SCAF), 355tra<strong>in</strong><strong>in</strong>g, 359Shannon, C. E., 211short term memory (STM), 227, 293sigmoid function. See function, sigmoidsignalcarrier signal, 46encod<strong>in</strong>g, 46overshoot, 52process<strong>in</strong>g, 45, 46, 274digital, 53unit impulse function, 54undershoot, 52simulated anneal<strong>in</strong>g. See anneal<strong>in</strong>gslab, 76Smale, Stephen, 137somatotopicarea, 264map, 263sonar, 342, 363, 364, 370spacecraft orientation system, 245, 256, 261space shuttle, 244, 246, 256spatiotemporal network (STN). 341-371algorithms, 366applications, 363data structures, 364for speech recognition, 345simulation, 364tra<strong>in</strong><strong>in</strong>g, 369spatiotemporal pattern, 341, 343classification, 362, 363spectrum analyzer, 345, 346, 347speech recognition, 274, 341, 345, 349, 353,354, 363, 366, 371spurious stable state, 133, 138stability-plasticity dilemma, 292statistical mechanics, 170, 171, 174Stearns, Samuel D., 67, 70, 86steepest descent. See gradient descentstimulation,sublim<strong>in</strong>al, 297supralim<strong>in</strong>al, 297STM. See short term memorystuck vector, 240, 241subset pattern, 306superset pattern, 306superset vector, 315surface <strong>in</strong> weight space. See weight surfacesymbolic logic, 22symptom diagnosis application, 208system


Index 401activity threshold, 356discrete-time, l<strong>in</strong>early <strong>in</strong>variant, 53dynamical, 20Szu, Harold, 189, 212Tagliar<strong>in</strong>i, Gene, 167Tank.D. W., 157, 167,41,telephone circuit, 69, 70temperature, 176, 178, 181, 191absolute, 175thresholdcondition, 240value, 25, 26variable, 356time-dilation effect, 361, 362tonotopic map, 263topology-preserv<strong>in</strong>g map. See self-organiz<strong>in</strong>gmaptra<strong>in</strong><strong>in</strong>gdata, 103epoch, 113mode, 6passes, 191set, 58translation <strong>in</strong>variance, 75transverse filter, 66travel<strong>in</strong>g-salesperson problem, 148, 178Turner, Charles J., 308unit, 4Kohenen, 265l<strong>in</strong>ear threshold, 26-27, 98unit hypersphere, 238unlabeled data, 264, 271unlearn<strong>in</strong>g, 234van Alien, E. J., 338Vecci, M. P., 212vectordifference, 260formulation, 20Venn diagram, 23video image, 107vigilance parameter. See adaptive resonancetheoryvisual cortex. See cortexvon Neumann architecture, 92Wasserman, Philip D., 124Weber function, 307weight, 18, 20, 32, 33, 34, 35contour plot, 61<strong>in</strong>itialization, 104matrices, 77search of, 60shar<strong>in</strong>g, 380space, 136surface, 97vector, 21Weisel, Torsten N., 373Werbos, P., 89, 124Widrow, Bernard, 67, 70, 78, 86Williams, Ronald J., 371w<strong>in</strong> region, 237, 238W<strong>in</strong>ter, Rodney, 78, 308XOR, 25, 26, 27zero-memory source, 171Zipser, David, 371

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!