23.02.2015 Views

Machine Learning - DISCo

Machine Learning - DISCo

Machine Learning - DISCo

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.6 REMARKS ON THE BACKPROPAGATION ALGORITHM<br />

4.6.1 Convergence and Local Minima<br />

As shown above, the BACKPROPAGATION algorithm implements a gradient descent<br />

search through the space of possible network weights, iteratively reducing the<br />

error E between the training example target values and the network outputs.<br />

Because the error surface for multilayer networks may contain many different<br />

local minima, gradient descent can become trapped in any of these. As a result,<br />

BACKPROPAGATION over multilayer networks is only guaranteed to converge toward<br />

some local minimum in E and not necessarily to the global minimum error.<br />

Despite the lack of assured convergence to the global minimum error, BACK-<br />

PROPAGATION is a highly effective function approximation method in practice. In<br />

many practical applications the problem of local minima has not been found to<br />

be as severe as one might fear. To develop some intuition here, consider that<br />

networks with large numbers of weights correspond to error surfaces in very high<br />

dimensional spaces (one dimension per weight). When gradient descent falls into<br />

a local minimum with respect to one of these weights, it will not necessarily be<br />

in a local minimum with respect to the other weights. In fact, the more weights in<br />

the network, the more dimensions that might provide "escape routes" for gradient<br />

descent to fall away from the local minimum with respect to this single weight.<br />

A second perspective on local minima can be gained by considering the<br />

manner in which network weights evolve as the number of training iterations<br />

increases. Notice that if network weights are initialized to values near zero, then<br />

during early gradient descent steps the network will represent a very smooth<br />

function that is approximately linear in its inputs. This is because the sigmoid<br />

threshold function itself is approximately linear when the weights are close to<br />

zero (see the plot of the sigmoid function in Figure 4.6). Only after the weights<br />

have had time to grow will they reach a point where they can represent highly<br />

nonlinear network functions. One might expect more local minima to exist in the<br />

region of the weight space that represents these more complex functions. One<br />

hopes that by the time the weights reach this point they have already moved<br />

close enough to the global minimum that even local minima in this region are<br />

acceptable.<br />

Despite the above comments, gradient descent over the complex error surfaces<br />

represented by ANNs is still poorly understood, and no methods are known to<br />

predict with certainty when local minima will cause difficulties. Common heuristics<br />

to attempt to alleviate the problem of local minima include:<br />

Add a momentum term to the weight-update rule as described in Equation<br />

(4.18). Momentum can sometimes carry the gradient descent procedure<br />

through narrow local minima (though in principle it can also carry it through<br />

narrow global minima into other local minima!).<br />

Use stochastic gradient descent rather than true gradient descent. As discussed<br />

in Section 4.4.3.3, the stochastic approximation to gradient descent<br />

effectively descends a different error surface for each training example, re-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!