08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

W 1 W 2 W 3 W 4 W 5 W 6<br />

Figure 6.8: A deep learning fully connected network.<br />

In fact the function<br />

ReLU(x) =<br />

{ x x ≥ 0<br />

0 otherwise<br />

where ∂ReLU(x)<br />

∂x<br />

=<br />

{ 1 x ≥ 0<br />

0 otherwise<br />

seems to work well even though typically its derivative at x = 0 is undefined. An advantage<br />

<strong>of</strong> ReLU over sigmoid is that ReLU does not saturate far from the origin.<br />

Training a deep learning network <strong>of</strong> 7 or 8 levels using gradient descent can be computationally<br />

expensive. 26 To address this issue one trains one level at a time on unlabeled<br />

data using an idea called autoencoding. There are three levels, the input, a middle level<br />

called the hidden level, and an output level as shown in Figure 6.9a. There are two sets<br />

<strong>of</strong> weights. W 1 is the weights <strong>of</strong> the hidden level gates and W 2 is W1 T . Let x be the input<br />

pattern and y be the output. The error is |x − y| 2 . One uses gradient descent to reduce<br />

the error. Once the weights S 1 are determined they are frozen and a second hidden level<br />

<strong>of</strong> gates is added as in Figure 6.9 b. In this network W 3 = W2<br />

T and stochastic gradient<br />

descent is again used this time to determine W 2 . In this way one level <strong>of</strong> weights is trained<br />

at a time.<br />

The output <strong>of</strong> the hidden gates is an encoding <strong>of</strong> the input. An image might be a<br />

10 8 dimensional input and there may only be 10 5 hidden gates. However, the number <strong>of</strong><br />

images might be 10 7 so even though the dimension <strong>of</strong> the hidden layer is smaller than the<br />

dimension <strong>of</strong> the input, the number <strong>of</strong> possible codes far exceeds the number <strong>of</strong> inputs<br />

and thus the hidden layer is a compressed representation <strong>of</strong> the input.. If the hidden layer<br />

were the same dimension as the input layer one might get the identity mapping. This<br />

does not happen for gradient descent starting with random weights.<br />

26 In the image recognition community, researchers work with networks <strong>of</strong> 150 levels. The levels tend<br />

to be convolution rather than fully connected.<br />

224

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!