12.07.2015 Views

Deep Learning Based on Manhattan Update Rule - Helwan University

Deep Learning Based on Manhattan Update Rule - Helwan University

Deep Learning Based on Manhattan Update Rule - Helwan University

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>Deep</str<strong>on</strong>g> <str<strong>on</strong>g>Learning</str<strong>on</strong>g> <str<strong>on</strong>g>Based</str<strong>on</strong>g> <strong>on</strong> <strong>Manhattan</strong> <strong>Update</strong> <strong>Rule</strong>is the largest eigenvalue of the Hessian matrix H(Λ ∗ )where g(Λ) is the local gradient vector defined byη satisfies the c<strong>on</strong>diti<strong>on</strong> 0 < η < 2λ max, where λ max models.g(Λ) = ∂F ∣ evaluated at the global maximum of the CML objectivefuncti<strong>on</strong> (Haykin, 1998). In practice, sec<strong>on</strong>d orderCML(Λ) ∣∣Λ(9)∂λ i statistics are not accumulated so λ max is not knownand the H(Λ) is the local Hessian matrix defined byand η is chosen in ad-hoc fashi<strong>on</strong> by trial and error.H ij (Λ) ≡ ∂F ∣ The training speed of gradient descent (batch mode)CML(Λ) ∣∣Λ(10) is usually slow. The training process can be acceleratedusing an <strong>on</strong>line variant known as stochastic gra-∂λ i ∂λ jThe Newt<strong>on</strong>’s Method update rule is given bydient descent (SGD). 3 This algorithm can update thelearning system <strong>on</strong> the basis of the objective functi<strong>on</strong>λ (τ) = λ (τ−1) − η (τ) H −1 (Λ)g(Λ) (11) measured for a single utterance or batch.Since CML is not a quadratic functi<strong>on</strong>, taking the full 3.2. <strong>Manhattan</strong> <strong>Update</strong>Newt<strong>on</strong> step H −1 (Λ)g(Λ) may lead to an overshootof the maximum. Hence, η (τ) ≠ 1 will lead to the Many authors have described variants of the gradientdamped Newt<strong>on</strong> step. A line search algorithm is used ascent algorithm, where increased rates of c<strong>on</strong>vergenceto calculate η (τ) . A line search works by evaluating through learning rate adaptati<strong>on</strong> were proposed (Jacobs,1988; Sutt<strong>on</strong>, 1992; Murata et al., 1997). A com-the objective functi<strong>on</strong> starting from the current modelin the directi<strong>on</strong> of search and choosing η (τ) will lead paris<strong>on</strong> between different algorithms was addressed into an increase of the CML objective functi<strong>on</strong>.(Schiffmann et al., 1994). The Resilient Propagati<strong>on</strong>(RProp) algorithm (Riedmiller & Braun, 1993), uses aHessian matrix calculati<strong>on</strong>, its inverting and storage, <strong>Manhattan</strong> update rule to provide faster c<strong>on</strong>vergence.makes Newt<strong>on</strong>’s Method useful <strong>on</strong>ly for small scale A <strong>Manhattan</strong> update rule does not involve the gradientmagnitude. Our <strong>Manhattan</strong> (MH) update ruleproblems. Quasi-Newt<strong>on</strong> or variable metric methodscan be used when it is impractical to evaluate the Hessianmatrix. Instead of obtaining an estimate of the are given byimplementati<strong>on</strong> is similar to RProp where the updatesHessian matrix at a single point, these methods graduallybuild up an approximate Hessian matrix by using{gradient informati<strong>on</strong> from some or all of the previous η (τ) min(η (τ−1)i =i φ, η max ) g (τ)i (Λ)g (τ−1)i (Λ) > 0iterates visited by the algorithm. Limited memorymax(η (τ−1)i κ, η min ) g (τ)i (Λ)g (τ−1)i (Λ) < 0quasi-Newt<strong>on</strong>’s methods like L-BFGS are particular(13)realizati<strong>on</strong>s of quasi-Newt<strong>on</strong>’s methods that cut downthe storage for large problems (Nocedal & Wright, where1999).{Truncated-Newt<strong>on</strong> method known as Hessian-Free approach(Nocedal & Wright, 1999; Martens, 2010;λ (τ−1)i + η (τ)i g (τ)i (Λ) > 0λ (τ) λ (τ−1)i =i − η (τ)i g (τ)i (Λ) < 0(14)Kingsbury et al., 2012), is a sec<strong>on</strong>d order method forlarge scale problems. It finds the search directi<strong>on</strong> usingan iterative solver and the solver is typically basedThe learning parameters were set as follows κ = 0.5,φ = 1.2, and η (0)i = η.<strong>on</strong> c<strong>on</strong>jugate gradient but other alternatives are possible.In this method, Hessian-vector products are computed3.3. DCRFs Gradient Computati<strong>on</strong>without explicitly forming the Hessian. Hessian-For an exp<strong>on</strong>ential family activati<strong>on</strong> functi<strong>on</strong> based <strong>on</strong>free methods approximately invert the Hessian whilefirst-order sufficient statistics, the gradient of the CMLquasi-Newt<strong>on</strong> methods invert an approximate Hessian.objective functi<strong>on</strong> for the output layer parameters isBy ignoring the sec<strong>on</strong>d order derivative, a first order given byapproximati<strong>on</strong> of the CML will lead to the gradientascent methods and the update is given by∇F CML (O) = Cjinum (O) − Cji den (O) (15)λ (τ) = λ (τ−1) + ηg(Λ) (12)where the accumulators of the sufficient statistics,The step size η must be small enough to ensure a stableC ji (O), for the j th state and i th c<strong>on</strong>straint are calshownincrease of the CML objective functi<strong>on</strong>. It can beSince CML objective functi<strong>on</strong> is maximized in thisthat the algorithm is c<strong>on</strong>vergent provided that work, stochastic gradient ascent is used to train DCRFs


<str<strong>on</strong>g>Deep</str<strong>on</strong>g> <str<strong>on</strong>g>Learning</str<strong>on</strong>g> <str<strong>on</strong>g>Based</str<strong>on</strong>g> <strong>on</strong> <strong>Manhattan</strong> <strong>Update</strong> <strong>Rule</strong>Haykin, Sim<strong>on</strong>. Neural Networks: A ComprehensiveFoundati<strong>on</strong>. Prentice Hal, 2nd editi<strong>on</strong>, 1998.Hifny, Yasser. C<strong>on</strong>diti<strong>on</strong>al Random Fields for C<strong>on</strong>tinuousSpeech Recogniti<strong>on</strong>. PhD thesis, <strong>University</strong> OfSheffield, 2006.Hifny, Yasser. Acoustic modeling based <strong>on</strong> deep c<strong>on</strong>diti<strong>on</strong>alrandom fields. <str<strong>on</strong>g>Deep</str<strong>on</strong>g> <str<strong>on</strong>g>Learning</str<strong>on</strong>g> for Audio,Speech and Language Processing, ICML, 2013.Hifny, Yasser, Renals, Steve, and Lawrence, Neil. Ahybrid MaxEnt/HMM based ASR system. In Proc.INTERSPEECH, pp. 3017–3020, Lisb<strong>on</strong>, Portugal,2005.Hint<strong>on</strong>, Geoffrey, Deng, Li, Yu, D<strong>on</strong>g, Dahl, George,rahman Mohamed, Abdel, Jaitly, Navdeep, Senior,Andrew, Vanhoucke, Vincent, Nguyen, Patrick,Sainath, Tara, , and Kingsbury, Brian. <str<strong>on</strong>g>Deep</str<strong>on</strong>g> NeuralNetworks for acoustic modeling in speech recogniti<strong>on</strong>.IEEE Signal Processing Magazine, 2012.Jacobs, R. A. Increased rates of c<strong>on</strong>vergence throughlearning rate adaptati<strong>on</strong>. Neural Networks, 1:295–307, 1988.Kingsbury, Brian, Sainath, Tara N., and Soltau, Hagen.Scalable minimum bayes risk training of <str<strong>on</strong>g>Deep</str<strong>on</strong>g>Neural Network acoustic models using distributedhessian-free optimizati<strong>on</strong>. In interspeech, 2012.Lafferty, John, McCallum, Andrew, and Pereira, Fernando.C<strong>on</strong>diti<strong>on</strong>al random fields: Probabilisticmodels for segmenting and labeling sequence data.In Proc. ICML, pp. 282–289, 2001.Lee, K.-F. and H<strong>on</strong>, H.-W. Speaker-independentph<strong>on</strong>e recogniti<strong>on</strong> using hidden Markov models.IEEE Transacti<strong>on</strong>s <strong>on</strong> Speech and Audio Processing,37(11):1641–1648, Nov 1989.Martens, J. <str<strong>on</strong>g>Deep</str<strong>on</strong>g> learning via hessian-free optimizati<strong>on</strong>.In Proc. ICML, 2010.Mohamed, Abdel-rahman, Yu, D<strong>on</strong>g, and Deng, Li.Investigati<strong>on</strong> of full-sequence training of <str<strong>on</strong>g>Deep</str<strong>on</strong>g> BeliefNetworks for speech recogniti<strong>on</strong>. In Interspeech,2010.Mohamed, Abdel-rahman, Dahl, George, and Hint<strong>on</strong>,Geoffrey. Acoustic modeling using <str<strong>on</strong>g>Deep</str<strong>on</strong>g> Belief Networks.IEEE Transacti<strong>on</strong>s <strong>on</strong> Audio, Speech andLanguage Processing, 20:14–22, 2012.Murata, Noboru, Müller, Klaus-Robert, Ziehe, Andreas,and ichi Amari, Shun. Adaptive <strong>on</strong>-line learningin changing envir<strong>on</strong>ments. In Proc. NIPS, volume9, pp. 599, 1997. URL citeseer.ist.psu.edu/murata97adaptive.html.Nocedal, Jorge and Wright, Stephen J. Numerical Optimizati<strong>on</strong>.Springer, 1999.Prabhavalkar, R. and Fosler-Lussier, E. Backpropagati<strong>on</strong>training for multilayer c<strong>on</strong>diti<strong>on</strong>al random fieldbased ph<strong>on</strong>e recogniti<strong>on</strong>. In Proc. IEEE ICASSP,volume 1, pp. 5534 – 5537, France, March 2010.Riedmiller, M. and Braun, H. A direct method forfaster backpropagati<strong>on</strong> learning: The RPROP algorithm.In Proc. IEEE Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>Neural Networks, pp. 586–591, 1993.Schiffmann, W., Joost, M., and Werner, R. Optimizati<strong>on</strong>of the backpropagati<strong>on</strong> algorithm for trainingmultilayer perceptr<strong>on</strong>s. Technical Report tech. rep.,<strong>University</strong> of Koblenz, 1994.Seide, F., Li, G., and ., D. Yu. C<strong>on</strong>versati<strong>on</strong>al speechtranscripti<strong>on</strong> using c<strong>on</strong>text-dependent <str<strong>on</strong>g>Deep</str<strong>on</strong>g> NeuralNetworks. In Interspeech, 2011.Sutt<strong>on</strong>, Richard S. Adapting bias by gradient descent:An incremental versi<strong>on</strong> of delta-bar-delta. InProc. AAAI, pp. 171–176, 1992. URL citeseer.ist.psu.edu/158284.html.Vinyals, Oriol and Povey, D. Krylov subspace descentfor deep learning. In AISTATS, 2012.Young, Steve, Kershaw, Dan, Odell, Julian, Ollas<strong>on</strong>,Dave, Valtchev, Valtcho, and Woodland, Phil. TheHTK Book, Versi<strong>on</strong> 3.1. 2001.Yu, D<strong>on</strong>g and Deng, Li. <str<strong>on</strong>g>Deep</str<strong>on</strong>g>-structured hidden c<strong>on</strong>diti<strong>on</strong>alrandom fields for ph<strong>on</strong>etic recogniti<strong>on</strong>. InProc. INTERSPEECH, 2010.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!