Deep Learning Based on Manhattan Update Rule - Helwan University

<strong>Deep</strong> <strong>Learning</strong> <strong>Based</strong> on Manhattan Update Ruleis the largest eigenvalue of the Hessian matrix H(Λ ∗ )where g(Λ) is the local gradient vector defined byη satisfies the condition 0 < η < 2λ max, where λ max models.g(Λ) = ∂F ∣ evaluated at the global maximum of the CML objectivefunction (Haykin, 1998). In practice, second orderCML(Λ) ∣∣Λ(9)∂λ i statistics are not accumulated so λ max is not knownand the H(Λ) is the local Hessian matrix defined byand η is chosen in ad-hoc fashion by trial and error.H ij (Λ) ≡ ∂F ∣ The training speed of gradient descent (batch mode)CML(Λ) ∣∣Λ(10) is usually slow. The training process can be acceleratedusing an online variant known as stochastic gra-∂λ i ∂λ jThe Newton’s Method update rule is given bydient descent (SGD). 3 This algorithm can update thelearning system on the basis of the objective functionλ (τ) = λ (τ−1) − η (τ) H −1 (Λ)g(Λ) (11) measured for a single utterance or batch.Since CML is not a quadratic function, taking the full 3.2. Manhattan UpdateNewton step H −1 (Λ)g(Λ) may lead to an overshootof the maximum. Hence, η (τ) ≠ 1 will lead to the Many authors have described variants of the gradientdamped Newton step. A line search algorithm is used ascent algorithm, where increased rates of convergenceto calculate η (τ) . A line search works by evaluating through learning rate adaptation were proposed (Jacobs,1988; Sutton, 1992; Murata et al., 1997). A com-the objective function starting from the current modelin the direction of search and choosing η (τ) will lead parison between different algorithms was addressed into an increase of the CML objective function.(Schiffmann et al., 1994). The Resilient Propagation(RProp) algorithm (Riedmiller & Braun, 1993), uses aHessian matrix calculation, its inverting and storage, Manhattan update rule to provide faster convergence.makes Newton’s Method useful only for small scale A Manhattan update rule does not involve the gradientmagnitude. Our Manhattan (MH) update ruleproblems. Quasi-Newton or variable metric methodscan be used when it is impractical to evaluate the Hessianmatrix. Instead of obtaining an estimate of the are given byimplementation is similar to RProp where the updatesHessian matrix at a single point, these methods graduallybuild up an approximate Hessian matrix by using{gradient information from some or all of the previous η (τ) min(η (τ−1)i =i φ, η max ) g (τ)i (Λ)g (τ−1)i (Λ) > 0iterates visited by the algorithm. Limited memorymax(η (τ−1)i κ, η min ) g (τ)i (Λ)g (τ−1)i (Λ) < 0quasi-Newton’s methods like L-BFGS are particular(13)realizations of quasi-Newton’s methods that cut downthe storage for large problems (Nocedal & Wright, where1999).{Truncated-Newton method known as Hessian-Free approach(Nocedal & Wright, 1999; Martens, 2010;λ (τ−1)i + η (τ)i g (τ)i (Λ) > 0λ (τ) λ (τ−1)i =i − η (τ)i g (τ)i (Λ) < 0(14)Kingsbury et al., 2012), is a second order method forlarge scale problems. It finds the search direction usingan iterative solver and the solver is typically basedThe learning parameters were set as follows κ = 0.5,φ = 1.2, and η (0)i = η.on conjugate gradient but other alternatives are possible.In this method, Hessian-vector products are computed3.3. DCRFs Gradient Computationwithout explicitly forming the Hessian. Hessian-For an exponential family activation function based onfree methods approximately invert the Hessian whilefirst-order sufficient statistics, the gradient of the CMLquasi-Newton methods invert an approximate Hessian.objective function for the output layer parameters isBy ignoring the second order derivative, a first order given byapproximation of the CML will lead to the gradientascent methods and the update is given by∇F CML (O) = Cjinum (O) − Cji den (O) (15)λ (τ) = λ (τ−1) + ηg(Λ) (12)where the accumulators of the sufficient statistics,The step size η must be small enough to ensure a stableC ji (O), for the j th state and i th constraint are calshownincrease of the CML objective function. It can beSince CML objective function is maximized in thisthat the algorithm is convergent provided that work, stochastic gradient ascent is used to train DCRFs

<strong>Deep</strong> <strong>Learning</strong> <strong>Based</strong> on Manhattan Update RuleHaykin, Simon. Neural Networks: A ComprehensiveFoundation. Prentice Hal, 2nd edition, 1998.Hifny, Yasser. Conditional Random Fields for ContinuousSpeech Recognition. PhD thesis, University OfSheffield, 2006.Hifny, Yasser. Acoustic modeling based on deep conditionalrandom fields. <strong>Deep</strong> <strong>Learning</strong> for Audio,Speech and Language Processing, ICML, 2013.Hifny, Yasser, Renals, Steve, and Lawrence, Neil. Ahybrid MaxEnt/HMM based ASR system. In Proc.INTERSPEECH, pp. 3017–3020, Lisbon, Portugal,2005.Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George,rahman Mohamed, Abdel, Jaitly, Navdeep, Senior,Andrew, Vanhoucke, Vincent, Nguyen, Patrick,Sainath, Tara, , and Kingsbury, Brian. <strong>Deep</strong> NeuralNetworks for acoustic modeling in speech recognition.IEEE Signal Processing Magazine, 2012.Jacobs, R. A. Increased rates of convergence throughlearning rate adaptation. Neural Networks, 1:295–307, 1988.Kingsbury, Brian, Sainath, Tara N., and Soltau, Hagen.Scalable minimum bayes risk training of <strong>Deep</strong>Neural Network acoustic models using distributedhessian-free optimization. In interspeech, 2012.Lafferty, John, McCallum, Andrew, and Pereira, Fernando.Conditional random fields: Probabilisticmodels for segmenting and labeling sequence data.In Proc. ICML, pp. 282–289, 2001.Lee, K.-F. and Hon, H.-W. Speaker-independentphone recognition using hidden Markov models.IEEE Transactions on Speech and Audio Processing,37(11):1641–1648, Nov 1989.Martens, J. <strong>Deep</strong> learning via hessian-free optimization.In Proc. ICML, 2010.Mohamed, Abdel-rahman, Yu, Dong, and Deng, Li.Investigation of full-sequence training of <strong>Deep</strong> BeliefNetworks for speech recognition. In Interspeech,2010.Mohamed, Abdel-rahman, Dahl, George, and Hinton,Geoffrey. Acoustic modeling using <strong>Deep</strong> Belief Networks.IEEE Transactions on Audio, Speech andLanguage Processing, 20:14–22, 2012.Murata, Noboru, Müller, Klaus-Robert, Ziehe, Andreas,and ichi Amari, Shun. Adaptive on-line learningin changing environments. In Proc. NIPS, volume9, pp. 599, 1997. URL citeseer.ist.psu.edu/murata97adaptive.html.Nocedal, Jorge and Wright, Stephen J. Numerical Optimization.Springer, 1999.Prabhavalkar, R. and Fosler-Lussier, E. Backpropagationtraining for multilayer conditional random fieldbased phone recognition. In Proc. IEEE ICASSP,volume 1, pp. 5534 – 5537, France, March 2010.Riedmiller, M. and Braun, H. A direct method forfaster backpropagation learning: The RPROP algorithm.In Proc. IEEE International Conference onNeural Networks, pp. 586–591, 1993.Schiffmann, W., Joost, M., and Werner, R. Optimizationof the backpropagation algorithm for trainingmultilayer perceptrons. Technical Report tech. rep.,University of Koblenz, 1994.Seide, F., Li, G., and ., D. Yu. Conversational speechtranscription using context-dependent <strong>Deep</strong> NeuralNetworks. In Interspeech, 2011.Sutton, Richard S. Adapting bias by gradient descent:An incremental version of delta-bar-delta. InProc. AAAI, pp. 171–176, 1992. URL citeseer.ist.psu.edu/158284.html.Vinyals, Oriol and Povey, D. Krylov subspace descentfor deep learning. In AISTATS, 2012.Young, Steve, Kershaw, Dan, Odell, Julian, Ollason,Dave, Valtchev, Valtcho, and Woodland, Phil. TheHTK Book, Version 3.1. 2001.Yu, Dong and Deng, Li. <strong>Deep</strong>-structured hidden conditionalrandom fields for phonetic recognition. InProc. INTERSPEECH, 2010.

Deep Learning Based on Manhattan Update Rule - Helwan University

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?