International Journal of Modern Engineering Sciences, 2015, 4(1):14-21 International Journal of Modern Engineering Sciences ISSN: 2167-1133 Florida, USA Journal homepage: www.ModernScientificPress.com/Journals/IJMES.aspx Article
Improve Levenberg-Marquardt Training Algorithm for Feed Forward Neural Networks Luma. N. M. Tawfiq* Department of Mathematics, College of Education Ibn Al-Haitham, Baghdad University
* Author to whom correspondence should be addressed; E-Mail: drluma_m@yahoo.com Article history: Received 12 April 2014, Received in revised form 10 March 2015, Accepted 25 March 2015, Published 17 May 2015.
Abstract: The aim of this paper is to design fast feed forward neural networks by develop training algorithm during improve Levenberg - Marquardt training algorithm which can speed up the solution times, reduce solver failures, and increase possibility of obtaining the globally optimal solution for any problem and solve the drawbacks related with this training algorithm and propose an efficient training algorithm for this type of network which have a very fast convergence rate for reasonable size networks. Keyword: Artificial neural network, Feed Forward neural network, Training Algorithm.
1. Introduction An Artificial neural network (Ann) is a simplified mathematical model of the human brain; it can be implemented by both electric elements and computer software. It is a parallel distributed processor with large numbers of connections; it is an information processing system that has certain performance characters in common with biological neural networks. These days every process is automated. A lot of mathematical procedures have been automated. There is a strong need of software that solves many problems in science and engineering. The application of neural networks for solving complex life problem can be regarded as a mesh-free numerical method.
Copyright Š 2015 by Modern Scientific Press Company, Florida, USA
Int. J. Modern Eng. Sci. 2015, 4(1): 14-21
15
The training neural networks problem and in particular the Levenberg - Marquardt training algorithm, has been studied for several decades. For an overview of the recent work see, e.g., [1-5] and references therein.
2. What is Artificial Neural Networks? Artificial neural networks (Ann's) have been developed as generalizations of mathematical models of human cognition or neural biology, based on the assumptions that: 1) Information processing occurs at many simple elements called neurons that are fundamental to the operation of Ann's. 2) Signals are passed between neurons over connection links. 3) Each connection link has an associated weight which, in a typical neural net, multiplies the signal transmitted. 4) Each neuron applies an action function (usually nonlinear) to its net input (sum of weighted input signals) to determine its output signal [6]. The units in a network are organized into a given topology by a set of connections, or weights, shown as lines in a diagram. Artificial neural networks is Characterized by [7]: 1) Architecture: its pattern of connections between the neurons. 2) Training Algorithm: its method of determining the weights on the connections. 3) Activation function. Artificial neural networks are often classified as single layer or multilayer. In determining the number of layers, the input units are not counted as a layer, because they perform no computation. Equivalently, the number of layers in the net can be defined to be the number of layers of weighted interconnects links between the slabs of neurons. This view is motivated by the fact that the weights in a net contain extremely important information [8].
3. Multilayer Feed Forward Architecture In a layered neural network the neurons are organized in the form of layers. We have at least two layers: an input and an output layer. The layers between the input and the output layer (if any) are called hidden layers, whose computation nodes are correspondingly called hidden neurons or hidden units. Extra hidden neurons raise the network’s ability to extract higher-order statistics from (input) data. The Ann is said to be fully connected in the sense that every node in each layer of the network is connected to every other node in the adjacent forward layer, otherwise the network is called partially Copyright Š 2015 by Modern Scientific Press Company, Florida, USA
Int. J. Modern Eng. Sci. 2015, 4(1): 14-21
16
connected. Each layer consists of a certain number of neurons; each neuron is connected to other neurons of the previous layer through adaptable synaptic weights w and biases b [9].
4. Training Feed Forward Neural Network Training is the process of adjusting connection weights w and biases b. In the first step, the network outputs and the difference between the actual (obtained) output and the desired (target) output (i.e., the error) is calculated for the initialized weights and biases (arbitrary values). During the second stage, the initialized weights in all links and biases in all neurons are adjusted to minimize the error by propagating the error backwards (the back propagation algorithm). The network outputs and the error are calculated again with the adapted weights and biases, and the process (the training of the Ann) is repeated at each epoch until a satisfied output yk (corresponding to the values of the input variables x) is obtained and the error is acceptably small. In most of the training algorithms a learning rate is used to determine the length of the weight update (step size) [10].
5. Levenberg-Marquardt Training Algorithm (trainlm) The Levenberg - Marquardt algorithm was designed to approach second order training (Newton algorithm) speed without having to compute the Hessian matrix. When the performance function has the form of a sum of squares, then the Hessian matrix can be approximated as H JTJ and the gradient can be computed as g JTe, where J is the Jacobian matrix, which contains first derivatives of the network errors with respect to the weights and biases, and e is a vector of network errors. The Levenberg-Marquardt algorithm uses this approximation to the Hessian matrix in the following Newton update [9]: Wk+1 Wk [ JTJ + I ]1 JT e when the scalar 0, this is just Newton’s method. When is large, this becomes gradient descent with a small step size.
6. Improve Implementation of Levenberg-Marquardt Algorithm Let ξp be the error given by the p-th training pattern vector and ξT = (ξ1,…,ξp). The error function is then: E = ½ ∑p=1P (ξp)2 = ½║ξ║2 , Let consider the following matrix:
Copyright © 2015 by Modern Scientific Press Company, Florida, USA
(1)
Int. J. Modern Eng. Sci. 2015, 4(1): 14-21
∂ξ1 /∂W1
J=
………
∂ξ1/∂WN
.
.
.
.
.
.
.
.
.
∂ξp/∂W1
17
………
∂ξp/∂WN
Then considering a small variation in W weight from step k to k+1 the error vector ξ may be developed and can be computed from Taylor series to the first order: ξ(k+1) = ξ(k+1) + J( W(k+1) – W(k) ) and the error function at step k+1 is : E(k+1) = ½ ║ξ(k+1)║2 = ½║ξ(k) + J(W(k+1) – W(k))║2
,
(2)
Minimizing (2) with respect to W(k+1) mean: ∂E(k+1) / ∂W(k+1) = [ ξ(k) + J(W(k+1) – W(k) ) ]T J = 0 Then:
ξ(k) + J(W(k+1) – W(k)) = 0
,
(3)
but J is not square matrix so, first multiplication with J T to the left of (3) and then multiplication the result by (JTJ)-1 again to the left which finally gives: W(k+1) = W(k) – (JTJ)-1JT ξ(k)
,
(4)
Which represents the core of Levenberg - Marquardt weights update formula. From (1) the Hessian matrix is: Hij = ∂2E/(∂wi ∂wj ) = ∑p=1P{ (∂ξp / ∂wi) (∂ξp/∂wj) + ξp(∂2ξp / ( ∂wi∂wj )) } And by neglecting the second order derivations the Hessian may be considered as: H ≈ JTJ, by this expression we recall equation (4) by: W(k+1) = W(k) – H-1 JT ξ(k) ; i.e., the equation (4) essentially involves the inverse Hessian; however this is done through the computation of the error gradient with respect to weights which may be efficiently done by the back propagation algorithm. One problem should be taken care of the formula (4) may give large values for ∆W, i.e., so large that the equation (2) approximation with not longer applied, to avoid this situation the following changed error function may be used instead: E(k+1) = ½║ξ(k) + J(W(k+1) – W(k))║2 + (/2) ║W(k+1) – W(k) ║2 , Where is a parameter governing the size of ∆W. By the same means as for (4) that is minimize (5) with respect to W(k+1) means: ∂E(k+1) / ∂W(k+1) = [ξ(k) + J( W(k+1) – W(k) ) ]TJ + [ W(k+1) – W(k) ] = 0 Then: ξ(k) JT + JTJ( W(k+1) – W(k) ) + ( W(k+1) – W(k) ) = 0 Copyright © 2015 by Modern Scientific Press Company, Florida, USA
(5)
Int. J. Modern Eng. Sci. 2015, 4(1): 14-21
18
ξ(k) JT + ( JTJ +I )( W(k+1) – W(k) ) = 0 The new update formula becomes: W(k+1) = W(k) – ( JTJ + I )-1 JT ξ(k)
,
(6)
Now, for ≳ 0, (6) approaches the Newton formula, for 0, (6) approaches the gradient descent formula. The practical result show's that , for sufficiently large values of the error function is "guaranteed" to decrease since the direction of change is opposite to the gradient and the step is proportional with 1/. The practical values for are; ≃ 0.1 at start then, if error decrease multiply by 10, if the error increases go back (restore the old value of W, i.e., undo changes), divide by 10 and try again.
7. Decrease Calculating of Hessian Matrix Considering the sum-of-squares error function for pattern vector xp (input data): Ep =½ ∑i=1k [ (ya(xp))i – (ytp)i ]2 = ½ [ ya(xp) – ytp ]T [ ya(xp) – ytp ] then the Hessian is calculated immediately as: ∂2Ep/(∂wji∂wkl) =∑i=1k(∂yai/∂wji)(∂yai/∂wkl) + ∑i=1k(yai –yti)(∂2yai/ (∂wji∂wkl)) Considering a well-trained network and the a mount of noise small then the terms (yai – yti) have to be small and may be neglected; then: ∂2Ep / (∂wji ∂wkl ) ∑i=1k (∂yai /∂wji)(∂yai /∂wkl)
,
(7)
and ∂yai /∂wji may be found by back propagation procedure .
8. Decrease Calculating for Inverse of Hessian Matrix Let consider the derivatives with respect to the weights denoted by g={∂/∂wji}ji , then from equation (7) the Hessian matrix may be written as a square matrix of order W × W and :
Hp = p=1P
g gT Ep where P is the number of vectors in the training set. Now, by adding a new training vector: Hp+1 = Hp + g gT Ep+1 Then by using the following formula which may be decrease the calculation of the inverse Hessian matrix: (A+BC)-1 = A-1 – A-1 B (I+ BA-1C)-1 C A-1, ( where A, B and C are any three matrices and A is invertable). By putting Hp = A, g = B and gTEp+1 = C, then we has: Copyright © 2015 by Modern Scientific Press Company, Florida, USA
Int. J. Modern Eng. Sci. 2015, 4(1): 14-21
19
Hp+1-1 = ( Hp + g gT Ep+1 )-1 = Hp-1 – Hp-1 g ( I + g Hp-1 gT Ep+1 )-1 gT Ep+1 Hp-1 Using the above formula and starting with H0 = I, the Hessian matrix may be computed by just one pass through all training set. Note That 1) Another way to improve network performance is to train multiple instances of the same network, but with a different set of initial weights, and choosing among those who give best results. This method is called committee of networks. 2) The practical results shows the criteria for stopping training process may be one of the following:
Stop after a fixed number of steps.
Stop when the error function had become smaller than a specified amount.
Stop when the change in the error function (∆E) had become smaller than a specified amount.
Stop when the error on an (independent) validation set begins to increase.
9. Numerical Experiment (Catalytic Reactions in a Flat Particle) This example arises in a study of heat and mass transfer for a catalytic reaction within a porous catalyst flat particle. The differential equation is the direct result of a material and energy balance, assuming a flat geometry for the particle and that conductive heat transfer is negligible compared to convective heat transfer yields the differential equation, 2nd order EVP [11]:
where γ = 40 is a dimensionless energy of activation, and β = 0.2 is a dimensionless parameter describing heat evolution. The boundary conditions are (mixed case): The analytical solution of the equation using Homotopy perturbation method is (see [11]):
where Now, we solve this problem by feed forward neural network, with three layer consist one input neuron, five hidden neuron and one output neuron, with tansigmoid transfer function in each hidden neuron, the trial neural form of the solution is taken to be: yt(x)
, Copyright © 2015 by Modern Scientific Press Company, Florida, USA
Int. J. Modern Eng. Sci. 2015, 4(1): 14-21
20
The feed forward neural network trained using a grid of ten equidistant points in [0, 1] and we get
The neural result introduced in Table (1), Table (2) gave the performance of the train for
epoch and time, and Table (3) gave the initial weight and bias of the design network. Table 1: Analytic and Neural solution Input
Analytic solution
Neural solution
The error
x
ya(x)
yt(x)
E(x) | yt(x) ya(x) |
0
0.753803487349275
0.753803487349275
0
0.1
0.756338346219074
0.756338346219074
0
0.2
0.763930378314138
0.763930378314138
0
0.3
0.776539879049182
0.776541246523008
1.367473825397703e-06
0.4
0.794093657835504
0.794094585127485
9.272919814229397e-07
0.5
0.816474113434863
0.816474113434863
0
0.6
0.843503119286524
0.843503119286524
0
0.7
0.874919851699920
0.874920094724528
430246074380804e-07
0.8
0.910351376555380
0.910351376555380
0
0.9
0.949274436367628
0.949274436367628
1.110223024625157e-16
1
0.990966431527794
0.990972504907977
6.073380183213573e-06
Table 2: The performance of the train for epoch and time Train Function
Performance of train
Epoch
Time
Msereg.
Trainlm
0.00
445
0::00:03
3.2461e-012
Table 3: Initial weight and bias of the network Weights and bias for trainlm Net.IW{1,1}
Net.LW{2,1}
Net.B{1}
0.0933
0.8863
0.0263
0.2067
0.0519
0.4688
0.3028
0.0786
0.2434
0.3515
0.0145
0.9199
0.7267
0.1933
0.7158
Copyright © 2015 by Modern Scientific Press Company, Florida, USA
Int. J. Modern Eng. Sci. 2015, 4(1): 14-21
21
10. Conclusion Based on our numerical experiment, we see that the application of improve Levenberg Marquardt appears to be the fastest method for training moderate-sized FFNN (up to several hundred weights), it also has a very efficient MATLAB implementation, since the solution of the matrix equation is a builtin function, so its attributes become even more pronounced in a MATLAB setting. Networks are also sensitive to the number of neurons in their hidden layers, too few neurons can lead to under-fitting and too many neurons can contribute to over-fitting, in which all training points are well fit, but the fitting curve takes swirled oscillations between these points.
References [1] Pradeep, T., Srinivasu, P., Avadhani, P. S., and Murthy, Y. V. S., Comparison of variable learning rate and Levenberg-Marquardt back-propagation training algorithms for detecting attacks in Intrusion Detection Systems, International Journal on Computer Science and Engineering (IJCSE), 3(11)(2011): 104-121. [2] Vikas, C., Analysis of Back Propagation Algorithm, IJTEEE Journal, 2(8)2013: 53-71. [3] Yamashita, N., and Fukushima, M., On the rate of convergence of the Levenberg-Marquardt method, Computing Suppl. J., 15(2001): 237-249. [4] Tawfiq, L. N. M. and Oraibi, Y. A., Fast Training Algorithms for Feed Forward Neural Networks, Ibn Al-Haitham Jour. for Pure & Appl. Sci., 26(1)(2013): 275 - 280. [5] Tawfiq, L. N. M., and Ali. M. H., Fast Feed Forward Neural Networks to Solve Boundary Value Problems, Lap lambert Academic Publishing, 2012. [6] Galushkin, I. A., Neural Networks Theory, Berlin Heidelberg, Springer, 2007. [7] Hristev, R. M., The ANN Book, Edition 1, GNU Public License, 1998. [8] Villmann, T., Seiffert, U., and Wismϋller, A., Theory and Applications of Neural maps, Esann2004 Proceedings- European Symposium on Ann, 2004: 25 - 38. [9] Tawfiq, L.N.M., and Naoum, R.S., On Training of Artificial Neural Network, AL-Fath Jornal , 23(2005): 73-88. [10] Jabber, A. K., On Training Feed Forward Neural Networks for Approximation Problem, MSc Thesis, Baghdad University, College of Education Ibn Al-Haitham, 2009. [11] Lin, Y., Enszer, J. A., and Stadtherr, M. A., Enclosing
All
Solutions
of Two – Point
Boundary Value Problems for ODEs, Journal of University of Notre Dame, USA, 11(3) (2007): 34-48.
Copyright © 2015 by Modern Scientific Press Company, Florida, USA