Volumen 7, número 1
Investigación
CIENTIFICA
enero–julio 2013,
issn 1870–8196
Solving Artificial Neural Networks (Perceptrons)
Using Genetic Algorithms
LUis COPERTARi Computer Engineering Program Autonomous University of Zacatecas
sAnTiAGO EsPARZA ALDOnsO BECERRA GUsTAVO ZEPEDA Software Engineering Program Autonomous University of Zacatecas
copertari@yahoo.com
2
Investigación
CIENTIFICA
Introduction Neural networks Artificial Neural Networks are fundamentally a development of the XX century. Originally, there was some interdisciplinary work on physics, psychology and neurophysiology carried out by Hermann von Helmholts, Ernst Mach and Iván Pavlov. Such work did not involve mathematical modeling of any kind and made emphasis on general theories of learning, vision, conditioning, among others. McCulloch and Pitts (1943) were responsible during the 1940s of the first works on artificial neural networks. They showed that artificial neurons can, in principle, compute any logical or arithmetic function. Then, Hebb (1949) proposed a mechanism for learning derived from biological neurons based on Pavlov’s classical conditioning. Rosemblatt (1958) invented during the 1950s the first practical applications of artificial neural networks, the perceptron and its learning rule, showing that it can perform character pattern recognition. Approximately during the same time, Widrow and Hoff (1960) introduced the linear neural network and the corresponding learning algorithm, both being similar to the perceptron in structure and capacity. Unfortunately, both algorithms had the same limitations, shown by Minsky and Papert (1969). They can solve only a very limited number of problems. Precisely due to the discovery of such limitations, the field of artificial neural networks suffered and was abandoned by researcher for approximately a decade. Also, the lack of new ideas and computational power made research during those years difficult. Some work, however, continued. During the 1970s, Kohonen (1972) and Anderson (1972) independently developed new neural networks that could act as memories. Grossberg (1976) was also active during those years developing self-organizing networks. During the 1980s, the power of personal computers and work stations began to rapidly grow. Additionally, there were two conceptual developments responsible for the re–emergence of artificial neural networks. The first idea, presented
by Hopfield (1982), was the used of statistical mechanics to explain the operation of a certain kind of recurrent networks that could be used as associative memories. The second idea, independently discovered by several researchers, was the discovery of the backpropagation algorithm, which allowed for the first time to train multi-layer perceptron networks, breaking in this way the computational complexity discovered by Minsky and Papers in 1969. Rumelhart and McClelland (1986) were responsible for the publication of the most influential algorithm of backpropagation. These two developments combined with the accessibility of the computational power reinvigorated the field of neural networks during the last two or three decades. There have been innumerable papers on new theoretical developments and applications.
Genetic algorithms Nature used powerful means to propel the satisfactory evolution of organisms. Those organisms that are not particularly adapted for a given environment die, when those who are well adapted to live, reproduce. The children are like their parents, so that each new generation has organisms like the well adapted members of the previous generation. If the modifications of the environment are minor, species will evolve gradually with it; however, it is likely that a sudden change in the environment cause the disappearance of entire species. Sometimes, random mutations occur, and even though they imply the sudden death of the mutated individual, some of this mutations result in new and satisfactory species. The publication of Darwin’s masterpiece «The origin of species based on natural selection» represented a breakthrough in the history of science. Genetic algorithms were the result of the work of Friedberg (1958), who tried to produce learning from the mutation of little Fortran programs. Since most mutations done to the programs produced inoperative code, little was the advance that could be achieved. John Holland (1975) renew this field using representations of agents with chains of bits in such way that using any possible chain an op-
Volumen 7, número 1 enero–julio 2013,
erational agent could be represented. John Koza (1992) has managed impressive results in complex representations of agents along with mutation and reproduction techniques were special attention is given to the syntax of the representation language.
Methodology The first step is to do research on the topic, especially considering the different solution approaches. It is important to highlight the differences and theorize about the advantages and disadvantages of each approach. Secondly, the working hypothesis is stated, which will guide the research efforts. It is proposed in light of the working hypothesis a series of experiments, some of them numerical, others with hardware and software. Then, highlighting the results, the observed issues are considered to explain what is happening. Finally, conclusions are detailed. Such approach is the fundamental approach of the scientific method, discussed by Gauch Jr. (2003) and Wilson (1990), wich is detailed as follows: Observation. The current knowledge on perceptrons (a kind of artificial neural network) is discussed and also genetic algorithms. Hypothesis. An hypothesis considering what has been observed is stated; being in this case that it is possible to solve neural networks (perceptrons) using genetic algorithms. Experiments. Numerical experiments were designed using the computer to see what results are obtained. Experiments with two approaches are carried out: first, reproducing the weight matrix horizontally; and second, reproducing the weight matrix vertically. Theory. The results are discussed and explanations detailing such results are given within a coherent contextual framework. Conclusion. Specific conclusions are reached considering which approach works best and if a mixed approach is the ideal situation.
issn 1870–8196
3
The problem The input vector The problem is how to train a perceptron using genetic algorithms. A series of digitalized numbers from zero to nine (0 to 9) is used to build and test the program. Such digitalized numbers are shown in figure 1. Notice that the numbers are drawn in a grid of 6 rows by 5 columns. These numbers are transformed into a vector of 6x5 = 30 rows called p. The procedure is simple. First, take the first row. If the square is black (full), a 1 corresponds to such position. If the square is white (empty), a -1 corresponds to such space. Then, take the second row, and so on until the final (sixth) row is reached. For example, for the number zero (0), p = [-1 1 1 1 -1 1 -1 -1 -1 1 1 -1 -1 -1 1 1 -1 -1 -1 1 1 -1 -1 -1 1 -1 1 1 1 -1]T, where the superscript T indicates transpose matrix.
Figure 1. Digitalized numbers from zero to nine (0 to 9). This vector (p) constitutes the input vector.
The output vector The output vector is the number represented in grid form, except that such number is transformed into a binary representation of four binary digits. For example, number zero (0) would be 0000, number one would be 0001, number two would be 0010, number three would be 0011, number four would be 0100, number five would be 0101, number six would be 0110, number seven would be 0111, number eight would be 1000 and number nine would be 1001. Generally speaking, the target number is 1*a+2*b+4*c+8*d as indicated in figure 2, where a, b, c and d are binary numbers (either zero or one).
InvestigaciĂłn
4
CIENTIFICA The input file
Figure 2. Target number in binary form. This target number constitutes the output layer.
The weight matrix The weight matrix connects the input vector with the output vector. Figure 3 shows the network structure in which all the weights (wij) are shown.
To train the network it is required to have a data set. In this research, a data set composed of n data inputs of n p vectors is generated by using a noise random factor. For each data point (a 1 for a black square and a -1 for a white square), a random number is drawn, and if such random number (r, where 0 â&#x2030;¤ r < 1) is less or equal to a given value R, then the data point is flipped. In the case of the input file, for convenience, a -1 is not used for empty squares but a zero. Figure 4 shows a sample text file (*.snu) in which the target number is zero, the noise percentage is five and the number of different data entries is five. Notice that the first row in the sample file is the number zero itself in grid form, the second row is the target number corresponding to such grid number, the third row is the number of different entries (5 in this case) and the fourth row is the noise percentage (5 per cent in this case). 011101000110001100011000101110 0 5 5 011101000110000100011000100110 110101000110001101011000101010 111111000110001100011000101110 011101000100001100011000101110
Figure 3. Network structure.
011101000100001100101000101110
Figure 4. Sample file.
We can see we have a weight matrix (W) of 4 rows and 30 columns. The input vector is a column vector of 30 inputs (thus, the multiplication Wp yields a four column vector). The bias vector corresponds to noise in the network and it is equivalent as having an additional input with a value for p equal to one. The transfer function (f), transforms the operation Wp + b into a binary form since in this case if the result is greater or equal to zero, the transfer function returns a one, if not it returns a zero. Notice that from figure 3, a1 corresponds to a, a2 corresponds to b, a3 corresponds to c and a4 corresponds to d in figure 2. The network equation is shown in equation (1). a = f (Wp+b)
(1)
The fifth to ninth row are the data set. The numbers in grid form corresponding to such data set are shown in Figure 5. Let n be the number of different data sets.
Figure 5. Data set corresponding to the sample file.
The output file A population of 1024 individuals is generated (although the population size is one of the variables).
Volumen 7, número 1 enero–julio 2013,
Each individual contains one occurrence of the weight matrix (W) and the bias vector (b). Such matrix and vector are initialized using small random values (between -1 and 1). Once all the calculations are completed, an output text file containing first the number of the best individual is shown; second the average for such individual of the squared difference between the value given for the individual and the target number for all data sets is given; third, the weight matrix obtained; and finally the bias vector.
The solution The procedure used in genetic algorithms is as follows: 1) coding, 2) mutation, 3) evaluation, 4) reproduction, 5) decoding. There is iteration between steps four and two, until at least a g percentage of the population coincides having the same result. Also, let w be the mutation rate used in the algorithm. Notice that equation (2) must apply at all times during the calculations, because if even 100% of the individuals (1) has the same solution, and there is a mutation on the population of wx100%, then there would only be 100%-wx100% (or 1-w) of individuals with the same solution, and if such value is not greater than gx100%, the algorithm would never stop nor converge to a solution, which is illustrated in equation (3). The symbol << means sufficiently smaller than. w + g << 1 (2) g << 1 - w (3) To solve the problem, the five steps must be followed. Codification is easy, since it only requires creating a population of m different W and b. Such thing is done by assigning random values between -1 and 1 to different wij (for i = 1,…,4, and j = 1,…,30) of matrix W and bi (for all i = 1,…,4) in vector b. Step two simply requires deciding whether or not any given value wij must change. Such thing is done by drawing a random number r and if such number is less than a given threshold value R, wij is changed by choosing another random number between -1 and 1.
issn 1870–8196
5
Step three is carried out by doing the operation from equation (1), taking such value, subtracting the target value T and squaring such result, as shown in equation (4), which gives the error vector (e). e = (a – T)2 (4) For step four, there are two ways to proceed: the horizontal approach and the vertical approach. In the horizontal approach, a random breaking point between zero and four is obtained and two random individuals from the top xx100% percentile of the population are chosen for reproduction. All rows between zero and such breaking point are taken from the first individual and all rows after the breaking point plus one and 4 are taken from the second individual. Clearly, if the number obtained is four, the first individual is reproduced as it is, whereas if the number obtained is zero, the second individual is reproduced as it is. In the vertical approach, the same is done, except that now the columns between 1 and 30 are taken and the breaking point is a number between zero and thirty. The final step is simply taking the best value of W and b and calculating and averaging the error given the data sample and generating an output file with the results.
The computer interface The problem is solved using Delphi. The computer interface for the sample file is shown in figure 6. Notice the space to draw the number in grid form and the space for choosing squares as a mean of choosing a target number in binary form, where an X means a 1 and no X means a 0. The input file is shown below the target number in binary form. To the right, there is the number of entries in the sample file of different number of input vectors p (n). Also, there is the noise percentage used to change the original number in grid form to generate the n samples of different input vectors. The first button is used to create the entry or sample file and the second button is used to load a sample file previously created. Below that is the mutation rate used during the genetic algorithm process. Also, there is the percen-
6
Investigaci贸n
CIENTIFICA
Figure 6. Computer interface for the sample file.
tile used to choose the best individuals in the population. For example, 25 per cent means that only the best quarter of the individuals in the population will be reproduced. Finally, there is the generational percentage (g), which is the percentage of individuals with the same solution that have to be available to finish the algorithm. The next button (Solve for Entire Population and Samples) triggers the algorithm and trains the network for all samples until 100%xg individuals are the same. The last button (Evaluate and Find Best Weight Matrix) simply applies equation (1) to the solution, generates the output file (*.out) and tells the user the averaged error and the number of the individual in the population with such error.
Results Modifying the noise percentage for the input file does not make much sense, since a high noise percentage results in images that are completely altered from the number being represented, thus becoming useless for training as a data set. Having a noise percentage of 5 per cent is reasonable as can be seen in Figure 5. A noise percentage lower than that would result in numbers in grid form all pretty much the same, whereas a noise percentage higher than that would result in patterns that have no resemblance with the original number. Also, playing with the mutation rate (w) and the generational rate (g) is pointless, since the values used here are reasonable according to a study on genetic algorithms for the traveling salesman problem carried out by Copertari (2006). A value for g = 0.70
Volumen 7, número 1
enero–julio 2013,
issn 1870–8196
7
TABLE 1 sAMPLE
siZE VERsUs POPULATiOn siZE TABLE FOR HORiZOnTAL sPLiT
Sample Size
Population Size 10
50
100
1,000
500
16
6.69
5.75
3.13
2.69
1.00
256
5.84
5.82
5.59
3.95
3.24
65535
6.97
6.63
4.50
The table from Table 1 is plotted in Figure 7.
(70 per cent) and a value for w = 0.05 (5%) are usually good and lead to very good solutions if not the optimal one. Also, taking the top twenty fifth percentile for reproduction is a reasonable reproduction policy. The only parameters left to play with are the sample size and the population size. A sample size of 16, 256, and 65535 were used. Also, population sizes of 10, 50, 100, 500 and 1,000 were used. The number seven was used in grid and binary form.
Horizontal scanning
Figure 7. Sample size versus population size chart for horizontal split.
First, it is required to see the results for the reproduction policy of horizontal split. Table 1 shows the resulting table for different sample sizes and different population sizes. The sample size is represented by the different lines. The population size is on the X coordinate. The resulting averaged error is on the Y coordinate. Notice that the lower the sample size, the lower the averaged error. This is reasonable since the less number of data values we need to accommodate and averaged (its error), the lower the resulting average will be. Also notice that as the population size increases, the averaged error decreases, which mean that larger population sizes are better when it comes to obtaining a better solution.
sAMPLE
Vertical scanning The reproduction policy of vertical split is different, but it seems to lead to similar results. Table 2 shows the sample size versus population size table for the vertical split strategy.
TABLE 2 siZE VERsUs POPULATiOn siZE TABLE FOR VERTiCAL sPLiT
Sample Size
Population Size 10
50
100
1,000
500
16
5.50
3.44
3.69
3.25
2.63
256
6.73
3.47
3.55
3.23
3.55
65535
5.00
6.81
5.95
8
Investigación
CIENTIFICA References
Figure 8. Sample size versus population size chart for vertical split.
These results can be plotted. Figure 8 shows the results. Notice that this time the trend is not as clear as before. However, generally speaking, larger population sizes still result in lower averaged errors. Also notice that, with some minor exceptions, smaller sample data files also result in lower averaged errors.
Discussion and conclusion We can see from the experimental results obtained using our two Delphi programs (sAnnUGA_H and sAnnUGA_V for horizontal and vertical split, respectively), that larger population sizes tend to lead to better solutions, that is, the weight and bias matrix and vector, respectively, provide a number that is closer to the target number in binary form. Also, fitting larger data sets naturally lead to higher averaged error rates due to the fact that more data has to be considered and such data contains more noise in it. However, the algorithm is limited to the fact that the weight matrix is only a matrix of four by thirty elements (plus the bias vector), which makes it difficult to learn a lot given the limited number of «synapses» available. An architecture with a hidden layer of several neurons would work much better, but training such network using genetic algorithms would not be the proper solution to the problem, since it would take much longer to solve. The backpropagation algorithm would have to be used instead.
Anderson, J.A. 1972. «A simple neural networks generating an interactive memory», Mathematical Biosciences, Vol. 14, 197-220. Copertari, Luis. 2006. «Resolviendo el Problema del Vendedor Ambulante con Algoritmos Genéticos», Revista de Investigación Científica, Vol. 2, No. 3. Friedberg, R.M. 1958. «A learning machine: Part I», in IBM Journal, Vol. 2, 2-13. Gauch Jr., Hugo G. 2003. Scientific Method in Practice, Cambridge, England: Cambridge University Press. Grossberg, S. 1976. «Adaptive pattern classification and universal recording: I. Parallel development and coding of neural feature dectectors», Biological Cybernetics, Vol. 23, 121-134. Hebb, D.O. 1949. The Organization of Behavior, New York, UsA: Wiley. Holland, J.H. 1975. Adaptation in Natural and Artificial Systems, University of Michigan Press. Hopfield, J.J. 1982. «Neural networks and physical systems with emergent collective computational abilities», Proceedings of the National Academy of Sciences, Vol. 79, 2554-2558. Kohonen, T. 1972. «Correlation matrix memories», IEEE Transactions on Computers, Vol. 21, 353-359. Koza, J.R. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection, MiT Press, Cambridge, Massachusetts. McCulloch, Warren & Walter Pitts. 1943. «A logical calculus of the ideas immanent in nervous activity», in Bulletin of Mathematical Byophisics, Vol. 5, 115133. Minsky, M. & S. Papert. 1969. Perceptrons, Cambridge, MA: MiT Press. Rosemblatt, F. 1958. «The perceptron: a probabilistic model for information storage and organization in the brain», Psycological Review, Vol. 65, 386-408. Rumelhart, D.E. & J.L. McClelland (editors). 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, Cambridge, MA: MiT Press. Widrow, B. & M.E. Hoff. 1960. «Adaptive switching circuits», 1960 IRE WESCON Convention Record, New York: iRE Part 4, 96-104. Wilson, Edgar Bright. 1990. An Introduction to Scientific Research, New York, UsA: Dover Publications, Inc.