Artificial Neural Networks Corso di Apprendimento Automatico Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari
December 16, 2008
Corso di Apprendimento Automatico
Artificial Neural Networks
Outline
Multilayer networks B ACKPROPAGATION Hidden layer representations Example: Face Recognition Advanced topics
Corso di Apprendimento Automatico
Artificial Neural Networks
Limitations of the Linear Models I Minsky and Papert (1969) showed that linear classifiers have limitations, e.g. can’t learn XOR LMs provide powerful gradient descent methods for reducing the error, even when the patterns are not linearly separable Unfortunately LMs are not general enough in applications for which linear discriminants are insufficient for minimum error With a clever choice of nonlinear φ functions one can obtain arbitrary decisions leading to minimum error choose a complete basis set (e.g. polynomials); such a classifier would have too many free parameters to be determined from a limited number of training patterns prior knowledge relevant to the classification problem exploited for guiding the choice of nonlinearity Corso di Apprendimento Automatico
Artificial Neural Networks
Limitations of the Linear Models II
Corso di Apprendimento Automatico
Artificial Neural Networks
Connectionist Models Consider humans: Neuron switching time ∼ .001 second Number of neurons ∼ 1010 Connections per neuron ∼ 104−5 Scene recognition time ∼ .1 second 100 inference steps doesn’t seem like enough → much parallel computation Properties of artificial neural nets (ANN’s): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically Corso di Apprendimento Automatico
Artificial Neural Networks
When to Consider Neural Networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Examples: Speech phoneme recognition [Waibel] Image classification [Kanade, Baluja, Rowley] Financial prediction
Corso di Apprendimento Automatico
Artificial Neural Networks
Application Non linear decision surface
learning how to predict vowels in the context h.d input: numeric, from spectral analysis of the sound
Corso di Apprendimento Automatico
Artificial Neural Networks
Multilayer ANN I
Can create network of perceptrons to approximate arbitrary target concepts Multilayer Artificial Neural Networks is an example of an artificial neural network Consists of: input layer, hidden layer(s), and output layer Topological structure usually found by experimentation Parameters can be found using B ACKPROPAGATION In analogy with neurobiology, weights or connections are sometimes called synapses and the value of the connection the synaptic weights
Corso di Apprendimento Automatico
Artificial Neural Networks
Multilayer ANN II
Corso di Apprendimento Automatico
Artificial Neural Networks
Multilayer ANN III
Corso di Apprendimento Automatico
Artificial Neural Networks
Multilayer Network Structure
Corso di Apprendimento Automatico
Artificial Neural Networks
Feed-forward Operation I Input Layer: each input vector is presented to the input units whose output equals the corresponding components Hidden Layer: Each hidden net unit performs the weighted sum of its inputs to form its (scalar) net activation (inner product of the inputs with the weights at the hidden units): netj =
d X
~ jt ~x xi wji = w
i=0
Each hidden unit emits an output that is a nonlinear function (transfer function) of its activation: yj = f (netj ) Example: a simple threshold or sign function +1 net ≥ 0 f (net) = sgn(net) = −1 net < 0 Corso di Apprendimento Automatico
Artificial Neural Networks
Feed-forward Operation II Output Layer: Each output unit computes its net activation based on the hidden unit signals: netk =
nH X
~ kt ~y yj wkj = w
i=0
Each output unit then computes the nonlinear function of its net, emitting zk = f (netk ) Typically c output units are given and the classification is decided with the label corresponding to the maximum yk = gk (~x )
Corso di Apprendimento Automatico
Artificial Neural Networks
Feed-forward Operation III General discriminant functions: ! nH d X X gk (~x ) = zk = f wkj f wji xi + wj0 + wk0 j=1
i=1
Class of functions that can be implemented by a three-layer neural network Broader generalizations: 1
transfer functions at the output layer different from those at the hidden layer
2
different functions at each individual unit
Corso di Apprendimento Automatico
Artificial Neural Networks
Expressive Capabilities of ANNs I Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Kolmogorov: any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units, proper nonlinearities and weights Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].
Corso di Apprendimento Automatico
Artificial Neural Networks
Expressive Capabilities of ANNs II
Corso di Apprendimento Automatico
Artificial Neural Networks
Sigmoid Unit I How to learn weights given network structure? Cannot simply use perceptron learning rule because we have hidden layer(s) Function we are trying to minimize: error Can use gradient descent Need differentiable activation function: use sigmoid function instead of threshold function f (x) =
1 1 + exp(−x)
Need differentiable error function: can’t use zero-one loss, but can use squared error E(x) =
Corso di Apprendimento Automatico
1 (y − f (x))2 2 Artificial Neural Networks
Sigmoid Unit II
σ(x) is the sigmoid function 1 1 + e−x
Corso di Apprendimento Automatico
Artificial Neural Networks
Sigmoid Unit III
Nice property: dσ(x) dx = σ(x)(1 − σ(x))
Corso di Apprendimento Automatico
Artificial Neural Networks
Multilayer Networks
We can derive gradient descent rules to train Multilayer networks of (sigmoid) units → B ACKPROPAGATION Multiple outputs → new error expression: ~]= E[w
1X 1X (td − od )2 = 2 2 d∈D
Corso di Apprendimento Automatico
X
(tkd − okd )2
d∈D k∈outputs
Artificial Neural Networks
Criterion Function and Gradient Descent Squared error: c
c
k=1
k =1
1X 2 1X 1 ~]= J[w ek = (tk − zk )2 = (~t − ~z )2 2 2 2 where ~t and ~z represent the target and the network output (length = c) Gradient descent: weights initialized with random values, ~ = −η ∂∂Jw~ changed in a direction that will reduce the error: ∆w that is: ∂J ∆wqp = −η ∂wqp
Corso di Apprendimento Automatico
Artificial Neural Networks
Fitting the Weights I ~ (m + 1) = w ~ (m) + ∆w(m) Iterative update: w where m indexes the particular input example ∆wqp (m) weight correction
=
η learning rate
×
δq (m)
×
local gradient
∂J Evaluate ∆wqp = −η ∂w : qp
for output units for hidden units ∂J We can trasform ∂w using the chain rule: qp ∂eq ∂f (netq ) ∂netq ∂J ∂J = ∂wqp ∂eq ∂f (netq ) ∂netq ∂wqp Corso di Apprendimento Automatico
Artificial Neural Networks
xp (m) input at unit j
Fitting the Weights II ∂J ∂ = ∂eq ∂eq
c
1X 2 ek 2
! = eq
k=1
∂(tq − f (netq )) ∂eq = = −1 ∂f (netq ) ∂f (netq ) ∂f (netq ) = f 0 (netq ) ∂netq ∂netq = f (netp ) = xp ∂wqp
Corso di Apprendimento Automatico
Artificial Neural Networks
Fitting the Weights III Hence:
∂J = −eq f 0 (netq )xp ∂wqp
Then the correction to be applied is defined by the delta rule: ∆wqp = −η
∂J ∂wqp
If we consider the local gradient defined δq = −
∂eq ∂f (netq ) ∂J ∂J =− = eq f 0 (netq ) ∂netq ∂eq ∂f (netq ) ∂netq
the delta rule becomes: ∆wqp = ηδq xp
Corso di Apprendimento Automatico
Artificial Neural Networks
Fitting the Weights IV 1
hidden-to-output weights: The error is not explicitly dependent upon wkj , use the chain rule for differentiation: ∂J ∂netk ∂J = ∂wkj ∂netk ∂wkj ∂J First term: ∂net local gradient (a.k.a. error or sensitivity) of k unit k :
δk =
∂J ∂J ∂zk = = (tk − zk )f 0 (netk ) ∂netk ∂zk ∂netk
k Second term: ∂net ∂wkj = yj
Summing up, the weight update is: ∆wkj = ηδk yj = η(tk − zk )f 0 (netk )yj Corso di Apprendimento Automatico
Artificial Neural Networks
Fitting the Weights V 2
input-to-hidden weights: credit assignment problem ∂J ∂J ∂yj ∂netj = ∂wji ∂yj ∂netj ∂wkj ∂ 1 Pc ∂J 2 = First term: ∂y (t − z ) k k k=1 ∂y 2 j j P k = − ck=1 (tk − zk ) ∂z ∂yj Pc ∂zk ∂netk = − k=1 (tk − zk ) ∂net k ∂yj Pc = − k=1 (tk − zk )f 0 (netk )wjk P Second term: let δj = f 0 (netj ) ck=1 wkj δk ∂net
Third term: ∂wkjj = xi Summing up, the weight update is: 0
∆wji = ηδj xi = ηxi f (netj )
c X
wkj δk
k=1
Corso di Apprendimento Automatico
Artificial Neural Networks
Learning Algorithm B ACKPROPAGATION Initialize weights wji , wkj ; criterion θ; η learning rate; m ← 0 do m ←m+1 Input the training example x (m) to the network and compute the outputs zk for each output unit k compute δk wjk ← wjk + ηδk yj for each hidden unit j compute δk wij ← wij + ηδj xi
until ∇J < θ
Corso di Apprendimento Automatico
Artificial Neural Networks
Stopping Condition in B ACKPROPAGATION
Error over training examples falling below some threshold Error over separate validation set meeting some criterion ... Warning: too few iterations → fail to reduce error too many iterations → overfitting
Corso di Apprendimento Automatico
Artificial Neural Networks
More on B ACKPROPAGATION Gradient descent over entire network weight vector Often include weight momentum α ∆wij (n) = ηδj xij + α∆wij (n − 1) Easily generalized to arbitrary directed graphs: X δr = or (1 − or ) wsr δs s∈downstream(r )
Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times)
Minimizes error over training examples Will it generalize well to subsequent examples?
Training can take thousands of iterations → slow! Using network after training is very fast Corso di Apprendimento Automatico
Artificial Neural Networks
Learning Hidden Layer Representations I Given an ANN:
Corso di Apprendimento Automatico
Artificial Neural Networks
Learning Hidden Layer Representations II
A target function (identity): Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001
→ → → → → → → →
Output 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001
Can this be learned??
Corso di Apprendimento Automatico
Artificial Neural Networks
Learning Hidden Layer Representations III Learned hidden layer representation (after 5000 epochs) Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001
→ → → → → → → →
Hidden Values .89 .04 .08 .01 .11 .88 .01 .97 .27 .99 .97 .71 .03 .05 .02 .22 .99 .99 .80 .01 .98 .60 .94 .01
Output → → → → → → → →
10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001
Rounding the weights to 0 or 1: an encoding of the distinct values → Can learn/invent new features !
Corso di Apprendimento Automatico
Artificial Neural Networks
Training I
one line per network output Corso di Apprendimento Automatico
Artificial Neural Networks
Training II
evolution of the weights in the hidden layer representation for output 01000000 Corso di Apprendimento Automatico
Artificial Neural Networks
Training III
evolution of the weights for one of the three hidden units
Corso di Apprendimento Automatico
Artificial Neural Networks
Convergence of B ACKPROPAGATION Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses
Corso di Apprendimento Automatico
Artificial Neural Networks
Remarks I
Can update weights after all training instances have been processed or incrementally: batch learning vs. stochastic backpropagation Weights are initialized to small random values How to avoid overfitting? Early stopping: use validation set to check when to stop Weight decay: add penalty term to error function How to speed up learning? Momentum: re-use proportion of old weight change Use optimization method that employs 2nd derivative
Corso di Apprendimento Automatico
Artificial Neural Networks
Remarks II
Corso di Apprendimento Automatico
Artificial Neural Networks
Remarks III ~ (m − 1) ~ (m + 1) ← w ~ (m) + ∆w ~ (m) + α∆w Momentum w {z } | {z } | gradient descent
Corso di Apprendimento Automatico
momentum
Artificial Neural Networks
Overfitting in ANNs I
better stopping after 9100 iterations Corso di Apprendimento Automatico
Artificial Neural Networks
Overfitting in ANNs II
when to stop? not always obvious: error decreases, then increase, then decreases again ... Corso di Apprendimento Automatico
Artificial Neural Networks
Neural Nets for Face Recognition I
Corso di Apprendimento Automatico
Artificial Neural Networks
Neural Nets for Face Recognition II
Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces
Corso di Apprendimento Automatico
Artificial Neural Networks
Learned Hidden Unit Weights Learned Weights
http://www.cs.cmu.edu/~tom/faces.html
Corso di Apprendimento Automatico
Artificial Neural Networks
Alternative Error Functions Weight decay: penalize large weight ~)≡ E(w
1X 2
X
(tkd − okd )2 + γ
X
d∈D k∈outputs
wji2
i,j
bias learning against complex decision surfaces Train on target slopes as well as values: 1X ~)≡ E(w 2
X
(tkd − okd )2 + µ
d∈D k∈outputs
X
∂tkd
j∈inputs
∂xdj
Tie together weights: e.g., in phoneme recognition network Corso di Apprendimento Automatico
Artificial Neural Networks
−
∂okd ∂xdj
!2
Recurrent Networks I
from acyclic graphs to... Recurrent Network: An output (at time t) can be input for nodes at previous layers (at time t + 1) apply to time series Learning algorithm: Unfold + BACKPROPAGATION [Mozer, 1995]
Corso di Apprendimento Automatico
Artificial Neural Networks
Recurrent Networks II
Corso di Apprendimento Automatico
Artificial Neural Networks
Dynamically Modifying Network Structure
C ASCADE -C ORRELATION [Fahlam & Labiere, 1990] start from an ANN without hidden layer nodes if residual error then add hidden layer nodes maximizing the weight of the correlation between hidden unit and error
”Optimal brain damage” [LeCun, 1990] opposite strategy start with a complex ANN prune if connections are unessential e.g. weight close to 0 study the effect of variations of weight on error
until termination condition based on error
Corso di Apprendimento Automatico
Artificial Neural Networks
Radial Basis Function Networks
Radial Basis Function Networks (RBF Networks): another type of feedforward network with 3 layers Hidden units represent points in instance space and activation depends on distance To this end, distance is converted into similarity: Gaussian activation function f Width may be different for each hidden unit
Points of equal activation form hypersphere (or hyperellipsoid) as opposed to hyperplane
Output layer same as in MultiLayer Feedforward Networks
Corso di Apprendimento Automatico
Artificial Neural Networks
Learning Radial Basis Function Networks
Parameters centers and widths of the RBFs + weights in output layer Can learn two sets of parameters independently and still get accurate models Eg.: clusters from k-means can be used to form basis functions Linear model can be used based on fixed RBFs Makes learning RBFs very efficient
Disadvantage: no built-in attribute weighting based on relevance
Corso di Apprendimento Automatico
Artificial Neural Networks
Credits
R. Duda, P. Hart, D. Stork: Pattern Classification, Wiley T. M. Mitchell: Machine Learning, McGraw Hill I. Witten & E. Frank: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann
Corso di Apprendimento Automatico
Artificial Neural Networks