Artificial Neural Networks by Nicola Fanizzi

Artificial Neural Networks Corso di Apprendimento Automatico Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari

December 16, 2008

Corso di Apprendimento Automatico

Artificial Neural Networks

Outline

Multilayer networks B ACKPROPAGATION Hidden layer representations Example: Face Recognition Advanced topics

Corso di Apprendimento Automatico

Artificial Neural Networks

Limitations of the Linear Models I Minsky and Papert (1969) showed that linear classifiers have limitations, e.g. can’t learn XOR LMs provide powerful gradient descent methods for reducing the error, even when the patterns are not linearly separable Unfortunately LMs are not general enough in applications for which linear discriminants are insufficient for minimum error With a clever choice of nonlinear φ functions one can obtain arbitrary decisions leading to minimum error choose a complete basis set (e.g. polynomials); such a classifier would have too many free parameters to be determined from a limited number of training patterns prior knowledge relevant to the classification problem exploited for guiding the choice of nonlinearity Corso di Apprendimento Automatico

Artificial Neural Networks

Limitations of the Linear Models II

Corso di Apprendimento Automatico

Artificial Neural Networks

Connectionist Models Consider humans: Neuron switching time ∼ .001 second Number of neurons ∼ 1010 Connections per neuron ∼ 104−5 Scene recognition time ∼ .1 second 100 inference steps doesn’t seem like enough → much parallel computation Properties of artificial neural nets (ANN’s): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically Corso di Apprendimento Automatico

Artificial Neural Networks

When to Consider Neural Networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Examples: Speech phoneme recognition [Waibel] Image classification [Kanade, Baluja, Rowley] Financial prediction

Corso di Apprendimento Automatico

Artificial Neural Networks

Application Non linear decision surface

learning how to predict vowels in the context h.d input: numeric, from spectral analysis of the sound

Corso di Apprendimento Automatico

Artificial Neural Networks

Multilayer ANN I

Can create network of perceptrons to approximate arbitrary target concepts Multilayer Artificial Neural Networks is an example of an artificial neural network Consists of: input layer, hidden layer(s), and output layer Topological structure usually found by experimentation Parameters can be found using B ACKPROPAGATION In analogy with neurobiology, weights or connections are sometimes called synapses and the value of the connection the synaptic weights

Corso di Apprendimento Automatico

Artificial Neural Networks

Multilayer ANN II

Corso di Apprendimento Automatico

Artificial Neural Networks

Multilayer ANN III

Corso di Apprendimento Automatico

Artificial Neural Networks

Multilayer Network Structure

Corso di Apprendimento Automatico

Artificial Neural Networks

Feed-forward Operation I Input Layer: each input vector is presented to the input units whose output equals the corresponding components Hidden Layer: Each hidden net unit performs the weighted sum of its inputs to form its (scalar) net activation (inner product of the inputs with the weights at the hidden units): netj =

d X

~ jt ~x xi wji = w

i=0

Each hidden unit emits an output that is a nonlinear function (transfer function) of its activation: yj = f (netj ) Example: a simple threshold or sign function  +1 net ≥ 0 f (net) = sgn(net) = −1 net < 0 Corso di Apprendimento Automatico

Artificial Neural Networks

Feed-forward Operation II Output Layer: Each output unit computes its net activation based on the hidden unit signals: netk =

nH X

~ kt ~y yj wkj = w

i=0

Each output unit then computes the nonlinear function of its net, emitting zk = f (netk ) Typically c output units are given and the classification is decided with the label corresponding to the maximum yk = gk (~x )

Corso di Apprendimento Automatico

Artificial Neural Networks

Feed-forward Operation III General discriminant functions:   ! nH d X X gk (~x ) = zk = f  wkj f wji xi + wj0 + wk0  j=1

i=1

Class of functions that can be implemented by a three-layer neural network Broader generalizations: 1

transfer functions at the output layer different from those at the hidden layer

different functions at each individual unit

Corso di Apprendimento Automatico

Artificial Neural Networks

Expressive Capabilities of ANNs I Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Kolmogorov: any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units, proper nonlinearities and weights Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Corso di Apprendimento Automatico

Artificial Neural Networks

Expressive Capabilities of ANNs II

Corso di Apprendimento Automatico

Artificial Neural Networks

Sigmoid Unit I How to learn weights given network structure? Cannot simply use perceptron learning rule because we have hidden layer(s) Function we are trying to minimize: error Can use gradient descent Need differentiable activation function: use sigmoid function instead of threshold function f (x) =

1 1 + exp(−x)

Need differentiable error function: can’t use zero-one loss, but can use squared error E(x) =

Corso di Apprendimento Automatico

1 (y − f (x))2 2 Artificial Neural Networks

Sigmoid Unit II

σ(x) is the sigmoid function 1 1 + e−x

Corso di Apprendimento Automatico

Artificial Neural Networks

Sigmoid Unit III

Nice property: dσ(x) dx = σ(x)(1 − σ(x))

Corso di Apprendimento Automatico

Artificial Neural Networks

Multilayer Networks

We can derive gradient descent rules to train Multilayer networks of (sigmoid) units → B ACKPROPAGATION Multiple outputs → new error expression: ~]= E[w

1X 1X (td − od )2 = 2 2 d∈D

Corso di Apprendimento Automatico

(tkd − okd )2

d∈D k∈outputs

Artificial Neural Networks

Criterion Function and Gradient Descent Squared error: c

k=1

k =1

1X 2 1X 1 ~]= J[w ek = (tk − zk )2 = (~t − ~z )2 2 2 2 where ~t and ~z represent the target and the network output (length = c) Gradient descent: weights initialized with random values, ~ = −η ∂∂Jw~ changed in a direction that will reduce the error: ∆w that is: ∂J ∆wqp = −η ∂wqp

Corso di Apprendimento Automatico

Artificial Neural Networks

Fitting the Weights I ~ (m + 1) = w ~ (m) + ∆w(m) Iterative update: w where m indexes the particular input example ∆wqp (m) weight correction

η learning rate

δq (m)

local gradient

∂J Evaluate ∆wqp = −η ∂w : qp

for output units for hidden units ∂J We can trasform ∂w using the chain rule: qp ∂eq ∂f (netq ) ∂netq ∂J ∂J = ∂wqp ∂eq ∂f (netq ) ∂netq ∂wqp Corso di Apprendimento Automatico

Artificial Neural Networks

xp (m) input at unit j

Fitting the Weights II ∂J ∂ = ∂eq ∂eq

1X 2 ek 2

! = eq

k=1

∂(tq − f (netq )) ∂eq = = −1 ∂f (netq ) ∂f (netq ) ∂f (netq ) = f 0 (netq ) ∂netq ∂netq = f (netp ) = xp ∂wqp

Corso di Apprendimento Automatico

Artificial Neural Networks

Fitting the Weights III Hence:

∂J = −eq f 0 (netq )xp ∂wqp

Then the correction to be applied is defined by the delta rule: ∆wqp = −η

∂J ∂wqp

If we consider the local gradient defined δq = −

∂eq ∂f (netq ) ∂J ∂J =− = eq f 0 (netq ) ∂netq ∂eq ∂f (netq ) ∂netq

the delta rule becomes: ∆wqp = ηδq xp

Corso di Apprendimento Automatico

Artificial Neural Networks

Fitting the Weights IV 1

hidden-to-output weights: The error is not explicitly dependent upon wkj , use the chain rule for differentiation: ∂J ∂netk ∂J = ∂wkj ∂netk ∂wkj ∂J First term: ∂net local gradient (a.k.a. error or sensitivity) of k unit k :

δk =

∂J ∂J ∂zk = = (tk − zk )f 0 (netk ) ∂netk ∂zk ∂netk

k Second term: ∂net ∂wkj = yj

Summing up, the weight update is: ∆wkj = ηδk yj = η(tk − zk )f 0 (netk )yj Corso di Apprendimento Automatico

Artificial Neural Networks

Fitting the Weights V 2

input-to-hidden weights: credit assignment problem ∂J ∂J ∂yj ∂netj = ∂wji ∂yj ∂netj ∂wkj ∂ 1 Pc ∂J 2 = First term: ∂y (t − z ) k k k=1 ∂y 2 j j P k = − ck=1 (tk − zk ) ∂z ∂yj Pc ∂zk ∂netk = − k=1 (tk − zk ) ∂net k ∂yj Pc = − k=1 (tk − zk )f 0 (netk )wjk P Second term: let δj = f 0 (netj ) ck=1 wkj δk ∂net

Third term: ∂wkjj = xi Summing up, the weight update is: 0

∆wji = ηδj xi = ηxi f (netj )

c X

wkj δk

k=1

Corso di Apprendimento Automatico

Artificial Neural Networks

Learning Algorithm B ACKPROPAGATION Initialize weights wji , wkj ; criterion θ; η learning rate; m ← 0 do m ←m+1 Input the training example x (m) to the network and compute the outputs zk for each output unit k compute δk wjk ← wjk + ηδk yj for each hidden unit j compute δk wij ← wij + ηδj xi

until ∇J < θ

Corso di Apprendimento Automatico

Artificial Neural Networks

Stopping Condition in B ACKPROPAGATION

Error over training examples falling below some threshold Error over separate validation set meeting some criterion ... Warning: too few iterations → fail to reduce error too many iterations → overfitting

Corso di Apprendimento Automatico

Artificial Neural Networks

More on B ACKPROPAGATION Gradient descent over entire network weight vector Often include weight momentum α ∆wij (n) = ηδj xij + α∆wij (n − 1) Easily generalized to arbitrary directed graphs: X δr = or (1 − or ) wsr δs s∈downstream(r )

Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times)

Minimizes error over training examples Will it generalize well to subsequent examples?

Training can take thousands of iterations → slow! Using network after training is very fast Corso di Apprendimento Automatico

Artificial Neural Networks

Learning Hidden Layer Representations I Given an ANN:

Corso di Apprendimento Automatico

Artificial Neural Networks

Learning Hidden Layer Representations II

A target function (identity): Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001

→ → → → → → → →

Output 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001

Can this be learned??

Corso di Apprendimento Automatico

Artificial Neural Networks

Learning Hidden Layer Representations III Learned hidden layer representation (after 5000 epochs) Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001

→ → → → → → → →

Hidden Values .89 .04 .08 .01 .11 .88 .01 .97 .27 .99 .97 .71 .03 .05 .02 .22 .99 .99 .80 .01 .98 .60 .94 .01

Output → → → → → → → →

10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001

Rounding the weights to 0 or 1: an encoding of the distinct values → Can learn/invent new features !

Corso di Apprendimento Automatico

Artificial Neural Networks

Training I

one line per network output Corso di Apprendimento Automatico

Artificial Neural Networks

Training II

evolution of the weights in the hidden layer representation for output 01000000 Corso di Apprendimento Automatico

Artificial Neural Networks

Training III

evolution of the weights for one of the three hidden units

Corso di Apprendimento Automatico

Artificial Neural Networks

Convergence of B ACKPROPAGATION Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses

Corso di Apprendimento Automatico

Artificial Neural Networks

Remarks I

Can update weights after all training instances have been processed or incrementally: batch learning vs. stochastic backpropagation Weights are initialized to small random values How to avoid overfitting? Early stopping: use validation set to check when to stop Weight decay: add penalty term to error function How to speed up learning? Momentum: re-use proportion of old weight change Use optimization method that employs 2nd derivative

Corso di Apprendimento Automatico

Artificial Neural Networks

Remarks II

Corso di Apprendimento Automatico

Artificial Neural Networks

Remarks III ~ (m − 1) ~ (m + 1) ← w ~ (m) + ∆w ~ (m) + α∆w Momentum w {z } | {z } | gradient descent

Corso di Apprendimento Automatico

momentum

Artificial Neural Networks

Overfitting in ANNs I

better stopping after 9100 iterations Corso di Apprendimento Automatico

Artificial Neural Networks

Overfitting in ANNs II

when to stop? not always obvious: error decreases, then increase, then decreases again ... Corso di Apprendimento Automatico

Artificial Neural Networks

Neural Nets for Face Recognition I

Corso di Apprendimento Automatico

Artificial Neural Networks

Neural Nets for Face Recognition II

Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces

Corso di Apprendimento Automatico

Artificial Neural Networks

Learned Hidden Unit Weights Learned Weights

http://www.cs.cmu.edu/~tom/faces.html

Corso di Apprendimento Automatico

Artificial Neural Networks

Alternative Error Functions Weight decay: penalize large weight ~)≡ E(w

1X 2

(tkd − okd )2 + γ

d∈D k∈outputs

wji2

i,j

bias learning against complex decision surfaces Train on target slopes as well as values:  1X ~)≡ E(w 2

(tkd − okd )2 + µ

d∈D k∈outputs

∂tkd

j∈inputs

∂xdj

Tie together weights: e.g., in phoneme recognition network Corso di Apprendimento Automatico

Artificial Neural Networks

−

∂okd ∂xdj

!2  

Recurrent Networks I

from acyclic graphs to... Recurrent Network: An output (at time t) can be input for nodes at previous layers (at time t + 1) apply to time series Learning algorithm: Unfold + BACKPROPAGATION [Mozer, 1995]

Corso di Apprendimento Automatico

Artificial Neural Networks

Recurrent Networks II

Corso di Apprendimento Automatico

Artificial Neural Networks

Dynamically Modifying Network Structure

C ASCADE -C ORRELATION [Fahlam & Labiere, 1990] start from an ANN without hidden layer nodes if residual error then add hidden layer nodes maximizing the weight of the correlation between hidden unit and error

”Optimal brain damage” [LeCun, 1990] opposite strategy start with a complex ANN prune if connections are unessential e.g. weight close to 0 study the effect of variations of weight on error

until termination condition based on error

Corso di Apprendimento Automatico

Artificial Neural Networks

Radial Basis Function Networks

Radial Basis Function Networks (RBF Networks): another type of feedforward network with 3 layers Hidden units represent points in instance space and activation depends on distance To this end, distance is converted into similarity: Gaussian activation function f Width may be different for each hidden unit

Points of equal activation form hypersphere (or hyperellipsoid) as opposed to hyperplane

Output layer same as in MultiLayer Feedforward Networks

Corso di Apprendimento Automatico

Artificial Neural Networks

Learning Radial Basis Function Networks

Parameters centers and widths of the RBFs + weights in output layer Can learn two sets of parameters independently and still get accurate models Eg.: clusters from k-means can be used to form basis functions Linear model can be used based on fixed RBFs Makes learning RBFs very efficient

Disadvantage: no built-in attribute weighting based on relevance

Corso di Apprendimento Automatico

Artificial Neural Networks

Credits

R. Duda, P. Hart, D. Stork: Pattern Classification, Wiley T. M. Mitchell: Machine Learning, McGraw Hill I. Witten & E. Frank: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann

Corso di Apprendimento Automatico

Artificial Neural Networks