Linear Discrimination Functions by Nicola Fanizzi

Linear Discrimination Functions Corso di Apprendimento Automatico Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Universita` degli Studi di Bari

November 30, 2008

Corso di Apprendimento Automatico

Linear Discrimination Functions

Outline

Linear models Gradient descent Perceptron Minimum square error approach Linear and logistic regression

Corso di Apprendimento Automatico

Linear Discrimination Functions

Linear Discriminant Functions I A linear discriminant function can be written as ~ t ~x + w0 g(x) = w1 x1 + · · · + wd xd = w where ~ = weight vector w w0 = bias or threshold

A 2-class linear classifier implements the following decision rule: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0

Corso di Apprendimento Automatico

Linear Discrimination Functions

Linear Discriminant Functions II

The equation g(x) = 0 defines the decision surface that separates points assigned to Ď&#x2030;1 from points assigned to Ď&#x2030;2 . When g(x) is linear, this decision surface is a hyperplane (H).

Corso di Apprendimento Automatico

Linear Discrimination Functions

Linear Discriminant Functions III H divides the feature space into 2 half spaces: R1 for 1, and R2 for 2 If x1 and x2 are both on the decision surface ~ t ~x1 + w0 = w ~ t ~x2 + w0 â&#x2021;&#x2019; w ~ t (~x1 â&#x2C6;&#x2019; ~x2 ) = 0 w w is normal to any vector lying in the hyperplane

Corso di Apprendimento Automatico

Linear Discrimination Functions

Linear Discriminant Functions IV If we express ~x as ~x = ~xp + r

~ w ~ || ||w

where ~xp is the normal projection of ~x onto H, and r is the algebraic distance from ~x to the hyperplane Since g(~xp ) = 0, ~ t ~x + w0 = r ||w ~ || i.e. r = we have g(~x ) = w

g(~x ) ~ || ||w

r is signed distance: r > 0 if ~x falls in R1 , r < 0 if ~x falls in R2 Distance from the origin to the hyperplane is

Corso di Apprendimento Automatico

w0 ~ || ||w

Linear Discrimination Functions

Linear Discriminant Functions V

Corso di Apprendimento Automatico

Linear Discrimination Functions

Multicategory Case I

2 approaches to extend the LDF approach to the multicategory case: ωi / not ωi Reduce the problem to c − 1 two-class problems: Problem #i: Find the functions that separates points assigned to ωi from those not assigned to ωi ωi / ωj Find the c(c − 1)/2 linear discriminants, one for every pair of classes Both approaches can lead to regions in which the classification is undefined

Corso di Apprendimento Automatico

Linear Discrimination Functions

Multicategory Case II

Corso di Apprendimento Automatico

Linear Discrimination Functions

Pairwise Classification Idea: build model for each pair of classes, using only training data from those classes Problem: classification problems for k -class Have to solve k(k−1) 2 problem Turns out not to be a problem in many cases because training sets become small: Assume data evenly distributed, i.e. 2n k per learning problem for n instances in total Suppose learning algorithm is linear in n Then runtime of pairwise classification is proportional to k (k −1) × 2n 2 k = (k − 1)n

Corso di Apprendimento Automatico

Linear Discrimination Functions

Linear Machine I Define c linear discriminant functions: ~ it ~x + wi0 gi (~x ) = w

i = 1, . . . , c

Linear Machine classifier: x ∈ ωi if gi (~x ) > gj (~x ) for all i 6= j In case of equal scores, the classification is undefined A LM divides the feature space into c decision regions, with gi (~x ) the largest discriminant if ~x is in Ri If Ri and Rj are contiguous, the boundary between them is a portion of the hyperplane Hij defined by: gi (~x ) = gj (~x )

Corso di Apprendimento Automatico

~i − w ~ j )t ~x + (wi0 − wj0 ) (w

Linear Discrimination Functions

Linear Machine II ~i − w ~ j is normal to Hij It follows that w The signed distance from ~x to Hij is: gi (~x ) − gj (~x ) ~i − w ~ j || ||w There are c(c − 1)/2 pairs of (convex) regions Not all regions are contiguous, and the total number of segments in the surfaces is often less than c(c − 1)/2

3 and 5 class problems Corso di Apprendimento Automatico

Linear Discrimination Functions

Generalized LDF I P The LDF can be written g(~x ) = w0 + di=1 wi xi By adding d(d + 1)/2 terms involving the products of pairs of components of ~x , we obtain the quadratic discriminant function: g(~x ) = w0 +

d X

wi xi +

i=1

d X d X

wij xi xj

i=1 j=1

The separating surface defined by g(~x ) = 0 is a second-degree or hyperquadric surface By continuing to add terms such as wijk xi xj xk we obtain the class of polynomial discriminant functions

Corso di Apprendimento Automatico

Linear Discrimination Functions

Generalized LDF II The generalized LDF is defined as ˆ

g(~x ) =

d X

ai yi (~x ) = ~at ~y

i=1

ˆ where ~a is a d-dimensional weight vector, and yi (~x ) are arbitrary functions of ~x The resulting discriminant function is not linear in ~x , but it is linear in ~y The functions yi (~x ) map points in d-dimensional ~x -space ˆ ~y -space to points in the d-dimensional

Corso di Apprendimento Automatico

Linear Discrimination Functions

Generalized LDF III Example: Let the quadratic discriminant function be g(~x ) = a1 + a2 x + a3 x 2   1 The 3-dimensional vector is then y =  x  x2

Corso di Apprendimento Automatico

Linear Discrimination Functions

2-class Linearly-Separable Case I g(~x ) =

d X

wi xi = ~at ~y

i=0

where x0 = 1 and ~y t = [1 ~x ] = [1 x1 · · · xd ] is an augmented feature vector and ~at = [w0 w ~ ] = [w0 w1 · · · wd ] is an augmented weight vector The hyperplane decision surface H defined ~at ~y = 0 passes through the origin in ~y -space t

~x ) ~~ The distance from any point ~y to H is given by ||a~ay|| = g( ||~a|| p ~ ||2 ) this distance is less then the Because ~a = (1 + ||w distance from ~x to H

Corso di Apprendimento Automatico

Linear Discrimination Functions

2-class Linearly-Separable Case II ~ ] = ~a Problem: find [w0 w Suppose that we have a set of n examples {~y1 , . . . , ~yn } labeled ω1 or ω2 Look for a weight vector ~a that classifies all the examples correctly: ~at ~yi > 0 and ~yi is labeled ω1 or ~at ~yi < 0 and ~yi is labeled ω2

If ~a exists, the examples are linearly separable

Corso di Apprendimento Automatico

Linear Discrimination Functions

2-class Linearly-Separable Case III Solutions Replacing all the examples labeled Ď&#x2030;2 by their negatives, one can look for a weight vector ~a such that ~at ~yi > 0 for all the examples ~a a.k.a. separating vector or solution vector Each example ~yi places a constraint on the possible location of a solution vector ~at ~yi = 0 defines a hyperplane through the origin having ~yi as a normal vector The solution vector (if it exists) must be on the positive side of every hyperplane Solution Region = intersection of the n half-spaces

Corso di Apprendimento Automatico

Linear Discrimination Functions

2-class Linearly-Separable Case IV

Any vector that lies in the solution region is a solution vector: the solution vector (if it exists) is not unique Additional requirements to find a solution vector closer to the middle of the region (a solution that is more likely to classify new examples correctly) Seek a unit-length weight vector that maximizes the minimum distance from the examples to the separating plane Corso di Apprendimento Automatico

Linear Discrimination Functions

2-class Linearly-Separable Case V

Seek the minimum-length weight vector satisfying ~at ~yi â&#x2030;Ľ b â&#x2030;Ľ 0 The solution region shrinks by margins b/||~yi ||

Corso di Apprendimento Automatico

Linear Discrimination Functions

Gradient Descent I Define a criterion function J(~a) that is minimized if ~a is a solution vector (~at ~yi ≥ 0, ∀i = 1, . . . , n) Start with some arbitrary vector ~a(1) Compute the gradient vector ∇J(~a(1)) The next value ~a(2) is obtained by moving a distance from ~a(1) in the direction of steepest descent i.e. along the negative of the gradient

In general, ~a(k + 1) is obtained from ~a(k ) using ~a(k + 1) ← ~a(k) − η(k)∇J(~a(k )) where η(k) is the learning rate

Corso di Apprendimento Automatico

Linear Discrimination Functions

Gradient Descent II

Corso di Apprendimento Automatico

Linear Discrimination Functions

Gradient Descent & Delta Rule I To understand, consider a simpler linear machine (a.k.a. unit), where o = w0 + w1 x1 + · · · + wn xn Let’s learn wi ’s that minimize the squared error ~]≡ E[w

1X~ (td − ~od )2 2 d∈D

where: D is set of training examples h~x , ti t is the target output value

Corso di Apprendimento Automatico

Linear Discrimination Functions

Gradient Descent & Delta Rule II Gradient

∂E ∂E ∂E ~]≡ , ,··· ∇E[w ∂w0 ∂w1 ∂wn

Training rule: ~ = −η∇E[w ~] ∆w i.e., ∆wi = −η

∂E ∂wi

Note that η is constant

Corso di Apprendimento Automatico

Linear Discrimination Functions

Gradient Descent & Delta Rule III

∂E ∂wi

= = = =

∂E ∂wi

∂ 1X (td − od )2 ∂wi 2 d 1X ∂ (td − od )2 2 ∂wi d 1X ∂ (td − od ) 2(td − od ) 2 ∂wi d X ∂ ~ · x~d ) (td − od ) (td − w ∂wi d X (td − od )(−xid ) d

Corso di Apprendimento Automatico

Linear Discrimination Functions

Basic G RADIENT-D ESCENT Algorithm G RADIENT-D ESCENT(D, η) D: training set, η: learning rate

(e.g. .5)

Initialize each wi to some small random value until the termination condition is met do Initialize each ∆wi to zero. for each h~x , ti ∈ D do Input the instance ~x to the unit and compute the output o for each wi do ∆wi ← ∆wi + η(t − o)xi

for each weight wi do wi ← wi + ∆wi

Corso di Apprendimento Automatico

Linear Discrimination Functions

Incremental (Stochastic) G RADIENT D ESCENT I Approximation of the standard G RADIENT-D ESCENT

Batch G RADIENT-D ESCENT: Do until satisfied 1 2

~] Compute the gradient ∇ED [w ~ ←w ~ − η∇ED [w ~] w

Incremental G RADIENT-D ESCENT: Do until satisfied For each training example d in D 1 2

~] Compute the gradient ∇Ed [w ~ ←w ~ − η∇Ed [w ~] w

Corso di Apprendimento Automatico

Linear Discrimination Functions

Incremental (Stochastic) G RADIENT D ESCENT II ~]≡ ED [w

1X (td − od )2 2 d∈D

~]≡ E d [w

1 (td − od )2 2

Training rule (delta rule): ∆wi ← η(t − o)xi similar to perceptron training rule, yet unthresholded convergence is only asymptotically guaranteed linear separability is no longer needed !

Corso di Apprendimento Automatico

Linear Discrimination Functions

Standard vs. Stochastic G RADIENT-D ESCENT

Incremental-GD can approximate Batch-GD arbitrarily closely if Îˇ made small enough error summed over all examples before summing updated upon each example standard GD more costly per update step and can employ larger Îˇ stochastic GD may avoid falling in local minima because of using Ed instead of ED

Corso di Apprendimento Automatico

Linear Discrimination Functions

Newton’s Algorithm

1 J(~a) ' J(~a(k )) + ∇J t (~a − ~a(k)) + (~a − ~a(k))t H(~a − ~a(k )) 2 where H =

∂2J ∂ai ∂aj

is the Hessian matrix

Choose ~a(k + 1) to minimize this function: ~a(k + 1) ← ~a(k) − H −1 ∇J(~a) Greater improvement per step than GD but not applicable when H is singular Time complexity O(d 3 )

Corso di Apprendimento Automatico

Linear Discrimination Functions

Perceptron I

Assumption: data is linearly separable Pd Hyperplane: i=0 wi xi = 0 assuming that there is a constant attribute x0 = 1 (bias) Algorithm for learning separating hyperplane: perceptron learning rule Classifier: P If di=0 wi xi > 0 then predict ω1 (or +1), otherwise predict ω2 (or −1)

Corso di Apprendimento Automatico

Linear Discrimination Functions

Perceptron II

Thresholded output o(x1 , . . . , xn ) =

+1 if w0 + w1 x1 + · · · + wd xd > 0 −1 otherwise.

Simpler vector notation: o(~x ) = sgn(~x ) =

~ ~x > 0 +1 if w −1 otherwise.

~ |w ~ ∈ Rn } Space of the hypotheses: {w Corso di Apprendimento Automatico

Linear Discrimination Functions

Decision Surface of a Perceptron Can represent some useful functions What weights represent g(x1 , x2 ) = AND(x1 , x2 )?

But some functions not representable e.g., not linearly separable (XOR) Therefore, weâ&#x20AC;&#x2122;ll want networks of these...

Corso di Apprendimento Automatico

Linear Discrimination Functions

Perceptron Training Rule I P Perceptron criterion function: J(~a) = ~y ∈Y (~a) (−~at ~y ) where Y (~a) is the set of examples misclassified by ~a If no samples are misclassified, Y (~a) is empty and J(~a) = 0 (i.e. ~a is a solution vector) J(~a) ≥ 0, since ~at ~yi ≤ 0 if ~yi is misclassified Geometrically, J(~a) is proportional to the sum of the distances from the misclassified samples to the decision boundary

Since ∇J =

y) ~y ∈Y (~a) (−~

the update rule becomes

~a(k + 1) ← ~a(k ) + η(k)

~y ∈Yk (~a)

where Y (~a) is the set of examples misclassified by ~a(k)

Corso di Apprendimento Automatico

Linear Discrimination Functions

Perceptron Training Rule II wi ← wi + ∆wi where ∆wi = η(t − o)xi Where: t = c(~x ) target value o perceptron output η small constant (e.g., .1)

Corso di Apprendimento Automatico

learning rate

Linear Discrimination Functions

Perceptron Training Rule III Perceptron Learning Rule Set all weights wi to zero do for each instance x in the training data if x is classified incorrectly by the perceptron ~ if x belongs to Ď&#x2030;1 add it to w ~ else subtract it from w until all instances in the training data are classified correctly ~ return w Can prove it will converge If training data is linearly separable and Îˇ sufficiently small

Corso di Apprendimento Automatico

Linear Discrimination Functions

Perceptron Training Rule IV

Îˇ = 1. Sequence of misclassified samples: ~y2 , ~y3 , ~y1 , ~y3 Corso di Apprendimento Automatico

Linear Discrimination Functions

Perceptron Training Rule V Why does this work? Consider situation where an instance pertaining to the first class has been added: (w0 + x0 )x0 + (w1 + x1 )x1 + (w2 + x2 )x2 + . . . + (wd + xd )xd This means output for a has increased by: x0 x0 + x1 x1 + x2 x2 + . . . + xd xd always positive, thus the hyperplane has moved into the correct direction (and output decreases for instances of other class)

Corso di Apprendimento Automatico

Linear Discrimination Functions

Fixed-Increment Single-Sample Perceptron

Perceptron({~y (k) }nk=1 ): weight vector input: {~y (k) }nk=1 training examples begin initialize ~a, k = 0 do k â&#x2020;? (k + 1) mod n if ~y (k) is misclassified by the model based on ~a then ~a â&#x2020;? ~a + ~y (k) until all examples properly classified return ~a end

Corso di Apprendimento Automatico

Linear Discrimination Functions

Comments

The perceptron algorithm adjusts the parameters only when it encounters an error, i.e. a misclassified training example Correctly classified examples can be ignored The learning rate Îˇ can be chosen arbitrarily, ~ it will only impact on the norm of the final w (and the corresponding magnitude of w0 ) ~ is a linear combination of training The final weight vector w points

Corso di Apprendimento Automatico

Linear Discrimination Functions

Linear Models: W INNOW

Another mistake-driven algorithm for finding a separating hyperplane Assumes binary data (i.e. attribute values are either zero or one) Difference: multiplicative updates instead of additive updates Weights are multiplied by a user-specified parameter α > 1 (or its inverse) Another difference: user-specified threshold parameter θ Predict first class if w0 + w1 x1 + w2 x2 + · · · + wk xk > θ

Corso di Apprendimento Automatico

Linear Discrimination Functions

The Algorithm I W INNOW while some instances are misclassified for each instance a in the training data classify a using the current weights if the predicted class is incorrect if a belongs to the first class for each xi that is 1, multiply wi by Îą (if xi is 0, leave wi unchanged) otherwise for each xi that is 1, divide wi by Îą (xi is 0, leave wi unchanged)

Corso di Apprendimento Automatico

Linear Discrimination Functions

The Algorithm II

W INNOW is very effective in homing in on relevant features (it is attribute efficient) Can also be used in an on-line setting in which new instances arrive continuously (like the perceptron algorithm)

Corso di Apprendimento Automatico

Linear Discrimination Functions

Balanced W INNOW I W INNOW doesn’t allow negative weights and this can be a drawback in some applications B ALANCED W INNOW maintains two weight vectors, one for each class: w + and w − Instance is classified as belonging to the first class (of two classes) if: (w0+ − w0− ) + (w1+ − w1− )x1 + (w2+ − w2− )x2 + · · · + (wk+ − wk− )xk > θ

Corso di Apprendimento Automatico

Linear Discrimination Functions

Balanced W INNOW II B ALANCED W INNOW while some instances are misclassified for each instance a in the training data classify a using the current weights if the predicted class is incorrect if a belongs to the first class for each xi that is 1, multiply wi+ by α and divide wi− by α (if xi is 0, leave wi+ and wi− unchanged) otherwise for each xi that is 1, multiply wi− by α and divide wi+ by α (if xi is 0, leave wi+ and wi− unchanged)

Corso di Apprendimento Automatico

Linear Discrimination Functions

Nonseparable Case

The Perceptron is an error correcting procedure converges when the examples are linearly separable Even if a separating vector is found for the training examples, it does not follow that the resulting classifier will perform well on independent test data To ensure that the performance on training and test data will be similar, many training samples should be used. Sufficiently large training samples are almost certainly non linearly separable No weight vector can correctly classify every example in a nonseparable set The corrections may never cease if set is nonseparable

Corso di Apprendimento Automatico

Linear Discrimination Functions

Learning rate

If we choose η(k) → 0 as k → ∞ then performance can be acceptable on nonseparable problems while preserving the ability to find a solution on separable problems The rate at which η(k ) approaches zero is important: Too slow: result will be sensitive to those examples that render the set nonseparable Too fast: may converge prematurely with sub-optimal results

η(k ) can be considered as a function of recent performance, decreasing it as performance improves: e.g. η(k ) ← η/k

Corso di Apprendimento Automatico

Linear Discrimination Functions

Minimum Squared Error Approach I Minimum Squared Error (MSE) It trades the ability to obtain a separating vector for good performance on both separable and nonseparable problems Previously, we sought a weight vector ~a making all of the inner products ~at ~y â&#x2030;Ľ 0 In the MSE procedure, one tries to make ~at ~yi = bi , where bi are some arbitrarily specified positive constants Using matrix notation: Y ~a = ~b If Y is nonsingular, then ~a = Y â&#x2C6;&#x2019;1~b Unfortunately Y is not a square matrix, usually with more rows than columns When there are more equations than unknowns, ~a is overdetermined, and ordinarily no exact solution exists.

Corso di Apprendimento Automatico

Linear Discrimination Functions

Minimum Squared Error Approach II We can seek a weight vector ~a that minimizes some function of an error vector ~e = Y ~a − ~b Minimizing the squared length of the error vector is equivalent to minimizing the sum-of-squared-error criterion function J(~a) = ||Y ~a − ~b||2 =

n X

(~at ~yi − bi )2

i=1

whose gradient is ∇J = 2

n X (~at ~yi − bi )~yi = 2Y t (Y ~a − ~b) i=1

Setting the gradient equal to zero, the following necessary condition holds: Y t Y ~a = Y t ~b Corso di Apprendimento Automatico

Linear Discrimination Functions

Minimum Squared Error Approach III Y t Y is a square matrix which is often nonsingular. Therefore, solving for ~a: ~a = (Y t Y )â&#x2C6;&#x2019;1 Y t ~b = Y +~b where Y + = (Y t Y )â&#x2C6;&#x2019;1 Y t is the pseudoinverse of Y Y + can be written also as lim â&#x2020;&#x2019;0 (Y t Y + I)â&#x2C6;&#x2019;1 Y t and it can be shown that this limit always exists, hence ~a = Y +~b the MSE solution to the problem Y ~a = ~b

Corso di Apprendimento Automatico

Linear Discrimination Functions

Widrow-Hoff Procedure a.k.a. LMS The criterion function J(~a) = ||Y ~a − ~b||2 could be minimized by a gradient descent procedure Advantages: Avoids the problems that arise when Y t Y is singular Avoids the need for working with large matrices

Since ∇J = 2Y t (Y ~a − ~b) a simple update rule would be ~a(1) arbitrary ~a(k + 1) = ~a(k) + η(k )(Y ~a − ~b) or, if we consider the samples sequentially ~a(1) arbitrary ~a(k + 1) = ~a(k) + η(k ) bk − ~a(k )t ~y (k) ~y (k )

Corso di Apprendimento Automatico

Linear Discrimination Functions

Widrow-Hoff or LMS Agorithm

LMS({~yi }ni=1 ) input {~yi }ni=1 : training examples begin Initialize ~a, ~b, θ, η(·), k ← 0 do k ← k + 1 mod n ~a ← ~a + η(k)(bk − ~a(k )t ~y (k))~y (k ) until |η(k)(bk − ~a(k)t ~y (k ))~y (k )| < θ return ~a end

Corso di Apprendimento Automatico

Linear Discrimination Functions

Linear Regression Standard technique for numeric prediction Outcome is linear combination of attributes:

x = w0 + w1 x1 + w2 x2 + · · · + wd xd Weights are calculated from the training data ~ standard math algorithms w

Predicted value for first training instance ~x (1) (1)

(1)

w0 + w1 x1 + w2 x2 + · · · + wd xd

d X j=0

assuming extended vectors with x0 = 1

Corso di Apprendimento Automatico

Linear Discrimination Functions

(1)

wj xj

Probabilistic Classification

Any regression technique can be used for classification Training: perform a regression for each class setting the output to 1 for training instances that belong to the class and 0 for those that donâ&#x20AC;&#x2122;t Prediction: predict class corresponding to model with largest output value (membership value)

Problem: membership values are not in [0, 1] range, so arenâ&#x20AC;&#x2122;t proper probability estimates

Corso di Apprendimento Automatico

Linear Discrimination Functions

Logistic Regression I Logit transformation Builds a linear model for a transformed target variable Assume we have two classes Logistic regression replaces the target Pr (1 | ~x ) by this target log

Pr (1 | ~x ) 1 − Pr (1 | ~x )

Transformation maps [0, 1] to (−∞, +∞)

Corso di Apprendimento Automatico

Linear Discrimination Functions

Logistic Regression II

Corso di Apprendimento Automatico

Linear Discrimination Functions

Example: Logistic Regression Model Resulting model: Pr (1 | ~x ) = 1/1 + e−(w0 +w1 x1 +w2 x2 +···+wd xd ) Example: Model with w0 = 0.5 and w1 = 1:

Parameters induced from data using maximum likelihood Corso di Apprendimento Automatico

Linear Discrimination Functions

Maximum Likelihood

Aim: maximize probability of training data wrt parameters Can use logarithms of probabilities and maximize log-likelihood of model: Pn (i) log (1 − Pr (1 | ~ x )) + x (i) log (1 − Pr (1 | ~x )) i=1 1 − x where the x (i) are either 0 or 1 Weights wi need to be chosen to maximize log-likelihood relatively simple method: iteratively re-weighted least squares

Corso di Apprendimento Automatico

Linear Discrimination Functions

Summary

Perceptron training rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate Îˇ Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate Îˇ Even when training data contains noise Even when training data not separable by H

Corso di Apprendimento Automatico

Linear Discrimination Functions

Credits

R. Duda, P. Hart, D. Stork: Pattern Classification, Wiley T. M. Mitchell: Machine Learning, McGraw Hill I. Witten & E. Frank: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann

Corso di Apprendimento Automatico

Linear Discrimination Functions