Linear Discrimination Functions Corso di Apprendimento Automatico Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica UniversitĂ degli Studi di Bari
November 4, 2009
Corso di Apprendimento Automatico
Linear Discrimination Functions
Outline
Linear models Gradient descent Perceptron Minimum square error approach Linear and logistic regression
Corso di Apprendimento Automatico
Linear Discrimination Functions
Linear Discriminant Functions I
A linear discriminant function can be written as ~ t ~x + w0 g(x) = w1 x1 + · · · + wd xd + w0 = w where ~ is the weight vector w w0 is the bias or threshold
A 2-class linear classifier implements the decision rule: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0
Corso di Apprendimento Automatico
Linear Discrimination Functions
Linear Discriminant Functions II
The equation g(x) = 0 defines the decision surface that separates points assigned to Ď&#x2030;1 from points assigned to Ď&#x2030;2 . When g(x) is linear, this decision surface is a hyperplane (H).
Corso di Apprendimento Automatico
Linear Discrimination Functions
Linear Discriminant Functions III
H divides the feature space into 2 half spaces: R1 for 1, and R2 for 2 If x1 and x2 are both on the decision surface ~ t ~x1 + w0 = w ~ t ~x2 + w0 â&#x2021;&#x2019; w ~ t (~x1 â&#x2C6;&#x2019; ~x2 ) = 0 w w is normal to any vector lying in the hyperplane
Corso di Apprendimento Automatico
Linear Discrimination Functions
Linear Discriminant Functions IV If we express ~x as ~x = ~xp + r
~ w ~ || ||w
where ~xp is the normal projection of ~x onto H, and r is the algebraic distance from ~x to the hyperplane Since g(~xp ) = 0, ~ t ~x + w0 = r ||w ~ || i.e. r = we have g(~x ) = w
g(~x ) ~ || ||w
r is signed distance: r > 0 if ~x falls in R1 , r < 0 if ~x falls in R2 Distance from the origin to the hyperplane is
Corso di Apprendimento Automatico
w0 ~ || ||w
Linear Discrimination Functions
Linear Discriminant Functions V
Corso di Apprendimento Automatico
Linear Discrimination Functions
Multi-category Case I
2 approaches to extend the LDF approach to the multi-category case: ωi / not ωi Reduce the problem to c − 1 two-class problems: Problem #i: Find the functions that separates points assigned to ωi from those not assigned to ωi ωi / ωj Find the c(c − 1)/2 linear discriminants, one for every pair of classes Both approaches can lead to regions in which the classification is undefined
Corso di Apprendimento Automatico
Linear Discrimination Functions
Multi-category Case II
Corso di Apprendimento Automatico
Linear Discrimination Functions
Pairwise Classification
Idea: build model for each pair of classes, using only training data from those classes Problem: solve c(c − 1)/2 classification problems for c classes Turns out not to be a problem in many cases because training sets become small: Assume data evenly distributed, i.e. 2n/c per learning problem for n instances in total Suppose learning algorithm is linear in n Then runtime of pairwise classification is proportional to c(c−1) × 2n 2 c = (c − 1)n
Corso di Apprendimento Automatico
Linear Discrimination Functions
Linear Machine I Define c linear discriminant functions: ~ it ~x + wi0 gi (~x ) = w
i = 1, . . . , c
Linear Machine classifier: ~x ∈ ωi if gi (~x ) > gj (~x ) for all i 6= j In case of equal scores, the classification is undefined A LM divides the feature space into c decision regions, with gi (~x ) the largest discriminant if ~x is in Ri If Ri and Rj are contiguous, the boundary between them is a portion of the hyperplane Hij defined by: gi (~x ) = gj (~x )
or
Corso di Apprendimento Automatico
~i − w ~ j )t ~x + (wi0 − wj0 ) (w
Linear Discrimination Functions
Linear Machine II ~i − w ~ j is normal to Hij It follows that w The signed distance from ~x to Hij is: gi (~x ) − gj (~x ) ~i − w ~ j || ||w There are c(c − 1)/2 pairs of convex regions Not all regions are contiguous, and the total number of segments in the surfaces is often less than c(c − 1)/2
3- and 5-class problems Corso di Apprendimento Automatico
Linear Discrimination Functions
Generalized LDF I The LDF is g(~x ) = w0 +
Pd
i=1 wi xi
Adding d(d + 1)/2 terms involving the products of pairs of components of ~x , quadratic discriminant function: g(~x ) = w0 +
d X
w i xi +
i=1
d X d X
wij xi xj
i=1 j=1
The separating surface defined by g(~x ) = 0 is a second-degree or hyperquadric surface Add more terms, such as wijk xi xj xk , we obtain polynomial discriminant functions
Corso di Apprendimento Automatico
Linear Discrimination Functions
Generalized LDF II The generalized LDF is defined g(~x ) =
dˆ X
ai yi (~x ) = ~at ~y
i=1
where: ˆ ~a is a d-dimensional weight vector and yi (~x ) are arbitrary functions of ~x The resulting discriminant function is not linear in ~x , but it is linear in ~y The functions yi (~x ) map points in d-dimensional ~x -space ˆ ~y -space to points in the d-dimensional
Corso di Apprendimento Automatico
Linear Discrimination Functions
Generalized LDF III 2 Example: Let the QDF be g(~x ) = a1 + a2 x + a3 x 1 The 3-dimensional vector is then y = x x2
Corso di Apprendimento Automatico
Linear Discrimination Functions
2-class Linearly-Separable Case I
g(~x ) =
d X
wi xi = ~at ~y
i=0
where x0 = 1 and ~y t = [1 ~x ] = [1 x1 · · · xd ] is an augmented feature vector and ~ ] = [w0 w1 · · · wd ] is an augmented weight vector ~at = [w0 w The hyperplane decision surface H defined ~at ~y = 0 passes through the origin in ~y -space ~t~
~
x) The distance from any point ~y to H is given by ||a~ay|| = g( ||~a|| p ~ ||2 ) this distance is less then the Because ~a = (1 + ||w distance from ~x to H
Corso di Apprendimento Automatico
Linear Discrimination Functions
2-class Linearly-Separable Case II
~ ] = ~a Problem: find [w0 w Suppose that we have a set of n examples {~y1 , . . . , ~yn } labeled ω1 or ω2 Look for a weight vector ~a that classifies all the examples correctly: ~at ~yi > 0 and ~yi is labeled ω1 or ~at ~yi < 0 and ~yi is labeled ω2
If ~a exists, the examples are linearly separable
Corso di Apprendimento Automatico
Linear Discrimination Functions
2-class Linearly-Separable Case III Solutions Replacing all the examples labeled Ď&#x2030;2 by their negatives, one can look for a weight vector ~a such that ~at ~yi > 0 for all the examples ~a a.k.a. separating vector or solution vector Each example ~yi places a constraint on the possible location of a solution vector ~at ~yi = 0 defines a hyperplane through the origin having ~yi as a normal vector The solution vector (if it exists) must be on the positive side of every hyperplane Solution Region = intersection of the n half-spaces
Corso di Apprendimento Automatico
Linear Discrimination Functions
2-class Linearly-Separable Case IV
Any vector that lies in the solution region is a solution vector: the solution vector (if it exists) is not unique Additional requirements to find a solution vector closer to the middle of the region (i.e. more likely to classify new examples correctly) Seek a unit-length weight vector that maximizes the minimum distance from the examples to the hyperplane Corso di Apprendimento Automatico
Linear Discrimination Functions
2-class Linearly-Separable Case V
Seek the minimum-length weight vector satisfying ~at ~yi â&#x2030;Ľ b â&#x2030;Ľ 0 The solution region shrinks by margin: b/||~yi ||
Corso di Apprendimento Automatico
Linear Discrimination Functions
Gradient Descent I
Define a criterion function J(~a) that is minimized when ~a is a solution vector: ~at ~yi ≥ 0, ∀i = 1, . . . , n Start with some arbitrary vector ~a(1) Compute the gradient vector ∇J(~a(1)) The next value ~a(2) is obtained by moving a distance from ~a(1) in the direction of steepest descent i.e. along the negative of the gradient
In general, ~a(k + 1) is obtained from ~a(k ) using ~a(k + 1) ← ~a(k ) − η(k )∇J(~a(k )) where η(k ) is the learning rate
Corso di Apprendimento Automatico
Linear Discrimination Functions
Gradient Descent II
Corso di Apprendimento Automatico
Linear Discrimination Functions
Gradient Descent & Delta Rule I To understand, consider a simpler linear machine (a.k.a. unit), where o = w0 + w1 x1 + · · · + wn xn Let’s learn wi ’s that minimize the squared error, ~] i.e. J(w) = E[w ~]≡ E[w
1X~ (td − ~od )2 2 d∈D
where: D is set of training examples h~x , ti t is the target output value
Corso di Apprendimento Automatico
Linear Discrimination Functions
Gradient Descent & Delta Rule II
Gradient
~]≡ ∇E[w
∂E ∂E ∂E , ,··· ∂w0 ∂w1 ∂wn
Training rule: ~ = −η∇E[w ~] ∆w i.e., ∆wi = −η
∂E ∂wi
Note that η may be a constant
Corso di Apprendimento Automatico
Linear Discrimination Functions
Gradient Descent & Delta Rule III
∂E ∂wi
= = = =
∂E ∂wi
=
∂ 1X (td − od )2 ∂wi 2 d 1X ∂ (td − od )2 2 ∂wi d 1X ∂ 2(td − od ) (td − od ) 2 ∂wi d X ∂ ~ · x~d ) (td − od ) (td − w ∂wi d X (td − od )(−xid ) d
Corso di Apprendimento Automatico
Linear Discrimination Functions
Basic G RADIENT-D ESCENT Algorithm G RADIENT-D ESCENT(D, η) D: training set, η: learning rate
(e.g. .5)
Initialize each wi to some small random value until the termination condition is met do Initialize each ∆wi to zero. for each h~x , ti ∈ D do Input the instance ~x to the unit and compute the output o for each wi do ∆wi ← ∆wi + η(t − o)xi
for each weight wi do wi ← wi + ∆wi
Corso di Apprendimento Automatico
Linear Discrimination Functions
Incremental (Stochastic) G RADIENT D ESCENT I Approximation of the standard G RADIENT-D ESCENT
Batch G RADIENT-D ESCENT: Do until satisfied 1 2
~] Compute the gradient ∇ED [w ~ ←w ~ − η∇ED [w ~] w
Incremental G RADIENT-D ESCENT: Do until satisfied For each training example d in D 1 2
~] Compute the gradient ∇Ed [w ~ ~ ~ w ← w − η∇Ed [w ] Corso di Apprendimento Automatico
Linear Discrimination Functions
Incremental (Stochastic) G RADIENT D ESCENT II ~]≡ ED [w
1X (td − od )2 2 d∈D
~]≡ E d [w
1 (td − od )2 2
Training rule (delta rule): ∆wi ← η(t − o)xi similar to perceptron training rule, yet unthresholded convergence is only asymptotically guaranteed linear separability is no longer needed !
Corso di Apprendimento Automatico
Linear Discrimination Functions
Standard vs. Stochastic G RADIENT-D ESCENT
Incremental-GD can approximate Batch-GD arbitrarily closely if Ρ made small enough error summed over all examples before summing updated upon each example standard GD more costly per update step and can employ larger Ρ stochastic GD may avoid falling in local minima because of using Ed instead of ED
Corso di Apprendimento Automatico
Linear Discrimination Functions
Newton’s Algorithm
1 J(~a) ' J(~a(k )) + ∇J t (~a − ~a(k )) + (~a − ~a(k ))t H(~a − ~a(k )) 2 where H =
∂2J ∂ai ∂aj
is the Hessian matrix
Choose ~a(k + 1) to minimize this function: ~a(k + 1) ← ~a(k ) − H −1 ∇J(~a) Greater improvement per step than GD but not applicable when H is singular Time complexity O(d 3 )
Corso di Apprendimento Automatico
Linear Discrimination Functions
Perceptron I
Assumption: data is linearly separable Pd Hyperplane: i=0 wi xi = 0 assuming that there is a constant attribute x0 = 1 (bias) Algorithm for learning separating hyperplane: perceptron learning rule Classifier: P If di=0 wi xi > 0 then predict ω1 (or +1), otherwise predict ω2 (or −1)
Corso di Apprendimento Automatico
Linear Discrimination Functions
Perceptron II
Thresholded output o(x1 , . . . , xn ) =
+1 if w0 + w1 x1 + · · · + wd xd > 0 −1 otherwise.
Simpler vector notation: o(~x ) = sgn(~x ) =
~ ~x > 0 +1 if w −1 otherwise.
~ |w ~ ∈ Rn } Space of the hypotheses: {w Corso di Apprendimento Automatico
Linear Discrimination Functions
Decision Surface of a Perceptron Can represent some useful functions What weights represent g(x1 , x2 ) = AND(x1 , x2 )?
But some functions not representable e.g., not linearly separable (XOR) Therefore, weâ&#x20AC;&#x2122;ll want networks of these...
Corso di Apprendimento Automatico
Linear Discrimination Functions
Perceptron Training Rule I P Perceptron criterion function: J(~a) = ~y ∈Y (~a) (−~at ~y ) where Y (~a) is the set of examples misclassified by ~a If no examples are misclassified, Y (~a) is empty and J(~a) = 0 (i.e. ~a is a solution vector) J(~a) ≥ 0, since ~at ~yi ≤ 0 if ~yi is misclassified Geometrically, J(~a) is proportional to the sum of the distances from the misclassified examples to the decision boundary
Since ∇J =
y) ~y ∈Y (~a) (−~
P
the update rule becomes
~a(k + 1) ← ~a(k ) + η(k )
X
~y
~y ∈Yk (~a)
where Y (~a) is the set of examples misclassified by ~a(k )
Corso di Apprendimento Automatico
Linear Discrimination Functions
Perceptron Training I
Set all coefficient ai to zero do for each instance y in the training data if y is classified incorrectly by the perceptron if y belongs to Ď&#x2030;1 add it to ~a else subtract it from ~a until all instances in the training data are classified correctly return ~a
Corso di Apprendimento Automatico
Linear Discrimination Functions
Perceptron Training II B ATCH P ERCEPTRON T RAINING Initialize ~a, η, θ, k ← 0 do k ←k +1 P ~a ← ~a + η(k ) ~y ∈Y ~y k P until | η(k ) ~y ∈Yk |< θ return ~a Can prove it will converge If training data is linearly separable and η sufficiently small
Corso di Apprendimento Automatico
Linear Discrimination Functions
Perceptron Training III Why does this work? Consider situation where an instance pertaining to the first class has been added: (a0 + y0 )y0 + (a1 + y1 )y1 + (a2 + y2 )y2 + . . . + (ad + ad )yd This means output for ~a has increased by: y0 y0 + y1 y1 + y2 y2 + . . . + yd yd always positive, thus the hyperplane has moved into the correct direction (and output decreases for instances of other class)
Corso di Apprendimento Automatico
Linear Discrimination Functions
Perceptron Training IV
Ρ = 1 and ~a(1) = ~0. Sequence of misclassified instances: ~y1 + ~y2 + ~y3 , ~y2 , ~y3 , ~y1 , ~y3 stop Corso di Apprendimento Automatico
Linear Discrimination Functions
Perceptron
Simplification F IXED -I NCREMENT S INGLE -E XAMPLE P ERCEPTRON input: {~y (k ) }nk =1 training examples begin initialize ~a, k = 0 do k â&#x2020;? (k + 1) mod n if ~y (k ) is misclassified by the model based on ~a then ~a â&#x2020;? ~a + ~y (k ) until all examples properly classified return ~a end
Corso di Apprendimento Automatico
Linear Discrimination Functions
Generalizations I
VARIABLE -I NCREMENT P ERCEPTRON WITH M ARGIN begin initialize ~a, θ, margin b, η, k ← 0 do k ← (k + 1) mod n if ~at ~y (k ) ≤ b then ~a ← ~a + ~y (k ) until ~at ~y (k ) > b for all k return ~a end
Corso di Apprendimento Automatico
Linear Discrimination Functions
Generalizations II B ATCH VARIABLE -I NCREMENT P ERCEPTRON begin initialize ~a, η, k ← 0 do k ← (k + 1) mod n Yk ← ∅ j ←0 do j ←j +1 if yj misclassified then Yk ← Yk ∪ {yj } until j = P n ~a ← ~a + ~y ∈Y ~y k until Yk = ∅ return ~a end Corso di Apprendimento Automatico
Linear Discrimination Functions
Comments
Perceptron adjusts the parameters only when it encounters an error, i.e. a misclassified training example Correctly classified examples can be ignored The learning rate Ρ can be chosen arbitrarily, it will only impact on the norm of the final ~a (and the corresponding magnitude of a0 ) The final weight vector ~a is a linear combination of training points
Corso di Apprendimento Automatico
Linear Discrimination Functions
Nonseparable Case
The Perceptron is an error correcting procedure converges when the examples are linearly separable Even if a separating vector is found for the training examples, it does not follow that the resulting classifier will perform well on independent test data To ensure that the performance on training and test data will be similar, many training examples should be used. Sufficiently large training examples are almost certainly non linearly separable No weight vector can correctly classify every example in a nonseparable set The corrections may never cease if set is nonseparable
Corso di Apprendimento Automatico
Linear Discrimination Functions
Learning rate
If we choose η(k ) → 0 as k → ∞ then performance can be acceptable on non-separable problems while preserving the ability to find a solution on separable problems η(k ) can be considered as a function of recent performance, decreasing it as performance improves: e.g. η(k ) ← η/k The rate at which η(k ) approaches zero is important: Too slow: result will be sensitive to those examples that render the set non-separable Too fast: may converge prematurely with sub-optimal results
Corso di Apprendimento Automatico
Linear Discrimination Functions
Linear Models: W INNOW
Another mistake-driven algorithm for finding a separating hyperplane Assumes binary attributes (i.e. propositional variables) Main difference: multiplicative instead of additive updates Weights are multiplied by a parameter α > 1 (or its inverse)
Another difference: user-specified threshold parameter θ Predict first class if w0 + w1 x1 + w2 x2 + · · · + wk xk > θ
Corso di Apprendimento Automatico
Linear Discrimination Functions
The Algorithm I W INNOW initialize ~a, Îą while some instances are misclassified for each instance ~y in the training data classify ~y using the current model ~a if the predicted class is incorrect if y belongs to the target class for each attribute yi = 1, multiply ai by Îą (if yi = 0, ai is left unchanged) otherwise for each attribute yi = 1, divide ai by Îą (if yi = 0, ai is left unchanged)
Corso di Apprendimento Automatico
Linear Discrimination Functions
The Algorithm II
W INNOW is very effective in homing in on relevant features (it is attribute efficient) Can also be used in an on-line setting in which new instances arrive continuously (like the perceptron algorithm)
Corso di Apprendimento Automatico
Linear Discrimination Functions
Balanced W INNOW I
W INNOW doesn’t allow negative weights and this can be a drawback in some applications B ALANCED W INNOW maintains two weight vectors, one for each class: a+ and a− Instance is classified as belonging to the first class (of two classes) if: (a0+ −a0− )+(a1+ −a1− )y1 +(a2+ −a2− )y2 +· · ·+(ak+ −ak− )yk > θ
Corso di Apprendimento Automatico
Linear Discrimination Functions
Balanced W INNOW II B ALANCED W INNOW while some instances are misclassified for each instance a in the training data classify a using the current weights if the predicted class is incorrect if a belongs to the first class for each attribute yi = 1, multiply ai+ by α and divide ai− by α (if yi = 0, leave ai+ and ai− unchanged) otherwise for each attribute yi = 1, multiply ai− by α and divide ai+ by α (if yi = 0, leave ai+ and ai− unchanged)
Corso di Apprendimento Automatico
Linear Discrimination Functions
Minimum Squared Error Approach I Minimum Squared Error (MSE) It trades the ability to obtain a separating vector for good performance on both separable and non-separable problems Previously, we sought a weight vector ~a making all of the inner products ~at ~y â&#x2030;Ľ 0 In the MSE procedure, one tries to make ~at ~yi = bi , where bi are some arbitrarily specified positive constants Using matrix notation: Y ~a = ~b If Y is nonsingular, then ~a = Y â&#x2C6;&#x2019;1~b Unfortunately Y is not a square matrix, usually with more rows than columns When there are more equations than unknowns, ~a is overdetermined, and ordinarily no exact solution exists.
Corso di Apprendimento Automatico
Linear Discrimination Functions
Minimum Squared Error Approach II We can seek a weight vector ~a that minimizes some function of an error vector ~e = Y ~a − ~b Minimizing the squared length of the error vector is equivalent to minimizing the sum-of-squared-error criterion function J(~a) = ||Y ~a − ~b||2 =
n X
(~at ~yi − bi )2
i=1
whose gradient is ∇J = 2
n X (~at ~yi − bi )~yi = 2Y t (Y ~a − ~b) i=1
Setting the gradient equal to zero, the following necessary condition holds: Y t Y ~a = Y t ~b Corso di Apprendimento Automatico
Linear Discrimination Functions
Minimum Squared Error Approach III Y t Y is a square matrix which is often nonsingular. Therefore, solving for ~a: ~a = (Y t Y )â&#x2C6;&#x2019;1 Y t ~b = Y +~b where Y + = (Y t Y )â&#x2C6;&#x2019;1 Y t is the pseudo-inverse of Y Y + can be written also as lim â&#x2020;&#x2019;0 (Y t Y + I)â&#x2C6;&#x2019;1 Y t and it can be shown that this limit always exists, hence ~a = Y +~b the MSE solution to the problem Y ~a = ~b
Corso di Apprendimento Automatico
Linear Discrimination Functions
W IDROW-H OFF procedure a.k.a. LMS I The criterion function J(~a) = ||Y ~a − ~b||2 could be minimized by a gradient descent procedure Advantages: Avoids the problems that arise when Y t Y is singular Avoids the need for working with large matrices
Since ∇J = 2Y t (Y ~a − ~b) a simple update rule would be
~a(1) arbitrary ~a(k + 1) = ~a(k ) + η(k )(Y ~a − ~b)
or, if we consider the examples sequentially ~a(1) arbitrary ~a(k + 1) = ~a(k ) + η(k ) bk − ~a(k )t ~y (k ) ~y (k )
Corso di Apprendimento Automatico
Linear Discrimination Functions
W IDROW-H OFF procedure a.k.a. LMS II
LMS({~yi }ni=1 ) input {~yi }ni=1 : training examples begin Initialize ~a, ~b, θ, η(·), k ← 0 do k ← k + 1 mod n ~a ← ~a + η(k )(bk − ~a(k )t ~y (k ))~y (k ) until |η(k )(bk − ~a(k )t ~y (k ))~y (k )| < θ return ~a end
Corso di Apprendimento Automatico
Linear Discrimination Functions
Summary
Perceptron training rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate Ρ Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with MSE Given sufficiently small learning rate Ρ Even when training data contains noise Even when training data not separable by H
Corso di Apprendimento Automatico
Linear Discrimination Functions
Linear Regression Standard technique for numeric prediction Outcome is linear combination of attributes:
x = w0 + w1 x1 + w2 x2 + · · · + wd xd Weights are calculated from the training data ~ standard math algorithms w
Predicted value for first training instance ~x (1) (1)
(1)
(1)
w0 + w1 x1 + w2 x2 + · · · + wd xd
=
d X j=0
assuming extended vectors with x0 = 1
Corso di Apprendimento Automatico
Linear Discrimination Functions
(1)
wj xj
Probabilistic Classification
Multiresponse Linear Regression (MLR) Any regression technique can be used for classification Training: perform a regression for each class â&#x2020;&#x2019; gi linear compute each linear expression for each class, setting the output to 1 for training instances that belong to the class and 0 for those that donâ&#x20AC;&#x2122;t Prediction: predict class corresponding to model with largest output value (membership value)
Corso di Apprendimento Automatico
Linear Discrimination Functions
Logistic Regression I
MLR drawbacks 1
membership values are not proper probabilities they can fall outside [0, 1]
2
least squares regression assumes that: the errors are not only statistically independent, but are also normally distributed with the same standard deviation
Logit transformation does not suffer from these problems Builds a linear model for a transformed target variable Assume we have two classes
Corso di Apprendimento Automatico
Linear Discrimination Functions
Logistic Regression II
Logistic regression replaces the target Pr (1 | ~x ) that cannot be approximated well using a linear function with this target Pr (1 | ~x ) log 1 − Pr (1 | ~x ) Transformation maps [0, 1] to (−∞, +∞)
Corso di Apprendimento Automatico
Linear Discrimination Functions
Logistic Regression III
logit tranformation function
Corso di Apprendimento Automatico
Linear Discrimination Functions
Example: Logistic Regression Model Resulting model: Pr (1 | ~y ) = 1/ 1 + e−(a0 +a1 y1 +a2 y2 +···+ad yd ) Example: Model with a0 = 0.5 and a1 = 1:
Parameters induced from data using maximum likelihood Corso di Apprendimento Automatico
Linear Discrimination Functions
Maximum Likelihood Aim: maximize probability of training data with respect to the parameters Can use logarithms of probabilities and maximize log-likelihood of model and MSE: n X
1 − x (i) log 1 − Pr (1 | ~y (i) ) +x (i) log 1 − Pr (1 | ~y (i) )
i=1
where the x (i) ’s are the responses (either 0 or 1) Weights ai need to be chosen to maximize log-likelihood relatively simple method: iteratively re-weighted least squares
Corso di Apprendimento Automatico
Linear Discrimination Functions
Credits
R. Duda, P. Hart, D. Stork: Pattern Classification, Wiley T. M. Mitchell: Machine Learning, McGraw Hill I. Witten & E. Frank: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann
Corso di Apprendimento Automatico
Linear Discrimination Functions