Introduction to Support Vector Machines and Kernel Methods by jau1990

Introduction to Support Vector Machines and Kernel Methods Johar M. Ashfaque,

Amer Iqbal

(Dated: April 12, 2019) We explain the support vector machine algorithm, and its extension the kernel method, for machine learning using small datasets. We also briefly discuss the Vapnik-Chervonenkis theory which forms the theoretical foundation of machine learning. This review is based on lectures given by the second author. I. MACHINE LEARNING AND VAPNIK-CHERVONENKIS THEORY

In its simplest form one can understand (supervised) machine learning as giving an approximation to a function f : X 7â&#x2020;&#x2019; Y given the values of the function (the training dataset) at a certain number of points in X, D = {(xi , f (xi )) | i = 1, Âˇ Âˇ Âˇ N } â&#x2C6;&#x2C6; X Ă&#x2014; Y . If there are no other assumptions about f and this is all the information that will ever be available then it can be done perfectly with no error. The subtlety lies in the fact that once f is approximated using D (call its approximation fb) new information can become available

(Vapnik - Chervonenkis): N 2 P |Ein â&#x2C6;&#x2019; Eout | > â&#x2030;¤ (Âˇ Âˇ Âˇ )eâ&#x2C6;&#x2019; 32 Thus, probabilistically speaking, one can bound the difference between in-sample and out-of-sample error if a large amount of data is available i.e., in a probabilistic sense learning is possible. 1. Given the data D = {(xi , yi ) | i = 1, Âˇ Âˇ Âˇ , N } = D0 â&#x2C6;Ş D1 Choose a hypothesis set from which fb will be chosen (linear functions, polynomials of higher degree, etc) 2. Define an error function using fb and D0 which will be minimized by searching through the hypothesis set. 3. Check using D1 if the chosen function behaves well on unseen data.

(xN +1 , f (xN +1 )), Âˇ Âˇ Âˇ and the question is then how well the function fb performs (or generalizes) on this new data. The learning aspect (beyond data fitting) arises precisely since one wants to minimize the potential of error on any new data which can arise i.e., not only minimize the error between f and fb on D (the in-sample error) but also minimize the error on any new dataset (the out of sample error). However, the approximation fb developed using D may not be any good if the new dataset on which it is being tested is similar to the dataset D on which the approximation was trained. To not to trick the approximation { it is required that D be sampled from X using some probability distribution and new data (test data) is also obtained from X using the same probability distribution. It does not matter which probability distribution is used since all that is required is that the test dataset be distributed in the same way as the training dataset. With respect to that distribution it can be shown

II. PERCEPTRON LEARNING ALGORITHM: A CLASSIFICATION ALGORITHM

This is a binary classification algorithm so that Y = {â&#x2C6;&#x2019;1, +1} x0 = 1 f : X 7â&#x2020;&#x2019; Y , x = , ~x â&#x2C6;&#x2C6; X (1) ~x The hypothesis set in which we try to find the best approximation (with the least error function) is given by assuming that it is a linearly separable data, H = {hw | hw (x) = sign wT x , x0 = 1 , w â&#x2C6;&#x2C6; RM +1 } The function sign(z) gives the sign (positive or negative) of the real number z.

x0 x1 x2 .. .

The length of the vector wk is bounded, T T wkT wk = wk−1 wk−1 + xT x + 2y` wk−1 x` | {z }

w0 w1 w2

(7)

hw (x)

Since the data is normalized xTi xi = 1 therefore,

∈ {−1, +1}

T wkT wk ≤ wk−1 wk−1 + 1 ≤ w0T w0 + k

(8)

Thus we get w?T wk w?T w0 + k δ p ≥ T ||w? || ||wk || ||w? || ||w0 ||2 + k

Thus the angle between w? and the weights grows without bound unless the iterations stop and they stop when all the points are correctly classified. This is the perceptron convergence theorem. If the data is not linearly separable then the above algorithm will not stop since it only stops when all points are correctly classified.

The error function (up to a factor of 4) just counts the fraction of number of points which are not correctly classified is, Ein (h) =

N 2 1 X hw (xi ) − yi N i=1

(9)

(2)

If the data is linearly separable then one can find w such that Ein (h) = 0. The weight vector correctly classifies a point xj ∈ D with f (xj ) = yj = +1 if the angle between w and xj is between [0, π2 ) i.e., Cosine of the angle is positive. And it correctly correctly classifies a point x` ∈ D with f (x` ) = y` = −1 if the angle between w and x` is between ( π2 , π] i.e., Cosine of the angle is negative. The algorithm that achieves this start with initializing w randomly and lets call this value w0 . Then we pick a point xj randomly from {x1 , · · · , xN }. If this point is classified correctly then we pick another point from {x1 , · · · , xN }. However, if xj is not classified correctly then we change w0 as w0 7→ w1 = w0 + yj xj

(3)

With the new weight w1 we repeat the process. Thus after k iterations we have, wk = wk−1 + y` x`

FIG. 1. Randomly generated linearly separable data.

(4)

when the weight wk−1 did not classify x` correctly T i.e., sign wk−1 x` = −y` . If we denote by w? the

III.

weight that classifies all the points correctly (since the data is linearly separable) then, w?T wk = w?T wk−1 + y` w?T x`

LINEAR REGRESSION

Given some data points the regression analysis tries to find a curve which fits the data and minimizes the error function. The function defining the curve depend on parameters which are optimized to minimize the error function. Models in which the function depends on the parameters linearly are called Linear Models. Consider the input vector x0 = 1 x= (10) ~x ∈ X

(5)

The quantity w?T x` has the same sign as y` and therefore y` w?T x` is a positive. If we denote the minimum value in the set {yi w?T xi | i = 1, · · · , N } as δ then w?T wk ≥ w?T wk−1 + δ =⇒ w?T wk ≥ w?T w0 + k(6) δ 2

and the hypothesis function

IV.

w0  w1    where w =  ·   ·  wM 

hw (x) = w x ,



The choice for the parameters w gives a curve (or a hyperplane in the case of more than one independent variable but linear in them) which will be the estimate for the data. The data points might not lie on the curve which can be incorporated using a error term. We say that

(11)

are the weights. This is an example of a linear model as modeling function hw (x) depends linearly on parameters w. The error function usually taken is just the sum of squares: Er(w) =

1 2

N X

hw (xi ) − yi

Yi = yw (xi ) + εi ,

1 2

Xw − y

(12)

Xw − y = 12 ||Xw − y||2

where   x0,1 x1,1 x2,1 · · · xM,1  xT1 x0,2 x1,2 x2,2 · · · xM,2   xT    2 = · X=   · · ·   · T xN x0,N x1,N x2,N · · · xM,N 

(16)

= √ exp − ( 2πσ 2 )N

and

i (vi

− yw (xi ))2 2σ 2

Given this probability distribution we would like to choose w so that the probability of obtaining the actual data yi is maximum. Since the probability of the variable being in some interval will be maximum if the interval lies close to the maximum of the probability distribution therefore we can maximize the probability distribution: P (y − y (x ))2 1 i w i maxw √ exp − i (17) 2σ 2 ( 2πσ 2 )N

 y1 y  y =  2 . ··· yN 

{(xi , yi ) | i = 1, · · · , N } is the training data and xk,i is the k-th coordinate of the i-th training data. The vector y has yi (the i-th training data ) as its i-th coordinate. The error function Er(w),

The term in the exponential is proportional to the error function Er(w) that we defined earlier: Er(w) 1 (18) maxw √ exp − 2σ 2 ( 2πσ 2 )N

Er(w) = 21 wT X T X w − wT X T y + 12 yT y (13) is positive definite convex function and therefore has a unique minimum. The critical point of the Er(w) is given by:

Thus in this case maximizing the probability of obtaining the observed data yi is the same as minimizing the error function: Er(w) 1 maxw √ 7→ minw Er(w) exp − 2σ 2 ( 2πσ 2 )N

X ∂Er(w) = (X T X)ab wb − (X T y)a = 0 =⇒(14) ∂wa b

w? = (X T X)−1 X T y and h? (x) = hw? (x) = w? T x

i = 1, · · · , N

The error εi is taken to be a Gaussian random variable with zero mean and different εi are independent of each other. This gives a set of independent identically distributed random variables. We have introduced the notation Yi to distinguish the random variable from the i-th value of the data which is yi . Give that Yi is now a Guassian random variable with mean yw (xi ) we can talk about the probability distribution of Y = {Y1 , · · · , YN }, Y p(Y = v | w, σ 2 ) = p(εi = vi − y|bf w (xi ) | w, σ 2 )

i=1

LOGISTIC REGRESSION

(15)

at p. Therefore

V. ON CONSTRAINED OPTIMIZATION, LAGRANGE MULTIPLIERS AND KKT CONDITIONS

(25)

where t is the tangent vector to the curve passing through p. Thus we see that ∇f at p is orthogonal to all the curves passing through p and is hence orthogonal to the hypersurface S. The vectors orthogonal to hypersurfaces Sa are also orthogonal to the hypersurface S since S lies inside each of Sa . Thus we can express the vetor orthogonal to S as a linear combination of vectors orthogonal to Sa and therefore

Suppose we would like to find the minimum of a function f (x, y) of two variables subject to the constraint g(x, y) = 0. The constraint g(x, y) defines a curve and the problem is to determine the minimum value of the function f (x, y) on that curve. If we parametrize the curve by parameter t then along the curve the function f (x, y) is a function of t, f (x, y)|curve = f (x(t), y(t))

t · ∇f |p = 0

(19)

where x = x(t), y = y(t) is a parametrization of the curve. The minimum value of the function along the curve will be a critical point (assume generic situation) of f (x(t), y(t)) with respect to t and therefore:

∇f = λ1 ∇g1 + λ2 ∇g2 + · · · + λk ∇gk k X ∇ f− λa ga = 0

d f (x(t), y(t)) = 0 =⇒ t · ∇f = 0 , dt

We can extend the gradient to the λ-space to incorporate the constraints defined by ga (x) within the gradient equation,

a=1

(20)

where t is the tangent vector to the curve. Thus at the critical point (and therefore at the minimum) the gradient of the function is orthogonal to the curve. Since the vector which is orthogonal to the curve is ∇g therefore, ∇f = λ ∇g =⇒ ∇(f + λg) = 0

k X ∇x,lambda f − λa ga = 0

and

g=0

Thus effectively one can consider the P minimizing with respect to x the function f − a λa ga . Notice that for any x for which the constraint (one or more) is violated we can choose the corresponding λ in P a way that the maximum value of the function f − a λa ga becomes ∞, ( X ∞ if x ∈ /S e maxλ f (x) − λa ga (x) = f (x) = f (x) if x ∈ S a

(21)

(22)

Both these equations can be merged into a single equation: ∇x,λ (f + λg) = 0

(23)

Therefore

where the gradient is now defined as ∇x,λ = ∂ ∂ ∂ i.e., in the space R3 . ∂x , ∂y , ∂λ Consider now the case of a function of N variables f (x) subject to k constraints, g1 (x) = 0, · · · , gk (x) = 0 .

(27)

a=1

Thus at the minimum we will have ∇(f + λg) = 0

(26)

minx∈S f (x) = minx fe(x)

= minx maxλ f (x) −

(28) λa ga (x)

The dual problem is given by changing the order in which minimum and maximum are taken: X maxλ minx f (x) − λa ga (x) (29)

(24)

The solution of the constraints defines a N − k dimensional hypersurface in RN ,

S = {x ∈ RN | g1 (x) = g2 (x) = · · · = gk (x) = 0} = S1 ∩ S2 ∩ · · · ∩ Sk , Sa = {x ∈ RN | ga (x) = 0}

The advantage of the dual problem is that the function X G(λ) = minx f (x) − λa ga (x) (30)

and we are interested in the maximum value of the function f (x) on S. If the minimum occurs at a point p ∈ S then for any curve passing through p the minimum value of the function on that curve is

is a concave function. However, the solution of the

original problem and the dual problem may not be the same i.e., X minx maxλ f (x) − λa ga (x) (31)

we have to constrain λ. maxλ≤0

f (x) − λ g(x) = fe(x) =

(

∞ g(x) > 0 (37) f (x) g(x) ≤ 0

This gives

X 6= maxλ minx f (x) − λa ga (x)

minx∈S f (x) = min fe(x) = minx maxλ≤0 f (x) − λg(x) (38)

Actually, X minx maxλ f (x) − λa ga (x)

Thus for the case of inequality constraints the Lagrange multipliers are also constrained in the same direction as the inequality constraints (given the way we have defined the Lagrangian).

(32)

X ≥ maxλ minx f (x) − λa ga (x) a

Thus the solution of the dual problem provides a lower bound to the solution of the original problem. The equality holds if the conditions of the strong duality theorem hold. This will be the case for the application to support vector machines (SVM) in which the we will have the optimization problem of a quadratic function. Thus we will replace the original problem with the dual problem maxλ G(λ)

VI.

SUPPORT VECTOR MACHINES

Support vector machines (SVM) is a powerful machine learning algorithm that can be used for both multi-class classification and regression. In the case of data which is linearly separable linear regression (LR) gives an algorithm that can separate the data using a hyperplane. However, there can in general be many hyperplanes which can separate the data and SVM, in some sense, chooses the best of them. The hyperplane that the SVM chooses has the greatest margin from the data points on two sides. An added advantage is that algorithm only requires the knowledge of support vectors or points that are on the margin. Suppose that the training examples are given by[? ]

(33)

Inequality Constraints

So far we have discussed constraints given by equalities ga (x) = 0. For applications to SVM it will be useful to consider inequality constraints i.e.,

D = {(x1 , y1 ), (x2 , y2 ), · · · , (xN , yN )} = D+ ∪ D (39) −

minx∈S f (x) where (34) S = {x ∈ Rn | ga (x) ≤ 0 , a = 1, · · · , k}

where points in D+ has last coordinate +1 and points in D− have last coordinate −1. We denote by E the subset of Xsuch that

In this case the Langrangian is given by L(x, λ) = f (x) −

k X

E = {x1 , x2 , · · · , xN } = E+ ∪ E− . λa ga (x)

x ∈ E+ if and only if f (x) = +1. The hypothesis function approximating f is

(35)

a=1

However, unconstrained optimization with respect to λ’s will not yield the original problem in Eq.(34) as was the case in Eq.(28). We can see this using the case of a single λ. Consider the Lagrangian L(x, λ) = f (x) − λ g(x)

(40)

hw (x) = wT x + c

(41)

such that ( >0 hw (x) = <0

(36)

Notice that if x ∈ / S then g(x) > 0 and therefore we can maximize the value of L(x, λ) by taking the corresponding λa to be very large negative. Similarly, for g(x) < 0 we can take λ to be very large positive to maximize L. To achieve result similar to Eq.(28)

for x ∈ D+ for x ∈ D−

(42)

Since the distance between the point x0 ∈ X and the plane hw (x) = 0 is given by |hw (x0 )| ||w|| 5

(43)

Therefore the distance of the training examples from the plane is given by |hw (xi )| yi hw (xi ) di = = ||w|| ||w||

The dual function is then given by G(λ) =

(44)

1 2

N X i=1

We would like to find w and c such that these distances are maximum while correctly classifying the training examples. Thus we would like to maximize the minimum distance that a training example has i.e., define y h (~x ) J(w, D) i w i ~ dmin (w, ~ c) = mini=1,··· ,N = ||w|| ~ ||w||

−

N X

maxλ −

j=1

λi λj yi yj xTi xj −

N X

λi

(49)

i=1

1 2

N X

λi λj yi yj xTi xj +

i,j=1

N X

λi

i=1

or equivalently in matrix notation, maxλ − 21 λT QK λ + λT 1

(51)

subject to following constraints λi ≥ 0 , i = 1, 2, · · · , N , λT y = 0

yi hw (xi ) ≥ 1

where, QK = yi yj K(xi , xj ) , K(xi , xj ) = xTi xj

or equivalently minw,c

subject to following constraints N X λi yi = 0 (50) λi ≥ 0 , i = 1, 2, · · · , N ,

(46)

subject to

λj yj xTj xi

Thus the dual problem is (changing λi to −λi so that the constraints are λi ≥ 0)

Thus we can the weights for the maximum margin by solving the following problem: 1 ||w||

N X

i,j=1

subject to yi hw (xi ) ≥ J(w, c, D) for all i = 1, 2, · · · , N . The function J(w, c, D) is affected by the scaling of (w, c) which does not affect the plane or the margin. We can use this scaling freedom to take,

maxw,c

λi 1 − yi c + yi

N X

= − 21

(45)

J(w, c, D) = 1

i=1

and determine maxw,c dmin (w, c)

N X λi (yi xTi ) λi (yi xi )

2 1 2 ||w||

yi hw (xi ) ≥ 1

subject to

If we denote the solution by λ?i then

The corresponding Lagrangian is given by (λi ≤ 0): L(w, c, λ, D) =

1 2

||w|| −

N X

λi

hw,c (x) =

1 − yi hw (xi )

w? =

∂c L(w, c, λ, D) =

J(w? , c? ) = 1

λi yi xi

(53)

(54)

Thus we see that given the matrix of inner products K(xi , xj ) defined on D ⊂ X we can solve the maximization problem given in Eq.(51) and obtain the optimal weights (w, c? ) given by Eq.(48) and Eq.(54). From Eq.(52) we see that only those vectors xi ’s contribute for which λ?i is non-zero. These vectors in X are called support vectors.

λi yi = 0

which gives N X

λ?i yi xi

and c? is given by

λi yi xi = 0 (47)

i=1

w=−

N X i=1

i=1 N X

(52)

so that

The dual problem is in terms of the dual function which is obtained from L by eliminating wand c using their equations of motion, N X

λ?i yi K(xi , x) + c?

i=1

∇w L(w, c, λ, D) = w +

N X

(48)

i=1

The 8 × 8 matrix of inner products QK yi yj K(xi , xj ) is given by

Example

As an example of linear SVM lets consider the following eight points with associated with value +1 and −1 indicated as subscript: D = {(4.64, −6)+1 , (3.02, −3.5)−1 , (2.5, −7.5)+1 , (4.46, −2.28)−1 , (2.78, −2.08)−1 , (2.98, −6.19)+1 , (1.18, −5.79)+1 , (1.16, −0.8)−1 }



57.6225 −35.043   56.625  −34.419 = −25.407  50.997   40.227 −10.194

−35.043 21.3704 −33.8 21.4492 15.6756 −30.6646 −23.8286 6.3032

56.625 −33.8 62.5 −28.25 −22.55 53.875 46.375 −8.9

−34.419 21.4492 −28.25 25.09 17.1412 −27.404 −18.464 6.9976

Using python library for convex constrained optimization CVXOPT for Eq.(51) gives the following optimal values: λ?1 = 0.17 , λ?2 = 0.348 , λ?7 = 0.178 λ?3 = λ?4 = λ?5 = λ?6 = λ?8 = 0

−25.407 15.6756 −22.55 17.1412 12.0548 −21.1596 −15.3236 4.8888

50.997 −30.6646 53.875 −27.404 −21.1596 47.1965 39.3565 −8.4088

40.227 −23.8286 46.375 −18.464 −15.3236 39.3565 34.9165 −6.0008

minw,c

2 1 2 ||w||

(56)

N X

ξi

(59)

i=1

subject to yi hw (xi ) ≥ 1 − ξi , {z } | constraint # 1

ξ ≥0 |i {z }

, i = 1, · · · , N

constraint # 2

The Lagrangian is now given by (with λi ≥ 0 and βi ≤ 0),

w? = (−0.0521, −0.8326) , c? = −3.7539 (57) and the classifier function is

L(w, c, λ, ξ, D) = 12 ||w||2 + C

hw? ,c? (x) = w? T x + c? (58) = −0.0521 x − 0.8326 y − 3.7539

N X

ξi

(60)

i=1

N X i=1

N X λi 1 − ξi − yi hw (xi ) + β i ξi i=1

λi are the Lagrange multipliers for the constraint # 1 and βi are Lagrange multipliers for the constraint # 2. The dual problem is that of maximizing the dual function which is obtained from Eq.(60) by

Non-separable Case and the effect of Ouliers

To take into account marginally non-separable case or the case in which there are a few outliers that can affect the optimal margin one can introduce slack variables ξi which measure by how much the i-th point in X violates the optimal margin constraint which now take the form, yi hw (xi ) ≥ 1 − ξi , ξi ≥ 0

(55)

mization problem is:

Thus using Eq.(53) and Eq.(54) the optimal weight is given by

 −10.194 6.3032   −8.9   6.9976   4.8888  −8.4088  −6.0008 1.9856

i = 1, 2, · · · , N

The violation, however, is minimized by modifying the function to be minimized. The modified mini7

eliminating (w, c, ξ) using their equations of motion: ∇w L = w −

N X

yi λi xi = 0

b The reason this works from the point of view X. of generalization is that the optimal linear function b also depends only on the separating the data in X kernel as can be seen in Eq.(52).

(61)

i=1

∂c L = −

N X

yi λi = 0

VIII.

CROSS-VALIDATION

i=1

∂ ξi L = C − λ i + β i = 0

For large data sets, the original sample of data can be partitioned into three sets: a training set on which to train the model, a validation set on which to validate the models and a test set to evaluate our trained model. However, when we do not have large samples of data, cross-validation becomes increasingly useful. Cross-validation or CV for short, allows us to select a model and estimate the error. When we select a model using CV, we do so by selecting one model from a range of other models which are trained on a particular data set and then by selecting the hyper-parameters of the model.

The dual function is then given by N X

G(λ) = − 12

λi λj yi yj K(xi , xj )

i,j=1

N X

λi

(62)

i=1

and since βi ≤ 0 therefore from Eq.(61) it follows that 0 ≤ λi ≤ C

(63)

Thus the dual problem is maxλ −

N X

1 2

λi λj yi yj K(xi , xj ) +

N X

i,j=1

λi

i=1

subject to following constraints N X 0 ≤ λi ≤ C , i = 1, 2, · · · , N , λi yi = 0 i=1

VII.

KERNEL METHODS FIG. 2. Diagram of k-fold cross-validation with k = 10. Image from Karl Rosaen Log http://karlrosaen.com/ml/learning-log/2016-06-20/

It can seen from the form of the dual function in Eq.(62) that it only depends on the matrix K(xi , xj ) which is the matrix of inner products in X. This dependence of the dual function on only the inner product in the input space can be used in case the data in the X space is not linearly separable. In such b such that a case if one assumes a map ϕ : X 7→ X the image of the data ϕ(xi ) is linearly separable in b called the feature space, then one can apply SVM X, b In this case the dual function directly to data in X. and the dual problem remain the same except that K(xi , xj ) = ϕ(xi )T ϕ(xj )

CV involves partitioning the data set into k-folds consisting of k number of equal sized samples of the original data set. From the k samples, the first k − 1 are used to train the model and remaining one is used to validate the model. The process is repeated k times and an average error is arrived at across all the k trials. The question then arises as to how we choose the right value of k? To find an answer, we need to recollect that a lower value of k increases bias and a higher value of k increases variance. However, the rule of the thumb dictates using k = 10. CV allows us to observe the variance in prediction. If the variance between the predicted values is high then the model maybe overfitting. After performing CV with various different models and different

(64)

Since only the function K(x, x0 ) appear in the dual function one can move a step further and just specify the kernel K(x, x0 ) without explicitly stating what the map ϕ is! All that is required is that the kernel K(x, x0 ) should be inner product in some space 8

hyper-parameters, the model that has the best performance with respect to its error and its variance is chosen. In short, choosing the right CV object is a crucial part of fitting a model properly. There are many ways to split data into training and test sets in order to avoid model overfitting. On the other hand, leave one out cross-validation

(LOOCV) uses a single observation from the original set of data as the validation data and the remaining observations are used as the model training data. We are essentially training the model with one less observation and validating the model on this very observation. The k value here is the number of observations present in the data.