Linear Kernel : Example - Machine Learning

The Linear Kernel is the simplest kernel function in SVM.

It is used when the data can be separated using a straight line or hyperplane.

The Linear Kernel does not transform the data into higher dimensions. It works directly on the original input features.

Dataset

Consider the following dataset:

Point	x1	x2	Class
A	1	1	-1
B	2	2	-1
C	4	4	+1
D	5	5	+1

Visual representation:

(5,5)  D(+1)

(4,4)  C(+1)


(2,2)  B(-1)

(1,1)  A(-1)

The classes can be separated using a straight line.

So we can use Linear Kernel SVM.

Step 1: Linear Kernel Formula

The Linear Kernel formula is:

K(xi, xj) = xiᵀxj

This means:

K(xi, xj) = dot product of two data points

For two points:

xi = (a, b)
xj = (c, d)

The dot product is:

xiᵀxj = ac + bd

Step 2: Why Linear Kernel Works Here

Class -1 points:

A(1,1), B(2,2)

Class +1 points:

C(4,4), D(5,5)

The closest opposite-class points are:

B(2,2) and C(4,4)

These points will decide the maximum-margin boundary.

Step 3: Find the Midpoint Between Closest Points

Closest points:

B(2,2), C(4,4)

Midpoint formula:

Midpoint = ((x1 + x2)/2, (y1 + y2)/2)

Substitute values:

Midpoint = ((2 + 4)/2, (2 + 4)/2)

Midpoint = (3,3)

So the decision boundary passes through:

(3,3)

Step 4: Find Direction Vector

Direction from B to C:

C - B = (4 - 2, 4 - 2)

= (2,2)

The decision boundary must be perpendicular to this direction.

So the normal vector of the hyperplane can be:

w = (2,2)

We simplify it as:

w = (1,1)

because both vectors point in the same direction.

So:

w1 = 1
w2 = 1

Step 5: Find Bias Value

General hyperplane equation:

w1x1 + w2x2 + b = 0

Substitute:

w1 = 1
w2 = 1

So:

x1 + x2 + b = 0

The decision boundary passes through (3,3).

Substitute:

3 + 3 + b = 0

6 + b = 0

b = -6

Therefore, the decision boundary is:

x1 + x2 - 6 = 0

x1 + x2 = 6

Step 6: Prediction Function

The prediction function is:

f(x) = x1 + x2 - 6

Decision rule:

If f(x) > 0 → Class +1
If f(x) < 0 → Class -1
If f(x) = 0 → Point lies on boundary

Step 7: Check Each Training Point

Point A(1,1), Class = -1

f(x) = 1 + 1 - 6

f(x) = -4

Since:

f(x) < 0

Prediction:

Class -1

Correct.

Point B(2,2), Class = -1

f(x) = 2 + 2 - 6

f(x) = -2

Since:

f(x) < 0

Prediction:

Class -1

Correct.

Point C(4,4), Class = +1

f(x) = 4 + 4 - 6

f(x) = 2

Since:

f(x) > 0

Prediction:

Class +1

Correct.

Point D(5,5), Class = +1

f(x) = 5 + 5 - 6

f(x) = 4

Since:

f(x) > 0

Prediction:

Class +1

Correct.

Step 8: Find Support Vectors

Support vectors are the closest points to the decision boundary.

Distance from point to line:

Distance = |w1x1 + w2x2 + b| / √(w1² + w2²)

Here:

w1 = 1
w2 = 1
b = -6

So:

Distance = |x1 + x2 - 6| / √(1² + 1²)

Distance = |x1 + x2 - 6| / √2

Distance of A(1,1)

Distance = |1 + 1 - 6| / √2

= 4 / √2

= 2.828

Distance of B(2,2)

Distance = |2 + 2 - 6| / √2

= 2 / √2

= 1.414

Distance of C(4,4)

Distance = |4 + 4 - 6| / √2

= 2 / √2

= 1.414

Distance of D(5,5)

Distance = |5 + 5 - 6| / √2

= 4 / √2

= 2.828

The minimum distance is:

1.414

So the closest points are:

B(2,2) and C(4,4)

Therefore:

Support Vectors = B(2,2), C(4,4)

Step 9: SVM Margin Condition

SVM condition:

y × f(x) ≥ 1

But our current decision function gives:

Point	y	f(x)	y × f(x)
A	-1	-4	4
B	-1	-2	2
C	+1	+2	2
D	+1	+4	4

For support vectors, SVM usually scales the equation so that:

y × f(x) = 1

Currently support vectors give:

y × f(x) = 2

So we scale the function by dividing by 2.

Original function:

f(x) = x1 + x2 - 6

Scaled SVM function:

f(x) = 0.5x1 + 0.5x2 - 3

Now:

w = (0.5, 0.5)
b = -3

Step 10: Verify Scaled SVM Function

New function:

f(x) = 0.5x1 + 0.5x2 - 3

Point B(2,2)

f(x) = 0.5(2) + 0.5(2) - 3

= 1 + 1 - 3

= -1

Since B has class -1:

y × f(x) = (-1)(-1) = 1

So B is a support vector.

Point C(4,4)

f(x) = 0.5(4) + 0.5(4) - 3

= 2 + 2 - 3

= 1

Since C has class +1:

y × f(x) = (+1)(+1) = 1

So C is a support vector.

Step 11: Margin Calculation

For the scaled SVM function:

w = (0.5, 0.5)

Norm of weight vector:

||w|| = √(0.5² + 0.5²)

= √(0.25 + 0.25)

= √0.5

= 0.707

Margin width:

Margin Width = 2 / ||w||

= 2 / 0.707

= 2.828

So the total margin width is:

2.828 units

This is equal to the distance between B and C.

Step 12: Predict New Data Points

Use the decision boundary:

x1 + x2 - 6 = 0

or scaled function:

f(x) = 0.5x1 + 0.5x2 - 3

Both give the same class prediction.

New Point P(6,3)

f(x) = 0.5(6) + 0.5(3) - 3

= 3 + 1.5 - 3

= 1.5

Since:

f(x) > 0

Prediction:

Class +1

New Point Q(1,2)

f(x) = 0.5(1) + 0.5(2) - 3

= 0.5 + 1 - 3

= -1.5

Since:

f(x) < 0

Prediction:

Class -1

New Point R(3,3)

f(x) = 0.5(3) + 0.5(3) - 3

= 1.5 + 1.5 - 3

= 0

So:

R lies exactly on the decision boundary.

Step 13: Linear Kernel Matrix

Using:

K(xi, xj) = xiᵀxj

Calculate dot products.

Points:

A = (1,1)
B = (2,2)
C = (4,4)
D = (5,5)

Kernel	A	B	C	D
A	2	4	8	10
B	4	8	16	20
C	8	16	32	40
D	10	20	40	50

This matrix represents the similarity between all pairs of points using the Linear Kernel.

Python Implementation

from sklearn.svm import SVC
import numpy as np

# Dataset
X = np.array([
    [1, 1],
    [2, 2],
    [4, 4],
    [5, 5]
])

y = np.array([-1, -1, 1, 1])

# Linear Kernel SVM
model = SVC(kernel="linear", C=1000)

# Train model
model.fit(X, y)

# Predict training data
training_predictions = model.predict(X)

print("Training Predictions:")
for point, actual, pred in zip(X, y, training_predictions):
    print(f"Point: {point}, Actual: {actual}, Predicted: {pred}")

# Predict new points
new_points = np.array([
    [6, 3],
    [1, 2],
    [3, 3]
])

new_predictions = model.predict(new_points)

print("\nNew Point Predictions:")
for point, pred in zip(new_points, new_predictions):
    print(f"Point: {point}, Predicted Class: {pred}")

print("\nSupport Vectors:")
print(model.support_vectors_)

print("\nWeight Vector:")
print(model.coef_)

print("\nBias:")
print(model.intercept_)

Final Result

Dataset:
A(1,1) → -1
B(2,2) → -1
C(4,4) → +1
D(5,5) → +1

Kernel Used:
Linear Kernel

Decision Boundary:
x1 + x2 - 6 = 0

Scaled SVM Function:
f(x) = 0.5x1 + 0.5x2 - 3

Support Vectors:
B(2,2), C(4,4)

Weight Vector:
w = (0.5, 0.5)

Bias:
b = -3

Margin Width:
2.828 units

Prediction:
P(6,3) → +1
Q(1,2) → -1
R(3,3) → Boundary

Important Points

Linear Kernel is used when data is linearly separable.
It calculates similarity using dot product.
It does not transform data into higher dimensions.
The decision boundary is a straight line.
Support vectors are the closest points to the boundary.
The SVM function is scaled so support vectors satisfy y × f(x) = 1.
The margin width is calculated as 2 / ||w||.
Linear Kernel is fast and works well for high-dimensional linearly separable data

Question 1: Why did we calculate the direction vector between B and C?

Recall our dataset:

Point	Coordinates	Class
A	(1,1)	-1
B	(2,2)	-1
C	(4,4)	+1
D	(5,5)	+1

We found that:

B(2,2)

and

C(4,4)

are the closest points belonging to opposite classes.

These points become the support vectors.

Now draw an imaginary line joining B and C:

D(+)

C(+)
  *
   \
    \
     \
      *
     B(-)

A(-)

The line joining B and C tells us:

In which direction
the two classes are separated.

So we calculate:

C - B

(4-2, 4-2)

(2,2)

This is called the direction vector.

What does (2,2) mean?

It means:

Move 2 units in x direction
Move 2 units in y direction

Graphically:

The arrow points from B to C.

Why do we need this direction?

Because SVM wants a hyperplane that is:

Perpendicular

to the separation direction.

Think:

Class -1
     |
     |
     |
 Boundary
     |
     |
     |
Class +1

Maximum margin occurs when the boundary is perpendicular to the direction joining the support vectors.

Question 2: Why did we simplify (2,2) to (1,1)?

This confuses almost everyone initially.

Let's understand.

We got:

Direction = (2,2)

Notice:

(2,2)
=
2 × (1,1)

Imagine three arrows:

(1,1)

(2,2)

(10,10)

Visual:

      ↗

      ↗↗

      ↗↗↗↗↗↗↗↗↗↗

All arrows point in exactly the same direction.

Only their length changes.

In Linear SVM:

Direction matters
Magnitude does not

Therefore:

(2,2)

and

(1,1)

represent the same direction.

So we simplify to:

w = (1,1)

to make calculations easier.

Real Mathematical Reason

Suppose we keep:

w = (2,2)

Hyperplane:

2x1 + 2x2 + b = 0

Passing through midpoint (3,3):

2(3)+2(3)+b=0

12+b=0

b=-12

Equation:

2x1 + 2x2 -12 = 0

Divide entire equation by 2:

x1 + x2 -6 = 0

Same line.

Exactly the same boundary.

Nothing changes.

Why does dividing not change the line?

Example:

Equation 1:

2x + 2y -12 = 0

Equation 2:

x + y -6 = 0

Take point:

(4,2)

Check Equation 1:

2(4)+2(2)-12

8+4-12

Check Equation 2:

4+2-6

Same point satisfies both.

Therefore:

Same line

The Most Important Concept

In SVM:

The hyperplane is:

w·x + b = 0

If we multiply everything by 10:

10w·x + 10b = 0

it still represents the same hyperplane.

Example:

x+y-6=0

and

100x+100y-600=0

are identical lines.

Therefore:

w=(1,1)

w=(2,2)

w=(10,10)

all point in the same direction and can describe the same boundary after adjusting b.

What Actually Matters?

For SVM:

Direction of w

determines:

Orientation of hyperplane

while

Magnitude of w

affects:

Margin width

but not the direction.

That's why we simplified:

(2,2)

(1,1)

because both point in the same direction and generate the same separating line after scaling.

This scaling property is one of the reasons SVM equations often look different in textbooks even though they represent the same decision boundary.

We did not scale to (0.5,0.5) because SVM requires it initially. We scaled it later to satisfy the canonical SVM margin condition.

Let's understand carefully.

Step 1: Our Original Boundary

We found:

w = (1,1)
b = -6

Therefore:

f(x) = x1 + x2 - 6

This boundary correctly classifies all points.

Check Support Vectors

Support vectors:

B(2,2) → Class -1
C(4,4) → Class +1

For B:

f(B) = 2 + 2 - 6

= -2

For C:

f(C) = 4 + 4 - 6

= 2

Now calculate:

y × f(x)

For B:

(-1)(-2)

= 2

For C:

(+1)(2)

= 2

So:

y × f(x) = 2

for support vectors.

Step 2: What Does SVM Want?

In SVM theory, we define the hyperplane in a special form:

y × f(x) ≥ 1

and support vectors must satisfy:

y × f(x) = 1

exactly.

This is called the:

Canonical Hyperplane

Canonical SVM Form

Our Current Hyperplane

Currently:

y × f(x) = 2

for support vectors.

But SVM prefers:

y × f(x) = 1

Step 3: How to Convert?

Current function:

f(x)=x1+x2−6

Support vectors give:

±2

We want:

±1

Therefore divide the whole equation by:

New function:

f(x)=0.5x1+0.5x2−3

Now check support vectors.

For B(2,2):

f(B)=0.5(2)+0.5(2)-3

=1+1-3

=-1

Now:

y×f(x)

=(-1)(-1)

=1

Perfect.

For C(4,4):

f(C)=0.5(4)+0.5(4)-3

=2+2-3

=1

Now:

(+1)(1)

=1

Perfect.

Why Not Scale to 0.25?

We could.

Suppose:

0.25x1+0.25x2−1.5

Check B:

0.25(2)+0.25(2)-1.5

=0.5+0.5-1.5

=-0.5

Then:

y×f(x)=0.5

Not equal to 1.

So it is not the canonical SVM form.

Note:

All these equations represent the same line:

x+y−6=0

2x+2y−12=0

0.5x+0.5y−3=0

100x+100y−600=0

Same geometric boundary.

Then Why Does SVM Care?

Because margin calculation uses:

Margin = 2/||w||

This formula only works correctly when:

Support vectors satisfy

y×f(x)=1

That's why SVM rescales the weights.

The Real Meaning

Think of it like measuring distance.

Suppose:

Temperature = 100°F

Temperature = 37.8°C

Different scales.

Same temperature.

Similarly:

w=(1,1)

and

w=(0.5,0.5)

describe the same boundary after adjusting the bias.

SVM chooses the scale where:

Support Vectors → y×f(x)=1

because it makes optimization and margin computation mathematically convenient.