Linear Kernel : Example
The Linear Kernel is the simplest kernel function in SVM.
It is used when the data can be separated using a straight line or hyperplane.
The Linear Kernel does not transform the data into higher dimensions. It works directly on the original input features.
Dataset
Consider the following dataset:
| Point | x1 | x2 | Class |
|---|---|---|---|
| A | 1 | 1 | -1 |
| B | 2 | 2 | -1 |
| C | 4 | 4 | +1 |
| D | 5 | 5 | +1 |
Visual representation:
(5,5) D(+1)
(4,4) C(+1)
(2,2) B(-1)
(1,1) A(-1)
The classes can be separated using a straight line.
So we can use Linear Kernel SVM.
Step 1: Linear Kernel Formula
The Linear Kernel formula is:
K(xi, xj) = xiᵀxj
This means:
K(xi, xj) = dot product of two data points
For two points:
xi = (a, b)
xj = (c, d)
The dot product is:
xiᵀxj = ac + bd
Step 2: Why Linear Kernel Works Here
Class -1 points:
A(1,1), B(2,2)
Class +1 points:
C(4,4), D(5,5)
The closest opposite-class points are:
B(2,2) and C(4,4)
These points will decide the maximum-margin boundary.
Step 3: Find the Midpoint Between Closest Points
Closest points:
B(2,2), C(4,4)
Midpoint formula:
Midpoint = ((x1 + x2)/2, (y1 + y2)/2)
Substitute values:
Midpoint = ((2 + 4)/2, (2 + 4)/2)
Midpoint = (3,3)
So the decision boundary passes through:
(3,3)
Step 4: Find Direction Vector
Direction from B to C:
C - B = (4 - 2, 4 - 2)
= (2,2)
The decision boundary must be perpendicular to this direction.
So the normal vector of the hyperplane can be:
w = (2,2)
We simplify it as:
w = (1,1)
because both vectors point in the same direction.
So:
w1 = 1
w2 = 1
Step 5: Find Bias Value
General hyperplane equation:
w1x1 + w2x2 + b = 0
Substitute:
w1 = 1
w2 = 1
So:
x1 + x2 + b = 0
The decision boundary passes through (3,3).
Substitute:
3 + 3 + b = 0
6 + b = 0
b = -6
Therefore, the decision boundary is:
x1 + x2 - 6 = 0
or
x1 + x2 = 6
Step 6: Prediction Function
The prediction function is:
f(x) = x1 + x2 - 6
Decision rule:
If f(x) > 0 → Class +1
If f(x) < 0 → Class -1
If f(x) = 0 → Point lies on boundary
Step 7: Check Each Training Point
Point A(1,1), Class = -1
f(x) = 1 + 1 - 6
f(x) = -4
Since:
f(x) < 0
Prediction:
Class -1
Correct.
Point B(2,2), Class = -1
f(x) = 2 + 2 - 6
f(x) = -2
Since:
f(x) < 0
Prediction:
Class -1
Correct.
Point C(4,4), Class = +1
f(x) = 4 + 4 - 6
f(x) = 2
Since:
f(x) > 0
Prediction:
Class +1
Correct.
Point D(5,5), Class = +1
f(x) = 5 + 5 - 6
f(x) = 4
Since:
f(x) > 0
Prediction:
Class +1
Correct.
Step 8: Find Support Vectors
Support vectors are the closest points to the decision boundary.
Distance from point to line:
Distance = |w1x1 + w2x2 + b| / √(w1² + w2²)
Here:
w1 = 1
w2 = 1
b = -6
So:
Distance = |x1 + x2 - 6| / √(1² + 1²)
Distance = |x1 + x2 - 6| / √2
Distance of A(1,1)
Distance = |1 + 1 - 6| / √2
= 4 / √2
= 2.828
Distance of B(2,2)
Distance = |2 + 2 - 6| / √2
= 2 / √2
= 1.414
Distance of C(4,4)
Distance = |4 + 4 - 6| / √2
= 2 / √2
= 1.414
Distance of D(5,5)
Distance = |5 + 5 - 6| / √2
= 4 / √2
= 2.828
The minimum distance is:
1.414
So the closest points are:
B(2,2) and C(4,4)
Therefore:
Support Vectors = B(2,2), C(4,4)
Step 9: SVM Margin Condition
SVM condition:
y × f(x) ≥ 1
But our current decision function gives:
| Point | y | f(x) | y × f(x) |
|---|---|---|---|
| A | -1 | -4 | 4 |
| B | -1 | -2 | 2 |
| C | +1 | +2 | 2 |
| D | +1 | +4 | 4 |
For support vectors, SVM usually scales the equation so that:
y × f(x) = 1
Currently support vectors give:
y × f(x) = 2
So we scale the function by dividing by 2.
Original function:
f(x) = x1 + x2 - 6
Scaled SVM function:
f(x) = 0.5x1 + 0.5x2 - 3
Now:
w = (0.5, 0.5)
b = -3
Step 10: Verify Scaled SVM Function
New function:
f(x) = 0.5x1 + 0.5x2 - 3
Point B(2,2)
f(x) = 0.5(2) + 0.5(2) - 3
= 1 + 1 - 3
= -1
Since B has class -1:
y × f(x) = (-1)(-1) = 1
So B is a support vector.
Point C(4,4)
f(x) = 0.5(4) + 0.5(4) - 3
= 2 + 2 - 3
= 1
Since C has class +1:
y × f(x) = (+1)(+1) = 1
So C is a support vector.
Step 11: Margin Calculation
For the scaled SVM function:
w = (0.5, 0.5)
Norm of weight vector:
||w|| = √(0.5² + 0.5²)
= √(0.25 + 0.25)
= √0.5
= 0.707
Margin width:
Margin Width = 2 / ||w||
= 2 / 0.707
= 2.828
So the total margin width is:
2.828 units
This is equal to the distance between B and C.
Step 12: Predict New Data Points
Use the decision boundary:
x1 + x2 - 6 = 0
or scaled function:
f(x) = 0.5x1 + 0.5x2 - 3
Both give the same class prediction.
New Point P(6,3)
f(x) = 0.5(6) + 0.5(3) - 3
= 3 + 1.5 - 3
= 1.5
Since:
f(x) > 0
Prediction:
Class +1
New Point Q(1,2)
f(x) = 0.5(1) + 0.5(2) - 3
= 0.5 + 1 - 3
= -1.5
Since:
f(x) < 0
Prediction:
Class -1
New Point R(3,3)
f(x) = 0.5(3) + 0.5(3) - 3
= 1.5 + 1.5 - 3
= 0
So:
R lies exactly on the decision boundary.
Step 13: Linear Kernel Matrix
Using:
K(xi, xj) = xiᵀxj
Calculate dot products.
Points:
A = (1,1)
B = (2,2)
C = (4,4)
D = (5,5)
| Kernel | A | B | C | D |
|---|---|---|---|---|
| A | 2 | 4 | 8 | 10 |
| B | 4 | 8 | 16 | 20 |
| C | 8 | 16 | 32 | 40 |
| D | 10 | 20 | 40 | 50 |
This matrix represents the similarity between all pairs of points using the Linear Kernel.
Python Implementation
from sklearn.svm import SVC
import numpy as np
# Dataset
X = np.array([
[1, 1],
[2, 2],
[4, 4],
[5, 5]
])
y = np.array([-1, -1, 1, 1])
# Linear Kernel SVM
model = SVC(kernel="linear", C=1000)
# Train model
model.fit(X, y)
# Predict training data
training_predictions = model.predict(X)
print("Training Predictions:")
for point, actual, pred in zip(X, y, training_predictions):
print(f"Point: {point}, Actual: {actual}, Predicted: {pred}")
# Predict new points
new_points = np.array([
[6, 3],
[1, 2],
[3, 3]
])
new_predictions = model.predict(new_points)
print("\nNew Point Predictions:")
for point, pred in zip(new_points, new_predictions):
print(f"Point: {point}, Predicted Class: {pred}")
print("\nSupport Vectors:")
print(model.support_vectors_)
print("\nWeight Vector:")
print(model.coef_)
print("\nBias:")
print(model.intercept_)
Final Result
Dataset:
A(1,1) → -1
B(2,2) → -1
C(4,4) → +1
D(5,5) → +1
Kernel Used:
Linear Kernel
Decision Boundary:
x1 + x2 - 6 = 0
Scaled SVM Function:
f(x) = 0.5x1 + 0.5x2 - 3
Support Vectors:
B(2,2), C(4,4)
Weight Vector:
w = (0.5, 0.5)
Bias:
b = -3
Margin Width:
2.828 units
Prediction:
P(6,3) → +1
Q(1,2) → -1
R(3,3) → Boundary
Important Points
-
Linear Kernel is used when data is linearly separable.
-
It calculates similarity using dot product.
-
It does not transform data into higher dimensions.
-
The decision boundary is a straight line.
-
Support vectors are the closest points to the boundary.
-
The SVM function is scaled so support vectors satisfy y × f(x) = 1.
-
The margin width is calculated as 2 / ||w||.
-
Linear Kernel is fast and works well for high-dimensional linearly separable data
Question 1: Why did we calculate the direction vector between B and C?
Recall our dataset:
| Point | Coordinates | Class |
|---|---|---|
| A | (1,1) | -1 |
| B | (2,2) | -1 |
| C | (4,4) | +1 |
| D | (5,5) | +1 |
We found that:
B(2,2)
and
C(4,4)
are the closest points belonging to opposite classes.
These points become the support vectors.
Now draw an imaginary line joining B and C:
D(+)
C(+)
*
\
\
\
*
B(-)
A(-)
The line joining B and C tells us:
In which direction
the two classes are separated.
So we calculate:
C - B
(4-2, 4-2)
(2,2)
This is called the direction vector.
What does (2,2) mean?
It means:
Move 2 units in x direction
Move 2 units in y direction
Graphically:
(4,4)
*
/
/
/
*
(2,2)
The arrow points from B to C.
Why do we need this direction?
Because SVM wants a hyperplane that is:
Perpendicular
to the separation direction.
Think:
Class -1
|
|
|
Boundary
|
|
|
Class +1
Maximum margin occurs when the boundary is perpendicular to the direction joining the support vectors.
Question 2: Why did we simplify (2,2) to (1,1)?
This confuses almost everyone initially.
Let's understand.
We got:
Direction = (2,2)
Notice:
(2,2)
=
2 × (1,1)
Imagine three arrows:
(1,1)
(2,2)
(10,10)
Visual:
↗
↗↗
↗↗↗↗↗↗↗↗↗↗
All arrows point in exactly the same direction.
Only their length changes.
In Linear SVM:
Direction matters
Magnitude does not
Therefore:
(2,2)
and
(1,1)
represent the same direction.
So we simplify to:
w = (1,1)
to make calculations easier.
Real Mathematical Reason
Suppose we keep:
w = (2,2)
Hyperplane:
2x1 + 2x2 + b = 0
Passing through midpoint (3,3):
2(3)+2(3)+b=0
12+b=0
b=-12
Equation:
2x1 + 2x2 -12 = 0
Divide entire equation by 2:
x1 + x2 -6 = 0
Same line.
Exactly the same boundary.
Nothing changes.
Why does dividing not change the line?
Example:
Equation 1:
2x + 2y -12 = 0
Equation 2:
x + y -6 = 0
Take point:
(4,2)
Check Equation 1:
2(4)+2(2)-12
8+4-12
0
Check Equation 2:
4+2-6
0
Same point satisfies both.
Therefore:
Same line
The Most Important Concept
In SVM:
The hyperplane is:
w·x + b = 0
If we multiply everything by 10:
10w·x + 10b = 0
it still represents the same hyperplane.
Example:
x+y-6=0
and
100x+100y-600=0
are identical lines.
Therefore:
w=(1,1)
w=(2,2)
w=(10,10)
all point in the same direction and can describe the same boundary after adjusting b.
What Actually Matters?
For SVM:
Direction of w
determines:
Orientation of hyperplane
while
Magnitude of w
affects:
Margin width
but not the direction.
That's why we simplified:
(2,2)
to
(1,1)
because both point in the same direction and generate the same separating line after scaling.
This scaling property is one of the reasons SVM equations often look different in textbooks even though they represent the same decision boundary.
We did not scale to (0.5,0.5) because SVM requires it initially. We scaled it later to satisfy the canonical SVM margin condition.
Let's understand carefully.
Step 1: Our Original Boundary
We found:
w = (1,1)
b = -6
Therefore:
f(x) = x1 + x2 - 6
This boundary correctly classifies all points.
Check Support Vectors
Support vectors:
B(2,2) → Class -1
C(4,4) → Class +1
For B:
f(B) = 2 + 2 - 6
= -2
For C:
f(C) = 4 + 4 - 6
= 2
Now calculate:
y × f(x)
For B:
(-1)(-2)
= 2
For C:
(+1)(2)
= 2
So:
y × f(x) = 2
for support vectors.
Step 2: What Does SVM Want?
In SVM theory, we define the hyperplane in a special form:
y × f(x) ≥ 1
and support vectors must satisfy:
y × f(x) = 1
exactly.
This is called the:
Canonical Hyperplane
or
Canonical SVM Form
Our Current Hyperplane
Currently:
y × f(x) = 2
for support vectors.
But SVM prefers:
y × f(x) = 1
Step 3: How to Convert?
Current function:
f(x)=x1+x2−6
Support vectors give:
±2
We want:
±1
Therefore divide the whole equation by:
2
New function:
f(x)=0.5x1+0.5x2−3
Now check support vectors.
For B(2,2):
f(B)=0.5(2)+0.5(2)-3
=1+1-3
=-1
Now:
y×f(x)
=(-1)(-1)
=1
Perfect.
For C(4,4):
f(C)=0.5(4)+0.5(4)-3
=2+2-3
=1
Now:
(+1)(1)
=1
Perfect.
Why Not Scale to 0.25?
We could.
Suppose:
0.25x1+0.25x2−1.5
Check B:
0.25(2)+0.25(2)-1.5
=0.5+0.5-1.5
=-0.5
Then:
y×f(x)=0.5
Not equal to 1.
So it is not the canonical SVM form.
Note:
All these equations represent the same line:
x+y−6=0
2x+2y−12=0
0.5x+0.5y−3=0
100x+100y−600=0
Same geometric boundary.
Then Why Does SVM Care?
Because margin calculation uses:
Margin = 2/||w||
This formula only works correctly when:
Support vectors satisfy
y×f(x)=1
That's why SVM rescales the weights.
The Real Meaning
Think of it like measuring distance.
Suppose:
Temperature = 100°F
or
Temperature = 37.8°C
Different scales.
Same temperature.
Similarly:
w=(1,1)
and
w=(0.5,0.5)
describe the same boundary after adjusting the bias.
SVM chooses the scale where:
Support Vectors → y×f(x)=1
because it makes optimization and margin computation mathematically convenient.