Gradient Descent in LR
Gradient Descent is an optimization algorithm used to find the best values of slope (m) and intercept (b) in Linear Regression (LR). It helps minimize prediction error by continuously updating model parameters step by step.
Instead of calculating the best-fit line directly using formulas, Gradient Descent gradually learns the optimal line through iterations.
Why Gradient Descent is Needed
Suppose a regression model predicts values poorly.
Example:
| Actual Marks | Predicted Marks |
|---|---|
| 50 | 30 |
| 60 | 35 |
| 70 | 40 |
The prediction error is high.
Gradient Descent helps:
-
Reduce prediction error
-
Improve model accuracy
-
Find optimal values of m and b
Main Idea of Gradient Descent
Gradient Descent works like this:
1. Start with random values of m and b
2. Calculate prediction error
3. Update m and b
4. Repeat until error becomes very small
Linear Regression Equation
y = mx + b
Where:
-
y → Predicted output
-
x → Input feature
-
m → Slope
-
b → Intercept
Cost Function
Gradient Descent minimizes the Cost Function.
The most common cost function is:
Mean Squared Error (MSE)
MSE Formula
MSE = Σ(actual_y - predicted_y)^2 / n
Goal:
Minimize MSE
Important Terms
1. Learning Rate
Learning Rate controls:
How big each step should be
Small Learning Rate
Slow learning
More iterations
Large Learning Rate
May skip minimum point
Unstable learning
2. Iterations
Iterations represent:
How many times parameters are updated
More iterations usually improve learning.
Mathematical Example
Dataset
| X | Y |
|---|---|
| 1 | 2 |
| 2 | 4 |
| 3 | 6 |
Here:
-
X is the input value
-
Y is the actual output value
The goal is to find the best line for prediction.
Linear Regression Equation
predicted_y = mx + b
Where:
-
m = slope
-
b = intercept
-
x = input value
-
predicted_y = predicted output
Initial Values
Let us start with:
m = 0
b = 0
Learning Rate = 0.01
n = 3
Here:
-
m and b are initialized with zero
-
Learning Rate controls the size of each update step
-
n is the number of data points
Step 1: Calculate Predictions
Using:
predicted_y = mx + b
Since:
m = 0
b = 0
For X = 1:
predicted_y = (0 * 1) + 0 = 0
For X = 2:
predicted_y = (0 * 2) + 0 = 0
For X = 3:
predicted_y = (0 * 3) + 0 = 0
Step 2: Calculate Errors
Error = Actual Y - Predicted Y
| X | Actual Y | Predicted Y | Error |
|---|---|---|---|
| 1 | 2 | 0 | 2 |
| 2 | 4 | 0 | 4 |
| 3 | 6 | 0 | 6 |
At the starting point, the errors are large because the model has not learned yet.
Step 3: Formula for Updating m and b
Gradient Descent updates m and b using these formulas:
m = m - LearningRate * dm
b = b - LearningRate * db
Where:
-
dm = derivative of error with respect to m
-
db = derivative of error with respect to b
Step 4: Calculate dm
Formula:
dm = (-2/n) * Σ[X * (Y - predicted_y)]
Substitute values:
dm = (-2/3) * [(1 * 2) + (2 * 4) + (3 * 6)]
Calculate inside the bracket:
(1 * 2) = 2
(2 * 4) = 8
(3 * 6) = 18
Now add them:
2 + 8 + 18 = 28
So:
dm = (-2/3) * 28
dm = -18.67
Step 5: Calculate db
Formula:
db = (-2/n) * Σ(Y - predicted_y)
Substitute values:
db = (-2/3) * (2 + 4 + 6)
Add the values:
2 + 4 + 6 = 12
So:
db = (-2/3) * 12
db = -8
Step 6: Update m
Formula:
m = m - LearningRate * dm
Substitute values:
m = 0 - (0.01 * -18.67)
m = 0 + 0.1867
m = 0.1867
Step 7: Update b
Formula:
b = b - LearningRate * db
Substitute values:
b = 0 - (0.01 * -8)
b = 0 + 0.08
b = 0.08
Updated Values After First Iteration
After one iteration of Gradient Descent:
m = 0.1867
b = 0.08
So the new prediction equation becomes:
predicted_y = 0.1867x + 0.08
Step 8: Check New Predictions
For X = 1:
predicted_y = (0.1867 * 1) + 0.08
predicted_y = 0.2667
For X = 2:
predicted_y = (0.1867 * 2) + 0.08
predicted_y = 0.4534
For X = 3:
predicted_y = (0.1867 * 3) + 0.08
predicted_y = 0.6401
New Prediction Table
| X | Actual Y | Old Prediction | New Prediction |
|---|---|---|---|
| 1 | 2 | 0 | 0.2667 |
| 2 | 4 | 0 | 0.4534 |
| 3 | 6 | 0 | 0.6401 |
The predictions are still not perfect, but they have improved slightly from the initial prediction of 0.
What Happens Next?
Gradient Descent repeats the same process many times:
1. Calculate predictions
2. Calculate errors
3. Calculate dm and db
4. Update m and b
5. Repeat
After many iterations, m and b move closer to the best values.
For this dataset, the ideal line is:
predicted_y = 2x + 0
So after enough iterations:
m ≈ 2
b ≈ 0
In the first iteration, Gradient Descent changed:
m = 0 → 0.1867
b = 0 → 0.08
This means the model started learning from the data.
With repeated iterations, the model continues improving until the prediction error becomes minimum.
Visualization of Learning
Iteration 1 → High Error
Iteration 10 → Lower Error
Iteration 100 → Minimum Error
Python Example — Gradient Descent
import numpy as np
# Dataset
X = np.array([1, 2, 3])
Y = np.array([2, 4, 6])
# Initial values
m = 0
b = 0
# Learning rate
L = 0.01
# Iterations
epochs = 1000
n = len(X)
# Gradient Descent
for i in range(epochs):
Y_pred = m * X + b
# Derivatives
dm = (-2/n) * sum(X * (Y - Y_pred))
db = (-2/n) * sum(Y - Y_pred)
# Update values
m = m - L * dm
b = b - L * db
print("Slope:", m)
print("Intercept:", b)
Expected Output
Slope ≈ 2
Intercept ≈ 0
Final Equation
y = 2x
What Gradient Descent Learned
The algorithm learned:
When X increases,
Y increases proportionally.
Types of Gradient Descent
| Type | Description |
|---|---|
| Batch Gradient Descent | Uses entire dataset |
| Stochastic Gradient Descent | Uses one sample at a time |
| Mini-Batch Gradient Descent | Uses small batches |
Advantages of Gradient Descent
-
Works for large datasets
-
Efficient optimization
-
Widely used in Deep Learning
-
Helps minimize prediction error
Limitations
-
Requires proper learning rate
-
Can be slow for complex problems
-
May get stuck in local minima
Important Points
1. Gradient Descent minimizes the cost function.
2. Learning Rate controls step size.
3. Gradient Descent updates slope and intercept iteratively.
4. MSE is commonly used as the cost function.
5. Gradient Descent is widely used in Machine Learning and Deep Learning.
Summary
Gradient Descent is an optimization algorithm used in Linear Regression to minimize prediction error by continuously updating slope and intercept values. It helps models learn the best-fit line step by step through iterative optimization.