Riding the Learning Curve: How a Single Number Decides Whether Your Neural Network Succeeds or Crashes#
Date: November 19, 2025
Training a deep learning model is a lot like teaching a student. If you teach too slowly, the student never learns enough. If you teach too quickly, the student gets confused and makes wild mistakes.
In neural networks, this teaching speed is controlled by the learning rate — one simple number that can decide whether your model becomes smart… or completely fails.
This blog explores what a learning rate is, why it is so powerful, and how changing it affects training in real-life experiments. Using a simple CNN on the MNIST dataset, we’ll compare learning rates and visualize how training behaves at different speeds.
What Is Learning Rate?#
Learning Rate is a key hyperparameter in neural networks that controls how quickly the model learns during training. It determines the size of the steps taken to minimize the loss function. It controls how much change is made in response to the error encountered, each time the model weights are updated. It determines the size of the steps taken towards a minimum of the loss function during optimization.
In short, its a hyperparameter that controls how much the model updates its weights in response to the error it makes.
Formally, it appears in the gradient descent update rule:
where:
\(θt\) = current parameters
\(θt+1\) = updated parameters
\(∇θJ(θ)\) = gradrient of the loss functions
\(η(eta)\) = learning rate
Why is this important?#
Because the learning rate determines how big each update step is.
Learning Rate |
What Happens |
|---|---|
Too Low |
Model learns extremely slowly, may get stuck |
Ideal |
Smooth, stable learning and fast convergence |
Too High |
Model becomes unstable → oscillates or diverges |
Visual analogy:
Small LR → baby steps
Medium LR → walking normally
Large LR → running downhill
Very large LR → falling off a cliff
Why Learning Rate Matters#
Controls how fast the model learns A higher learning rate speeds up learning at the cost of stability.
Affects the quality of the final solution A bad learning rate can trap the model in poor local minima or saddle points.
Determines training stability
Too high → training “explodes”.
Too low → training drags for hours.Interacts with optimizers For optimizers like Adam, RMSProp, and SGD, the default LR is often set to values that balance speed vs. reliability. But there’s never a perfect LR for every model.
To understand learning rates in action#
To understand learning rates in action, we trained a simple Convolutional Neural Network (CNN) on the MNIST handwritten digit dataset.
Experiment Setup#
Dataset: MNIST
Model: Simple CNN
Epochs: 5
Optimizer: Adam
Learning Rates Tested:
0.0001 (very low)
0.001 (recommended for Adam)
0.01 (high)
Goal#
Show how different learning rates affect:
Training loss
Training speed
Final test accuracy
Training stability
Import Dependencies
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import datasets
import matplotlib.pyplot as plt
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torch
2 import torch.nn as nn
3 import torch.optim as optim
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\__init__.py:281
277 raise err
279 kernel32.SetErrorMode(prev_error_mode)
--> 281 _load_dll_libraries()
282 del _load_dll_libraries
285 def _get_cuda_dep_paths(path: str, lib_folder: str, lib_name: str) -> list[str]:
286 # Libraries can either be in
287 # path/nvidia/lib_folder/lib or
288 # path/nvidia/cuXX/lib (since CUDA 13.0) or
289 # path/lib_folder/lib
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\__init__.py:257, in _load_dll_libraries()
255 is_loaded = False
256 if with_load_library_flags:
--> 257 res = kernel32.LoadLibraryExW(dll, None, 0x00001100)
258 last_error = ctypes.get_last_error()
259 if res is None and last_error != 126:
KeyboardInterrupt:
Load Dataset
transform = transforms.ToTensor()
train_data = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_data = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=32, shuffle= True)
Build CNN
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2)
)
self.fc = nn.Sequential(
nn.Flatten(),
nn.Linear(32*7*7, 128), nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x): return self.fc(self.conv(x))
Training Function
def train_model(lr):
model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
loss_list = []
for epoch in range(5):
for x, y in train_loader:
pred = model(x)
loss = criterion(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_list.append(loss.item())
print(f"LR={lr}, Epoch={epoch+1}, Loss={loss.item():.4f}")
return model, loss_list
Run Experiments
learning_rates = [0.0001, 0.001, 0.01]
results = {}
for lr in learning_rates:
model, losses = train_model(lr)
results[lr] = losses
Plot Loss Curves
for lr, loss in results.items():
plt.plot(loss, label=f"LR={lr}")
plt.legend()
plt.title("Loss Comparison Across Learning Rates")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()
Key Takeaways#
Hyperparameter Sensitivity: The learning rate is a decisive factor in model training. Even with identical architectures, variations in LR can determine the difference between rapid convergence and model divergence.
The “Goldilocks” Principle:
Too Low: Can lead to slow training or, as seen here, getting stuck in suboptimal states.
Too High: Causes oscillation and inability to reach the absolute global minimum.
Optimal: Facilitates smooth, efficient descent (e.g., \(\alpha = 0.001\) in this experiment).
Visual Diagnostics: Plotting loss curves is essential. Numerical accuracy metrics alone often hide how the model is learning (or failing to learn).
References Academic Sources
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6980
Technical Documentation & Tutorials
PyTorch. (2024). Optimizers - PyTorch Documentation. Retrieved from https://pytorch.org/docs/stable/optim.html
Li, F., Karpathy, A., & Johnson, J. (n.d.). CS231n: Convolutional Neural Networks for Visual Recognition (Optimization). Stanford University. https://cs231n.github.io/optimization-1/
Brownlee, J. (2022). A Gentle Introduction to Learning Rate in Deep Learning. Machine Learning Mastery.
The learning rate is often the single most significant hyperparameter to tune in deep learning. As shown in the visual analysis, a magnitude change in learning rate (e.g., from \(10^{-4}\) to \(10^{-3}\)) dramatically alters the training trajectory. For future iterations, utilizing Learning Rate Schedulers (decaying the rate over time) is recommended to combine the speed of high initial rates with the precision of low final rates.