Riding the Learning Curve: How a Single Number Decides Whether Your Neural Network Succeeds or Crashes

Riding the Learning Curve: How a Single Number Decides Whether Your Neural Network Succeeds or Crashes#

Date: November 19, 2025

Training a deep learning model is a lot like teaching a student. If you teach too slowly, the student never learns enough. If you teach too quickly, the student gets confused and makes wild mistakes.

In neural networks, this teaching speed is controlled by the learning rate — one simple number that can decide whether your model becomes smart… or completely fails.

This blog explores what a learning rate is, why it is so powerful, and how changing it affects training in real-life experiments. Using a simple CNN on the MNIST dataset, we’ll compare learning rates and visualize how training behaves at different speeds.

What Is Learning Rate?#

Learning Rate is a key hyperparameter in neural networks that controls how quickly the model learns during training. It determines the size of the steps taken to minimize the loss function. It controls how much change is made in response to the error encountered, each time the model weights are updated. It determines the size of the steps taken towards a minimum of the loss function during optimization.

In short, its a hyperparameter that controls how much the model updates its weights in response to the error it makes.

Formally, it appears in the gradient descent update rule:

\[θt+1​=θt​−η⋅∇θ​J(θ)\]

where:

  • \(θt\) = current parameters

  • \(θt+1\) = updated parameters

  • \(∇θ​J(θ)\) = gradrient of the loss functions

  • \(η(eta)\) = learning rate

Why is this important?#

Because the learning rate determines how big each update step is.

Learning Rate

What Happens

Too Low

Model learns extremely slowly, may get stuck

Ideal

Smooth, stable learning and fast convergence

Too High

Model becomes unstable → oscillates or diverges

Visual analogy:
Small LR → baby steps
Medium LR → walking normally
Large LR → running downhill
Very large LR → falling off a cliff

Why Learning Rate Matters#

  1. Controls how fast the model learns A higher learning rate speeds up learning at the cost of stability.

  2. Affects the quality of the final solution A bad learning rate can trap the model in poor local minima or saddle points.

  3. Determines training stability
    Too high → training “explodes”.
    Too low → training drags for hours.

  4. Interacts with optimizers For optimizers like Adam, RMSProp, and SGD, the default LR is often set to values that balance speed vs. reliability. But there’s never a perfect LR for every model.

To understand learning rates in action#

To understand learning rates in action, we trained a simple Convolutional Neural Network (CNN) on the MNIST handwritten digit dataset.

Experiment Setup#

  • Dataset: MNIST

  • Model: Simple CNN

  • Epochs: 5

  • Optimizer: Adam

  • Learning Rates Tested:

    • 0.0001 (very low)

    • 0.001 (recommended for Adam)

    • 0.01 (high)

Goal#

Show how different learning rates affect:

  • Training loss

  • Training speed

  • Final test accuracy

  • Training stability

  1. Import Dependencies

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import datasets
import matplotlib.pyplot as plt
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torch
      2 import torch.nn as nn
      3 import torch.optim as optim

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\__init__.py:281
    277                     raise err
    279         kernel32.SetErrorMode(prev_error_mode)
--> 281     _load_dll_libraries()
    282     del _load_dll_libraries
    285 def _get_cuda_dep_paths(path: str, lib_folder: str, lib_name: str) -> list[str]:
    286     # Libraries can either be in
    287     # path/nvidia/lib_folder/lib or
    288     # path/nvidia/cuXX/lib (since CUDA 13.0) or
    289     # path/lib_folder/lib

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\__init__.py:257, in _load_dll_libraries()
    255 is_loaded = False
    256 if with_load_library_flags:
--> 257     res = kernel32.LoadLibraryExW(dll, None, 0x00001100)
    258     last_error = ctypes.get_last_error()
    259     if res is None and last_error != 126:

KeyboardInterrupt: 
  1. Load Dataset

transform = transforms.ToTensor()

train_data = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_data = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=32, shuffle= True)
  1. Build CNN

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2)
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32*7*7, 128), nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x): return self.fc(self.conv(x))
  1. Training Function

def train_model(lr):
    model = CNN()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    loss_list = []
    
    for epoch in range(5):
        for x, y in train_loader:
            pred = model(x)
            loss = criterion(pred, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        loss_list.append(loss.item())
        print(f"LR={lr}, Epoch={epoch+1}, Loss={loss.item():.4f}")
    return model, loss_list
  1. Run Experiments

learning_rates = [0.0001, 0.001, 0.01]
results = {}

for lr in learning_rates:
    model, losses = train_model(lr)
    results[lr] = losses
  1. Plot Loss Curves

for lr, loss in results.items():
    plt.plot(loss, label=f"LR={lr}")
plt.legend()
plt.title("Loss Comparison Across Learning Rates")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()

Key Takeaways#

Hyperparameter Sensitivity: The learning rate is a decisive factor in model training. Even with identical architectures, variations in LR can determine the difference between rapid convergence and model divergence.

The “Goldilocks” Principle:

  • Too Low: Can lead to slow training or, as seen here, getting stuck in suboptimal states.

  • Too High: Causes oscillation and inability to reach the absolute global minimum.

  • Optimal: Facilitates smooth, efficient descent (e.g., \(\alpha = 0.001\) in this experiment).

Visual Diagnostics: Plotting loss curves is essential. Numerical accuracy metrics alone often hide how the model is learning (or failing to learn).


References Academic Sources

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6980

Technical Documentation & Tutorials

PyTorch. (2024). Optimizers - PyTorch Documentation. Retrieved from https://pytorch.org/docs/stable/optim.html

Li, F., Karpathy, A., & Johnson, J. (n.d.). CS231n: Convolutional Neural Networks for Visual Recognition (Optimization). Stanford University. https://cs231n.github.io/optimization-1/

Brownlee, J. (2022). A Gentle Introduction to Learning Rate in Deep Learning. Machine Learning Mastery.


The learning rate is often the single most significant hyperparameter to tune in deep learning. As shown in the visual analysis, a magnitude change in learning rate (e.g., from \(10^{-4}\) to \(10^{-3}\)) dramatically alters the training trajectory. For future iterations, utilizing Learning Rate Schedulers (decaying the rate over time) is recommended to combine the speed of high initial rates with the precision of low final rates.