Optimizers

Optimizers update neural network parameters using computed gradients. Understanding how optimizers work and how to tune them is essential for training models effectively.

Introduction

Training a neural network means finding parameter values that minimize a loss function. This is an optimization problem: start with random parameters, compute gradients that indicate how to adjust them, and iteratively update parameters to reduce loss.

Optimizers automate this process. They take computed gradients and apply update rules to parameters. Different optimizers use different strategies—some use only the current gradient (like SGD), while others accumulate information from previous steps (like momentum-based methods).

This guide explains optimizer fundamentals, shows you how to create and use optimizers, describes the algorithms available in Tofu, and provides practical guidance for tuning hyperparameters and troubleshooting training issues.

Optimizer Fundamentals

Understanding how optimizers work requires grasping two key concepts: gradient descent and the learning rate.

Gradient Descent: Following the Slope Downhill

Imagine you're standing on a mountain in fog, trying to reach the lowest point. You can't see far, but you can feel which direction slopes downward beneath your feet. Gradient descent works the same way: at each step, compute which direction reduces the loss function, then take a small step in that direction.

Mathematically, for a parameter theta and loss L:

theta_new = theta_old - learning_rate * gradient

Where gradient = dL/dtheta (the derivative of loss with respect to the parameter).

The gradient points in the direction of steepest ascent (uphill). By subtracting it, we move downhill toward lower loss.

Learning Rate: Step Size Matters

The learning rate controls how large a step to take. This is the single most important hyperparameter in training neural networks.

Too large: You'll overshoot the minimum, potentially making loss worse or causing training to diverge completely.

Loss landscape:  \    /
                  \__/
With large steps: --> X <-- (overshoot back and forth)

Too small: Training converges slowly. You'll make progress, but it might take 10x or 100x more iterations than necessary.

Loss landscape:  \    /
                  \__/
With tiny steps:  . . . . . (very slow progress)

Just right: Training converges efficiently without instability.

Loss landscape:  \    /
                  \__/
Good step size:   -> -> -> (steady progress to minimum)

Typical learning rates range from 0.0001 to 0.1. Start with 0.01 and adjust based on training behavior.

Stochastic Gradient Descent (SGD)

Classical gradient descent computes gradients using the entire training dataset. This is expensive and slow. Stochastic Gradient Descent (SGD) uses small batches of data instead—typically 32, 64, or 128 examples at a time.

The "stochastic" (random) part means each batch gives a noisy estimate of the true gradient. But averaging over many batches gives the correct direction, and computing on small batches is much faster than using the entire dataset.

In practice, when people say "SGD," they usually mean "mini-batch SGD"—computing gradients on small batches rather than single examples or the full dataset.

Why Multiple Optimizer Types?

If vanilla SGD works, why do we need other optimizers? Because SGD has limitations:

Slow convergence on complex loss landscapes
Oscillation in narrow valleys (moves back and forth rather than forward)
Sensitivity to learning rate choice

Advanced optimizers like SGD with momentum address these issues by accumulating information about previous gradients. This helps accelerate training and dampen oscillations.

Creating Optimizers

Optimizers in Tofu are tied to computation graphs. When you create an optimizer, it automatically collects all trainable parameters (nodes created with tofu_graph_param) from the graph.

Basic Setup

Creating an optimizer follows this pattern:

// 1. Create graph and add parameters
tofu_graph *g = tofu_graph_create();

tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);

tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

// 2. Create optimizer (automatically finds W and b)
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// 3. Use in training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    tofu_optimizer_zero_grad(opt);
    // ... forward pass, compute loss ...
    tofu_graph_backward(g, loss);
    tofu_optimizer_step(opt);
}

// 4. Cleanup (optimizer before graph)
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(weights);
tofu_tensor_free_data_too(bias);

Key points:

Automatic parameter collection: The optimizer scans the graph and finds all PARAM nodes when created
One optimizer per graph: Each optimizer manages parameters from a single graph
Cleanup order matters: Always free the optimizer before the graph

Choosing a Learning Rate

Start with these defaults:

0.01 - Safe starting point for most problems
0.001 - Deep networks, complex problems
0.1 - Small networks, simple problems

After a few iterations, check if loss is decreasing. If not, reduce the learning rate by 10x. If loss decreases very slowly, try increasing by 2x-5x.

Memory Considerations

Different optimizers have different memory requirements:

SGD: No extra memory (just the parameters themselves)
SGD with momentum: One velocity buffer per parameter (doubles memory)

For large networks on memory-constrained devices, vanilla SGD may be the only option. For everything else, momentum is usually worth the extra memory.

SGD: Stochastic Gradient Descent

Vanilla SGD is the simplest optimizer. It updates parameters by directly subtracting the scaled gradient.

The Algorithm

For each parameter theta:

theta = theta - learning_rate * gradient

That's it. Compute the gradient, scale it by the learning rate, subtract from the parameter.

In code:

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

The optimizer applies this update rule to every parameter automatically when you call tofu_optimizer_step().

Implementation Example

Here's a complete training loop using SGD:

// Setup
tofu_graph *g = tofu_graph_create();

// Network: linear layer (input_dim=4, output_dim=3)
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);

tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);

// Create SGD optimizer with learning rate 0.01
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Training loop
for (int epoch = 0; epoch < 100; epoch++) {
    double epoch_loss = 0.0;

    for (int batch = 0; batch < num_batches; batch++) {
        // Zero gradients before forward pass
        tofu_optimizer_zero_grad(opt);

        // Forward pass: pred = input @ W + b
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
        tofu_graph_node *h = tofu_graph_matmul(g, x, W_node);
        tofu_graph_node *pred = tofu_graph_add(g, h, b_node);

        // Compute loss
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

        // Get loss value for logging
        float loss_val;
        tofu_tensor *loss_tensor = tofu_graph_get_value(loss);
        TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_val, TOFU_FLOAT);
        epoch_loss += loss_val;

        // Backward pass: compute gradients
        tofu_graph_backward(g, loss);

        // Update parameters using gradients
        tofu_optimizer_step(opt);

        // Clear operations for next batch
        tofu_graph_clear_ops(g);
    }

    printf("Epoch %d: Loss = %.6f\n", epoch, epoch_loss / num_batches);
}

// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);

When to Use SGD

SGD works well when:

Memory is tight: SGD has no extra memory overhead
Loss landscape is smooth: Few local minima, well-conditioned gradients
You have time to tune: SGD is sensitive to learning rate, so you'll need to experiment

SGD struggles when:

Loss landscape is complex: Many local minima or saddle points
Gradients are noisy: High variance in gradient estimates
Convergence needs to be fast: SGD converges slower than momentum-based methods

Tuning SGD

The learning rate is the only hyperparameter for vanilla SGD. Here's how to tune it:

Start with 0.01:

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

Watch the first few iterations:

Loss decreasing steadily: Good sign, continue training
Loss increasing or NaN: Learning rate too high, reduce by 10x
Loss barely changing: Learning rate too low, increase by 2x-5x

Common learning rate values:

0.1 - Aggressive, works for simple problems
0.01 - Conservative, good default
0.001 - Very conservative, deep networks
0.0001 - Fine-tuning pretrained models

Monitoring training:

for (int epoch = 0; epoch < max_epochs; epoch++) {
    // ... training loop ...

    if (epoch_loss < best_loss) {
        best_loss = epoch_loss;
        no_improvement_count = 0;
    } else {
        no_improvement_count++;
    }

    // Reduce learning rate if stuck
    if (no_improvement_count > 10) {
        opt->learning_rate *= 0.5;
        printf("Reducing learning rate to %.6f\n", opt->learning_rate);
        no_improvement_count = 0;
    }
}

SGD with Momentum

Momentum helps SGD converge faster and more smoothly by accumulating a velocity term that averages gradients over time. This dampens oscillations and accelerates progress in consistent directions.

The Algorithm

Instead of directly using the current gradient, momentum maintains a velocity vector that accumulates gradients exponentially:

v = momentum * v - learning_rate * gradient
theta = theta + v

Where:

v is the velocity (initialized to zero)
momentum is a coefficient (typically 0.9)
learning_rate scales the gradient contribution
gradient is the current parameter gradient

This differs from classical momentum formulations but is mathematically equivalent. The key insight: multiply the velocity by momentum (typically 0.9), then subtract the scaled gradient and add the result to the parameter.

Why Momentum Works

Think of momentum as a ball rolling downhill. When the slope consistently points in one direction, the ball accelerates (velocity builds up). When the slope changes direction, the accumulated velocity smooths out oscillations.

Without momentum (vanilla SGD):

Narrow valley:  |        |
Path taken:     | -> <- ->| (oscillates back and forth)
                | -> <- ->|

With momentum:

Narrow valley:  |        |
Path taken:     |  -->   | (smooth progress forward)
                |   -->  |

Momentum provides two benefits:

Acceleration: Builds up speed in consistent directions
Dampening: Reduces oscillations in directions that change frequently

Implementation Example

Creating an SGD optimizer with momentum requires one additional parameter:

// Create optimizer with learning_rate=0.01, momentum=0.9
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

The rest of the training loop is identical to vanilla SGD:

// Setup
tofu_graph *g = tofu_graph_create();

// Network: two-layer MLP
tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){784, 128}, TOFU_FLOAT);
tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){128}, TOFU_FLOAT);
tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){128, 10}, TOFU_FLOAT);
tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);

tofu_graph_node *W1_node = tofu_graph_param(g, W1);
tofu_graph_node *b1_node = tofu_graph_param(g, b1);
tofu_graph_node *W2_node = tofu_graph_param(g, W2);
tofu_graph_node *b2_node = tofu_graph_param(g, b2);

// Create optimizer with momentum
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

// Training loop
for (int epoch = 0; epoch < 100; epoch++) {
    for (int batch = 0; batch < num_batches; batch++) {
        tofu_optimizer_zero_grad(opt);

        // Forward pass
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);

        // Layer 1: h1 = relu(x @ W1 + b1)
        tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1_node);
        h1 = tofu_graph_add(g, h1, b1_node);
        h1 = tofu_graph_relu(g, h1);

        // Layer 2: output = h1 @ W2 + b2
        tofu_graph_node *h2 = tofu_graph_matmul(g, h1, W2_node);
        h2 = tofu_graph_add(g, h2, b2_node);

        // Loss
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
        tofu_graph_node *loss = tofu_graph_mse_loss(g, h2, target);

        // Backward and update
        tofu_graph_backward(g, loss);
        tofu_optimizer_step(opt);

        tofu_graph_clear_ops(g);
    }
}

// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W1);
tofu_tensor_free_data_too(b1);
tofu_tensor_free_data_too(W2);
tofu_tensor_free_data_too(b2);

Tuning Momentum

Momentum has two hyperparameters: learning rate and momentum coefficient.

Learning Rate: Start with the same values as vanilla SGD (0.01 is a good default). Momentum often allows slightly higher learning rates because it dampens oscillations.

Momentum Coefficient: Controls how much past gradients influence current updates.

Common values:

0.9 - Standard choice, works well for most problems
0.95 - High momentum, use for slow convergence
0.99 - Very high momentum, use for very deep networks
0.5-0.8 - Low momentum, use if training is unstable

The momentum coefficient is easier to tune than learning rate. Start with 0.9 and adjust if needed.

When to Use Momentum

Use momentum when:

Training is slow: Momentum accelerates convergence
Gradients are noisy: Momentum smooths out noise
Deep networks: Momentum helps propagate gradients through many layers
Memory is available: Momentum requires one velocity buffer per parameter

Stick with vanilla SGD when:

Memory is very tight: Momentum doubles memory requirements
Loss landscape is simple: Vanilla SGD may be sufficient

In practice, momentum is the default choice for most problems. The memory cost is usually worth the faster convergence.

Using Optimizers in Training

Now that you understand optimizer algorithms, let's look at the mechanics of using them in training loops.

The Training Cycle

Every training iteration follows the same four-step pattern:

1. Zero gradients    (tofu_optimizer_zero_grad)
2. Forward pass      (build computation graph)
3. Backward pass     (tofu_graph_backward)
4. Update parameters (tofu_optimizer_step)

This cycle repeats for every batch in every epoch.

Step-by-Step Breakdown

Step 1: Zero Gradients

Gradients accumulate by default in Tofu. If you don't zero them, they'll keep adding up across iterations, leading to incorrect updates.

tofu_optimizer_zero_grad(opt);

This clears all gradient buffers for parameters tracked by the optimizer. Always call this before the forward pass.

Step 2: Forward Pass

Build the computation graph by adding operations. Each operation automatically computes its value:

tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *h = tofu_graph_matmul(g, x, W);
tofu_graph_node *pred = tofu_graph_add(g, h, b);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

At this point, loss contains the computed loss value, but gradients haven't been computed yet.

Step 3: Backward Pass

Compute gradients by calling backward on the loss node:

tofu_graph_backward(g, loss);

This triggers reverse-mode automatic differentiation. Tofu walks the graph backwards, computing gradients for every parameter using the chain rule. After this call, every parameter has its gradient stored in node->grad.

Step 4: Update Parameters

Apply the optimizer's update rule to adjust parameters:

tofu_optimizer_step(opt);

This uses the computed gradients to update parameters. For SGD, it subtracts learning_rate * gradient from each parameter. For momentum, it updates velocity buffers and then parameters.

Step 5: Clear Operations

Before the next iteration, clear operation nodes from the graph while preserving parameters:

tofu_graph_clear_ops(g);

This frees memory used by intermediate computations (matmul results, activations, etc.) but keeps parameters and their gradients intact.

Complete Training Loop

Here's a full training loop with all the pieces:

tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

for (int epoch = 0; epoch < num_epochs; epoch++) {
    double total_loss = 0.0;

    for (int batch = 0; batch < num_batches; batch++) {
        // 1. Zero gradients
        tofu_optimizer_zero_grad(opt);

        // 2. Forward pass
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
        tofu_graph_node *pred = forward_pass(g, x);  // Your model
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

        // Track loss for logging
        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
        total_loss += loss_val;

        // 3. Backward pass
        tofu_graph_backward(g, loss);

        // 4. Update parameters
        tofu_optimizer_step(opt);

        // 5. Clear operations
        tofu_graph_clear_ops(g);
    }

    // Log epoch statistics
    printf("Epoch %d: Avg Loss = %.6f\n", epoch, total_loss / num_batches);
}

Common Mistakes

Mistake 1: Forgetting to zero gradients

// WRONG: Gradients accumulate indefinitely
for (int i = 0; i < iterations; i++) {
    // No zero_grad call!
    // ... forward, backward, step ...
}

This causes gradients to grow without bound. Updates become incorrect after the first iteration.

Correct:

for (int i = 0; i < iterations; i++) {
    tofu_optimizer_zero_grad(opt);  // Clear old gradients
    // ... forward, backward, step ...
}

Mistake 2: Calling step before backward

// WRONG: No gradients computed yet!
tofu_optimizer_step(opt);
tofu_graph_backward(g, loss);

The optimizer needs gradients to update parameters. Always call backward before step.

Correct:

tofu_graph_backward(g, loss);   // Compute gradients
tofu_optimizer_step(opt);        // Use gradients to update

Mistake 3: Not clearing operations

// WRONG: Memory grows indefinitely
for (int batch = 0; batch < num_batches; batch++) {
    // ... training ...
    // No clear_ops call!
}

Each batch adds nodes to the graph. Without clearing, memory usage grows until the program crashes.

Correct:

for (int batch = 0; batch < num_batches; batch++) {
    // ... training ...
    tofu_graph_clear_ops(g);  // Clear after each batch
}

Monitoring Training

Track key metrics to understand training progress:

for (int epoch = 0; epoch < num_epochs; epoch++) {
    double epoch_loss = 0.0;
    int num_correct = 0;

    for (int batch = 0; batch < num_batches; batch++) {
        // ... training loop ...

        // Track loss
        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
        epoch_loss += loss_val;

        // Track accuracy (for classification)
        num_correct += count_correct_predictions(pred, target);
    }

    double avg_loss = epoch_loss / num_batches;
    double accuracy = (double)num_correct / (num_batches * batch_size);

    printf("Epoch %d: Loss = %.6f, Accuracy = %.2f%%\n",
           epoch, avg_loss, accuracy * 100);
}

Learning Rate Strategies

The learning rate often needs adjustment during training. Starting with a fixed rate works for simple problems, but complex models benefit from learning rate schedules.

Fixed Learning Rate

The simplest strategy: use the same learning rate throughout training.

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Training loop uses 0.01 for all epochs
for (int epoch = 0; epoch < 100; epoch++) {
    // ... training ...
}

This works well when:

The problem is simple
You've found a good learning rate through experimentation
Training converges in a reasonable number of epochs

Step Decay

Reduce the learning rate by a fixed factor every N epochs:

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.1);

for (int epoch = 0; epoch < 100; epoch++) {
    // Reduce learning rate by 10x every 30 epochs
    if (epoch % 30 == 0 && epoch > 0) {
        opt->learning_rate *= 0.1;
        printf("Epoch %d: Learning rate reduced to %.6f\n",
               epoch, opt->learning_rate);
    }

    // ... training loop ...
}

Common schedules:

Divide by 10 every 30 epochs (0.1 -> 0.01 -> 0.001)
Divide by 2 every 10 epochs (0.1 -> 0.05 -> 0.025)

Step decay is simple and effective for many problems.

Exponential Decay

Gradually reduce the learning rate every epoch:

double initial_lr = 0.1;
double decay_rate = 0.95;

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, initial_lr);

for (int epoch = 0; epoch < 100; epoch++) {
    // Update learning rate
    opt->learning_rate = initial_lr * pow(decay_rate, epoch);

    if (epoch % 10 == 0) {
        printf("Epoch %d: Learning rate = %.6f\n", epoch, opt->learning_rate);
    }

    // ... training loop ...
}

This provides smooth, gradual decay. The decay rate controls how quickly the learning rate decreases (0.95 is typical).

Cosine Annealing

Reduce the learning rate following a cosine curve:

#include <math.h>

double initial_lr = 0.1;
double min_lr = 0.001;
int num_epochs = 100;

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, initial_lr);

for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Cosine annealing formula
    double progress = (double)epoch / num_epochs;
    opt->learning_rate = min_lr + (initial_lr - min_lr) *
                         (1.0 + cos(M_PI * progress)) / 2.0;

    // ... training loop ...
}

Cosine annealing provides smooth decay that starts fast and slows down near the end.

Learning Rate Warmup

For very high initial learning rates, gradually increase from a small value:

double target_lr = 0.1;
int warmup_epochs = 5;

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, target_lr / warmup_epochs);

for (int epoch = 0; epoch < 100; epoch++) {
    // Warmup phase
    if (epoch < warmup_epochs) {
        opt->learning_rate = target_lr * (epoch + 1) / warmup_epochs;
    }

    // ... training loop ...
}

Warmup prevents instability in the first few epochs when using aggressive learning rates.

Adaptive Scheduling

Reduce the learning rate when progress stalls:

double best_loss = INFINITY;
int patience = 5;
int no_improvement_count = 0;

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

for (int epoch = 0; epoch < 100; epoch++) {
    // ... training loop ...
    double epoch_loss = compute_epoch_loss();

    // Track progress
    if (epoch_loss < best_loss) {
        best_loss = epoch_loss;
        no_improvement_count = 0;
    } else {
        no_improvement_count++;
    }

    // Reduce learning rate if stuck
    if (no_improvement_count >= patience) {
        opt->learning_rate *= 0.5;
        printf("Reducing learning rate to %.6f\n", opt->learning_rate);
        no_improvement_count = 0;
    }
}

This adapts to training dynamics automatically, reducing the learning rate only when needed.

Choosing a Strategy

Start simple: Use a fixed learning rate first. Only add scheduling if training plateaus.

Step decay: Good default for most problems. Easy to understand and implement.

Exponential/Cosine: Use for long training runs (100+ epochs) where smooth decay is beneficial.

Adaptive: Best when you're not sure how many epochs you need or when progress is unpredictable.

Warmup: Use when starting with very high learning rates (0.1+) to prevent early instability.

Choosing an Optimizer

With multiple optimizers available, how do you choose? Here's a practical decision guide.

Start with SGD + Momentum

For most problems, SGD with momentum (0.01 learning rate, 0.9 momentum) is the best starting point:

tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

This provides:

Good convergence speed
Reasonable memory overhead
Robustness to hyperparameter choices

Decision Tree

Is memory extremely tight? (< 2x parameter memory available)
├─ YES: Use vanilla SGD
└─ NO: Continue

Is the problem very simple? (linear model, small dataset)
├─ YES: Use vanilla SGD (momentum won't help much)
└─ NO: Continue

Is the network deep? (> 5 layers)
├─ YES: Use SGD with momentum 0.9 or higher
└─ NO: Use SGD with momentum 0.9

Comparison Table

Optimizer	Memory	Convergence	Tuning	Best For
SGD	Minimal	Slower	Difficult	Memory-constrained, simple problems
SGD+Momentum	2x params	Faster	Moderate	General purpose, deep networks

Network Depth Considerations

Shallow networks (1-3 layers):

Vanilla SGD often sufficient
Momentum helps but not essential

Medium networks (4-10 layers):

Momentum recommended
Use momentum 0.9

Deep networks (10+ layers):

Momentum essential
Use momentum 0.95-0.99

Problem Type Recommendations

Regression (MSE loss):

// Start here
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

Classification (cross-entropy loss):

// May need higher learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.05, 0.9);

Fine-tuning pretrained models:

// Very small learning rate to preserve learned features
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.0001);

When in Doubt

Default configuration for new problems:

tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

This works well for most cases. Adjust based on training behavior:

Loss diverges: Reduce learning rate by 10x
Convergence too slow: Increase learning rate by 2-5x
Still slow: Increase momentum to 0.95

Troubleshooting

Training neural networks is often an iterative process of diagnosing and fixing issues. Here are common problems and their solutions.

Loss is NaN or Infinite

Symptoms: Loss becomes NaN or infinity after a few iterations.

Causes:

Learning rate too high
Gradient explosion (very large gradients)
Numerical instability in loss function

Solutions:

Reduce learning rate dramatically:

// If using 0.01, try 0.001
opt->learning_rate = 0.001;

Check gradients for extreme values:

tofu_graph_backward(g, loss);

// Before optimizer step, check gradient magnitudes
for (int i = 0; i < opt->num_params; i++) {
    tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
    double max_grad = find_max_abs_value(grad);

    if (max_grad > 1000.0) {
        printf("Warning: Large gradient detected: %.2f\n", max_grad);
    }
}

tofu_optimizer_step(opt);

Implement gradient clipping:

void clip_gradients(tofu_optimizer *opt, double max_norm) {
    for (int i = 0; i < opt->num_params; i++) {
        tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
        if (!grad) continue;

        // Compute L2 norm
        double norm = 0.0;
        for (int j = 0; j < grad->len; j++) {
            float val;
            TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
            norm += val * val;
        }
        norm = sqrt(norm);

        // Clip if too large
        if (norm > max_norm) {
            double scale = max_norm / norm;
            for (int j = 0; j < grad->len; j++) {
                float val;
                TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
                val *= scale;
                TOFU_TENSOR_DATA_FROM(grad, j, val, TOFU_FLOAT);
            }
        }
    }
}

// Use before optimizer step
tofu_graph_backward(g, loss);
clip_gradients(opt, 1.0);  // Clip to max norm of 1.0
tofu_optimizer_step(opt);

Loss Not Decreasing

Symptoms: Loss stays constant or decreases very slowly.

Causes:

Learning rate too low
Model stuck in poor initialization
Gradient vanishing
Wrong loss function or labels

Solutions:

Increase learning rate:

opt->learning_rate *= 10.0;  // Try 10x higher

Check if gradients are flowing:

tofu_graph_backward(g, loss);

// Check if gradients are non-zero
for (int i = 0; i < opt->num_params; i++) {
    tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
    double sum_abs = 0.0;

    for (int j = 0; j < grad->len; j++) {
        float val;
        TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
        sum_abs += fabs(val);
    }

    double mean_abs = sum_abs / grad->len;

    if (mean_abs < 1e-7) {
        printf("Warning: Very small gradients (%.2e) for parameter %d\n",
               mean_abs, i);
    }
}

Try momentum if using vanilla SGD:

// Replace vanilla SGD with momentum
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

Loss Oscillates

Symptoms: Loss goes up and down rather than steadily decreasing.

Causes:

Learning rate too high
Batch size too small (noisy gradients)
Wrong momentum setting

Solutions:

Reduce learning rate:

opt->learning_rate *= 0.5;  // Try half the current rate

Use or increase momentum:

// If using vanilla SGD, add momentum
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

// If already using momentum, increase it
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.95);

Training Slows Down Over Time

Symptoms: Loss decreases quickly at first, then stalls.

Causes:

Learning rate too high for fine-tuning
Converging to local minimum
Need learning rate schedule

Solutions:

Implement step decay:

for (int epoch = 0; epoch < 100; epoch++) {
    // Reduce learning rate when progress slows
    if (epoch == 30 || epoch == 60 || epoch == 90) {
        opt->learning_rate *= 0.1;
        printf("Reduced learning rate to %.6f\n", opt->learning_rate);
    }

    // ... training loop ...
}

Use adaptive scheduling:

double best_loss = INFINITY;
int no_improvement = 0;

for (int epoch = 0; epoch < 100; epoch++) {
    // ... training ...

    if (epoch_loss < best_loss) {
        best_loss = epoch_loss;
        no_improvement = 0;
    } else {
        no_improvement++;
    }

    if (no_improvement > 5) {
        opt->learning_rate *= 0.5;
        no_improvement = 0;
    }
}

Memory Issues

Symptoms: Program crashes with allocation errors or runs out of memory.

Causes:

Not clearing operations between batches
Momentum optimizer on large networks
Accumulating tensors unintentionally

Solutions:

Always clear operations:

for (int batch = 0; batch < num_batches; batch++) {
    // ... training ...
    tofu_graph_clear_ops(g);  // Essential for memory management
}

Use vanilla SGD if momentum exhausts memory:

// Switch from momentum to vanilla SGD
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_create(g, 0.01);

Best Practices

Here are guidelines to help you train models effectively and avoid common pitfalls.

Always Zero Gradients

Make this the first line of every training iteration:

for (int batch = 0; batch < num_batches; batch++) {
    tofu_optimizer_zero_grad(opt);  // Never forget this!
    // ... rest of training loop ...
}

Without this, gradients accumulate across batches, leading to incorrect updates.

Monitor Multiple Metrics

Don't rely on loss alone. Track additional metrics:

for (int epoch = 0; epoch < num_epochs; epoch++) {
    double total_loss = 0.0;
    double total_l2_norm = 0.0;
    double max_grad = 0.0;

    for (int batch = 0; batch < num_batches; batch++) {
        // ... training ...

        // Track loss
        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
        total_loss += loss_val;

        // Track parameter norm
        total_l2_norm += compute_parameter_norm(opt);

        // Track max gradient
        max_grad = fmax(max_grad, compute_max_gradient(opt));
    }

    printf("Epoch %d: Loss=%.4f, Param_Norm=%.4f, Max_Grad=%.4f\n",
           epoch, total_loss / num_batches,
           total_l2_norm / num_batches, max_grad);
}

Save Checkpoints

Periodically save model parameters during training:

void save_parameters(tofu_optimizer *opt, const char *filename) {
    FILE *f = fopen(filename, "wb");
    if (!f) return;

    for (int i = 0; i < opt->num_params; i++) {
        tofu_tensor *param = tofu_graph_get_value(opt->params[i]);
        fwrite(param->data, 1, param->len * sizeof(float), f);
    }

    fclose(f);
}

// Use during training
for (int epoch = 0; epoch < 100; epoch++) {
    // ... training loop ...

    // Save every 10 epochs
    if (epoch % 10 == 0) {
        char filename[256];
        snprintf(filename, sizeof(filename), "model_epoch_%d.bin", epoch);
        save_parameters(opt, filename);
    }
}

Start Conservative

Begin with conservative hyperparameters and increase aggressiveness only if needed:

// Conservative defaults
double learning_rate = 0.01;  // Not too high
double momentum = 0.9;         // Standard momentum

tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, learning_rate, momentum);

It's easier to increase the learning rate if training is too slow than to recover from instability caused by too-high rates.

Test on Small Data First

Before training on the full dataset, verify your setup on a small subset:

// Test with 10 batches first
int test_batches = 10;

for (int epoch = 0; epoch < 5; epoch++) {
    for (int batch = 0; batch < test_batches; batch++) {
        // ... training loop ...
    }
}

// If loss decreases on small data, scale to full dataset

This quickly reveals issues with the model, loss function, or optimizer configuration.

Use Learning Rate Warmup for High Rates

When using aggressive learning rates (> 0.05), warm up gradually:

double target_lr = 0.1;
int warmup_epochs = 5;

for (int epoch = 0; epoch < 100; epoch++) {
    if (epoch < warmup_epochs) {
        opt->learning_rate = target_lr * (epoch + 1) / warmup_epochs;
    } else {
        opt->learning_rate = target_lr;
    }

    // ... training loop ...
}

Document Your Configuration

Keep track of hyperparameters that work:

// Document successful configurations
printf("Configuration:\n");
printf("  Optimizer: SGD with Momentum\n");
printf("  Learning Rate: %.6f\n", opt->learning_rate);
printf("  Momentum: %.2f\n", 0.9);
printf("  Batch Size: %d\n", batch_size);
printf("  Schedule: Step decay by 0.1 every 30 epochs\n");

This helps when you need to replicate results or adjust for similar problems.

Complete Example

Here's a complete training example that demonstrates all the concepts from this guide.

Problem: Binary Classification

We'll train a two-layer neural network to classify binary data.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "tofu.h"

// Network architecture: input(10) -> hidden(20) -> output(1)

int main(void) {
    // Hyperparameters
    const int input_dim = 10;
    const int hidden_dim = 20;
    const int output_dim = 1;
    const int batch_size = 32;
    const int num_batches = 100;
    const int num_epochs = 50;
    const double learning_rate = 0.01;
    const double momentum = 0.9;

    // Create graph
    tofu_graph *g = tofu_graph_create();

    // Initialize parameters
    tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){input_dim, hidden_dim}, TOFU_FLOAT);
    tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){hidden_dim}, TOFU_FLOAT);
    tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){hidden_dim, output_dim}, TOFU_FLOAT);
    tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){output_dim}, TOFU_FLOAT);

    // Add parameters to graph
    tofu_graph_node *W1_node = tofu_graph_param(g, W1);
    tofu_graph_node *b1_node = tofu_graph_param(g, b1);
    tofu_graph_node *W2_node = tofu_graph_param(g, W2);
    tofu_graph_node *b2_node = tofu_graph_param(g, b2);

    // Create optimizer
    tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, learning_rate, momentum);

    printf("Training Configuration:\n");
    printf("  Architecture: %d -> %d -> %d\n", input_dim, hidden_dim, output_dim);
    printf("  Optimizer: SGD with Momentum\n");
    printf("  Learning Rate: %.4f\n", learning_rate);
    printf("  Momentum: %.2f\n", momentum);
    printf("  Batch Size: %d\n", batch_size);
    printf("  Epochs: %d\n\n", num_epochs);

    // Training loop
    for (int epoch = 0; epoch < num_epochs; epoch++) {
        double epoch_loss = 0.0;

        // Learning rate schedule: reduce by 0.1 every 20 epochs
        if (epoch % 20 == 0 && epoch > 0) {
            opt->learning_rate *= 0.1;
            printf("Reduced learning rate to %.6f\n", opt->learning_rate);
        }

        for (int batch = 0; batch < num_batches; batch++) {
            // 1. Zero gradients
            tofu_optimizer_zero_grad(opt);

            // 2. Generate synthetic batch data (normally loaded from dataset)
            tofu_tensor *batch_x = generate_batch_data(batch_size, input_dim);
            tofu_tensor *batch_y = generate_batch_labels(batch_size, output_dim);

            // 3. Forward pass
            tofu_graph_node *x = tofu_graph_input(g, batch_x);

            // Layer 1: h = relu(x @ W1 + b1)
            tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1_node);
            h1 = tofu_graph_add(g, h1, b1_node);
            h1 = tofu_graph_relu(g, h1);

            // Layer 2: pred = h @ W2 + b2
            tofu_graph_node *pred = tofu_graph_matmul(g, h1, W2_node);
            pred = tofu_graph_add(g, pred, b2_node);

            // Loss
            tofu_graph_node *target = tofu_graph_input(g, batch_y);
            tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

            // Track loss
            float loss_val;
            TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
            epoch_loss += loss_val;

            // 4. Backward pass
            tofu_graph_backward(g, loss);

            // 5. Update parameters
            tofu_optimizer_step(opt);

            // 6. Clear operations
            tofu_graph_clear_ops(g);

            // Free batch data
            tofu_tensor_free_data_too(batch_x);
            tofu_tensor_free_data_too(batch_y);
        }

        // Log progress
        double avg_loss = epoch_loss / num_batches;
        printf("Epoch %2d: Loss = %.6f\n", epoch, avg_loss);

        // Early stopping
        if (avg_loss < 0.001) {
            printf("Converged! Stopping early.\n");
            break;
        }
    }

    // Cleanup
    tofu_optimizer_free(opt);
    tofu_graph_free(g);
    tofu_tensor_free_data_too(W1);
    tofu_tensor_free_data_too(b1);
    tofu_tensor_free_data_too(W2);
    tofu_tensor_free_data_too(b2);

    printf("\nTraining complete!\n");

    return 0;
}

// Helper function to generate synthetic data (replace with real data loading)
tofu_tensor* generate_batch_data(int batch_size, int input_dim) {
    float *data = (float*)malloc(batch_size * input_dim * sizeof(float));
    for (int i = 0; i < batch_size * input_dim; i++) {
        data[i] = ((float)rand() / RAND_MAX) * 2.0f - 1.0f;  // Random in [-1, 1]
    }
    tofu_tensor *t = tofu_tensor_create_with_values(data, 2,
                                                     (int[]){batch_size, input_dim});
    free(data);
    return t;
}

tofu_tensor* generate_batch_labels(int batch_size, int output_dim) {
    float *data = (float*)malloc(batch_size * output_dim * sizeof(float));
    for (int i = 0; i < batch_size * output_dim; i++) {
        data[i] = ((float)rand() / RAND_MAX) > 0.5f ? 1.0f : 0.0f;
    }
    tofu_tensor *t = tofu_tensor_create_with_values(data, 2,
                                                     (int[]){batch_size, output_dim});
    free(data);
    return t;
}

This example demonstrates:

Creating a computation graph with parameters
Building a multi-layer network
Setting up an optimizer with momentum
Implementing a learning rate schedule
Proper training loop structure
Monitoring loss over time
Correct cleanup order

Adapt this template for your specific problem by:

Changing network architecture (layer sizes, activations)
Loading real data instead of synthetic batches
Adding validation/test evaluation
Saving model checkpoints
Implementing gradient clipping if needed

Summary

Optimizers are the engine of neural network training. They take computed gradients and update parameters to minimize loss. Here are the key takeaways:

Core Concepts:

Gradient descent follows gradients downhill toward lower loss
Learning rate controls step size (most important hyperparameter)
Momentum accumulates velocity to accelerate convergence and dampen oscillations

Choosing an Optimizer:

Start with SGD + momentum (learning_rate=0.01, momentum=0.9)
Use vanilla SGD only for memory-constrained or very simple problems
Deep networks benefit from higher momentum (0.95-0.99)

Using Optimizers:

Always follow the pattern: zero_grad, forward, backward, step
Clear operations between batches to manage memory
Monitor training metrics (loss, gradients, parameter norms)

Tuning:

Start with learning_rate=0.01
Increase if training is too slow, decrease if unstable
Use learning rate schedules for long training runs
Implement gradient clipping for unstable gradients

Troubleshooting:

NaN loss: Reduce learning rate, clip gradients
No progress: Increase learning rate, add momentum
Oscillations: Reduce learning rate, increase momentum
Slow convergence: Use learning rate schedule, higher momentum

With these principles, you can train neural networks effectively and debug issues when they arise. Experiment with different configurations, monitor training carefully, and adjust based on what you observe. The best optimizer and hyperparameters depend on your specific problem, so be prepared to iterate.

Keyboard shortcuts

Tofu User Guide