Loss Functions

Loss functions are the core mechanism that guides neural network training. They measure how far your model's predictions are from the true values, producing a single scalar number that quantifies the error. During training, the optimizer uses gradients of this loss to adjust model parameters and improve predictions.

This guide explains how loss functions work, when to use each type, and how to integrate them into your training loops. You'll learn to choose the right loss function for your task and interpret loss values during training.

Introduction
Loss Function Fundamentals
Mean Squared Error
Cross-Entropy Loss
Choosing a Loss Function
Loss in Training Loop
Understanding Loss Values
Advanced Topics
Complete Examples
Best Practices

Introduction

Every machine learning model needs a way to evaluate how well it's performing. A loss function (also called an objective function or cost function) provides this evaluation by computing a numerical score representing prediction error.

During training:

The model makes predictions on input data
The loss function compares predictions to true target values
The result is a single number (scalar) representing total error
Gradients of this loss tell us how to adjust weights
The optimizer updates weights to reduce the loss

The choice of loss function depends on your task type (regression vs classification) and the structure of your data. Tofu provides two fundamental loss functions that cover most use cases:

Mean Squared Error (MSE): For regression tasks where you predict continuous values
Cross-Entropy Loss: For classification tasks where you predict discrete classes

Let's explore the fundamental properties all loss functions must have, then dive into each type.

Loss Function Fundamentals

To work correctly with gradient-based optimization, loss functions must satisfy three key requirements.

1. Objective Function

A loss function defines the optimization objective—the quantity we want to minimize during training. Lower loss means better predictions. The training process iteratively adjusts model parameters to find weights that minimize this function.

Think of it like hiking down a mountain in fog. The loss value tells you your current altitude, and the gradient tells you which direction is downhill. Your goal is to reach the lowest point (minimize loss).

2. Scalar Output

Loss functions must return a single number (scalar), not a vector or matrix. This scalar summarizes all prediction errors across all samples and features into one value.

Why scalar? Because optimization algorithms need a single objective to minimize. You can't simultaneously minimize multiple conflicting objectives without combining them into one number.

Example shapes:

Predictions:  [batch_size, features]  e.g., [32, 10]
Targets:      [batch_size, features]  e.g., [32, 10]
Loss:         [1]                     (scalar)

The loss computation typically:

Computes per-element errors
Sums or averages across all elements
Returns a single scalar value

3. Differentiable

Loss functions must be differentiable (smooth, with computable gradients). The gradient tells us how loss changes when we adjust each parameter—it's the compass that guides optimization.

Non-differentiable functions (like step functions or absolute value at zero) create problems for gradient-based optimizers. They can't compute meaningful gradients, so training fails or converges slowly.

Mathematically, we need:

∂L/∂w  (gradient of loss L with respect to each weight w)

Tofu's automatic differentiation system computes these gradients automatically through backpropagation, so you don't need to derive formulas manually.

Loss Function Workflow

Here's how a loss function fits into one training iteration:

1. Forward Pass:
   Input → Model → Predictions

2. Loss Computation:
   Loss = loss_function(Predictions, Targets)

3. Backward Pass:
   Compute ∂Loss/∂weights for all parameters

4. Parameter Update:
   weights = weights - learning_rate * ∂Loss/∂weights

5. Repeat until loss is minimized

Now let's examine specific loss functions and when to use them.

Mean Squared Error

Mean Squared Error (MSE) is the most common loss function for regression tasks—problems where you predict continuous numerical values rather than discrete classes.

Mathematical Formula

MSE computes the average squared difference between predictions and targets:

MSE = (1/n) * Σ(prediction - target)²

Where:
- n = number of elements (batch_size × features)
- Σ = sum over all elements

The squaring operation ensures:

Errors are always positive (negative errors don't cancel positive ones)
Large errors are penalized more heavily than small errors
The function is smooth and differentiable everywhere

Implementation in Tofu

Use tofu_graph_mse_loss() to add MSE loss to your computation graph:

tofu_graph_node* tofu_graph_mse_loss(tofu_graph* g,
                                      tofu_graph_node* pred,
                                      tofu_graph_node* target);

Parameters:

g: Computation graph
pred: Model predictions (any shape)
target: True target values (must match pred shape)

Returns: Scalar loss node (shape [1])

Requirements:

pred and target must have identical shapes
Both must be non-NULL

When to Use MSE

MSE is ideal for regression problems:

Perfect use cases:

Predicting house prices (continuous dollar values)
Estimating temperature (continuous degrees)
Forecasting stock prices
Predicting ages, distances, or other continuous quantities
Image denoising (pixel value reconstruction)

Why it works:

Treats all dimensions equally
Penalizes large errors more than small ones (squared term)
Has nice mathematical properties (convex, smooth gradients)
Easy to interpret (units are squared target units)

When NOT to use:

Classification tasks (use cross-entropy instead)
When outliers are common (MSE heavily penalizes outliers)
When you care about percentage error rather than absolute error

Practical Example

Here's a complete regression example predicting house prices:

// Setup: Simple linear regression y = x @ W + b
tofu_graph *g = tofu_graph_create();

// Training data: 4 samples with 2 features each
float input_data[] = {
    1.0f, 2.0f,   // Sample 1: [sqft=1000, bedrooms=2]
    2.0f, 3.0f,   // Sample 2: [sqft=2000, bedrooms=3]
    3.0f, 4.0f,   // Sample 3: [sqft=3000, bedrooms=4]
    4.0f, 5.0f    // Sample 4: [sqft=4000, bedrooms=5]
};

// Target prices (in $100k)
float target_data[] = {
    150.0f,  // $150k
    250.0f,  // $250k
    350.0f,  // $350k
    450.0f   // $450k
};

// Create tensors
tofu_tensor *x_tensor = tofu_tensor_create(
    input_data, 2, (int[]){4, 2}, TOFU_FLOAT);
tofu_tensor *y_tensor = tofu_tensor_create(
    target_data, 2, (int[]){4, 1}, TOFU_FLOAT);

// Model parameters (weights and bias)
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){2, 1}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){1}, TOFU_FLOAT);

// Build computation graph
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

// Forward pass: prediction = x @ W + b
tofu_graph_node *matmul_result = tofu_graph_matmul(g, x, W);
tofu_graph_node *prediction = tofu_graph_add(g, matmul_result, b);

// Target
tofu_graph_node *target = tofu_graph_input(g, y_tensor);

// Compute MSE loss
tofu_graph_node *loss = tofu_graph_mse_loss(g, prediction, target);

// Get loss value
tofu_tensor *loss_value = tofu_graph_get_value(loss);
float loss_scalar;
TOFU_TENSOR_DATA_TO(loss_value, 0, loss_scalar, TOFU_FLOAT);
printf("MSE Loss: %.6f\n", loss_scalar);

// Backward pass to compute gradients
tofu_graph_backward(g, loss);

// Now W->grad and b->grad contain gradients for parameter updates

Understanding MSE Values

MSE values depend on the scale of your target values:

Small targets (e.g., normalized to [0, 1]):

Good MSE: < 0.01
Acceptable: 0.01 - 0.1
Poor: > 0.1

Large targets (e.g., house prices in thousands):

MSE = 10,000 means average error is √10,000 = $100
MSE = 1,000 means average error is √1,000 ≈ $32
MSE = 100 means average error is √100 = $10

Tip: Take the square root of MSE to get Root Mean Squared Error (RMSE), which has the same units as your target variable and is easier to interpret.

Gradient Behavior

The gradient of MSE with respect to predictions is:

∂MSE/∂pred = (2/n) * (pred - target)

Key properties:

Gradient magnitude is proportional to error size
Large errors produce large gradients (faster learning)
Small errors produce small gradients (slower learning)
Can cause exploding gradients if predictions are very wrong

Cross-Entropy Loss

Cross-Entropy Loss (also called log loss) is the standard loss function for classification tasks—problems where you assign inputs to discrete categories.

Mathematical Formula

Cross-entropy measures the difference between predicted probability distributions and true labels:

CE = -(1/n) * Σ(target * log(prediction))

Where:
- n = batch_size × num_classes
- target = one-hot encoded true class (or class probabilities)
- prediction = softmax probabilities (sum to 1)
- log = natural logarithm

The formula rewards correct classifications (low loss) and heavily penalizes confident wrong predictions (high loss).

Why Cross-Entropy for Classification?

Cross-entropy has special properties that make it ideal for classification:

Probabilistic interpretation: It measures the "surprise" of predictions given the true distribution
Strong gradients: Even for small probability errors, gradients remain strong enough to drive learning
Numerical stability: Works well with softmax activation (more on this below)
Theoretical foundation: Derived from maximum likelihood estimation in statistics

MSE doesn't work well for classification because:

It treats class probabilities as arbitrary numbers, ignoring their sum-to-one constraint
Gradients vanish when the model is confident but wrong
No probabilistic interpretation

Softmax and Cross-Entropy Connection

Cross-entropy is almost always used with softmax activation on the final layer. Here's why:

Softmax converts raw scores (logits) into probabilities:

softmax(x_i) = exp(x_i) / Σ(exp(x_j))

Properties:
- All outputs are in range (0, 1)
- Outputs sum to 1 (valid probability distribution)
- Highlights the maximum value (turns scores into confident predictions)

Together, softmax + cross-entropy creates a powerful combination:

Softmax outputs represent class probabilities
Cross-entropy compares these probabilities to true labels
Gradients flow efficiently even when predictions are wrong

Implementation in Tofu

Use tofu_graph_ce_loss() with softmax probabilities:

tofu_graph_node* tofu_graph_ce_loss(tofu_graph* g,
                                     tofu_graph_node* pred,
                                     tofu_graph_node* target);

Parameters:

g: Computation graph
pred: Predicted probabilities from softmax (shape: [batch, num_classes])
target: One-hot encoded true labels (shape: [batch, num_classes])

Returns: Scalar loss node (shape [1])

Requirements:

pred must be softmax probabilities (values in [0, 1], sum to 1 per sample)
target must be one-hot encoded (1 for true class, 0 for others)
Shapes must match

Numerical stability: Tofu's implementation adds epsilon (1e-7) to avoid log(0), which would be undefined.

When to Use Cross-Entropy

Cross-entropy is ideal for classification:

Perfect use cases:

Image classification (cat vs dog vs bird)
Text classification (spam vs not spam)
Sentiment analysis (positive/negative/neutral)
Multi-class problems (10 digits, 1000 object categories, etc.)
Any problem with discrete categorical outputs

Why it works:

Designed for probability distributions
Strong gradients throughout training
Natural pairing with softmax
Well-studied theoretical properties

When NOT to use:

Regression problems (use MSE instead)
Multi-label classification where multiple classes can be true simultaneously (requires binary cross-entropy per class)

Practical Example: MNIST-style Digit Classification

Here's a complete classification example for 4 classes:

// Setup: Neural network for 4-class classification
tofu_graph *g = tofu_graph_create();

// Training data: 8 samples with 10 features each
float input_data[8 * 10] = { /* ...fill with data... */ };

// One-hot encoded labels (8 samples, 4 classes)
float label_data[8 * 4] = {
    1, 0, 0, 0,  // Sample 0: class 0
    0, 1, 0, 0,  // Sample 1: class 1
    0, 0, 1, 0,  // Sample 2: class 2
    0, 0, 0, 1,  // Sample 3: class 3
    1, 0, 0, 0,  // Sample 4: class 0
    0, 1, 0, 0,  // Sample 5: class 1
    0, 0, 1, 0,  // Sample 6: class 2
    0, 0, 0, 1   // Sample 7: class 3
};

// Create tensors
tofu_tensor *x_tensor = tofu_tensor_create(
    input_data, 2, (int[]){8, 10}, TOFU_FLOAT);
tofu_tensor *y_tensor = tofu_tensor_create(
    label_data, 2, (int[]){8, 4}, TOFU_FLOAT);

// Model parameters
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){10, 4}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);

// Build graph
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

// Forward pass: logits = x @ W + b
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
logits = tofu_graph_add(g, logits, b);

// Softmax activation (converts logits to probabilities)
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);  // axis=1

// Target labels
tofu_graph_node *target = tofu_graph_input(g, y_tensor);

// Cross-entropy loss
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);

// Get loss value
tofu_tensor *loss_value = tofu_graph_get_value(loss);
float loss_scalar;
TOFU_TENSOR_DATA_TO(loss_value, 0, loss_scalar, TOFU_FLOAT);
printf("Cross-Entropy Loss: %.6f\n", loss_scalar);

// Backward pass
tofu_graph_backward(g, loss);

// Gradients are now available in W->grad and b->grad

Understanding Cross-Entropy Values

Cross-entropy values depend on the number of classes:

Binary classification (2 classes):

Random guessing: ~0.693 (log(2))
Good model: < 0.3
Excellent model: < 0.1

Multi-class (e.g., 10 classes):

Random guessing: ~2.303 (log(10))
Good model: < 1.0
Excellent model: < 0.5

Key insight: Cross-entropy can never be negative. Zero loss means perfect predictions (100% confidence in correct class). As predictions get worse, loss increases without bound.

Interpreting Loss During Training

Watch for these patterns:

Healthy training:

Epoch 0:  loss = 2.30  (random initialization)
Epoch 10: loss = 1.20  (learning started)
Epoch 50: loss = 0.50  (converging)
Epoch 100: loss = 0.15  (well-trained)

Problems:

Loss stays at log(num_classes): Model isn't learning (check learning rate)
Loss increases: Learning rate too high or numerical instability
Loss plateaus early: Model too simple or data too hard

Gradient Behavior

The gradient of cross-entropy with respect to predictions is:

∂CE/∂pred = -(1/n) * (target / pred)

Key properties:

When prediction is very wrong (pred ≈ 0 but target = 1), gradient is very large
When prediction is correct and confident (pred ≈ 1 and target = 1), gradient is small
This creates strong learning signals when needed most

Combined with softmax, the gradient simplifies beautifully to:

∂(CE + softmax)/∂logits = pred - target

This is why softmax + cross-entropy is the gold standard for classification.

Choosing a Loss Function

Selecting the right loss function is critical—it defines what "good" means for your model. Here's a decision guide.

Decision Tree

Start: What type of problem are you solving?
│
├─ Predicting continuous values (numbers)?
│  └─ Use Mean Squared Error (MSE)
│     Examples: Regression, image denoising, forecasting
│
└─ Predicting discrete categories (classes)?
   └─ Use Cross-Entropy Loss
      Examples: Classification, object recognition, sentiment analysis

Quick Reference Table

Task Type	Loss Function	Output Activation	Target Format
Regression	MSE	None (linear)	Continuous values
Binary Classification	Cross-Entropy	Softmax (2 classes)	One-hot [0,1] or [1,0]
Multi-class Classification	Cross-Entropy	Softmax	One-hot encoding
Image Reconstruction	MSE	None or Sigmoid	Pixel values

Detailed Recommendations

Use MSE when:

Output is a continuous number (prices, temperatures, distances)
You care about absolute error magnitude
Your task is regression or reconstruction
Outliers are rare or acceptable

Use Cross-Entropy when:

Output is a discrete category (class label)
You need probability predictions
Your task is classification
You want strong gradients throughout training

Example scenarios:

Problem	Input	Output	Loss	Why
House price prediction	Features (sqft, bedrooms)	Price ($)	MSE	Continuous value
Spam detection	Email text	Spam/Not Spam	Cross-Entropy	Binary classification
Digit recognition	Image pixels	Digit (0-9)	Cross-Entropy	Multi-class classification
Temperature forecast	Historical data	Temperature (°F)	MSE	Continuous value
Sentiment analysis	Review text	Pos/Neg/Neutral	Cross-Entropy	Multi-class classification

Common Mistakes

Mistake 1: Using MSE for classification

// WRONG: Using MSE to predict classes
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *loss = tofu_graph_mse_loss(g, logits, target);  // Bad!

Problem: MSE treats class probabilities as arbitrary numbers, leading to weak gradients and poor convergence.

Fix: Use softmax + cross-entropy:

// CORRECT: Classification with cross-entropy
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);  // Good!

Mistake 2: Forgetting softmax before cross-entropy

// WRONG: Cross-entropy without softmax
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *loss = tofu_graph_ce_loss(g, logits, target);  // Bad!

Problem: Cross-entropy expects probabilities (sum to 1), but logits are raw scores.

Fix: Always apply softmax first:

// CORRECT: Softmax before cross-entropy
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);  // Good!

Mistake 3: Wrong target format

// WRONG: Class indices instead of one-hot for cross-entropy
float targets[] = {0, 1, 2, 1};  // Class indices

Problem: Cross-entropy expects one-hot encoded targets, not class indices.

Fix: Convert to one-hot:

// CORRECT: One-hot encoded targets
float targets[] = {
    1, 0, 0,  // Class 0
    0, 1, 0,  // Class 1
    0, 0, 1,  // Class 2
    0, 1, 0   // Class 1
};

Loss in Training Loop

The loss function integrates into the training loop at a specific point in the forward-backward cycle. Understanding this workflow ensures correct implementation.

Training Loop Structure

A typical training iteration follows this pattern:

1. Zero gradients       (clear previous gradients)
2. Forward pass         (compute predictions)
3. Compute loss         (evaluate predictions)
4. Backward pass        (compute gradients via backpropagation)
5. Optimizer step       (update parameters)
6. Repeat

Loss computation happens after the forward pass and before the backward pass. It's the bridge connecting prediction to optimization.

Complete Training Loop Example

Here's a full training loop showing loss integration:

// Setup: Create graph, parameters, and optimizer
tofu_graph *g = tofu_graph_create();

tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);

tofu_graph_node *W_node = tofu_graph_param(g, weights);
tofu_graph_node *b_node = tofu_graph_param(g, bias);

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Training loop
const int NUM_EPOCHS = 100;
const int BATCH_SIZE = 32;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    float epoch_loss = 0.0f;
    int num_batches = 0;

    for (int batch = 0; batch < num_batches_in_dataset; batch++) {
        // 1. Zero gradients
        tofu_optimizer_zero_grad(opt);

        // 2. Forward pass
        // Load batch data
        tofu_tensor *batch_x = load_batch_input(batch);
        tofu_tensor *batch_y = load_batch_target(batch);

        tofu_graph_node *x = tofu_graph_input(g, batch_x);
        tofu_graph_node *y_true = tofu_graph_input(g, batch_y);

        // Model forward pass
        tofu_graph_node *h = tofu_graph_matmul(g, x, W_node);
        tofu_graph_node *pred = tofu_graph_add(g, h, b_node);

        // 3. Compute loss
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, y_true);

        // Extract loss value for monitoring
        tofu_tensor *loss_tensor = tofu_graph_get_value(loss);
        float loss_value;
        TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_value, TOFU_FLOAT);
        epoch_loss += loss_value;

        // 4. Backward pass
        tofu_graph_backward(g, loss);

        // 5. Optimizer step
        tofu_optimizer_step(opt);

        // Cleanup batch resources
        tofu_tensor_free(batch_x);
        tofu_tensor_free(batch_y);

        // Clear graph operations (keeps parameters)
        tofu_graph_clear_ops(g);

        num_batches++;
    }

    // Report progress
    float avg_loss = epoch_loss / num_batches;
    printf("Epoch %d: loss = %.6f\n", epoch, avg_loss);
}

Monitoring Loss During Training

Track loss values across epochs to monitor training progress:

// Loss tracking
float loss_history[NUM_EPOCHS];

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    // ... training code ...

    loss_history[epoch] = avg_loss;

    // Print every 10 epochs
    if (epoch % 10 == 0) {
        printf("Epoch %3d: loss = %.6f\n", epoch, avg_loss);
    }
}

Loss Curves

Visualizing loss over time reveals training behavior:

Healthy training curve:

Loss
│
2.0 ┤●
    │ ●
1.5 ┤  ●
    │   ●●
1.0 ┤     ●●
    │       ●●●
0.5 ┤          ●●●●●●●●●●●
    └──────────────────────────> Epoch

Characteristics:

Smooth decrease
Eventually plateaus
No wild fluctuations

Warning signs:

Loss increasing:
    │    ●●●
    │  ●●
    │●●
    └─────> Learning rate too high

Loss plateauing early:
    │●●●●●●●●●●●●●●
    │
    └─────> Model too simple or stuck

Loss oscillating:
    │ ● ● ● ● ●
    │● ● ● ● ● ●
    └─────> Batch size too small or LR too high

Understanding Loss Values

Interpreting loss values correctly helps diagnose training issues and assess model quality.

Absolute Loss Magnitude

Loss value interpretation depends heavily on context:

For MSE:

Scale depends on target value range
MSE = 100 is terrible for normalized data [0, 1]
MSE = 100 might be excellent for house prices in thousands
Always consider: What's the typical magnitude of your targets?

For Cross-Entropy:

Random guessing baseline: log(num_classes)
Binary classification random: 0.693
10-class random: 2.303
Perfect predictions: 0.0

Rule of thumb: Compare loss to a baseline (random guessing or simple heuristic) to assess improvement.

Relative Changes Matter More

Focus on loss trends rather than absolute values:

// Good trend (decreasing)
Epoch 0:   loss = 1.50
Epoch 10:  loss = 1.20  (20% reduction)
Epoch 20:  loss = 0.85  (29% reduction)
Epoch 50:  loss = 0.45  (47% reduction)

// Bad trend (increasing)
Epoch 0:   loss = 1.50
Epoch 10:  loss = 1.65  (increasing - problem!)

Common Loss Troubleshooting

Problem: Loss is NaN or infinite

Causes:

Learning rate too high (exploding gradients)
Numerical overflow in loss computation
Invalid data (NaN in input)

Fixes:

// 1. Reduce learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.001);  // Was 0.1

// 2. Check for NaN in data
for (int i = 0; i < tensor->len; i++) {
    float val;
    TOFU_TENSOR_DATA_TO(tensor, i, val, TOFU_FLOAT);
    if (isnan(val) || isinf(val)) {
        fprintf(stderr, "Invalid data at index %d\n", i);
    }
}

// 3. Add gradient clipping (manual)
tofu_tensor *grad = tofu_graph_get_grad(param_node);
// Clip gradients to [-max_grad, max_grad]

Problem: Loss doesn't decrease

Causes:

Learning rate too low
Model too simple (can't fit data)
Weights initialized poorly
Wrong loss function for task

Fixes:

// 1. Increase learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.1);  // Was 0.001

// 2. Add hidden layers (increase model capacity)
tofu_graph_node *h1 = tofu_graph_relu(g, tofu_graph_add(g,
    tofu_graph_matmul(g, x, W1), b1));
tofu_graph_node *h2 = tofu_graph_relu(g, tofu_graph_add(g,
    tofu_graph_matmul(g, h1, W2), b2));

// 3. Check you're using the right loss function
// Classification → Cross-Entropy, Regression → MSE

Problem: Loss plateaus too early

Causes:

Model capacity too small
Learning rate needs adjustment
Reached local minimum
Need more training time

Fixes:

// 1. Train longer
const int NUM_EPOCHS = 500;  // Was 100

// 2. Add capacity
// Increase hidden layer size or add more layers

// 3. Try learning rate schedule
float lr = (epoch < 50) ? 0.1 : 0.01;  // Reduce LR after 50 epochs

Problem: Loss oscillates wildly

Causes:

Learning rate too high
Batch size too small
Numerical instability

Fixes:

// 1. Reduce learning rate
lr = 0.001;  // Was 0.1

// 2. Increase batch size
BATCH_SIZE = 64;  // Was 16

// 3. Add momentum (helps smooth updates)
tofu_optimizer *opt = tofu_optimizer_adam_create(g, 0.001);

Comparing Train vs Validation Loss

Always monitor loss on held-out validation data:

// Training loop with validation
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    // Training
    float train_loss = train_one_epoch(g, opt, train_data);

    // Validation (no gradient updates)
    float val_loss = evaluate_loss(g, val_data);

    printf("Epoch %d: train_loss=%.4f, val_loss=%.4f\n",
           epoch, train_loss, val_loss);

    // Check for overfitting
    if (val_loss > train_loss * 1.5) {
        printf("Warning: Model may be overfitting\n");
    }
}

Healthy pattern:

Train loss: 0.50, Val loss: 0.55  (close - good generalization)

Overfitting:

Train loss: 0.10, Val loss: 0.80  (gap too large - overfitting)

Advanced Topics

Beyond basic loss functions, advanced techniques can improve training stability and model performance.

Loss Weighting

Sometimes you want to emphasize certain samples or classes. Loss weighting adjusts the contribution of individual samples.

Class weighting for imbalanced data:

If you have 90% negative samples and 10% positive samples in binary classification, the model may ignore the minority class. Weight the minority class higher:

// Manually weight loss by class
// Assume we have per-sample weights
float class_weights[2] = {1.0f, 9.0f};  // Weight minority class 9x

// Compute weighted loss manually
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *loss_unweighted = tofu_graph_ce_loss(g, probs, target);

// Get loss and multiply by weights
tofu_tensor *loss_tensor = tofu_graph_get_value(loss_unweighted);
float loss_val;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_val, TOFU_FLOAT);

// Weight by class (simplified - production code would weight per-sample)
int predicted_class = /* determine class */;
float weighted_loss = loss_val * class_weights[predicted_class];

Note: Tofu doesn't have built-in weighted loss functions yet, so implement weighting manually or at the sample level.

Regularization Loss Terms

Regularization adds a penalty term to prevent overfitting:

Total Loss = Task Loss + λ * Regularization Term

Where λ controls regularization strength

L2 Regularization (Weight Decay):

Penalize large weights to prevent overfitting:

// Compute L2 regularization manually
float l2_penalty = 0.0f;
const float lambda = 0.01f;

tofu_tensor *W = param_tensor;
for (int i = 0; i < W->len; i++) {
    float w;
    TOFU_TENSOR_DATA_TO(W, i, w, TOFU_FLOAT);
    l2_penalty += w * w;
}
l2_penalty *= lambda;

// Add to loss
float total_loss = task_loss + l2_penalty;

Note: Most optimizers (like Adam) have built-in weight decay support, which is more efficient than manual regularization.

Custom Loss Functions

For specialized tasks, you may need custom losses. Implement them by:

Computing the loss value using tensor operations
Implementing the backward pass (gradient computation)

Example: Huber loss (robust to outliers)

// Huber loss: Combines MSE (small errors) with MAE (large errors)
// Loss = 0.5 * (pred - target)^2        if |error| < delta
//      = delta * (|error| - 0.5*delta)  otherwise

// This requires implementing a custom graph operation
// (beyond basic usage - see advanced tutorials)

For most use cases, MSE and cross-entropy are sufficient. Custom losses require deeper knowledge of Tofu's backward pass implementation.

Multi-Task Learning

Train one model for multiple tasks by combining losses:

// Example: Predict both class and bounding box
tofu_graph_node *class_logits = /* classification head */;
tofu_graph_node *bbox_pred = /* regression head */;

// Classification loss
tofu_graph_node *class_probs = tofu_graph_softmax(g, class_logits, 1);
tofu_graph_node *class_loss = tofu_graph_ce_loss(g, class_probs, class_target);

// Bounding box regression loss
tofu_graph_node *bbox_loss = tofu_graph_mse_loss(g, bbox_pred, bbox_target);

// Combined loss (weighted sum)
// Note: Must be done manually as Tofu doesn't support loss addition yet
float class_loss_val = extract_scalar(class_loss);
float bbox_loss_val = extract_scalar(bbox_loss);
float total_loss = class_loss_val + 0.5 * bbox_loss_val;  // Weight bbox 50%

Complete Examples

Let's walk through two complete, practical examples: regression and classification.

Example 1: Regression - House Price Prediction

Goal: Predict house prices from square footage and number of bedrooms.

#include <stdio.h>
#include <stdlib.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"

int main() {
    // Dataset: 4 houses
    float features[] = {
        1000.0f, 2.0f,  // 1000 sqft, 2 bedrooms → $150k
        1500.0f, 3.0f,  // 1500 sqft, 3 bedrooms → $200k
        2000.0f, 3.0f,  // 2000 sqft, 3 bedrooms → $250k
        2500.0f, 4.0f   // 2500 sqft, 4 bedrooms → $300k
    };

    float prices[] = {150.0f, 200.0f, 250.0f, 300.0f};

    // Create tensors
    tofu_tensor *X = tofu_tensor_create(features, 2, (int[]){4, 2}, TOFU_FLOAT);
    tofu_tensor *y = tofu_tensor_create(prices, 2, (int[]){4, 1}, TOFU_FLOAT);

    // Model parameters (linear regression: y = X @ W + b)
    tofu_tensor *W = tofu_tensor_zeros(2, (int[]){2, 1}, TOFU_FLOAT);
    tofu_tensor *b = tofu_tensor_zeros(1, (int[]){1}, TOFU_FLOAT);

    // Create graph and optimizer
    tofu_graph *g = tofu_graph_create();
    tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.0001);  // Small LR

    // Training loop
    for (int epoch = 0; epoch < 1000; epoch++) {
        tofu_optimizer_zero_grad(opt);

        // Forward pass
        tofu_graph_node *x_node = tofu_graph_input(g, X);
        tofu_graph_node *W_node = tofu_graph_param(g, W);
        tofu_graph_node *b_node = tofu_graph_param(g, b);
        tofu_graph_node *y_node = tofu_graph_input(g, y);

        tofu_graph_node *pred = tofu_graph_add(g,
            tofu_graph_matmul(g, x_node, W_node), b_node);

        // MSE loss
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, y_node);

        // Extract loss value
        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);

        // Backward and optimize
        tofu_graph_backward(g, loss);
        tofu_optimizer_step(opt);

        if (epoch % 100 == 0) {
            printf("Epoch %4d: MSE = %.2f\n", epoch, loss_val);
        }

        tofu_graph_clear_ops(g);
    }

    // Final predictions
    printf("\nFinal predictions:\n");
    tofu_graph_node *x_node = tofu_graph_input(g, X);
    tofu_graph_node *W_node = tofu_graph_param(g, W);
    tofu_graph_node *b_node = tofu_graph_param(g, b);
    tofu_graph_node *pred = tofu_graph_add(g,
        tofu_graph_matmul(g, x_node, W_node), b_node);

    tofu_tensor *predictions = tofu_graph_get_value(pred);
    for (int i = 0; i < 4; i++) {
        float pred_price;
        TOFU_TENSOR_DATA_TO(predictions, i, pred_price, TOFU_FLOAT);
        printf("House %d: Predicted=%.1f, Actual=%.1f\n",
               i, pred_price, prices[i]);
    }

    // Cleanup
    tofu_optimizer_free(opt);
    tofu_graph_free(g);
    tofu_tensor_free(X);
    tofu_tensor_free(y);
    tofu_tensor_free_data_too(W);
    tofu_tensor_free_data_too(b);

    return 0;
}

Expected output:

Epoch    0: MSE = 42500.00
Epoch  100: MSE = 1250.50
Epoch  200: MSE = 523.75
Epoch  900: MSE = 12.30

Final predictions:
House 0: Predicted=148.5, Actual=150.0
House 1: Predicted=201.2, Actual=200.0
House 2: Predicted=251.8, Actual=250.0
House 3: Predicted=298.7, Actual=300.0

Example 2: Classification - XOR Problem

Goal: Learn the XOR function (classic non-linear classification).

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"

// Xavier initialization
float xavier_init(int fan_in) {
    float limit = sqrtf(6.0f / fan_in);
    return limit * (2.0f * rand() / RAND_MAX - 1.0f);
}

int main() {
    // XOR dataset
    float inputs[] = {
        0, 0,  // → 0
        0, 1,  // → 1
        1, 0,  // → 1
        1, 1   // → 0
    };

    // One-hot targets (2 classes: [1,0] = class 0, [0,1] = class 1)
    float targets[] = {
        1, 0,  // XOR(0,0) = 0
        0, 1,  // XOR(0,1) = 1
        0, 1,  // XOR(1,0) = 1
        1, 0   // XOR(1,1) = 0
    };

    // Create tensors
    tofu_tensor *X = tofu_tensor_create(inputs, 2, (int[]){4, 2}, TOFU_FLOAT);
    tofu_tensor *y = tofu_tensor_create(targets, 2, (int[]){4, 2}, TOFU_FLOAT);

    // Model: 2 → 4 → 2 (need hidden layer for non-linearity)
    tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){2, 4}, TOFU_FLOAT);
    tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
    tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){4, 2}, TOFU_FLOAT);
    tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){2}, TOFU_FLOAT);

    // Xavier initialization for W1, W2
    for (int i = 0; i < W1->len; i++) {
        float val = xavier_init(2);
        TOFU_TENSOR_DATA_FROM(W1, i, val, TOFU_FLOAT);
    }
    for (int i = 0; i < W2->len; i++) {
        float val = xavier_init(4);
        TOFU_TENSOR_DATA_FROM(W2, i, val, TOFU_FLOAT);
    }

    // Create graph and optimizer
    tofu_graph *g = tofu_graph_create();
    tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.5);  // Higher LR

    // Training loop
    for (int epoch = 0; epoch < 2000; epoch++) {
        tofu_optimizer_zero_grad(opt);

        // Forward pass
        tofu_graph_node *x = tofu_graph_input(g, X);
        tofu_graph_node *w1 = tofu_graph_param(g, W1);
        tofu_graph_node *b1_node = tofu_graph_param(g, b1);
        tofu_graph_node *w2 = tofu_graph_param(g, W2);
        tofu_graph_node *b2_node = tofu_graph_param(g, b2);
        tofu_graph_node *y_node = tofu_graph_input(g, y);

        // Layer 1: x @ W1 + b1 → ReLU
        tofu_graph_node *h1 = tofu_graph_relu(g, tofu_graph_add(g,
            tofu_graph_matmul(g, x, w1), b1_node));

        // Layer 2: h1 @ W2 + b2 → softmax
        tofu_graph_node *logits = tofu_graph_add(g,
            tofu_graph_matmul(g, h1, w2), b2_node);
        tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);

        // Cross-entropy loss
        tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, y_node);

        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);

        // Backward and optimize
        tofu_graph_backward(g, loss);
        tofu_optimizer_step(opt);

        if (epoch % 200 == 0) {
            printf("Epoch %4d: CE Loss = %.4f\n", epoch, loss_val);
        }

        tofu_graph_clear_ops(g);
    }

    // Test predictions
    printf("\nFinal predictions:\n");
    tofu_graph_node *x = tofu_graph_input(g, X);
    tofu_graph_node *w1 = tofu_graph_param(g, W1);
    tofu_graph_node *b1_node = tofu_graph_param(g, b1);
    tofu_graph_node *w2 = tofu_graph_param(g, W2);
    tofu_graph_node *b2_node = tofu_graph_param(g, b2);

    tofu_graph_node *h1 = tofu_graph_relu(g, tofu_graph_add(g,
        tofu_graph_matmul(g, x, w1), b1_node));
    tofu_graph_node *logits = tofu_graph_add(g,
        tofu_graph_matmul(g, h1, w2), b2_node);
    tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);

    tofu_tensor *predictions = tofu_graph_get_value(probs);

    for (int i = 0; i < 4; i++) {
        float prob0, prob1;
        TOFU_TENSOR_DATA_TO(predictions, i*2, prob0, TOFU_FLOAT);
        TOFU_TENSOR_DATA_TO(predictions, i*2+1, prob1, TOFU_FLOAT);
        int pred_class = (prob1 > prob0) ? 1 : 0;
        int true_class = (targets[i*2+1] > 0.5f) ? 1 : 0;

        printf("[%.0f, %.0f] → Pred=%d (%.3f, %.3f), True=%d\n",
               inputs[i*2], inputs[i*2+1],
               pred_class, prob0, prob1, true_class);
    }

    // Cleanup
    tofu_optimizer_free(opt);
    tofu_graph_free(g);
    tofu_tensor_free(X);
    tofu_tensor_free(y);
    tofu_tensor_free_data_too(W1);
    tofu_tensor_free_data_too(b1);
    tofu_tensor_free_data_too(W2);
    tofu_tensor_free_data_too(b2);

    return 0;
}

Expected output:

Epoch    0: CE Loss = 0.7120
Epoch  200: CE Loss = 0.4532
Epoch  400: CE Loss = 0.2145
Epoch 1800: CE Loss = 0.0523

Final predictions:
[0, 0] → Pred=0 (0.972, 0.028), True=0
[0, 1] → Pred=1 (0.045, 0.955), True=1
[1, 0] → Pred=1 (0.039, 0.961), True=1
[1, 1] → Pred=0 (0.968, 0.032), True=0

Best Practices

Follow these guidelines for effective loss function usage:

1. Match Loss to Task Type

Always use:

MSE for regression
Cross-Entropy for classification

Never mix them: Using the wrong loss leads to poor convergence and incorrect learning.

2. Monitor Loss During Training

// Log loss to file or console
FILE *log = fopen("training_log.txt", "w");
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    float loss = train_epoch(...);
    fprintf(log, "%d,%.6f\n", epoch, loss);
    if (epoch % 10 == 0) {
        printf("Epoch %d: loss=%.6f\n", epoch, loss);
    }
}
fclose(log);

Track trends, not just final values. Loss should decrease smoothly over time.

3. Use Appropriate Learning Rates

Loss behavior reveals learning rate issues:

// Too high: Loss explodes or oscillates wildly
// Solution: Reduce by 10x
lr = 0.01;  // Was 0.1

// Too low: Loss barely decreases
// Solution: Increase by 10x
lr = 0.1;  // Was 0.01

4. Normalize Your Data

Large input/output ranges cause numerical instability:

// Bad: Raw house prices ($100k - $500k)
float price = 250000.0f;

// Good: Normalized to reasonable range
float price_normalized = (250000.0f - mean) / std_dev;
// or
float price_scaled = 250000.0f / 1000.0f;  // Scale to [100-500]

Normalization prevents exploding gradients and improves convergence.

5. Check for Numerical Issues

// Add checks during training
if (isnan(loss_val) || isinf(loss_val)) {
    fprintf(stderr, "ERROR: Loss is %f at epoch %d\n", loss_val, epoch);
    // Reduce learning rate or check data
    break;
}

6. Compare to Baselines

Always establish a baseline before training:

// Baseline 1: Random predictions
// For classification: loss ≈ log(num_classes)
// For regression: loss ≈ variance of targets

// Baseline 2: Simple heuristic
// Classification: Always predict most common class
// Regression: Always predict mean target value

printf("Random baseline loss: %.4f\n", baseline_loss);
printf("Trained model loss: %.4f\n", final_loss);
printf("Improvement: %.1f%%\n",
       100.0 * (baseline_loss - final_loss) / baseline_loss);

7. Use Validation Data

Never trust training loss alone:

// Split data: 80% train, 20% validation
float train_loss = evaluate_loss(g, train_data);
float val_loss = evaluate_loss(g, val_data);

if (val_loss > train_loss * 1.5) {
    printf("Warning: Possible overfitting\n");
}

8. Save Best Model Based on Validation Loss

float best_val_loss = INFINITY;
tofu_tensor *best_W = NULL;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    train_epoch(...);
    float val_loss = evaluate_validation(...);

    if (val_loss < best_val_loss) {
        best_val_loss = val_loss;
        // Save model parameters
        if (best_W) tofu_tensor_free_data_too(best_W);
        best_W = tofu_tensor_clone(W);
        printf("New best model at epoch %d: val_loss=%.4f\n",
               epoch, val_loss);
    }
}

9. Early Stopping

Stop training when validation loss stops improving:

int patience = 20;  // Wait 20 epochs for improvement
int no_improve_count = 0;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    float val_loss = train_and_validate(...);

    if (val_loss < best_val_loss) {
        best_val_loss = val_loss;
        no_improve_count = 0;
    } else {
        no_improve_count++;
    }

    if (no_improve_count >= patience) {
        printf("Early stopping at epoch %d\n", epoch);
        break;
    }
}

10. Document Your Loss Function Choice

// At the top of your training code, document decisions:
/*
 * Model: Image classifier for 10 classes
 * Loss: Cross-Entropy (classification task)
 * Architecture: Input → 128 → 64 → 10 (softmax)
 * Optimizer: SGD with lr=0.01
 * Expected loss: Random ~2.3, Target <0.5
 */

This helps future debugging and maintains clear expectations.

Summary

Loss functions are the foundation of neural network training. Key takeaways:

Match loss to task:
- Regression → MSE
- Classification → Cross-Entropy (with softmax)
Loss must be:
- Scalar (single number)
- Differentiable (smooth gradients)
- Representative of task objective
Monitor loss trends:
- Decreasing = learning
- Plateauing = convergence or stuck
- Increasing = problem (LR too high, numerical issues)
Interpret loss in context:
- Compare to baselines (random guessing)
- Track validation loss (detect overfitting)
- Understand scale (depends on data range)
Debug with loss values:
- NaN/Inf → Check learning rate, data validity
- No decrease → Increase LR or model capacity
- Oscillation → Reduce LR or increase batch size

With proper loss function selection and monitoring, you'll train neural networks that converge reliably and achieve strong performance on your task.

For more details on loss function implementation and gradients, see the Graph API Reference. For optimizer integration, see the Optimizer User Guide.

Keyboard shortcuts

Tofu User Guide