Introduction

Welcome to the Tofu User Guide.

Installation

Learn how to build and install Tofu on Linux, macOS, and ESP32 microcontrollers. This guide covers everything from downloading the source code to verifying your installation works correctly.

In this guide you'll learn:

Prerequisites and system requirements for each platform
How to configure and build Tofu from source
How to run tests to verify your build
How to install Tofu to your system
How to cross-compile for ESP32 embedded systems

Prerequisites:

A C compiler (GCC 7+ or Clang 10+)
GNU Make build system
Git version control
pkg-config package manager

Estimated time: 10 minutes for basic installation, 15 minutes including tests

Quick Installation

If you're already familiar with building C projects, here's the fastest path to get Tofu running:

# Clone the repository
cd ~/projects  # or your preferred workspace
git clone https://github.com/c2akula/tofu.git
cd tofu

# Configure and build
chmod +x configure
./configure
make

# Optional: Run tests to verify the build
make test

# Install to your system
sudo make install

# Verify installation
pkg-config --modversion tofu

After installation, you can include Tofu in your projects with:

gcc -o myprogram myprogram.c $(pkg-config --cflags --libs tofu)

Detailed Installation - Linux

This section covers step-by-step installation on Ubuntu 20.04 and similar Linux distributions.

System Requirements

Tofu requires the following packages on Linux:

Build tools: GCC (7+), GNU Make, GNU Autotools utilities
Development headers: Standard C library headers
Package management: pkg-config for dependency tracking
Testing (optional): Check unit testing framework

Installing Dependencies

On Ubuntu and Debian-based distributions:

sudo apt-get update
sudo apt-get install build-essential perl git pkg-config check

This installs:

gcc and g++: C/C++ compilers
make: Build automation
perl: Required by the configure script
git: Version control
pkg-config: Package configuration utility
check: Unit testing framework (optional but recommended)

Cloning the Repository

cd ~/projects  # Choose your workspace location
git clone https://github.com/c2akula/tofu.git
cd tofu

The repository is approximately 5 MB and includes:

Source code in src/ directory
Tests in test/ directory
Examples in examples/ directory
Documentation in docs/ directory

Configuring the Build

Before building, configure the build system:

chmod +x configure
./configure

You'll see output like:

configure tofu version 1.0.0
Checking autotools...
Checking pkg-config...
Checking C compiler...
Checking math library...
Generated config.mk

For custom installation directories, use:

./configure --install-dir=/opt/tofu
./configure --prefix=$HOME/.local

To see all available options:

./configure -h

Available options include:

--build-dir=DIR: Where to place build artifacts (default: build)
--install-dir=DIR: Where to install Tofu (default: /usr/local)
--debug=yes: Include debug symbols for development
--esp32=yes: Configure for ESP32 cross-compilation

Building the Library

After configuration, compile Tofu:

make lib

This builds both static and shared libraries. Expected output:

Compiling src/tofu_tensor.c
Compiling src/tofu_graph.c
Compiling src/tofu_optimizer.c
...
Linking build/lib/libtofu.so.1.0.0
Created: build/lib/libtofu.a
Created: build/lib/libtofu.so -> libtofu.so.1.0.0

Build artifacts are placed in the build/ directory:

build/lib/libtofu.a: Static library
build/lib/libtofu.so.1.0.0: Shared library (versioned)
build/include/tofu/: Public header files

Running Tests

Verify your build with the comprehensive test suite:

make test

This builds and runs all tests. Expected output:

Compiling test/test_tensor.c
Compiling test/test_graph.c
Compiling test/test_optimizer.c
...
Running tests...
[====] Completed: 13 test(s), passed: 13, failed: 0

All tests should pass (66 checks across 5 test suites). If any fail:

Check that all dependencies are installed
Run make clean and retry
Report the issue with your system details

To run individual tests:

cd test
../build/test/test_tofu tensor_creation

Installing to Your System

Once tests pass, install Tofu to your system:

sudo make install

This installs:

Headers to /usr/local/include/tofu/
Static library to /usr/local/lib/libtofu.a
Shared library to /usr/local/lib/libtofu.so.1.0.0
Package config file to /usr/local/lib/pkgconfig/tofu.pc

If you configured with a custom prefix:

# Install to home directory (no sudo needed)
./configure --prefix=$HOME/.local
make
make install

Verifying Installation

Check that Tofu is installed correctly:

pkg-config --modversion tofu

Expected output:

1.0.0

Get compilation and linking flags:

pkg-config --cflags --libs tofu

Expected output:

-I/usr/local/include/tofu -L/usr/local/lib -ltofu

Detailed Installation - macOS

This section covers installation on macOS 13 and newer.

System Requirements

Tofu requires:

Xcode Command Line Tools: Apple's development toolchain with GCC/Clang
Homebrew (optional): Package manager for additional tools
pkg-config: For dependency management

Installing Xcode Command Line Tools

First, install the Xcode Command Line Tools:

xcode-select --install

A dialog will appear. Click Install and wait for completion (5-10 minutes).

Verify installation:

gcc --version
make --version

Expected output shows GCC (Apple Clang) version 14+.

Installing Additional Dependencies

Using Homebrew to install the check testing framework:

brew install pkg-config check

If Homebrew isn't installed, see https://brew.sh

Cloning and Building

The build process is identical to Linux:

cd ~/Projects  # macOS convention
git clone https://github.com/c2akula/tofu.git
cd tofu

chmod +x configure
./configure
make lib
make test

Installing to Your System

Install to /usr/local (default) or your home directory:

# System-wide installation
sudo make install

# Or to home directory (no sudo needed)
./configure --prefix=$HOME/.local
make
make install

macOS-Specific Notes

M1/M2 Arm chips: Tofu builds natively on Apple Silicon. No special configuration needed.
Intel Macs: Fully supported. Tofu uses standard C that compiles on both architectures.
Shared library permissions: You may need to adjust Gatekeeper policies if running compiled binaries. See Apple's code signing documentation.
Homebrew path: If using Homebrew, you may need export PATH="/opt/homebrew/bin:$PATH" on M1/M2 Macs.

Cross-Compilation - ESP32

Build Tofu for embedded deployment on ESP32 microcontrollers.

Why ESP32?

ESP32 is a popular microcontroller combining:

WiFi and Bluetooth connectivity
240 MHz dual-core processor
520 KB RAM
Capable of running lightweight neural networks for inference

Tofu's lightweight C implementation makes it ideal for ESP32 deployment.

Prerequisites

You need the ESP-IDF (Espressif IoT Development Framework) toolchain:

Install ESP-IDF following the official guide: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/get-started/index.html
Verify toolchain is available:
```
which xtensa-esp32-elf-gcc
```
Should show the path to the toolchain. If not, add to your shell profile:
```
export PATH="$PATH:/path/to/esp-idf/tools/xtensa-esp32-elf/bin"
```
Verify the toolchain works:
```
xtensa-esp32-elf-gcc --version
```

Configuring for ESP32

Clone Tofu and configure for cross-compilation:

git clone https://github.com/c2akula/tofu.git
cd tofu

chmod +x configure
./configure --esp32=yes

If your ESP32 toolchain is not in your PATH:

./configure --esp32=yes --esp32-toolchain-dir=/opt/esp-idf/tools/xtensa-esp32-elf

The configure output will show:

configure tofu version 1.0.0
Configuring for ESP32 cross-compilation
Using toolchain: /path/to/xtensa-esp32-elf
Generated config.mk

Building for ESP32

Build the library for ESP32:

make lib

This creates:

build/lib/libtofu.a: Static library for ESP32

Note: Tests are skipped during ESP32 cross-compilation:

Skipping tests when cross-compiling for ESP32

This is expected. Tests require POSIX system calls not available on ESP32.

Using Tofu on ESP32

Once built, link the static library into your ESP32 project:

Copy the built library:

cp build/lib/libtofu.a /path/to/your/esp32-project/components/tofu/lib/
cp -r build/include/tofu/* /path/to/your/esp32-project/components/tofu/include/

In your ESP32 project's CMakeLists.txt:

set(COMPONENT_LIBS tofu m c)  # Link tofu library

Include headers in your code:

#include "tofu_tensor.h"
#include "tofu_graph.h"

Build and deploy:
```
idf.py build
idf.py flash
```

Verification

After installation, verify everything works by running a simple test program.

Using pkg-config

Verify Tofu is discoverable by pkg-config:

pkg-config --list-all | grep tofu

Should show:

tofu - A light-weight neural network compiler for different software/hardware backends.

Using pkg-config in Your Build

When building applications that use Tofu:

gcc -o myapp myapp.c $(pkg-config --cflags --libs tofu)

This automatically includes:

Correct compiler flags: -I/usr/local/include/tofu
Linker flags: -L/usr/local/lib -ltofu

Example Verification Program

Create a simple C program to test your installation:

// test_tofu.c
#include "tofu_tensor.h"
#include <stdio.h>

int main() {
    // Create a simple tensor [2, 3]
    float data[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
    int dims[] = {2, 3};
    tofu_tensor *t = tofu_tensor_create(data, 2, dims, TOFU_FLOAT);

    if (t == NULL) {
        printf("Failed to create tensor\n");
        return 1;
    }

    printf("Tensor created successfully\n");
    printf("Shape: [%d, %d]\n", t->dims[0], t->dims[1]);
    printf("Size: %d elements\n", t->len);

    tofu_tensor_free(t);
    printf("Tensor freed successfully\n");
    return 0;
}

Compile and run:

gcc -o test_tofu test_tofu.c $(pkg-config --cflags --libs tofu)
./test_tofu

Expected output:

Tensor created successfully
Shape: [2, 3]
Size: 6 elements
Tensor freed successfully

Troubleshooting Common Issues

Issue: pkg-config: command not found

Solution: Install pkg-config (sudo apt-get install pkg-config on Linux, brew install pkg-config on macOS)

Issue: fatal error: tofu_tensor.h: No such file or directory

Solution: Run make install to install headers. Or use the pkg-config output in your compilation flags.

Issue: undefined reference to 'tofu_tensor_create'

Solution: Ensure you're using $(pkg-config --libs tofu) in your linker flags

Issue: Tests fail with "check not found"

Solution: Install check (sudo apt-get install check on Linux, brew install check on macOS)

Issue: Configure fails with unknown compiler

Solution: Ensure GCC/Clang is installed and in your PATH. Run gcc --version to verify.

Issue: Build fails on macOS with permission errors

Solution: Try sudo make install or configure with a home directory prefix: ./configure --prefix=$HOME/.local

Uninstallation

To remove Tofu from your system:

sudo make uninstall

This removes all installed files, headers, and pkg-config configuration.

If you installed to a home directory:

make uninstall

Next Steps

Now that Tofu is installed, explore:

Quick Start: Write your first neural network
First Network: Build a complete training example
API Reference: Full API documentation

Happy building!

Quick Start

Get started with Tofu in 5 minutes.

In this quick-start guide, you'll create your first Tofu program that performs a basic tensor operation: matrix multiplication. This will introduce you to the fundamentals of working with tensors in Tofu.

Prerequisites

You should have Tofu installed and built on your system. If not, please see the Installation Guide first.

Your First Program

Create a file called first.c with the following code:

#include <stdio.h>
#include "tofu/tofu_tensor.h"

int main() {
    // Create two matrices: a (2x3) and b (3x2)
    int dims_a[] = {2, 3};
    tofu_tensor *a = tofu_tensor_zeros(2, dims_a, TOFU_FLOAT);

    int dims_b[] = {3, 2};
    tofu_tensor *b = tofu_tensor_zeros(2, dims_b, TOFU_FLOAT);

    // Initialize matrix a with values [1, 2, 3, 4, 5, 6]
    float vals_a[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
    for (int i = 0; i < 6; i++) {
        TOFU_TENSOR_DATA_FROM(a, i, vals_a[i], TOFU_FLOAT);
    }

    // Initialize matrix b with values [1, 2, 3, 4, 5, 6]
    float vals_b[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
    for (int i = 0; i < 6; i++) {
        TOFU_TENSOR_DATA_FROM(b, i, vals_b[i], TOFU_FLOAT);
    }

    // Perform matrix multiplication: result = a @ b
    tofu_tensor *result = tofu_tensor_matmul(a, b, NULL);

    // Print the result to stdout
    printf("Result of a @ b:\n");
    tofu_tensor_print(result, "%.1f");

    // Free all tensor memory
    tofu_tensor_free_data_too(a);
    tofu_tensor_free_data_too(b);
    tofu_tensor_free_data_too(result);

    return 0;
}

Compiling Your Program

To compile this program, you'll need to link against the Tofu library:

gcc -o first first.c -I/path/to/tofu/build/include \
    -L/path/to/tofu/build/src -ltofu -lm

Or if you've installed Tofu to your system:

gcc -o first first.c -ltofu -lm

The -lm flag is needed for the math library, and -ltofu links against Tofu.

Expected Output

When you run the program, you should see:

Result of a @ b:
[[22.0 28.0]
 [49.0 64.0]]

This is the result of multiplying:

Matrix a: [[1, 2, 3], [4, 5, 6]] (shape 2×3)
Matrix b: [[1, 2], [3, 4], [5, 6]] (shape 3×2)

What Just Happened?

Let's break down the key components of your first tensor program:

Tensor Creation

tofu_tensor *a = tofu_tensor_zeros(2, dims_a, TOFU_FLOAT);

This creates a tensor (multi-dimensional array) with:

2 dimensions (a matrix)
Shape defined by dims_a (2 rows, 3 columns)
Data type TOFU_FLOAT (32-bit floating-point)
All elements initialized to zero

Tensor Data Access

TOFU_TENSOR_DATA_FROM(a, i, vals_a[i], TOFU_FLOAT);

This macro writes a floating-point value into the tensor at flat index i. Tensors store data in a flat array, but Tofu handles the indexing for multi-dimensional access.

Operations

tofu_tensor *result = tofu_tensor_matmul(a, b, NULL);

Matrix multiplication is one of the fundamental operations in Tofu. Passing NULL for the destination tensor tells Tofu to allocate a new tensor for the result.

Memory Management

tofu_tensor_free_data_too(a);

Always clean up your tensors when done. Use tofu_tensor_free_data_too() when you created the tensor with tofu_tensor_zeros() (which allocates both the structure and data). This prevents memory leaks.

For deeper explanations of tensors, operations, and data types, see Core Concepts.

Next Steps

Now that you've created your first tensor program, you're ready to explore more:

Build a neural network: Learn how to create layers and train models in Your First Network
Explore more operations: Check out the API reference for all available tensor operations
Try more examples: Look for example programs in the examples/ directory

Happy tensor computing!

Your First Neural Network

Welcome to Tofu! In this guide, you'll build and train your first neural network to solve the classic XOR problem. By the end, you'll understand how to construct computation graphs, perform forward and backward passes, and optimize model parameters.

What You'll Build

You'll build a neural network that learns the XOR (exclusive OR) function. XOR is a simple yet elegant problem that demonstrates why neural networks need hidden layers. Your final model will output correct predictions for all four XOR inputs.

Why XOR?

XOR is the perfect learning problem because:

Non-linear: You can't solve XOR with a single linear layer. This teaches you why hidden layers matter.
Small: The dataset has only 4 examples, so training is fast.
Well-understood: We know exactly what "correct" looks like.
Practical: The same patterns apply to larger, real-world problems.

What You'll Learn

How to create and structure computation graphs in Tofu
How to build a multi-layer neural network
How to execute forward passes (predictions)
How to perform backward passes (gradient computation)
How to run training loops with optimizers
How to manage memory ownership correctly
How to verify that your network actually learned

The XOR Problem

Understanding XOR

XOR returns 1 when inputs are different, 0 when they're the same:

[0, 0] → 0  (same, output 0)
[0, 1] → 1  (different, output 1)
[1, 0] → 1  (different, output 1)
[1, 1] → 0  (same, output 0)

Why It's Special

A single linear layer cannot learn XOR. Mathematically, XOR is not linearly separable—you cannot draw a single straight line to separate the 1s from the 0s on a 2D plane.

However, a network with a hidden layer can solve it by learning intermediate features. The hidden layer performs a non-linear transformation that makes XOR linearly separable in the higher-dimensional hidden space.

Network Architecture

To solve XOR, we'll use this architecture:

Input Layer (2 units)
    ↓
Hidden Layer (4 units with ReLU activation)
    ↓
Output Layer (1 unit)

The flow:

Input layer: Takes [x1, x2] (the two binary inputs)
Hidden layer: Learns 4 intermediate features via matrix multiplication and ReLU
Output layer: Combines hidden features to produce final prediction

The ReLU (Rectified Linear Unit) activation in the hidden layer is crucial—it introduces non-linearity. Without it, stacking layers would be equivalent to a single linear layer.

Complete Code Walkthrough

Here's the full XOR training program. We'll break it down into sections and explain each part.

Section 1: Setup and Includes

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"

/* Xavier weight initialization for better convergence */
static float xor_xavier_init(int fan_in) {
    float limit = sqrtf(6.0f / (float)fan_in);
    return 2.0f * (float)rand() / RAND_MAX - 1.0f;
}

We include Tofu's three core modules:

tofu_tensor.h: Tensor creation and manipulation
tofu_graph.h: Computation graph construction and differentiation
tofu_optimizer.h: Gradient-based parameter updates

The xor_xavier_init function initializes weights using Xavier initialization. This ensures weights start in a good range that helps training converge faster than random initialization.

Section 2: Main Function and Configuration

int main() {
    printf("============================================================\n");
    printf("XOR Neural Network Training Example\n");
    printf("============================================================\n\n");

    /* Configuration */
    const int INPUT_SIZE = 2;
    const int HIDDEN_SIZE = 4;
    const int OUTPUT_SIZE = 1;
    const int NUM_EPOCHS = 2000;
    const float LEARNING_RATE = 0.1f;
    const int REPORT_INTERVAL = 200;

These constants define our network shape and training hyperparameters:

INPUT_SIZE (2): Two binary inputs for XOR
HIDDEN_SIZE (4): Four hidden units (more than enough to solve XOR)
OUTPUT_SIZE (1): Single output for binary classification
NUM_EPOCHS (2000): Number of times we iterate through the dataset
LEARNING_RATE (0.1): Controls step size in parameter updates (higher = faster but riskier)
REPORT_INTERVAL (200): How often to print progress

Section 3: Data Preparation

    /* Prepare XOR dataset */
    float xor_inputs[4][2] = {
        {0.0f, 0.0f},
        {0.0f, 1.0f},
        {1.0f, 0.0f},
        {1.0f, 1.0f}
    };

    float xor_targets[4][1] = {
        {0.0f},
        {1.0f},
        {1.0f},
        {0.0f}
    };

    printf("XOR Dataset:\n");
    for (int i = 0; i < 4; i++) {
        printf("  [%.0f, %.0f] -> %.0f\n",
               xor_inputs[i][0], xor_inputs[i][1], xor_targets[i][0]);
    }
    printf("\n");

We hardcode the complete XOR dataset. This is all the training data we need—the network must generalize from just 4 examples (and it can because of the structure of the XOR problem).

Section 4: Creating the Computation Graph

    /* Create computation graph */
    tofu_graph* g = tofu_graph_create();
    assert(g != NULL);

We create an empty computation graph. All our operations will be added to this graph. Think of it as a blueprint for computations—nodes represent operations, edges represent data flow.

Important concept: The graph doesn't own the tensor data. We allocate tensors separately and pass them to the graph. We're responsible for freeing them later.

Section 5: Initializing Network Parameters

    /* Initialize weights with Xavier initialization */
    float* w1_data = (float*)malloc(INPUT_SIZE * HIDDEN_SIZE * sizeof(float));
    for (int i = 0; i < INPUT_SIZE * HIDDEN_SIZE; i++) {
        w1_data[i] = xor_xavier_init(INPUT_SIZE);
    }
    tofu_tensor* t_w1 = tofu_tensor_create(w1_data, 2,
                                           (int[]){INPUT_SIZE, HIDDEN_SIZE}, TOFU_FLOAT);
    tofu_graph_node* w1 = tofu_graph_param(g, t_w1);

    /* Initialize first bias */
    float* b1_data = (float*)calloc(HIDDEN_SIZE, sizeof(float));
    tofu_tensor* t_b1 = tofu_tensor_create(b1_data, 1, (int[]){HIDDEN_SIZE}, TOFU_FLOAT);
    tofu_graph_node* b1 = tofu_graph_param(g, t_b1);

    /* Initialize weights for hidden to output */
    float* w2_data = (float*)malloc(HIDDEN_SIZE * OUTPUT_SIZE * sizeof(float));
    for (int i = 0; i < HIDDEN_SIZE * OUTPUT_SIZE; i++) {
        w2_data[i] = xor_xavier_init(HIDDEN_SIZE);
    }
    tofu_tensor* t_w2 = tofu_tensor_create(w2_data, 2,
                                           (int[]){HIDDEN_SIZE, OUTPUT_SIZE}, TOFU_FLOAT);
    tofu_graph_node* w2 = tofu_graph_param(g, t_w2);

    /* Initialize second bias */
    float* b2_data = (float*)calloc(OUTPUT_SIZE, sizeof(float));
    tofu_tensor* t_b2 = tofu_tensor_create(b2_data, 1, (int[]){OUTPUT_SIZE}, TOFU_FLOAT);
    tofu_graph_node* b2 = tofu_graph_param(g, t_b2);

We initialize four parameter tensors:

w1: Shape [2, 4]. Transforms input to hidden layer. Each column is one hidden unit's weights from both inputs.
b1: Shape [4]. Bias for each hidden unit. We use calloc (zeros) for biases.
w2: Shape [4, 1]. Transforms hidden to output. Weights from all 4 hidden units to the single output.
b2: Shape [1]. Bias for the output unit.

Each tensor is converted to a graph node using tofu_graph_param. These are "parameter" nodes (trainable) as opposed to "input" nodes (non-trainable). The optimizer will update these during training.

Section 6: Creating the Optimizer

    /* Create optimizer */
    tofu_optimizer* optimizer = tofu_optimizer_sgd_create(g, LEARNING_RATE);
    assert(optimizer != NULL);

We create an SGD (Stochastic Gradient Descent) optimizer with our learning rate. The optimizer will:

Collect all parameter nodes from the graph
Compute gradients during backward passes
Update parameters based on those gradients

SGD is simple and effective: param = param - learning_rate * gradient

Section 7: Training Loop Structure

    float best_loss = INFINITY;

    for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
        float epoch_loss = 0.0f;

        /* Process each training example */
        for (int sample = 0; sample < 4; sample++) {

We train for 2000 epochs (full passes through the dataset). Each epoch processes all 4 XOR examples. We track the best loss and accumulate epoch loss for reporting.

This is "online learning" (one example at a time) rather than "batch learning," which is appropriate for tiny datasets.

Section 8: Forward Pass

            /* Zero gradients */
            tofu_graph_zero_grad(g);

            /* Create input tensor for this sample */
            float* input_data = (float*)malloc(INPUT_SIZE * sizeof(float));
            input_data[0] = xor_inputs[sample][0];
            input_data[1] = xor_inputs[sample][1];
            tofu_tensor* t_input = tofu_tensor_create(input_data, 1,
                                                      (int[]){INPUT_SIZE}, TOFU_FLOAT);
            tofu_graph_node* x = tofu_graph_input(g, t_input);

            /* Forward pass: Layer 1 */
            /* h1 = x @ w1 + b1 */
            tofu_graph_node* h1_matmul = tofu_graph_matmul(g, x, w1);
            tofu_graph_node* h1_bias = tofu_graph_add(g, h1_matmul, b1);

            /* Apply ReLU activation */
            tofu_graph_node* h1_relu = tofu_graph_relu(g, h1_bias);

            /* Forward pass: Layer 2 (output) */
            /* y = h1 @ w2 + b2 */
            tofu_graph_node* y_matmul = tofu_graph_matmul(g, h1_relu, w2);
            tofu_graph_node* y_pred = tofu_graph_add(g, y_matmul, b2);

            /* Create target tensor */
            float* target_data = (float*)malloc(OUTPUT_SIZE * sizeof(float));
            target_data[0] = xor_targets[sample][0];
            tofu_tensor* t_target = tofu_tensor_create(target_data, 1,
                                                       (int[]){OUTPUT_SIZE}, TOFU_FLOAT);
            tofu_graph_node* y_target = tofu_graph_input(g, t_target);

            /* Compute MSE loss */
            tofu_graph_node* loss_node = tofu_graph_mse_loss(g, y_pred, y_target);

Let's trace through the computation:

Zero gradients: Before computing new gradients, we clear old ones to prevent accumulation.

Create input node: Each example is a separate tensor created fresh, wrapped in a graph input node (non-trainable).

Hidden layer computation:

h1_matmul = x @ w1: Matrix multiply. Input [1, 2] @ weights [2, 4] → [1, 4]
h1_bias = h1_matmul + b1: Add bias [4] to each row (broadcasting)
h1_relu = ReLU(h1_bias): Apply ReLU activation element-wise (max(0, x))

Output layer computation:

y_matmul = h1_relu @ w2: Matrix multiply. Hidden [1, 4] @ weights [4, 1] → [1, 1]
y_pred = y_matmul + b2: Add output bias [1]

Loss computation: loss = MSE(y_pred, y_target) = mean squared error = (y_pred - y_target)²

The computation graph now contains a chain: input → matmul → add → ReLU → matmul → add → loss

Section 9: Backward Pass and Parameter Update

            /* Extract loss value */
            tofu_tensor* loss_tensor = tofu_graph_get_value(loss_node);
            float sample_loss = 0.0f;
            if (loss_tensor && loss_tensor->len > 0) {
                TOFU_TENSOR_DATA_TO(loss_tensor, 0, sample_loss, TOFU_FLOAT);
            }
            epoch_loss += sample_loss;

            /* Backward pass: compute gradients */
            tofu_graph_backward(g, loss_node);

            /* Optimizer step: update weights and biases */
            tofu_optimizer_step(optimizer);

            /* Cleanup input/target tensors for this sample */
            tofu_tensor_free(t_input);
            tofu_tensor_free(t_target);
            free(input_data);
            free(target_data);

            /* Clear operations for next sample (keeps parameters) */
            tofu_graph_clear_ops(g);
        }

        /* Average loss over all 4 samples */
        epoch_loss /= 4;

Extract loss: We read the numerical loss value from the computed tensor using TOFU_TENSOR_DATA_TO.

Backward pass: tofu_graph_backward(g, loss_node) propagates gradients backward through the graph. Starting from the loss scalar, it computes:

∂loss/∂y_pred (what change in prediction would reduce loss)
∂loss/∂h1_relu (through the output layer)
∂loss/∂h1_bias (through ReLU)
∂loss/∂h1_matmul (through addition)
∂loss/∂w1, ∂loss/∂b1 (through matmul and add)
∂loss/∂w2, ∂loss/∂b2 (through second layer)

This is automatic differentiation—Tofu handles all the calculus!

Parameter update: tofu_optimizer_step(optimizer) updates all parameters using their gradients:

w1 ← w1 - learning_rate × ∂loss/∂w1
b1 ← b1 - learning_rate × ∂loss/∂b1
(and similarly for w2, b2)

Cleanup: We free the input and target tensors (we allocated them, so we own them). Importantly, we keep w1, b1, w2, b2 (the parameters) intact.

Clear operations: tofu_graph_clear_ops(g) removes all the intermediate computation nodes but keeps the parameters. This prepares for the next sample without recreating parameters.

Section 10: Training Progress Reporting

        /* Report progress */
        if (epoch % REPORT_INTERVAL == 0 || epoch == NUM_EPOCHS - 1) {
            printf("Epoch %4d: loss = %.6f\n", epoch, epoch_loss);
        }
    }

    printf("\nTraining Complete!\n");
    printf("Final average loss: %.6f\n", best_loss);

We print loss every 200 epochs and at the end. Watching loss decrease is satisfying and helps you spot problems (e.g., loss increasing means learning rate is too high).

Section 11: Evaluation

    printf("\n");
    printf("Final Predictions:\n");
    printf("Input       Predicted  Target\n");
    printf("----        ---------  ------\n");

    for (int sample = 0; sample < 4; sample++) {
        /* Build inference graph for this sample */
        float* input_data = (float*)malloc(INPUT_SIZE * sizeof(float));
        input_data[0] = xor_inputs[sample][0];
        input_data[1] = xor_inputs[sample][1];
        tofu_tensor* t_input = tofu_tensor_create(input_data, 1,
                                                  (int[]){INPUT_SIZE}, TOFU_FLOAT);
        tofu_graph_node* x = tofu_graph_input(g, t_input);

        /* Forward pass (same as training) */
        tofu_graph_node* h1_matmul = tofu_graph_matmul(g, x, w1);
        tofu_graph_node* h1_bias = tofu_graph_add(g, h1_matmul, b1);
        tofu_graph_node* h1_relu = tofu_graph_relu(g, h1_bias);
        tofu_graph_node* y_matmul = tofu_graph_matmul(g, h1_relu, w2);
        tofu_graph_node* y_pred = tofu_graph_add(g, y_matmul, b2);

        /* Get prediction */
        tofu_tensor* pred_tensor = tofu_graph_get_value(y_pred);
        float prediction = 0.0f;
        if (pred_tensor && pred_tensor->len > 0) {
            TOFU_TENSOR_DATA_TO(pred_tensor, 0, prediction, TOFU_FLOAT);
        }

        printf("[%.0f, %.0f]    %.4f    %.0f\n",
               xor_inputs[sample][0], xor_inputs[sample][1],
               prediction, xor_targets[sample][0]);

        tofu_tensor_free(t_input);
        free(input_data);
        tofu_graph_clear_ops(g);
    }

After training, we perform inference (forward pass without backward) on all 4 examples. We print predicted values vs. targets. Since our output is a real number (not discrete), we'll see values close to 0 or 1.

Section 12: Accuracy Check and Cleanup

    /* Check accuracy (threshold at 0.5) */
    int correct = 0;
    for (int sample = 0; sample < 4; sample++) {
        /* ... build prediction graph ... */
        int pred_class = (prediction > 0.5f) ? 1 : 0;
        int true_class = (int)xor_targets[sample][0];

        if (pred_class == true_class) {
            correct++;
        }
        /* ... cleanup ... */
    }

    float accuracy = (float)correct / 4.0f;
    printf("Accuracy: %d/4 (%.1f%%)\n", correct, accuracy * 100.0f);

    if (accuracy == 1.0f) {
        printf("\nSuccess! Network learned XOR perfectly!\n");
    }

    /* Cleanup (IMPORTANT ORDER) */
    printf("\n");
    printf("Cleaning up resources...\n");

    tofu_optimizer_free(optimizer);
    tofu_graph_free(g);

    /* Free parameter tensors (caller owns them) */
    tofu_tensor_free_data_too(t_w1);
    tofu_tensor_free_data_too(t_b1);
    tofu_tensor_free_data_too(t_w2);
    tofu_tensor_free_data_too(t_b2);

    printf("Done!\n");
    printf("============================================================\n");

    return 0;
}

We convert predictions to binary (threshold at 0.5) and compute accuracy. Finally, cleanup is critical and in the right order:

Free optimizer (it might hold references to the graph)
Free graph (it owns the nodes but not the tensor data)
Free parameter tensors (we allocated the data, so we free it)

This order prevents use-after-free errors.

Compiling and Running

Save the complete program as examples/xor_training.c (or copy from Tofu's examples directory).

Compile

# From the tofu directory
make lib                # Build the library
cc -I./src examples/xor_training.c build/src/libtofu.a -o xor_training -lm

Run

./xor_training

Expected Output

============================================================
XOR Neural Network Training Example
============================================================

Network Architecture: [2] -> [4] -> [1]
Training: 2000 epochs with SGD, learning_rate=0.100

XOR Dataset:
  [0, 0] -> 0
  [0, 1] -> 1
  [1, 0] -> 1
  [1, 1] -> 0

Epoch    0: loss = 0.488130
Epoch  200: loss = 0.000000
Epoch  400: loss = 0.000000
...
Epoch 1999: loss = 0.000000

Training Complete!
Final average loss: 0.000000

Final Predictions:
Input       Predicted  Target
----        ---------  ------
[0, 0]    0.0000    0
[0, 1]    1.0000    1
[1, 0]    1.0000    1
[1, 1]    0.0000    0

Accuracy: 4/4 (100.0%)

Success! Network learned XOR perfectly!

Cleaning up resources...
Done!

Key observation: Loss rapidly converges to ~0 within a few hundred epochs! The network quickly learns to solve XOR.

Understanding the Results

Why Loss Decreases

Initially, the network makes random predictions (loss ≈ 0.49, far from correct). During training:

Gradients computed via backprop tell each parameter how it contributed to the error
Parameters move (via optimizer) in the direction that reduces loss
With each example, the network improves
After sufficient epochs, predictions are nearly perfect (loss → 0)

This iterative improvement is the essence of machine learning.

What the Network Learned

The hidden layer developed 4 internal features (learned by the 4 hidden units). These features transform the input space so that XOR becomes linearly separable. Think of it as the network learning new coordinate axes in which the problem is easier.

The output layer learned to combine these 4 hidden features into a single decision.

How to Verify Learning

The predictions match targets perfectly:

[0, 0] → 0.0000 (should be 0) ✓
[0, 1] → 1.0000 (should be 1) ✓
[1, 0] → 1.0000 (should be 1) ✓
[1, 1] → 0.0000 (should be 0) ✓

100% accuracy is the highest possible.

Experimenting Further

Now that you've trained a network, try modifying parameters to understand their effects:

Try Different Learning Rates

Change LEARNING_RATE to 0.01 (slower) or 0.5 (faster, but risky). Watch how convergence speed changes.

Try Different Hidden Sizes

Change HIDDEN_SIZE to 2 (too small—might not converge) or 8 (overkill). Can the network still solve XOR?

Add More Hidden Layers

Modify the forward pass to add another hidden layer:

tofu_graph_node* h2 = tofu_graph_matmul(g, h1_relu, w2);
h2 = tofu_graph_add(g, h2, b2);
h2 = tofu_graph_relu(g, h2);
tofu_graph_node* y_matmul = tofu_graph_matmul(g, h2, w3);

Does a deeper network help? (For XOR, it shouldn't be necessary.)

Monitor Individual Gradients

After tofu_graph_backward(), print gradient values to understand what each parameter is learning:

tofu_tensor* w1_grad = tofu_graph_get_grad(w1);
printf("W1 gradient[0]: %.6f\n", ...);

Next Steps

You've mastered the fundamentals! Here's your learning path:

Dive Deeper: Read the Concepts Guide to understand backpropagation and automatic differentiation in detail.
Build Bigger: Study the CNN Training Example to see how to scale to realistic datasets and architectures.
Real Datasets: Try training on real data:
- MNIST for digit classification
- Iris for flower classification
- Your own custom dataset
Advanced Optimizers: Experiment with SGD with momentum or Adam (if available in your version).
API Reference: Consult the Graph API and Optimizer API for complete documentation of all functions.

Key Takeaways

Computation graphs let you define complex computations and differentiate them automatically
Forward pass computes predictions (operations evaluate top-to-bottom)
Backward pass computes gradients (automatic differentiation flows bottom-to-top)
Optimizers update parameters based on gradients to minimize loss
Memory ownership is crucial: you own input/parameter tensors, graph owns computed nodes
Iteration (epochs) matters: neural networks improve with repeated exposure to data

You now understand the full training pipeline. You're ready to tackle more complex problems!

Core Concepts

Understanding the fundamental concepts behind Tofu will help you build neural networks efficiently and write correct, safe code. This guide introduces the key ideas that make Tofu powerful.

Introduction

Tofu is built around a small number of core concepts: tensors, computation graphs, and automatic differentiation. These three ideas work together to enable rapid development of machine learning systems.

If you come from a NumPy background, tensors will feel familiar—they're just multi-dimensional arrays. The difference is that Tofu tensors can be organized into computation graphs that automatically compute gradients for training. You won't need to manually derive derivative formulas or write error-prone gradient code.

This guide builds intuition for each concept with concrete examples. By the end, you'll understand how they fit together in a complete training loop.

Tensors

A tensor is simply a multi-dimensional array of numbers. Tensors are the fundamental data structure in Tofu—all neural network operations work with tensors.

Understanding Tensor Dimensions

Think of tensors as a natural extension of scalar numbers:

Scalar: A single number. Shape: [] (0 dimensions). Example: 5.0
Vector: A 1-D list of numbers. Shape: [3] (1 dimension). Example: [1.0, 2.0, 3.0]
Matrix: A 2-D grid of numbers. Shape: [2, 3] (2 dimensions). Example: [[1, 2, 3], [4, 5, 6]]
Tensor (3-D+): Higher-dimensional arrays. Shape: [2, 3, 4] (3 dimensions), etc.

In practice, machine learning uses tensors extensively:

An image might be shape [height, width, channels] (e.g., [28, 28, 1] for 28x28 grayscale)
A batch of images might be shape [batch_size, height, width, channels] (e.g., [32, 28, 28, 1])
Neural network weights are often 2-D matrices: shape [input_dim, output_dim]

Tensor Shape and Size

Every tensor has a shape—a tuple of integer dimensions. The total number of elements is the product of all dimensions.

Tensor shape [2, 3, 4] contains 2 * 3 * 4 = 24 elements

Tofu's tensor structure stores both the shape and a flat data buffer:

tofu_tensor {
    int ndim;        // Number of dimensions (e.g., 3)
    int *dims;       // Array of dimension sizes (e.g., [2, 3, 4])
    void *data;      // Flat buffer of 24 floating point numbers
    tofu_dtype dtype; // Data type (TOFU_FLOAT, TOFU_INT32, etc.)
}

Data Types

Tensors can hold different numeric types depending on your needs:

TOFU_FLOAT - 32-bit floating point (most common for neural networks)
TOFU_DOUBLE - 64-bit floating point (higher precision)
TOFU_INT32, TOFU_INT64 - Integer types
TOFU_BOOL - Boolean values

For machine learning, you'll typically use TOFU_FLOAT for efficiency and simplicity.

Creating Tensors

Tofu provides several ways to create tensors:

// Create tensor from existing data buffer (you manage the buffer)
float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
int dims[] = {2, 2};
tofu_tensor *t = tofu_tensor_create(data, 2, dims, TOFU_FLOAT);

// Create tensor with values (library manages allocation)
float values[] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *t = tofu_tensor_create_with_values(values, 2, dims);

// Create zero-filled tensor
int dims[] = {3, 4};
tofu_tensor *t = tofu_tensor_zeros(2, dims, TOFU_FLOAT);

// Create tensor with sequential values (like NumPy arange)
tofu_tensor *t = tofu_tensor_arange(0.0, 10.0, 1.0, TOFU_FLOAT);  // [0, 1, ..., 9]

Tensor Operations

Tofu provides three main categories of tensor operations:

Element-wise operations apply an operation to each element independently:

Addition:  [1, 2] + [3, 4] = [4, 6]
Multiply:  [1, 2] * [3, 4] = [3, 8]
Power:     [2, 3] ^ 2 = [4, 9]

Matrix operations perform linear algebra calculations:

Matrix multiplication: [2,3] @ [3,4] = [2,4]

Reduction operations combine elements along an axis:

Sum reduction along axis 1:
[[1, 2, 3],     [6]        (1+2+3=6)
 [4, 5, 6]] --> [15]       (4+5+6=15)

These operations form the building blocks of neural networks. For example, a fully-connected layer performs: output = matmul(input, weights) + bias.

Broadcasting: Implicit Dimension Expansion

A powerful feature of tensor operations is broadcasting—automatically expanding smaller tensors to match larger ones:

Matrix [3, 4] + Vector [4] broadcasts the vector to [3, 4]:
[[a, b, c, d],     [[a+x, b+y, c+z, d+w],
 [e, f, g, h],  +   [e+x, f+y, g+z, h+w],
 [i, j, k, l]]  x   [i+x, j+y, k+z, l+w]]

where vector [4] is implicitly treated as [x, y, z, w]
and repeated for each row

This allows you to add biases to layer outputs without manually replicating them.

Computation Graphs

A computation graph is a way of representing how data flows through operations. Instead of computing results immediately, you describe the computation, then ask the graph to compute outputs and gradients.

Why Use Computation Graphs?

There are two key advantages:

Memory efficiency: The graph knows all operations in advance, so it can optimize memory usage.
Automatic differentiation: Once you have the graph, computing gradients is automatic—no manual derivative math needed.

Graph Structure: Directed Acyclic Graph (DAG)

A computation graph is a directed acyclic graph (DAG) where:

Nodes represent tensors or operations
Edges represent data flow between nodes
No cycles (data only flows forward)

Here's a simple example:

      x (INPUT)  W (PARAM)
           |        |
           v        v
        matmul ←────+
           |
           | y
           v
          add ← bias (PARAM)
           |
           v
         relu
           |
           v
       output

This graph represents: output = relu((x @ W) + bias)

Node Types

Nodes in a computation graph come in three flavors:

Leaf nodes have no inputs:

INPUT: Data that doesn't need gradients (e.g., batch data)
PARAM: Trainable parameters (e.g., weights, biases)

Operation nodes combine inputs:

MATMUL: Matrix multiplication
ADD: Element-wise addition
MUL: Element-wise multiplication
RELU: Activation function
SOFTMAX: Softmax activation
MSE_LOSS: Mean squared error loss
CE_LOSS: Cross-entropy loss
And many more...

Important: Graph nodes own their results (the tensors computed by operations), but the graph does NOT own INPUT or PARAM tensors. You create those tensors, add them to the graph, and you're responsible for freeing them later.

Building a Graph

Creating a graph follows this pattern:

// 1. Create tensors (you own these)
tofu_tensor *input = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 5}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){5}, TOFU_FLOAT);

// 2. Create graph and add leaf nodes
tofu_graph *g = tofu_graph_create();

tofu_graph_node *x = tofu_graph_input(g, input);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

// 3. Build computation by adding operations
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
tofu_graph_node *z = tofu_graph_add(g, y, b);
tofu_graph_node *out = tofu_graph_relu(g, z);

The graph now contains all the computation. Each operation node automatically computes its value during construction.

Forward Pass: Computing Outputs

When you add the first operation node to a graph, forward pass computation happens automatically:

tofu_graph_node *y = tofu_graph_matmul(g, x, W);  // Forward pass computes matmul
tofu_tensor *result = tofu_graph_get_value(y);   // Get the computed result

Each node stores its result in node->value. You can inspect these at any time during or after building the graph.

Automatic Differentiation

Automatic differentiation is the magic that lets you compute gradients without writing a single derivative formula. It works by building on the chain rule from calculus.

The Chain Rule in Action

Recall from calculus: if z = f(g(x)), then dz/dx = (df/dg) * (dg/dx).

In neural networks, this chains together:

Forward:  x → square → * 2 → y

Backward: ∂y/∂x = ∂(*2)/∂(square) * ∂(square)/∂x
                 = 2 * (2*x)
                 = 4x

For x = 3: Forward gives y = (3^2) * 2 = 18. Backward gives dy/dx = 4*3 = 12.

Two Phases: Forward and Backward

Training has two phases:

Forward pass: Compute outputs by executing the graph. Each node records its operation and inputs for later use.

Backward pass: Starting from a loss scalar, compute gradients by working backwards through the graph, applying the chain rule at each node.

Forward Pass:
x → op1 → op2 → loss
(compute intermediate values)

Backward Pass (reverse):
loss → ∂op2 → ∂op1 → gradients for x
(compute gradients using chain rule)

Reverse-Mode Autodiff: How Tofu Does It

Tofu uses reverse-mode automatic differentiation (also called backpropagation):

Build the graph by adding nodes (forward pass happens automatically)
Call tofu_graph_backward(g, loss) to compute gradients
Gradients accumulate in node->grad for each node

During backward:

// Each node that requires gradients gets a ∂loss/∂node value in node->grad
tofu_graph_node *W = tofu_graph_param(g, weights);
// ... build graph and call backward ...
tofu_tensor *W_grad = tofu_graph_get_grad(W);  // Now contains ∂loss/∂W

The backward pass visits nodes in reverse topological order, so gradients flow correctly through the entire graph.

Gradient Accumulation

Gradients accumulate by default. If you call backward twice without zeroing, gradients add up:

tofu_graph_backward(g, loss1);  // node->grad = ∂loss1/∂node
tofu_graph_backward(g, loss2);  // node->grad += ∂loss2/∂node (accumulates!)

This is why you must call tofu_graph_zero_grad() before each training iteration:

for (int epoch = 0; epoch < 100; epoch++) {
    tofu_graph_zero_grad(g);    // Clear old gradients

    // Forward pass (build graph)
    tofu_graph_node *loss = build_forward_pass(g);

    // Backward pass (compute new gradients)
    tofu_graph_backward(g, loss);
}

Memory Management and Ownership

Proper memory management is critical in Tofu. Understanding ownership prevents memory leaks and use-after-free bugs.

Ownership Rules

Caller owns: INPUT and PARAM tensors

When you create a tensor and pass it to tofu_graph_input() or tofu_graph_param(), the graph does NOT take ownership. You must free it yourself:

tofu_tensor *input = tofu_tensor_create(data, 2, dims, TOFU_FLOAT);
tofu_graph_node *x = tofu_graph_input(g, input);

// After you're done with the graph...
tofu_graph_free(g);      // Graph is freed
tofu_tensor_free(input); // YOU must free the input tensor

Graph owns: Operation results

When an operation creates a result (like matmul, add, relu), the graph owns that tensor. You never free operation results—the graph does:

tofu_graph_node *result = tofu_graph_matmul(g, a, b);
tofu_tensor *result_value = tofu_graph_get_value(result);

// When you're done:
tofu_graph_free(g);  // Frees result_value automatically

Some operations create "views"—new tensor structures that share memory with originals. No data is copied:

tofu_tensor *t = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor *reshaped = tofu_tensor_reshape(t, 2, (int[]){3, 4});

// reshaped shares data with t
// Both point to the same 12 floats, just interpreted differently

When using views, remember:

Don't free the view with free_data_too (that would free shared memory)
Use tofu_tensor_free() on views (just free the structure)
The original must outlive the view

Cleanup Order

Always clean up in this order:

// 1. Free optimizer (if used)
tofu_optimizer_free(opt);

// 2. Free graph (frees operation results and nodes)
tofu_graph_free(g);

// 3. Free parameter tensors (you created these)
tofu_tensor_free_data_too(weights);
tofu_tensor_free_data_too(bias);

// 4. Free input tensors (you created these)
tofu_tensor_free(input);

Common Mistakes

Mistake 1: Freeing a view with free_data_too

// WRONG!
tofu_tensor *view = tofu_tensor_reshape(t, 2, (int[]){3, 4});
tofu_tensor_free_data_too(view);  // Frees shared memory!

Correct:

tofu_tensor *view = tofu_tensor_reshape(t, 2, (int[]){3, 4});
tofu_tensor_free(view);            // Just free the structure

Mistake 2: Using graph node results after freeing the graph

// WRONG!
tofu_graph_free(g);
tofu_tensor *result = tofu_graph_get_value(node);  // Dangling pointer!

Correct:

tofu_tensor *result = tofu_graph_get_value(node);  // Get before freeing
tofu_graph_free(g);

Mistake 3: Forgetting to free parameter tensors

// WRONG!
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_free(g);  // Forgetting to free weights!

Correct:

tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_free(g);
tofu_tensor_free_data_too(weights);  // Free the tensor you created

Training Loop Pattern

A typical training loop follows a consistent pattern:

for each epoch:
    for each batch:
        1. Zero gradients
        2. Build forward graph and compute loss
        3. Backward pass (compute gradients)
        4. Optimizer step (update parameters using gradients)
        5. Clear operations (keep parameters, discard computation nodes)

Here's what this looks like in code:

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);  // Learning rate

for (int epoch = 0; epoch < 100; epoch++) {
    for (int batch = 0; batch < num_batches; batch++) {
        // 1. Zero gradients
        tofu_optimizer_zero_grad(opt);

        // 2. Forward pass
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
        tofu_graph_node *pred = tofu_graph_matmul(g, x, W);
        pred = tofu_graph_add(g, pred, b);
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);

        // 3. Compute loss
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

        // 4. Backward pass (compute gradients)
        tofu_graph_backward(g, loss);

        // 5. Update parameters
        tofu_optimizer_step(opt);

        // 6. Clear operations for next batch (W and b are preserved)
        tofu_graph_clear_ops(g);
    }
}

The order is important:

Zero before forward: Clean slate for new gradients
Forward then backward: Must compute values before computing gradients
Backward before step: Optimizer needs gradients to update
Clear after step: Makes room for next batch while keeping parameters

How These Concepts Work Together

Now you can see how everything fits together:

Tensors hold your data—inputs, parameters, and intermediate results
Computation graphs describe the structure of your model and automatically compute results
Automatic differentiation computes gradients by applying the chain rule through the graph
Training loop repeats: zero gradients, forward, backward, update parameters

The power of this design: you describe your model once, and Tofu automatically computes all the gradients. No manual derivative formulas. No gradient bugs. Just correct, efficient training.

Summary

Understanding these core concepts will serve you well as you build neural networks with Tofu:

Tensors are multi-dimensional arrays—the fundamental data structure
Computation graphs organize operations in a way that enables automatic differentiation
Automatic differentiation computes gradients automatically using the chain rule
Memory ownership is explicit: you own inputs and parameters, graphs own operations
Training follows a pattern: zero gradients, forward, backward, update, repeat

Next, check out the tutorials to see these concepts in action. The first tutorial will walk you through building a complete neural network from scratch.

Tensors

Tensors are the fundamental data structure in Tofu, representing multi-dimensional arrays that flow through neural networks. This comprehensive guide covers everything you need to know about creating, manipulating, and operating on tensors.

Introduction

Tensors are multi-dimensional arrays that generalize scalars (0-D), vectors (1-D), and matrices (2-D) to arbitrary dimensions. In neural networks, tensors represent:

Input data (images, text, sensor readings)
Model parameters (weights, biases)
Intermediate activations
Gradients during backpropagation

Prerequisites

This guide assumes you've completed the Getting Started guide and understand:

Basic tensor concepts (shape, dimensions)
C programming and memory management
How to compile and link against Tofu

What This Guide Covers

Tensor fundamentals and memory layout
Creation methods for different use cases
Data access and modification patterns
Shape operations (reshape, transpose, slice)
Mathematical operations (matmul, element-wise, broadcasting)
Reduction operations (sum, mean, max)
Activation functions
Memory management and ownership

Tensor Fundamentals

Tensor Structure

A tofu_tensor represents a multi-dimensional array with the following key properties:

struct tofu_tensor {
    tofu_dtype dtype;          // Data type (FLOAT, INT32, etc.)
    int len;                   // Total number of elements
    int ndim;                  // Number of dimensions
    int *dims;                 // Array of dimension sizes
    void *data;                // Pointer to data buffer
    struct tofu_tensor *owner; // For view tensors, points to data owner
    void *backend_data;        // Backend-specific data
};

Example of a 2×3 matrix:

ndim = 2
dims = [2, 3]
len = 6
data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]

Visual representation:
[[1.0, 2.0, 3.0],
 [4.0, 5.0, 6.0]]

Data Types

Tofu supports multiple data types via the tofu_dtype enum:

Type	Description	Size	Use Cases
`TOFU_FLOAT`	32-bit floating point	4 bytes	Neural network weights, activations
`TOFU_DOUBLE`	64-bit floating point	8 bytes	High-precision computation
`TOFU_INT32`	32-bit signed integer	4 bytes	Labels, indices, counters
`TOFU_INT64`	64-bit signed integer	8 bytes	Large indices
`TOFU_INT8`	8-bit signed integer	1 byte	Quantized weights
`TOFU_INT16`	16-bit signed integer	2 bytes	Quantized activations
`TOFU_BOOL`	Boolean	4 bytes	Masks, conditions

Most neural network operations use TOFU_FLOAT for weights and activations, while TOFU_INT32 is common for labels and class predictions.

Memory Layout

Tofu uses row-major (C-style) memory layout, where the last dimension varies fastest:

// 2×3 matrix
float data[] = {1.0, 2.0, 3.0,   // Row 0
                4.0, 5.0, 6.0};  // Row 1

// 2×3×4 tensor (2 matrices, each 3×4)
// data[0..11] = first matrix (row-major)
// data[12..23] = second matrix (row-major)

This affects how you iterate and index tensors:

// Iterating in memory order (efficient)
for (int i = 0; i < dims[0]; i++) {
    for (int j = 0; j < dims[1]; j++) {
        int index = i * dims[1] + j;
        // Access data[index]
    }
}

Creating Tensors

Wrapping Existing Data

Use tofu_tensor_create() to wrap existing data without copying:

tofu_tensor *tofu_tensor_create(void *data, int ndim, const int *dims,
                                tofu_dtype dtype);

This is efficient when you already have data in memory:

float weights[] = {0.1, 0.2, 0.3, 0.4};
tofu_tensor *W = tofu_tensor_create(weights, 2, (int[]){2, 2}, TOFU_FLOAT);

// Use W...

tofu_tensor_free(W);  // Frees structure, NOT data
// weights[] is still valid

Use when: Data is managed elsewhere (stack, static, or you handle malloc/free)

Zero Initialization

tofu_tensor_zeros() allocates and zero-initializes a tensor:

tofu_tensor *tofu_tensor_zeros(int ndim, const int *dims, tofu_dtype dtype);

Example:

tofu_tensor *t = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
// t is a 3×4 matrix filled with 0.0

tofu_tensor_free_data_too(t);  // Frees both structure and data

Use when: You need a new tensor and will populate it later (common for parameters)

Creating With Values

tofu_tensor_create_with_values() creates and initializes from an array:

float values[] = {1.0, 2.0, 3.0, 4.0};
tofu_tensor *t = tofu_tensor_create_with_values(values, 1, (int[]){4});

tofu_tensor_free_data_too(t);

Use when: You have initial values ready (common for biases, small constants)

Range of Values

tofu_tensor_arange() creates a tensor with evenly spaced values:

tofu_tensor *tofu_tensor_arange(double start, double stop, double step,
                                tofu_dtype dtype);

Examples:

// Forward slicing (positive step)
tofu_tensor *t = tofu_tensor_arange(0.0, 10.0, 2.0, TOFU_FLOAT);
// t = [0.0, 2.0, 4.0, 6.0, 8.0]

// Reverse slicing (negative step) - v1.1.0+
tofu_tensor *r = tofu_tensor_arange(10.0, 0.0, -2.0, TOFU_FLOAT);
// r = [10.0, 8.0, 6.0, 4.0, 2.0]

tofu_tensor_free_data_too(t);
tofu_tensor_free_data_too(r);

Use when: Generating test data, indices, or sequences

Note: Returns NULL for empty ranges (start == stop) or incompatible step directions (e.g., arange(0, 10, -1))

Deep Copy

tofu_tensor_clone() creates an independent copy:

tofu_tensor *original = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *copy = tofu_tensor_clone(original);

// Modifying copy doesn't affect original

tofu_tensor_free_data_too(copy);
tofu_tensor_free_data_too(original);

Use when: You need to preserve original data while modifying a copy

Accessing and Modifying Data

Reading Values

Use the TOFU_TENSOR_DATA_FROM() macro to safely read values:

tofu_tensor *t = tofu_tensor_arange(0.0, 5.0, 1.0, TOFU_FLOAT);

for (int i = 0; i < t->len; i++) {
    float value;
    TOFU_TENSOR_DATA_FROM(t, i, value, TOFU_FLOAT);
    printf("t[%d] = %.1f\n", i, value);
}

tofu_tensor_free_data_too(t);

Writing Values

Use TOFU_TENSOR_DATA_TO() macro to safely write values:

tofu_tensor *t = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);

for (int i = 0; i < t->len; i++) {
    float value = i * 0.5;
    TOFU_TENSOR_DATA_TO(t, i, value, TOFU_FLOAT);
}

tofu_tensor_print(t, "%.1f");  // [0.0, 0.5, 1.0, 1.5]
tofu_tensor_free_data_too(t);

Direct Pointer Access

For performance-critical code, access data directly:

tofu_tensor *t = tofu_tensor_zeros(2, (int[]){100, 100}, TOFU_FLOAT);
float *data = (float*)t->data;

// Fast iteration
for (int i = 0; i < t->len; i++) {
    data[i] = i * 0.1;
}

tofu_tensor_free_data_too(t);

Warning: Ensure type safety - casting to wrong type causes undefined behavior.

Iterating Multi-Dimensional Tensors

For 2-D tensors (matrices):

tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
float *data = (float*)matrix->data;

for (int i = 0; i < matrix->dims[0]; i++) {      // Rows
    for (int j = 0; j < matrix->dims[1]; j++) {  // Columns
        int index = i * matrix->dims[1] + j;
        data[index] = i + j;
    }
}

tofu_tensor_free_data_too(matrix);

For 3-D tensors:

tofu_tensor *tensor = tofu_tensor_zeros(3, (int[]){2, 3, 4}, TOFU_FLOAT);
float *data = (float*)tensor->data;

for (int i = 0; i < tensor->dims[0]; i++) {
    for (int j = 0; j < tensor->dims[1]; j++) {
        for (int k = 0; k < tensor->dims[2]; k++) {
            int index = (i * tensor->dims[1] + j) * tensor->dims[2] + k;
            data[index] = i + j + k;
        }
    }
}

tofu_tensor_free_data_too(tensor);

Shape Operations

Reshape

tofu_tensor_reshape() changes tensor shape without copying data:

tofu_tensor *tofu_tensor_reshape(const tofu_tensor *src, int ndim,
                                 const int *dims);

Example:

tofu_tensor *t = tofu_tensor_arange(0.0, 12.0, 1.0, TOFU_FLOAT);
// t shape: [12]

tofu_tensor *matrix = tofu_tensor_reshape(t, 2, (int[]){3, 4});
// matrix shape: [3, 4], shares data with t

tofu_tensor_print(matrix, "%.1f");
// [[0.0, 1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0, 7.0],
//  [8.0, 9.0, 10.0, 11.0]]

tofu_tensor_free(matrix);  // View only
tofu_tensor_free_data_too(t);  // Original with data

Important: Product of new dimensions must equal original size.

In-Place Reshape

tofu_tensor_reshape_src() reshapes a tensor in place:

void tofu_tensor_reshape_src(tofu_tensor *t, int ndim, const int *new_dims);

Example:

tofu_tensor *t = tofu_tensor_arange(0.0, 6.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){2, 3});

tofu_tensor_print(t, "%.1f");
// [[0.0, 1.0, 2.0],
//  [3.0, 4.0, 5.0]]

tofu_tensor_free_data_too(t);

Transpose

tofu_tensor_transpose() swaps dimensions:

tofu_tensor *tofu_tensor_transpose(const tofu_tensor *src, tofu_tensor *dst);

For 2-D tensors, transposes rows and columns:

float data[] = {1, 2, 3,
                4, 5, 6};
tofu_tensor *A = tofu_tensor_create(data, 2, (int[]){2, 3}, TOFU_FLOAT);
// [[1, 2, 3],
//  [4, 5, 6]]

tofu_tensor *AT = tofu_tensor_transpose(A, NULL, NULL);
// [[1, 4],
//  [2, 5],
//  [3, 6]]

tofu_tensor_free_data_too(AT);
tofu_tensor_free(A);

Use cases: Matrix operations, batch dimensions, image transformations

Slice

tofu_tensor_slice() extracts a subtensor:

tofu_tensor *tofu_tensor_slice(const tofu_tensor *src, tofu_tensor *dst,
                               int axis, int start, int len);

Example:

tofu_tensor *t = tofu_tensor_arange(0.0, 10.0, 1.0, TOFU_FLOAT);
// [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

tofu_tensor *slice = tofu_tensor_slice(t, NULL, 0, 2, 5);
// [2, 3, 4, 5, 6]

tofu_tensor_free_data_too(slice);
tofu_tensor_free_data_too(t);

For matrices, slice rows:

tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){10, 5}, TOFU_FLOAT);

tofu_tensor *rows = tofu_tensor_slice(matrix, NULL, 0, 2, 3);
// Extracts rows 2, 3, 4 → shape [3, 5]

tofu_tensor_free_data_too(rows);
tofu_tensor_free_data_too(matrix);

Concatenate

tofu_tensor_concat() joins tensors along an axis:

tofu_tensor *tofu_tensor_concat(const tofu_tensor *src1, const tofu_tensor *src2,
                                tofu_tensor *dst, int axis);

Example:

tofu_tensor *a = tofu_tensor_arange(0.0, 3.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(a, 2, (int[]){1, 3});  // [[0, 1, 2]]

tofu_tensor *b = tofu_tensor_arange(3.0, 6.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(b, 2, (int[]){1, 3});  // [[3, 4, 5]]

tofu_tensor *c = tofu_tensor_concat(a, b, NULL, 0);  // Concatenate rows
// [[0, 1, 2],
//  [3, 4, 5]]

tofu_tensor_free_data_too(c);
tofu_tensor_free_data_too(b);
tofu_tensor_free_data_too(a);

Mathematical Operations

Matrix Multiplication

tofu_tensor_matmul() performs matrix multiplication with broadcasting:

tofu_tensor *tofu_tensor_matmul(const tofu_tensor *src1, const tofu_tensor *src2,
                                tofu_tensor *dst);

Basic matrix multiplication:

tofu_tensor *A = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT);  // 2×3
tofu_tensor *B = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);  // 3×4

tofu_tensor *C = tofu_tensor_matmul(A, B, NULL);  // 2×4
// C[i,j] = Σ(A[i,k] * B[k,j])

tofu_tensor_free_data_too(C);
tofu_tensor_free_data_too(B);
tofu_tensor_free_data_too(A);

Dimension rules:

For 1-D @ 1-D: src1->dims[0] must equal src2->dims[0] (dot product)
For 2-D and higher: src1->dims[src1->ndim-1] must equal src2->dims[src2->ndim-2]

Batch matrix multiplication:

tofu_tensor *batches = tofu_tensor_zeros(3, (int[]){10, 2, 3}, TOFU_FLOAT);
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);

tofu_tensor *results = tofu_tensor_matmul(batches, weights, NULL);
// Shape: [10, 2, 4] - broadcasts weights across batch

tofu_tensor_free_data_too(results);
tofu_tensor_free_data_too(weights);
tofu_tensor_free_data_too(batches);

Inner Product

tofu_tensor_inner() computes dot product (sum of element-wise products):

tofu_tensor *a = tofu_tensor_arange(0.0, 3.0, 1.0, TOFU_FLOAT);  // [0, 1, 2]
tofu_tensor *b = tofu_tensor_arange(1.0, 4.0, 1.0, TOFU_FLOAT);  // [1, 2, 3]

tofu_tensor *result = tofu_tensor_inner(a, b, NULL);
// result = 0*1 + 1*2 + 2*3 = 8

tofu_tensor_free_data_too(result);
tofu_tensor_free_data_too(b);
tofu_tensor_free_data_too(a);

Outer Product

tofu_tensor_outer() computes outer product:

tofu_tensor *a = tofu_tensor_arange(1.0, 4.0, 1.0, TOFU_FLOAT);  // [1, 2, 3]
tofu_tensor *b = tofu_tensor_arange(1.0, 3.0, 1.0, TOFU_FLOAT);  // [1, 2]

tofu_tensor *result = tofu_tensor_outer(a, b, NULL);
// [[1, 2],
//  [2, 4],
//  [3, 6]]

tofu_tensor_free_data_too(result);
tofu_tensor_free_data_too(b);
tofu_tensor_free_data_too(a);

Element-Wise Operations

tofu_tensor_elew() performs element-wise operations:

tofu_tensor *tofu_tensor_elew(const tofu_tensor *src1, const tofu_tensor *src2,
                              tofu_tensor *dst, tofu_elew_op op);

Supported operations:

Operation	Description	Example
`TOFU_SUM`	Addition	a + b
`TOFU_SUB`	Subtraction	a - b
`TOFU_MUL`	Multiplication	a * b
`TOFU_DIV`	Division	a / b
`TOFU_MAX`	Maximum	max(a, b)
`TOFU_MIN`	Minimum	min(a, b)

Example:

tofu_tensor *a = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT);  // [1, 2, 3, 4]
tofu_tensor *b = tofu_tensor_arange(2.0, 6.0, 1.0, TOFU_FLOAT);  // [2, 3, 4, 5]

tofu_tensor *sum = tofu_tensor_elew(a, b, NULL, TOFU_SUM);  // [3, 5, 7, 9]
tofu_tensor *prod = tofu_tensor_elew(a, b, NULL, TOFU_MUL);  // [2, 6, 12, 20]

tofu_tensor_free_data_too(prod);
tofu_tensor_free_data_too(sum);
tofu_tensor_free_data_too(b);
tofu_tensor_free_data_too(a);

Element-Wise with Scalar

tofu_tensor_elew_param() applies operation with a scalar:

tofu_tensor *tofu_tensor_elew_param(const tofu_tensor *src, double param,
                                    tofu_tensor *dst, tofu_elew_op op);

Example:

tofu_tensor *t = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT);  // [1, 2, 3, 4]

tofu_tensor *scaled = tofu_tensor_elew_param(t, 2.0, NULL, TOFU_MUL);  // [2, 4, 6, 8]
tofu_tensor *shifted = tofu_tensor_elew_param(t, 10.0, NULL, TOFU_SUM);  // [11, 12, 13, 14]

tofu_tensor_free_data_too(shifted);
tofu_tensor_free_data_too(scaled);
tofu_tensor_free_data_too(t);

Broadcasting

Broadcasting allows operations on tensors with different but compatible shapes:

Rules:

Start from trailing dimensions
Dimensions are compatible if they're equal or one is 1
Missing dimensions are treated as 1

Examples:

// Shape [3, 4] + Shape [4] → broadcasts to [3, 4]
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT);  // [4]

tofu_tensor *result = tofu_tensor_elew_broadcast(matrix, bias, NULL, TOFU_SUM);
// bias is added to each row of matrix

tofu_tensor_free_data_too(result);
tofu_tensor_free_data_too(bias);
tofu_tensor_free_data_too(matrix);

Reduction Operations

Sum Reduction

tofu_tensor_sumreduce() sums along an axis:

tofu_tensor *t = tofu_tensor_arange(0.0, 12.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){3, 4});

tofu_tensor *col_sums = tofu_tensor_sumreduce(t, NULL, 0);  // Sum rows
// Shape: [1, 4], values: [[12, 15, 18, 21]]

tofu_tensor *row_sums = tofu_tensor_sumreduce(t, NULL, 1);  // Sum columns
// Shape: [3, 1], values: [[6], [22], [38]]

tofu_tensor_free_data_too(row_sums);
tofu_tensor_free_data_too(col_sums);
tofu_tensor_free_data_too(t);

Mean Reduction

tofu_tensor_meanreduce() computes mean along an axis:

tofu_tensor *data = tofu_tensor_arange(0.0, 12.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(data, 2, (int[]){3, 4});

tofu_tensor *row_means = tofu_tensor_meanreduce(data, NULL, 1);
// Shape: [3, 1], values: [[1.5], [5.5], [9.5]]

tofu_tensor_free_data_too(row_means);
tofu_tensor_free_data_too(data);

Max Reduction

tofu_tensor_maxreduce() finds maximum values and optionally their indices:

tofu_tensor *tofu_tensor_maxreduce(const tofu_tensor *src, tofu_tensor *dst,
                                   tofu_tensor *arg, int axis);

Example:

float data[] = {3, 1, 4, 1, 5, 9, 2, 6, 5};
tofu_tensor *t = tofu_tensor_create(data, 2, (int[]){3, 3}, TOFU_FLOAT);

tofu_tensor *indices = tofu_tensor_zeros(2, (int[]){3, 1}, TOFU_INT32);
tofu_tensor *max_vals = tofu_tensor_maxreduce(t, NULL, indices, 1);

// max_vals shape: [3, 1], values: [[4], [9], [6]]
// indices shape: [3, 1], values: [[2], [2], [1]]

tofu_tensor_free_data_too(max_vals);
tofu_tensor_free_data_too(indices);
tofu_tensor_free(t);

Activation Functions

Leaky ReLU

tofu_tensor *tofu_tensor_lrelu(const tofu_tensor *src, tofu_tensor *dst,
                               float negslope);

Example:

float data[] = {-2, -1, 0, 1, 2};
tofu_tensor *x = tofu_tensor_create(data, 1, (int[]){5}, TOFU_FLOAT);

tofu_tensor *relu = tofu_tensor_lrelu(x, NULL, 0.0f);  // Standard ReLU
// [0, 0, 0, 1, 2]

tofu_tensor *leaky = tofu_tensor_lrelu(x, NULL, 0.01f);  // Leaky ReLU
// [-0.02, -0.01, 0, 1, 2]

tofu_tensor_free_data_too(leaky);
tofu_tensor_free_data_too(relu);
tofu_tensor_free(x);

Softmax

tofu_tensor *tofu_tensor_softmax(const tofu_tensor *src, tofu_tensor *dst,
                                 int axis);

Example:

float logits[] = {1, 2, 3};
tofu_tensor *t = tofu_tensor_create(logits, 1, (int[]){3}, TOFU_FLOAT);

tofu_tensor *probs = tofu_tensor_softmax(t, NULL, 0);
// Approximately [0.09, 0.24, 0.67]

tofu_tensor_free_data_too(probs);
tofu_tensor_free(t);

Layer Normalization

tofu_tensor *tofu_tensor_layer_norm(const tofu_tensor *src, tofu_tensor *dst,
                                    const tofu_tensor *gamma, const tofu_tensor *beta,
                                    int axis, double eps);

Utility Functions

Printing

void tofu_tensor_print(const tofu_tensor *t, const char *fmt);

Example:

tofu_tensor *t = tofu_tensor_arange(0.0, 6.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){2, 3});

tofu_tensor_print(t, "%.1f");
// [[0.0, 1.0, 2.0],
//  [3.0, 4.0, 5.0]]

tofu_tensor_free_data_too(t);

Size Queries

size_t size = tofu_tensor_size(t);  // Total elements
int same_shape = tofu_tensor_issameshape(t1, t2);
int broadcastable = tofu_tensor_isbroadcastable(t1, t2);

Type Conversion

tofu_tensor *ints = tofu_tensor_convert(floats, NULL, TOFU_INT32);

Memory Management

Ownership Rules

Rule 1: Tensors created with tofu_tensor_create() don't own their data

float data[4] = {1, 2, 3, 4};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){4}, TOFU_FLOAT);
tofu_tensor_free(t);  // Only frees tensor structure
// data is still valid

Rule 2: Tensors created with tofu_tensor_zeros(), tofu_tensor_clone(), etc. own their data

tofu_tensor *t = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
tofu_tensor_free_data_too(t);  // Frees both structure and data

Rule 3: View operations (reshape, transpose) share data

tofu_tensor *original = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor *view = tofu_tensor_reshape(original, 2, (int[]){3, 4});

// view shares data with original
tofu_tensor_free(view);  // Free view only
tofu_tensor_free_data_too(original);  // Free data with original

Common Mistakes

Mistake 1: Using free_data_too on user-owned data

// WRONG
float data[4];
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){4}, TOFU_FLOAT);
tofu_tensor_free_data_too(t);  // Tries to free stack memory!

Mistake 2: Memory leak from not freeing library-owned data

// WRONG
tofu_tensor *t = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
tofu_tensor_free(t);  // Leaks data buffer!

Mistake 3: Freeing view data

// WRONG
tofu_tensor *original = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor *view = tofu_tensor_reshape(original, 2, (int[]){3, 4});
tofu_tensor_free_data_too(view);  // Frees shared data!
tofu_tensor_free_data_too(original);  // Double free!

Best Practices

Efficiency

Reuse destination tensors to avoid allocations
Use views when possible (reshape, not clone)
Choose appropriate data types (INT32 for labels, FLOAT for weights)
Batch operations for efficiency

Debugging

Validate shapes before operations
Print intermediate results with tofu_tensor_print()
Check for NaN/Inf in numerical operations
Track allocations to find memory leaks

Common Pitfalls

Don't modify views expecting independence
Don't use free_data_too on tensor_create() tensors
Verify broadcast compatibility before operations
Ensure consistent data types in operations

Next Steps

Now that you understand tensors, continue to:

Computation Graphs - Build neural networks with automatic differentiation
Training Guide - Implement complete training loops
API Reference - Detailed function specifications

For practical examples, see the tutorials section for complete neural network implementations.

Computation Graphs

Computation graphs are the foundation of automatic differentiation and neural network training in Tofu. This guide explains how to build, manage, and use computation graphs for deep learning applications.

Introduction

A computation graph is a directed acyclic graph (DAG) that represents mathematical operations and their dependencies. Each node in the graph represents either:

A leaf node: input data or trainable parameter
An operation node: a mathematical operation (matmul, add, relu, etc.)

Tofu uses dynamic computation graphs (define-by-run), meaning the graph structure is built on-the-fly as operations execute. This provides flexibility for control flow and conditional computation.

Key Concepts

Forward Pass: Computes values by flowing data through the graph from inputs to outputs. Each operation node stores its result in node->value.

Backward Pass: Computes gradients by flowing derivatives backward from the loss to parameters. Uses reverse-mode automatic differentiation (backpropagation). Each node stores its gradient in node->grad.

Requires Gradient: A flag indicating whether a node needs gradient computation. Parameters always require gradients, while inputs do not.

Topological Order: The graph automatically sorts nodes in reverse topological order for efficient backward pass execution.

A Simple Example

// Create graph
tofu_graph *g = tofu_graph_create();

// Add input and parameters
float x_data[] = {1.0f, 2.0f};
float w_data[] = {0.5f, -0.3f};

tofu_tensor *x_tensor = tofu_tensor_create(x_data, 1, (int[]){2}, TOFU_FLOAT);
tofu_tensor *w_tensor = tofu_tensor_create(w_data, 2, (int[]){2, 1}, TOFU_FLOAT);

tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, w_tensor);

// Compute: y = x @ W
tofu_graph_node *y = tofu_graph_matmul(g, x, W);

// Backward pass
tofu_graph_backward(g, y);

// Access gradient
tofu_tensor *W_grad = tofu_graph_get_grad(W);  // Contains dL/dW

// Cleanup
tofu_graph_free(g);
tofu_tensor_free(x_tensor);
tofu_tensor_free(w_tensor);

When to Use Graphs

Use computation graphs when you need:

Automatic gradient computation for training
Complex neural network architectures
Efficient backpropagation through multiple layers
Dynamic control flow in models

For simple tensor operations without gradients, you can use the Tensor API directly without graphs.

Graph Fundamentals

Directed Acyclic Graphs (DAGs)

Computation graphs must be acyclic to ensure well-defined forward and backward passes. Each operation creates a new node that depends on its input nodes, forming a directed edge from inputs to outputs.

// This creates a DAG:
//   x ---\
//         matmul --> y --> relu --> z
//   W ---/
//   b -----------------------------> add --> out

tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *b = tofu_graph_param(g, b_tensor);

tofu_graph_node *y = tofu_graph_matmul(g, x, W);
tofu_graph_node *z = tofu_graph_relu(g, y);
tofu_graph_node *out = tofu_graph_add(g, z, b);

The graph automatically tracks dependencies. When you call tofu_graph_backward(g, out), it computes gradients for all parameters by traversing edges backward.

Nodes and Their Roles

Leaf Nodes are the sources of data in the graph:

TOFU_OP_INPUT: Non-trainable data (features, targets). Does not receive gradients.
TOFU_OP_PARAM: Trainable parameters (weights, biases). Receives and accumulates gradients.

Operation Nodes perform computations:

Binary operations: matmul, add, mul
Activations: relu, softmax, layer_norm
Shape operations: reshape, transpose
Reductions: mean, sum
Loss functions: mse_loss, ce_loss

Each operation node stores:

value: Result of forward computation
grad: Gradient from backward pass
inputs: Pointer to input nodes
backward_fn: Function to compute gradients for inputs
backward_ctx: Saved tensors needed for backward (e.g., input values for ReLU)

Forward Pass Execution

The forward pass computes outputs by executing operations in order:

tofu_graph *g = tofu_graph_create();

// Create computation: y = relu(x @ W + b)
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *b = tofu_graph_param(g, b_tensor);

// Forward pass happens automatically as operations are added
tofu_graph_node *xW = tofu_graph_matmul(g, x, W);  // Computes x @ W immediately
tofu_graph_node *xWb = tofu_graph_add(g, xW, b);   // Computes xW + b immediately
tofu_graph_node *y = tofu_graph_relu(g, xWb);      // Computes relu(xWb) immediately

// At this point, y->value contains the final result
tofu_tensor *result = tofu_graph_get_value(y);

Each operation executes immediately when called, computing and storing the result in the new node's value field.

Backward Pass Execution

The backward pass computes gradients using the chain rule:

// Continuing from above...
tofu_graph_backward(g, y);

// Now all parameter nodes have gradients:
tofu_tensor *W_grad = tofu_graph_get_grad(W);  // dL/dW
tofu_tensor *b_grad = tofu_graph_get_grad(b);  // dL/db

The backward pass:

Initializes loss->grad = 1.0 (derivative of loss w.r.t. itself)
Sorts nodes in reverse topological order
For each node (from loss back to inputs):
- Calls its backward_fn to compute input gradients
- Accumulates gradients for nodes that appear multiple times

Important: Gradients accumulate across backward passes. Always call tofu_graph_zero_grad() before each training iteration unless you explicitly want gradient accumulation.

Gradient Flow and Requires Grad

The requires_grad flag determines whether a node needs gradient computation:

tofu_graph_node *x = tofu_graph_input(g, x_tensor);     // requires_grad = 0
tofu_graph_node *W = tofu_graph_param(g, W_tensor);     // requires_grad = 1

tofu_graph_node *y = tofu_graph_matmul(g, x, W);        // requires_grad = 1 (because W does)
tofu_graph_node *z = tofu_graph_add(g, y, const_node);  // requires_grad = 1 (because y does)

An operation node requires gradients if ANY of its inputs require gradients. This propagates through the graph automatically.

Gradients only flow to nodes with requires_grad = 1:

PARAM nodes always receive gradients (these are your trainable weights)
INPUT nodes never receive gradients (they're just data)
Operation nodes receive gradients if they're on a path from a parameter to the loss

Creating and Managing Graphs

Creating a Graph

Use tofu_graph_create() to create a new empty graph:

tofu_graph *g = tofu_graph_create();

// Graph starts empty
// num_nodes = 0
// next_id = 0

// ... build your graph ...

The graph allocates memory dynamically and grows as you add nodes. It manages all nodes internally and frees them when tofu_graph_free() is called.

Freeing a Graph

Use tofu_graph_free() to clean up a graph and all its nodes:

tofu_graph *g = tofu_graph_create();

// Build graph...
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *y = tofu_graph_matmul(g, x, W);

// Free graph (but NOT tensors!)
tofu_graph_free(g);

// You must still free tensors separately
tofu_tensor_free(x_tensor);
tofu_tensor_free(W_tensor);

Critical Memory Management Rule: The graph does NOT take ownership of tensors passed to tofu_graph_input() or tofu_graph_param(). You must:

Free the graph first: tofu_graph_free(g)
Then free tensors: tofu_tensor_free(tensor)

The graph DOES own and free:

All graph nodes
All gradients (node->grad)
All intermediate operation results (e.g., the result of matmul, add, etc.)

Correct Cleanup Order

// Create tensors
tofu_tensor *x_tensor = tofu_tensor_zeros(2, (int[]){1, 4}, TOFU_FLOAT);
tofu_tensor *W_tensor = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);

// Build graph
tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *y = tofu_graph_matmul(g, x, W);

// Training loop...

// CORRECT CLEANUP ORDER:
// 1. Free optimizer (if used)
tofu_optimizer_free(opt);

// 2. Free graph
tofu_graph_free(g);

// 3. Free tensors
tofu_tensor_free_data_too(x_tensor);
tofu_tensor_free_data_too(W_tensor);

Clearing Operations

Use tofu_graph_clear_ops() to remove operation nodes while keeping parameters:

tofu_graph *g = tofu_graph_create();

// Add parameters (persist across iterations)
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *b = tofu_graph_param(g, b_tensor);

for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Build forward graph for this batch
    tofu_graph_node *x = tofu_graph_input(g, batch_data);
    tofu_graph_node *y = tofu_graph_matmul(g, x, W);
    tofu_graph_node *out = tofu_graph_add(g, y, b);

    // Backward pass and optimization...
    tofu_graph_backward(g, loss);
    tofu_optimizer_step(opt);

    // Clear operations (W and b are preserved!)
    tofu_graph_clear_ops(g);
}

This is essential for training loops to prevent node accumulation and memory growth. After clear_ops():

All operation nodes are freed
PARAM and INPUT nodes remain in the graph
Parameter values and gradients are preserved
The graph is ready for the next forward pass

When to Use Clear Ops:

Between training iterations in a loop
When reusing the same graph with different input data
To prevent unbounded memory growth during training

When NOT to Clear Ops:

If you need to access intermediate operation results after backward
During a single forward/backward pass
Before calling the optimizer (gradients would be lost!)

Adding Leaf Nodes

Leaf nodes are the starting points of computation. They represent either input data or trainable parameters.

Input Nodes

Input nodes represent non-trainable data like features or labels:

float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *x_tensor = tofu_tensor_create(data, 2, (int[]){1, 4}, TOFU_FLOAT);

tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, x_tensor);

// Input nodes do NOT compute gradients
// x->requires_grad == 0
// x->op == TOFU_OP_INPUT

Characteristics of input nodes:

requires_grad = 0 (no gradient computation)
Used for features, labels, or other non-trainable data
Graph does NOT own the tensor (caller must free it)
Typically created fresh for each training iteration

Parameter Nodes

Parameter nodes represent trainable weights or biases:

float weights[] = {0.5f, -0.3f, 0.2f, 0.1f};
tofu_tensor *W_tensor = tofu_tensor_create(weights, 2, (int[]){2, 2}, TOFU_FLOAT);

tofu_graph *g = tofu_graph_create();
tofu_graph_node *W = tofu_graph_param(g, W_tensor);

// Parameter nodes DO compute gradients
// W->requires_grad == 1
// W->op == TOFU_OP_PARAM

Characteristics of parameter nodes:

requires_grad = 1 (gradient computation enabled)
Used for weights, biases, or other learnable parameters
Graph does NOT own the tensor (caller must free it)
Typically created once and reused across iterations
Preserved by tofu_graph_clear_ops()

Ownership Rules

Critical: The graph does NOT take ownership of tensors passed to tofu_graph_input() or tofu_graph_param().

// Create tensor (you own this)
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);

// Add to graph (graph does NOT take ownership)
tofu_graph_node *W_node = tofu_graph_param(g, W);

// Later: free graph first, then tensor
tofu_graph_free(g);          // Frees the node, but NOT the tensor
tofu_tensor_free_data_too(W); // You must free the tensor

Why this design?

Parameters persist across multiple training iterations
You may need to save/load parameters independently
Gives you full control over memory management

Typical Usage Pattern

// Setup phase (once)
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);

tofu_graph *g = tofu_graph_create();
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);

// Training loop (many iterations)
for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Create fresh input for each iteration
    float *batch_data = load_batch(epoch);
    tofu_tensor *x_tensor = tofu_tensor_create(batch_data, 2, (int[]){32, 784}, TOFU_FLOAT);

    // Add input node
    tofu_graph_node *x = tofu_graph_input(g, x_tensor);

    // Build forward graph using parameters
    tofu_graph_node *logits = tofu_graph_matmul(g, x, W_node);
    tofu_graph_node *out = tofu_graph_add(g, logits, b_node);

    // Training step...
    tofu_graph_backward(g, loss);
    tofu_optimizer_step(opt);

    // Clean up this iteration's input
    tofu_tensor_free(x_tensor);
    free(batch_data);

    // Clear operations (W_node and b_node are preserved)
    tofu_graph_clear_ops(g);
}

// Cleanup (once at end)
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);

Common Pitfalls

Pitfall 1: Freeing tensors before graph

// WRONG - this will crash!
tofu_tensor_free(W_tensor);  // Frees tensor
tofu_graph_free(g);          // Graph still references freed memory

// CORRECT
tofu_graph_free(g);          // Free graph first
tofu_tensor_free(W_tensor);  // Then free tensor

Pitfall 2: Not freeing tensors

// WRONG - memory leak!
tofu_graph_free(g);
// Forgot to free tensors!

// CORRECT
tofu_graph_free(g);
tofu_tensor_free_data_too(W_tensor);
tofu_tensor_free_data_too(b_tensor);

Pitfall 3: Clearing ops without re-adding params

// WRONG - params are gone after clear_ops if added in loop
for (int i = 0; i < 100; i++) {
    tofu_graph_node *W = tofu_graph_param(g, W_tensor);  // Re-adding param each time
    // ... training ...
    tofu_graph_clear_ops(g);  // Removes W!
}

// CORRECT - add params once before loop
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
for (int i = 0; i < 100; i++) {
    // ... use W ...
    tofu_graph_clear_ops(g);  // W is preserved
}

Mathematical Operations

Mathematical operations create new nodes that compute values during the forward pass and propagate gradients during the backward pass.

Matrix Multiplication

Matrix multiplication is the workhorse of neural networks:

// y = x @ W
// x: [batch, in_features]
// W: [in_features, out_features]
// y: [batch, out_features]

tofu_graph_node *x = tofu_graph_input(g, x_tensor);    // [32, 784]
tofu_graph_node *W = tofu_graph_param(g, W_tensor);    // [784, 10]
tofu_graph_node *y = tofu_graph_matmul(g, x, W);       // [32, 10]

The operation:

Computes standard matrix multiplication with broadcasting
Supports batched operations (3D, 4D tensors)
Implements backward pass: dL/dx = dL/dy @ W^T and dL/dW = x^T @ dL/dy

Precondition: Inner dimensions must match: a->dims[last] == b->dims[second-to-last]

Element-wise Addition

Addition is commonly used for adding biases:

// out = x + b
// x: [batch, features]
// b: [features]
// out: [batch, features]

tofu_graph_node *x = tofu_graph_matmul(g, input, W);   // [32, 10]
tofu_graph_node *b = tofu_graph_param(g, b_tensor);    // [10]
tofu_graph_node *out = tofu_graph_add(g, x, b);        // [32, 10]

The operation:

Performs element-wise addition with broadcasting
Follows NumPy broadcasting rules
Implements backward pass: gradients flow to both inputs

Broadcasting example:

// Broadcasting [2, 3] + [3] -> [2, 3]
float a_data[] = {1, 2, 3, 4, 5, 6};
float b_data[] = {10, 20, 30};

tofu_tensor *a = tofu_tensor_create(a_data, 2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_create(b_data, 1, (int[]){3}, TOFU_FLOAT);

tofu_graph_node *a_node = tofu_graph_input(g, a);
tofu_graph_node *b_node = tofu_graph_input(g, b);
tofu_graph_node *c = tofu_graph_add(g, a_node, b_node);

// Result: [[11, 22, 33], [14, 25, 36]]

Element-wise Multiplication

Multiplication is useful for attention mechanisms and gating:

// out = x * y
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *y = tofu_graph_input(g, y_tensor);
tofu_graph_node *out = tofu_graph_mul(g, x, y);

The operation:

Performs element-wise multiplication with broadcasting
Implements backward pass: dL/dx = dL/dout * y and dL/dy = dL/dout * x

Example: Attention scaling

// Attention: scale * (Q @ K^T)
tofu_graph_node *qk = tofu_graph_matmul(g, Q, K_T);
tofu_graph_node *scale_tensor = tofu_graph_param(g, scale);
tofu_graph_node *scaled = tofu_graph_mul(g, qk, scale_tensor);

Chaining Operations

Operations can be chained to build complex expressions:

// Build: y = ReLU(x @ W + b)
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *b = tofu_graph_param(g, b_tensor);

// Chain operations
tofu_graph_node *xW = tofu_graph_matmul(g, x, W);     // Linear transformation
tofu_graph_node *xWb = tofu_graph_add(g, xW, b);      // Add bias
tofu_graph_node *y = tofu_graph_relu(g, xWb);         // Apply activation

// Each intermediate result is stored in the node's value
tofu_tensor *xW_value = tofu_graph_get_value(xW);     // Can inspect intermediates

Multi-Layer Networks

Chaining creates deeper networks:

// Two-layer network: h = ReLU(x @ W1 + b1), out = h @ W2 + b2
tofu_graph_node *x = tofu_graph_input(g, x_tensor);

// Layer 1: [batch, 784] -> [batch, 128]
tofu_graph_node *W1 = tofu_graph_param(g, W1_tensor);
tofu_graph_node *b1 = tofu_graph_param(g, b1_tensor);
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1);
tofu_graph_node *h1b = tofu_graph_add(g, h1, b1);
tofu_graph_node *h1a = tofu_graph_relu(g, h1b);

// Layer 2: [batch, 128] -> [batch, 10]
tofu_graph_node *W2 = tofu_graph_param(g, W2_tensor);
tofu_graph_node *b2 = tofu_graph_param(g, b2_tensor);
tofu_graph_node *h2 = tofu_graph_matmul(g, h1a, W2);
tofu_graph_node *out = tofu_graph_add(g, h2, b2);

The backward pass automatically computes gradients for all parameters (W1, b1, W2, b2) using the chain rule.

Operation Results

Every operation stores its result immediately:

tofu_graph_node *y = tofu_graph_matmul(g, x, W);

// Result is available immediately
tofu_tensor *result = tofu_graph_get_value(y);
tofu_tensor_print(result, "%.2f");

// The tensor is owned by the node - don't free it!
// It will be freed when you call tofu_graph_free(g)

Activation Functions

Activation functions introduce non-linearity, enabling neural networks to learn complex patterns.

ReLU (Rectified Linear Unit)

ReLU is the most common activation function:

// y = max(0, x)
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *y = tofu_graph_relu(g, x);

// Values: [-2, -1, 0, 1, 2] -> [0, 0, 0, 1, 2]

Properties:

Simple and efficient: y = (x > 0) ? x : 0
Gradient: 1 where x > 0, else 0
Helps avoid vanishing gradients in deep networks
Creates sparse activations (many zeros)

Usage pattern in networks:

// Hidden layer with ReLU
tofu_graph_node *h = tofu_graph_matmul(g, x, W);
tofu_graph_node *h_bias = tofu_graph_add(g, h, b);
tofu_graph_node *h_act = tofu_graph_relu(g, h_bias);  // Apply ReLU after bias

Softmax

Softmax converts logits to probabilities for classification:

// Apply softmax along axis 1 (last dimension)
// Input:  [[1, 2, 3], [4, 5, 6]]  (logits)
// Output: [[0.09, 0.24, 0.67], [0.09, 0.24, 0.67]]  (probabilities)

tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);

Properties:

Outputs sum to 1.0 along the specified axis
Numerically stable (subtracts max before exp)
Used in the final layer for classification
Axis parameter specifies normalization dimension

Formula: softmax(x_i) = exp(x_i - max(x)) / sum(exp(x_j - max(x)))

Multi-class classification example:

// 10-class classifier
tofu_graph_node *x = tofu_graph_input(g, x_tensor);      // [batch, features]
tofu_graph_node *W = tofu_graph_param(g, W_tensor);      // [features, 10]
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);    // [batch, 10]
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);  // [batch, 10]

// probs now contains class probabilities for each sample

Layer Normalization

Layer normalization stabilizes training in deep networks:

// Normalize along axis 1
// out = gamma * (x - mean) / sqrt(var + eps) + beta

tofu_graph_node *x = tofu_graph_input(g, x_tensor);      // [batch, features]
tofu_graph_node *gamma = tofu_graph_param(g, gamma_tensor);  // [features]
tofu_graph_node *beta = tofu_graph_param(g, beta_tensor);    // [features]

tofu_graph_node *normalized = tofu_graph_layer_norm(g, x, gamma, beta, 1, 1e-5);

Parameters:

x: Input tensor
gamma: Scale parameter (can be NULL for no scaling)
beta: Shift parameter (can be NULL for no shift)
axis: Normalization axis (typically last dimension)
eps: Small constant for numerical stability (typically 1e-5)

Properties:

Normalizes activations to zero mean and unit variance
Helps stabilize training and enables higher learning rates
Common in transformers and deep networks

Typical usage in transformers:

// Layer norm after self-attention
tofu_graph_node *attn_out = tofu_graph_matmul(g, attn_weights, V);
tofu_graph_node *normed = tofu_graph_layer_norm(g, attn_out, gamma, beta, 1, 1e-5);

Combining Activations

Different activations serve different purposes:

// Multi-layer network with different activations
tofu_graph_node *x = tofu_graph_input(g, x_tensor);

// Hidden layer 1: ReLU for non-linearity
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1);
tofu_graph_node *h1_bias = tofu_graph_add(g, h1, b1);
tofu_graph_node *h1_act = tofu_graph_relu(g, h1_bias);

// Hidden layer 2: ReLU + Layer Norm
tofu_graph_node *h2 = tofu_graph_matmul(g, h1_act, W2);
tofu_graph_node *h2_bias = tofu_graph_add(g, h2, b2);
tofu_graph_node *h2_act = tofu_graph_relu(g, h2_bias);
tofu_graph_node *h2_norm = tofu_graph_layer_norm(g, h2_act, gamma, beta, 1, 1e-5);

// Output layer: Softmax for classification
tofu_graph_node *logits = tofu_graph_matmul(g, h2_norm, W_out);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);

Shape Operations

Shape operations manipulate tensor dimensions without changing the underlying data.

Reshape

Reshape changes tensor dimensions while preserving total elements:

// Flatten: [batch, height, width, channels] -> [batch, height * width * channels]
int batch = 32;
int h = 28, w = 28, c = 1;
int flat_dim = h * w * c;  // 784

tofu_graph_node *img = tofu_graph_input(g, img_tensor);  // [32, 28, 28, 1]
tofu_graph_node *flat = tofu_graph_reshape(g, img, 2, (int[]){batch, flat_dim});  // [32, 784]

Properties:

View operation (no data copy)
Total number of elements must remain constant
Useful for transitioning between convolutional and fully-connected layers

Common patterns:

// Flatten for fully-connected layer
tofu_graph_node *flat = tofu_graph_reshape(g, x, 2, (int[]){batch, -1});

// Unflatten for visualization
tofu_graph_node *img = tofu_graph_reshape(g, flat, 4, (int[]){batch, 28, 28, 1});

// Prepare patches for Vision Transformer
tofu_graph_node *patches = tofu_graph_reshape(g, img, 3, (int[]){batch, num_patches, patch_dim});

Transpose

Transpose permutes tensor dimensions:

// Transpose matrix: [m, n] -> [n, m]
tofu_graph_node *W = tofu_graph_param(g, W_tensor);      // [784, 10]
tofu_graph_node *W_T = tofu_graph_transpose(g, W, NULL);  // [10, 784]

// NULL means reverse dimension order

With explicit axis permutation:

// Permute: [batch, seq, features] -> [batch, features, seq]
int axes[] = {0, 2, 1};
tofu_graph_node *x = tofu_graph_input(g, x_tensor);       // [32, 100, 64]
tofu_graph_node *x_T = tofu_graph_transpose(g, x, axes);  // [32, 64, 100]

Common usage in attention:

// Attention: Q @ K^T
tofu_graph_node *Q = tofu_graph_matmul(g, x, W_q);      // [batch, seq, dim]
tofu_graph_node *K = tofu_graph_matmul(g, x, W_k);      // [batch, seq, dim]
tofu_graph_node *K_T = tofu_graph_transpose(g, K, NULL);  // [batch, dim, seq]
tofu_graph_node *scores = tofu_graph_matmul(g, Q, K_T);   // [batch, seq, seq]

Mean Reduction

Compute mean along specified axes (coming soon - API being finalized).

Sum Reduction

Compute sum along specified axes (coming soon - API being finalized).

Combining Shape Operations

Shape operations often work together:

// Vision Transformer patch embedding
// Input: [batch, height, width, channels]
// Output: [batch, num_patches, embed_dim]

tofu_graph_node *img = tofu_graph_input(g, img_tensor);  // [32, 224, 224, 3]

// Step 1: Reshape to patches
int patch_size = 16;
int num_patches = (224 / patch_size) * (224 / patch_size);  // 196
int patch_dim = patch_size * patch_size * 3;  // 768

tofu_graph_node *patches = tofu_graph_reshape(g, img, 2,
    (int[]){32, num_patches, patch_dim});  // [32, 196, 768]

// Step 2: Project patches to embedding dimension
tofu_graph_node *W_proj = tofu_graph_param(g, W_proj_tensor);  // [768, 512]
tofu_graph_node *embeddings = tofu_graph_matmul(g, patches, W_proj);  // [32, 196, 512]

(Part 1 of 2) --- (Part 2 of 2) ---

8. Loss Functions

Loss functions quantify how well your model performs by comparing predictions against ground truth. Tofu provides two essential loss functions optimized for different tasks.

Mean Squared Error (MSE)

MSE measures the average squared difference between predictions and targets. Use it for regression tasks where you predict continuous values.

tofu_graph_node* tofu_graph_mse_loss(tofu_graph* g,
                                      tofu_graph_node* pred,
                                      tofu_graph_node* target);

Mathematical definition:

MSE = mean((pred - target)²)

When to use MSE:

Regression problems (predicting house prices, temperatures, etc.)
When output values are continuous and unbounded
When you want to penalize larger errors more heavily (squared term)

Example: Linear regression

// Predict continuous values
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
tofu_graph_node* W = tofu_graph_param(g, weights);
tofu_graph_node* b = tofu_graph_param(g, bias);

tofu_graph_node* pred = tofu_graph_add(g, tofu_graph_matmul(g, x, W), b);
tofu_graph_node* target = tofu_graph_input(g, target_tensor);

// MSE loss for regression
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);
tofu_graph_backward(g, loss);

Key properties:

Output is a scalar (single value)
Gradients scale linearly with error magnitude
Sensitive to outliers due to squaring
Always non-negative

Cross-Entropy Loss

Cross-entropy measures the difference between predicted and true probability distributions. Use it for classification tasks.

tofu_graph_node* tofu_graph_ce_loss(tofu_graph* g,
                                     tofu_graph_node* pred,
                                     tofu_graph_node* target);

Mathematical definition:

CE = -sum(target * log(pred))

When to use cross-entropy:

Classification problems (image recognition, sentiment analysis)
When outputs represent class probabilities
Multi-class or binary classification tasks

Example: Multi-class classification

// Predict class probabilities
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
tofu_graph_node* W = tofu_graph_param(g, weights);
tofu_graph_node* b = tofu_graph_param(g, bias);

// Forward pass: logits -> softmax -> probabilities
tofu_graph_node* logits = tofu_graph_add(g, tofu_graph_matmul(g, x, W), b);
tofu_graph_node* probs = tofu_graph_softmax(g, logits, 1);  // axis=1 for batch

// Target should be one-hot encoded: [0, 1, 0, 0] for class 1
tofu_graph_node* target = tofu_graph_input(g, target_one_hot);

// Cross-entropy loss
tofu_graph_node* loss = tofu_graph_ce_loss(g, probs, target);
tofu_graph_backward(g, loss);

Target format: Targets must be one-hot encoded vectors:

// For batch_size=2, num_classes=4
// If sample 0 is class 2 and sample 1 is class 0:
float target_data[] = {
    0.0f, 0.0f, 1.0f, 0.0f,  // Sample 0: class 2
    1.0f, 0.0f, 0.0f, 0.0f   // Sample 1: class 0
};
tofu_tensor* target = tofu_tensor_create(target_data, 2,
                                          (int[]){2, 4}, TOFU_FLOAT);

Key properties:

Numerically stable implementation (avoids log(0))
Works well with softmax activation
Penalizes confident wrong predictions heavily
Output is a scalar (averaged over batch)

Loss Function Comparison

Property	MSE Loss	Cross-Entropy Loss
Use case	Regression	Classification
Output type	Continuous values	Probabilities (0-1)
Activation	Linear or ReLU	Softmax
Gradient behavior	Linear with error	Exponential confidence penalty
Outlier sensitivity	High (squared)	Moderate (logarithmic)

9. Forward Pass

The forward pass computes outputs by propagating data through your graph from inputs to loss. Results are automatically stored in each node.

Accessing Results with `get_value`

After building your graph, each node contains its computed result in the value field. Access it using:

tofu_tensor* tofu_graph_get_value(tofu_graph_node* node);

Important: The returned tensor is owned by the node. Never free it yourself.

Example: Inspecting intermediate activations

// Build network
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
tofu_graph_node* W1 = tofu_graph_param(g, weights1);
tofu_graph_node* h = tofu_graph_relu(g, tofu_graph_matmul(g, x, W1));

// Access hidden layer activations
tofu_tensor* hidden_values = tofu_graph_get_value(h);
printf("Hidden layer statistics:\n");
tofu_tensor_print(hidden_values, "%.4f");

Typical Forward Pass Pattern

// 1. Create inputs
tofu_graph_node* x = tofu_graph_input(g, input_data);
tofu_graph_node* target = tofu_graph_input(g, target_data);

// 2. Add parameters
tofu_graph_node* W = tofu_graph_param(g, weights);
tofu_graph_node* b = tofu_graph_param(g, bias);

// 3. Build computation graph
tofu_graph_node* logits = tofu_graph_add(g, tofu_graph_matmul(g, x, W), b);
tofu_graph_node* probs = tofu_graph_softmax(g, logits, 1);

// 4. Compute loss
tofu_graph_node* loss = tofu_graph_ce_loss(g, probs, target);

// 5. Access results
tofu_tensor* loss_value = tofu_graph_get_value(loss);
tofu_tensor* predictions = tofu_graph_get_value(probs);

Reading Loss Values

Loss is typically a scalar tensor (single element):

tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);
tofu_tensor* loss_tensor = tofu_graph_get_value(loss);

// Extract scalar value
float loss_value;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_value, TOFU_FLOAT);
printf("Training loss: %.6f\n", loss_value);

Making Predictions

For inference, forward pass without computing loss:

// Inference mode (no target needed)
tofu_graph_node* x = tofu_graph_input(g, test_input);
tofu_graph_node* W = tofu_graph_param(g, trained_weights);
tofu_graph_node* b = tofu_graph_param(g, trained_bias);

tofu_graph_node* pred = tofu_graph_softmax(g,
    tofu_graph_add(g, tofu_graph_matmul(g, x, W), b), 1);

// Get predictions
tofu_tensor* predictions = tofu_graph_get_value(pred);

// Find class with highest probability
int pred_class = 0;
float max_prob = -1.0f;
for (int i = 0; i < num_classes; i++) {
    float prob;
    TOFU_TENSOR_DATA_TO(predictions, i, prob, TOFU_FLOAT);
    if (prob > max_prob) {
        max_prob = prob;
        pred_class = i;
    }
}

10. Backward Pass

The backward pass computes gradients using reverse-mode automatic differentiation (backpropagation). This enables training via gradient descent.

Understanding Backpropagation

When you call tofu_graph_backward(), the graph computes how changes to each parameter affect the loss using the chain rule:

∂loss/∂W = ∂loss/∂output × ∂output/∂W

The algorithm:

Starts from the loss node (scalar)
Propagates gradients backward through operations
Accumulates gradients at parameter nodes
Stores results in node->grad

Calling Backward

void tofu_graph_backward(tofu_graph* g, tofu_graph_node* loss);

Requirements:

loss must be a scalar (single element tensor)
Call after forward pass completes
Gradients accumulate with each call

Example: Training iteration

// Forward pass
tofu_graph_node* x = tofu_graph_input(g, batch_data);
tofu_graph_node* W = tofu_graph_param(g, weights);
tofu_graph_node* pred = tofu_graph_matmul(g, x, W);
tofu_graph_node* target = tofu_graph_input(g, batch_targets);
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);

// Backward pass - computes all gradients
tofu_graph_backward(g, loss);

// Now W->grad contains ∂loss/∂W

Accessing Gradients with `get_grad`

After backward pass, retrieve gradients from parameter nodes:

tofu_tensor* tofu_graph_get_grad(tofu_graph_node* node);

Returns: Pointer to gradient tensor, or NULL if backward hasn't been called yet.

Important: The returned tensor is owned by the node. Never free it yourself.

Example: Manual parameter update

tofu_graph_backward(g, loss);

// Get gradient
tofu_tensor* W_grad = tofu_graph_get_grad(W);
tofu_tensor* W_value = tofu_graph_get_value(W);

// Manual SGD update: W = W - learning_rate * grad
float lr = 0.01f;
for (int i = 0; i < W_value->len; i++) {
    float w, grad;
    TOFU_TENSOR_DATA_TO(W_value, i, w, TOFU_FLOAT);
    TOFU_TENSOR_DATA_TO(W_grad, i, grad, TOFU_FLOAT);

    float updated = w - lr * grad;
    TOFU_TENSOR_DATA_FROM(&updated, W_value, i, TOFU_FLOAT);
}

Zeroing Gradients with `zero_grad`

Gradients accumulate by default. Always zero them before each training iteration:

void tofu_graph_zero_grad(tofu_graph* g);

Why this matters:

// WRONG: Gradients accumulate forever
for (int epoch = 0; epoch < 100; epoch++) {
    tofu_graph_backward(g, loss);  // Adds to existing gradients!
    tofu_optimizer_step(opt);
}

// CORRECT: Clear gradients each iteration
for (int epoch = 0; epoch < 100; epoch++) {
    tofu_graph_zero_grad(g);       // Start fresh
    tofu_graph_backward(g, loss);  // Compute gradients
    tofu_optimizer_step(opt);      // Update parameters
}

Gradient Accumulation (Advanced)

Sometimes you intentionally want gradients to accumulate across multiple batches:

// Simulate larger batch by accumulating gradients
int accumulation_steps = 4;

for (int step = 0; step < accumulation_steps; step++) {
    // Forward pass on mini-batch
    tofu_graph_node* loss = compute_loss(g, mini_batches[step]);

    // Accumulate gradients (don't zero between mini-batches)
    tofu_graph_backward(g, loss);

    if (step < accumulation_steps - 1) {
        tofu_graph_clear_ops(g);  // Clear graph but keep gradients
    }
}

// Update once with accumulated gradients
tofu_optimizer_step(opt);

// Now zero for next iteration
tofu_graph_zero_grad(g);

Complete Backward Pass Example

// Training loop with proper gradient handling
for (int epoch = 0; epoch < num_epochs; epoch++) {
    // 1. Zero gradients from previous iteration
    tofu_graph_zero_grad(g);

    // 2. Forward pass
    tofu_graph_node* x = tofu_graph_input(g, train_data);
    tofu_graph_node* W = tofu_graph_param(g, weights);
    tofu_graph_node* pred = tofu_graph_matmul(g, x, W);
    tofu_graph_node* target = tofu_graph_input(g, train_targets);
    tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);

    // 3. Backward pass
    tofu_graph_backward(g, loss);

    // 4. Check gradients (debugging)
    tofu_tensor* W_grad = tofu_graph_get_grad(W);
    if (W_grad) {
        float grad_norm = 0.0f;
        for (int i = 0; i < W_grad->len; i++) {
            float g;
            TOFU_TENSOR_DATA_TO(W_grad, i, g, TOFU_FLOAT);
            grad_norm += g * g;
        }
        printf("Gradient norm: %.6f\n", sqrtf(grad_norm));
    }

    // 5. Update parameters (use optimizer in practice)
    tofu_optimizer_step(optimizer);

    // 6. Clear operations for next iteration
    tofu_graph_clear_ops(g);
}

Debugging Gradients

Common issues and solutions:

Vanishing gradients (gradients near zero):

tofu_tensor* grad = tofu_graph_get_grad(W);
float max_grad = 0.0f;
for (int i = 0; i < grad->len; i++) {
    float g;
    TOFU_TENSOR_DATA_TO(grad, i, g, TOFU_FLOAT);
    if (fabsf(g) > max_grad) max_grad = fabsf(g);
}
if (max_grad < 1e-7f) {
    printf("WARNING: Vanishing gradients detected\n");
}

Exploding gradients (gradients too large):

if (max_grad > 100.0f) {
    printf("WARNING: Exploding gradients detected\n");
    // Consider gradient clipping or reducing learning rate
}

11. Memory and Ownership

Understanding memory management is critical for correct and leak-free code.

Ownership Rules (CRITICAL)

Rule 1: Graph does NOT own input/parameter tensors

tofu_tensor* W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node* W_node = tofu_graph_param(g, W);

// You still own W! Must free it after graph_free
tofu_graph_free(g);
tofu_tensor_free_data_too(W);  // Your responsibility

Rule 2: Graph OWNS intermediate operation results

tofu_graph_node* result = tofu_graph_matmul(g, x, W);
// result->value is owned by the graph
// Don't free it - tofu_graph_free() handles it

tofu_graph_free(g);  // Frees result->value automatically

Rule 3: Graph OWNS all nodes

tofu_graph_node* node = tofu_graph_relu(g, x);
// Don't free node - graph owns it

tofu_graph_free(g);  // Frees all nodes

Rule 4: Never free tensors returned by get_value/get_grad

tofu_tensor* value = tofu_graph_get_value(node);   // Node owns this
tofu_tensor* grad = tofu_graph_get_grad(node);     // Node owns this

// WRONG: tofu_tensor_free(value);  // CRASH!
// CORRECT: Just use the tensor, don't free it

Complete Cleanup Pattern

// 1. Allocate parameter tensors
tofu_tensor* W = tofu_tensor_zeros(2, (int[]){64, 32}, TOFU_FLOAT);
tofu_tensor* b = tofu_tensor_zeros(1, (int[]){32}, TOFU_FLOAT);

// 2. Create graph
tofu_graph* g = tofu_graph_create();

// 3. Add parameters to graph
tofu_graph_node* W_node = tofu_graph_param(g, W);
tofu_graph_node* b_node = tofu_graph_param(g, b);

// 4. Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Allocate batch data (you manage this)
    float* batch_data = load_batch(epoch);
    tofu_tensor* x_tensor = tofu_tensor_create(batch_data, 2,
                                                (int[]){32, 64}, TOFU_FLOAT);

    // Build graph (operations owned by graph)
    tofu_graph_node* x = tofu_graph_input(g, x_tensor);
    tofu_graph_node* out = tofu_graph_add(g,
                                tofu_graph_matmul(g, x, W_node), b_node);
    // ... training ...

    // Free batch resources (you own these)
    tofu_tensor_free(x_tensor);
    free(batch_data);

    // Clear operations but keep parameters
    tofu_graph_clear_ops(g);
}

// 5. Cleanup (CRITICAL ORDER!)
tofu_graph_free(g);               // Graph owns: nodes, ops, gradients
tofu_tensor_free_data_too(W);     // You own: parameter tensors
tofu_tensor_free_data_too(b);

Memory Management with Optimizers

// Create optimizer (holds references to graph and parameters)
tofu_optimizer* opt = tofu_optimizer_adam_create(g, 0.001, 0.9, 0.999, 1e-8);

// Training...

// Cleanup order matters!
tofu_optimizer_free(opt);         // 1. Free optimizer first
tofu_graph_free(g);               // 2. Then graph
tofu_tensor_free_data_too(W);     // 3. Then parameter tensors
tofu_tensor_free_data_too(b);

Common Memory Pitfalls

Pitfall 1: Freeing parameter tensors too early

// WRONG
tofu_tensor* W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node* W_node = tofu_graph_param(g, W);
tofu_tensor_free_data_too(W);  // DON'T DO THIS! Graph needs it

tofu_graph_backward(g, loss);  // CRASH: W is freed but graph uses it

Pitfall 2: Freeing operation results

// WRONG
tofu_graph_node* result = tofu_graph_matmul(g, x, W);
tofu_tensor* result_value = tofu_graph_get_value(result);
tofu_tensor_free(result_value);  // CRASH! Graph owns this

Pitfall 3: Forgetting to free parameter tensors

// Memory leak
tofu_tensor* W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node* W_node = tofu_graph_param(g, W);

tofu_graph_free(g);  // Graph freed but W still allocated!
// Missing: tofu_tensor_free_data_too(W);

Pitfall 4: Double-free via clear_ops

// WRONG
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
tofu_graph_clear_ops(g);  // Removes x node
tofu_tensor_free(input_tensor);  // OK

tofu_tensor_free(input_tensor);  // CRASH: Double free!

Batch Processing Memory Pattern

Efficient pattern for processing multiple batches:

tofu_graph* g = tofu_graph_create();

// Parameters persist across batches
tofu_tensor* W = tofu_tensor_zeros(2, (int[]){64, 32}, TOFU_FLOAT);
tofu_graph_node* W_node = tofu_graph_param(g, W);

for (int batch = 0; batch < num_batches; batch++) {
    // Allocate batch-specific data
    float* batch_data = malloc(batch_size * 64 * sizeof(float));
    load_batch_data(batch_data, batch);

    tofu_tensor* x_tensor = tofu_tensor_create(batch_data, 2,
                                                (int[]){batch_size, 64}, TOFU_FLOAT);

    // Build graph for this batch
    tofu_graph_node* x = tofu_graph_input(g, x_tensor);
    tofu_graph_node* out = tofu_graph_matmul(g, x, W_node);
    // ... compute loss, backward, optimize ...

    // Free batch resources
    tofu_tensor_free(x_tensor);  // Free tensor wrapper
    free(batch_data);            // Free data buffer

    // Clear operations (keeps W_node!)
    tofu_graph_clear_ops(g);
}

// Final cleanup
tofu_graph_free(g);
tofu_tensor_free_data_too(W);

12. Building Complex Networks

Move beyond single layers to build sophisticated architectures.

Multi-Layer Perceptron (MLP)

A complete MLP with multiple hidden layers:

typedef struct {
    tofu_tensor *W1, *b1;  // Input -> Hidden1
    tofu_tensor *W2, *b2;  // Hidden1 -> Hidden2
    tofu_tensor *W3, *b3;  // Hidden2 -> Output
} MLP;

// Initialize weights with Xavier/He initialization
MLP* mlp_create(int input_dim, int hidden1, int hidden2, int output_dim) {
    MLP* mlp = malloc(sizeof(MLP));

    // Layer 1: input_dim -> hidden1
    mlp->W1 = tofu_tensor_zeros(2, (int[]){input_dim, hidden1}, TOFU_FLOAT);
    mlp->b1 = tofu_tensor_zeros(1, (int[]){hidden1}, TOFU_FLOAT);

    // Initialize W1 with Xavier: uniform(-sqrt(6/n), sqrt(6/n))
    float limit1 = sqrtf(6.0f / input_dim);
    for (int i = 0; i < mlp->W1->len; i++) {
        float val = (2.0f * rand() / RAND_MAX - 1.0f) * limit1;
        TOFU_TENSOR_DATA_FROM(&val, mlp->W1, i, TOFU_FLOAT);
    }

    // Layer 2: hidden1 -> hidden2
    mlp->W2 = tofu_tensor_zeros(2, (int[]){hidden1, hidden2}, TOFU_FLOAT);
    mlp->b2 = tofu_tensor_zeros(1, (int[]){hidden2}, TOFU_FLOAT);

    float limit2 = sqrtf(6.0f / hidden1);
    for (int i = 0; i < mlp->W2->len; i++) {
        float val = (2.0f * rand() / RAND_MAX - 1.0f) * limit2;
        TOFU_TENSOR_DATA_FROM(&val, mlp->W2, i, TOFU_FLOAT);
    }

    // Layer 3: hidden2 -> output_dim
    mlp->W3 = tofu_tensor_zeros(2, (int[]){hidden2, output_dim}, TOFU_FLOAT);
    mlp->b3 = tofu_tensor_zeros(1, (int[]){output_dim}, TOFU_FLOAT);

    float limit3 = sqrtf(6.0f / hidden2);
    for (int i = 0; i < mlp->W3->len; i++) {
        float val = (2.0f * rand() / RAND_MAX - 1.0f) * limit3;
        TOFU_TENSOR_DATA_FROM(&val, mlp->W3, i, TOFU_FLOAT);
    }

    return mlp;
}

// Forward pass
tofu_graph_node* mlp_forward(tofu_graph* g, tofu_graph_node* x, MLP* mlp) {
    // Add parameters to graph
    tofu_graph_node* W1 = tofu_graph_param(g, mlp->W1);
    tofu_graph_node* b1 = tofu_graph_param(g, mlp->b1);
    tofu_graph_node* W2 = tofu_graph_param(g, mlp->W2);
    tofu_graph_node* b2 = tofu_graph_param(g, mlp->b2);
    tofu_graph_node* W3 = tofu_graph_param(g, mlp->W3);
    tofu_graph_node* b3 = tofu_graph_param(g, mlp->b3);

    // Layer 1: x @ W1 + b1 -> ReLU
    tofu_graph_node* h1 = tofu_graph_matmul(g, x, W1);
    h1 = tofu_graph_add(g, h1, b1);
    h1 = tofu_graph_relu(g, h1);

    // Layer 2: h1 @ W2 + b2 -> ReLU
    tofu_graph_node* h2 = tofu_graph_matmul(g, h1, W2);
    h2 = tofu_graph_add(g, h2, b2);
    h2 = tofu_graph_relu(g, h2);

    // Layer 3: h2 @ W3 + b3 (logits)
    tofu_graph_node* out = tofu_graph_matmul(g, h2, W3);
    out = tofu_graph_add(g, out, b3);

    return out;
}

// Cleanup
void mlp_free(MLP* mlp) {
    tofu_tensor_free_data_too(mlp->W1);
    tofu_tensor_free_data_too(mlp->b1);
    tofu_tensor_free_data_too(mlp->W2);
    tofu_tensor_free_data_too(mlp->b2);
    tofu_tensor_free_data_too(mlp->W3);
    tofu_tensor_free_data_too(mlp->b3);
    free(mlp);
}

// Usage
MLP* model = mlp_create(784, 256, 128, 10);  // MNIST-style
tofu_graph* g = tofu_graph_create();

for (int epoch = 0; epoch < 100; epoch++) {
    tofu_graph_node* x = tofu_graph_input(g, batch_data);
    tofu_graph_node* logits = mlp_forward(g, x, model);
    tofu_graph_node* probs = tofu_graph_softmax(g, logits, 1);
    tofu_graph_node* target = tofu_graph_input(g, batch_targets);
    tofu_graph_node* loss = tofu_graph_ce_loss(g, probs, target);

    tofu_graph_backward(g, loss);
    tofu_optimizer_step(optimizer);
    tofu_graph_clear_ops(g);
}

mlp_free(model);
tofu_graph_free(g);

Residual Connections

Residual connections (skip connections) help training deep networks:

// Residual block: output = ReLU(x + F(x))
tofu_graph_node* residual_block(tofu_graph* g, tofu_graph_node* x,
                                 tofu_tensor* W1, tofu_tensor* b1,
                                 tofu_tensor* W2, tofu_tensor* b2) {
    // Main path F(x)
    tofu_graph_node* W1_node = tofu_graph_param(g, W1);
    tofu_graph_node* b1_node = tofu_graph_param(g, b1);
    tofu_graph_node* W2_node = tofu_graph_param(g, W2);
    tofu_graph_node* b2_node = tofu_graph_param(g, b2);

    // F(x) = W2 @ ReLU(W1 @ x + b1) + b2
    tofu_graph_node* h1 = tofu_graph_matmul(g, x, W1_node);
    h1 = tofu_graph_add(g, h1, b1_node);
    h1 = tofu_graph_relu(g, h1);

    tofu_graph_node* h2 = tofu_graph_matmul(g, h1, W2_node);
    h2 = tofu_graph_add(g, h2, b2_node);

    // Skip connection: x + F(x)
    tofu_graph_node* residual = tofu_graph_add(g, x, h2);

    // Final activation
    return tofu_graph_relu(g, residual);
}

// Stack multiple residual blocks
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
x = residual_block(g, x, W1a, b1a, W1b, b1b);
x = residual_block(g, x, W2a, b2a, W2b, b2b);
x = residual_block(g, x, W3a, b3a, W3b, b3b);
tofu_graph_node* out = tofu_graph_matmul(g, x, W_out);

Custom Layer Abstractions

Encapsulate common patterns:

// Linear layer: y = x @ W + b
typedef struct {
    tofu_tensor* W;
    tofu_tensor* b;
} LinearLayer;

LinearLayer* linear_create(int in_features, int out_features) {
    LinearLayer* layer = malloc(sizeof(LinearLayer));
    layer->W = tofu_tensor_zeros(2, (int[]){in_features, out_features}, TOFU_FLOAT);
    layer->b = tofu_tensor_zeros(1, (int[]){out_features}, TOFU_FLOAT);

    // Initialize weights
    float limit = sqrtf(6.0f / in_features);
    for (int i = 0; i < layer->W->len; i++) {
        float val = (2.0f * rand() / RAND_MAX - 1.0f) * limit;
        TOFU_TENSOR_DATA_FROM(&val, layer->W, i, TOFU_FLOAT);
    }

    return layer;
}

tofu_graph_node* linear_forward(tofu_graph* g, tofu_graph_node* x, LinearLayer* layer) {
    tofu_graph_node* W = tofu_graph_param(g, layer->W);
    tofu_graph_node* b = tofu_graph_param(g, layer->b);
    tofu_graph_node* out = tofu_graph_matmul(g, x, W);
    return tofu_graph_add(g, out, b);
}

void linear_free(LinearLayer* layer) {
    tofu_tensor_free_data_too(layer->W);
    tofu_tensor_free_data_too(layer->b);
    free(layer);
}

// Build network with layer abstractions
LinearLayer* fc1 = linear_create(784, 256);
LinearLayer* fc2 = linear_create(256, 10);

tofu_graph_node* x = tofu_graph_input(g, input);
x = linear_forward(g, x, fc1);
x = tofu_graph_relu(g, x);
x = linear_forward(g, x, fc2);
tofu_graph_node* probs = tofu_graph_softmax(g, x, 1);

Transformer-Style Attention (Simplified)

Basic attention mechanism pattern:

// Simplified attention: softmax(Q @ K^T) @ V
tofu_graph_node* attention(tofu_graph* g,
                           tofu_graph_node* Q,  // Query
                           tofu_graph_node* K,  // Key
                           tofu_graph_node* V)  // Value
{
    // 1. Compute attention scores: Q @ K^T
    tofu_graph_node* K_T = tofu_graph_transpose(g, K, NULL);
    tofu_graph_node* scores = tofu_graph_matmul(g, Q, K_T);

    // 2. Softmax over last dimension
    tofu_graph_node* attn_weights = tofu_graph_softmax(g, scores, -1);

    // 3. Apply attention: attn_weights @ V
    tofu_graph_node* output = tofu_graph_matmul(g, attn_weights, V);

    return output;
}

13. Best Practices

Guidelines for robust and maintainable graph-based code.

Graph Design Principles

1. Separate model structure from training logic

// Good: Model is a reusable structure
typedef struct {
    tofu_tensor *W1, *b1, *W2, *b2;
} Model;

tofu_graph_node* model_forward(tofu_graph* g, tofu_graph_node* x, Model* m);

// Train the model
void train(Model* model, Dataset* data) {
    tofu_graph* g = tofu_graph_create();
    // Training loop uses model_forward()
    tofu_graph_free(g);
}

2. Use clear_ops between iterations

// Efficient: Reuse graph structure
for (int batch = 0; batch < num_batches; batch++) {
    // Build graph for this batch
    tofu_graph_node* loss = build_forward_graph(g, batch);
    tofu_graph_backward(g, loss);
    tofu_optimizer_step(opt);

    // Clear ops but keep parameters
    tofu_graph_clear_ops(g);
}

3. Always check tensor shapes during development

tofu_graph_node* result = tofu_graph_matmul(g, x, W);
tofu_tensor* result_tensor = tofu_graph_get_value(result);

printf("Result shape: [");
for (int i = 0; i < result_tensor->ndim; i++) {
    printf("%d%s", result_tensor->dims[i],
           i < result_tensor->ndim - 1 ? ", " : "");
}
printf("]\n");

Debugging Strategies

Monitor loss values

tofu_tensor* loss_tensor = tofu_graph_get_value(loss);
float loss_val;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_val, TOFU_FLOAT);

if (isnan(loss_val) || isinf(loss_val)) {
    printf("ERROR: Loss is NaN or Inf at epoch %d\n", epoch);
    // Check gradients, learning rate, or input data
}

if (loss_val > prev_loss * 2.0f) {
    printf("WARNING: Loss spiked at epoch %d\n", epoch);
    // Consider reducing learning rate
}

Check gradient magnitudes

void print_gradient_stats(tofu_graph_node* param, const char* name) {
    tofu_tensor* grad = tofu_graph_get_grad(param);
    if (!grad) return;

    float min = 1e9f, max = -1e9f, sum = 0.0f;
    for (int i = 0; i < grad->len; i++) {
        float g;
        TOFU_TENSOR_DATA_TO(grad, i, g, TOFU_FLOAT);
        if (g < min) min = g;
        if (g > max) max = g;
        sum += g * g;
    }

    printf("%s grad: min=%.6f, max=%.6f, norm=%.6f\n",
           name, min, max, sqrtf(sum));
}

Validate forward pass outputs

tofu_tensor* probs = tofu_graph_get_value(softmax_node);

// Check probabilities sum to 1
float sum = 0.0f;
for (int i = 0; i < probs->dims[1]; i++) {
    float p;
    TOFU_TENSOR_DATA_TO(probs, i, p, TOFU_FLOAT);
    sum += p;
}

if (fabsf(sum - 1.0f) > 1e-5f) {
    printf("WARNING: Probabilities don't sum to 1: %.6f\n", sum);
}

Performance Tips

1. Batch your data

// Slow: Process one sample at a time
for (int i = 0; i < 1000; i++) {
    tofu_graph_node* x = tofu_graph_input(g, single_samples[i]);
    // ... forward, backward, update ...
}

// Fast: Process batches
int batch_size = 32;
for (int i = 0; i < 1000; i += batch_size) {
    tofu_graph_node* x = tofu_graph_input(g, batched_samples[i/batch_size]);
    // ... forward, backward, update ...
}

2. Reuse graph structure

// Less efficient: Create new graph each iteration
for (int epoch = 0; epoch < 100; epoch++) {
    tofu_graph* g = tofu_graph_create();
    // ... train ...
    tofu_graph_free(g);
}

// More efficient: Reuse graph
tofu_graph* g = tofu_graph_create();
for (int epoch = 0; epoch < 100; epoch++) {
    // ... train ...
    tofu_graph_clear_ops(g);
}
tofu_graph_free(g);

3. Profile your code

#include <time.h>

clock_t start = clock();
tofu_graph_backward(g, loss);
clock_t end = clock();

double time_ms = 1000.0 * (end - start) / CLOCKS_PER_SEC;
printf("Backward pass: %.2f ms\n", time_ms);

Common Pitfalls to Avoid

Forgetting to zero gradients
- Always call tofu_graph_zero_grad() before backward pass
Freeing tensors too early
- Don't free parameter tensors until after tofu_graph_free()
Wrong loss node
- Ensure loss is scalar before calling backward
Shape mismatches
- Use tofu_tensor_print() to debug shape issues
Learning rate too high
- Start with small values (0.001-0.01) and adjust
No validation set
- Always evaluate on separate data to detect overfitting

Complete Training Template

// Complete training example with best practices
void train_model(Dataset* train_data, Dataset* val_data) {
    // Initialize
    tofu_graph* g = tofu_graph_create();
    Model* model = model_create(input_dim, hidden_dim, output_dim);
    tofu_optimizer* opt = tofu_optimizer_adam_create(g, 0.001, 0.9, 0.999, 1e-8);

    float best_val_loss = 1e9f;

    for (int epoch = 0; epoch < num_epochs; epoch++) {
        // Training phase
        float train_loss = 0.0f;
        for (int batch = 0; batch < train_data->num_batches; batch++) {
            tofu_graph_zero_grad(g);

            tofu_graph_node* x = tofu_graph_input(g, train_data->batches[batch].x);
            tofu_graph_node* pred = model_forward(g, x, model);
            tofu_graph_node* target = tofu_graph_input(g, train_data->batches[batch].y);
            tofu_graph_node* loss = tofu_graph_ce_loss(g, pred, target);

            tofu_tensor* loss_tensor = tofu_graph_get_value(loss);
            float batch_loss;
            TOFU_TENSOR_DATA_TO(loss_tensor, 0, batch_loss, TOFU_FLOAT);
            train_loss += batch_loss;

            tofu_graph_backward(g, loss);
            tofu_optimizer_step(opt);
            tofu_graph_clear_ops(g);
        }
        train_loss /= train_data->num_batches;

        // Validation phase (no gradient computation)
        float val_loss = 0.0f;
        for (int batch = 0; batch < val_data->num_batches; batch++) {
            tofu_graph_node* x = tofu_graph_input(g, val_data->batches[batch].x);
            tofu_graph_node* pred = model_forward(g, x, model);
            tofu_graph_node* target = tofu_graph_input(g, val_data->batches[batch].y);
            tofu_graph_node* loss = tofu_graph_ce_loss(g, pred, target);

            tofu_tensor* loss_tensor = tofu_graph_get_value(loss);
            float batch_loss;
            TOFU_TENSOR_DATA_TO(loss_tensor, 0, batch_loss, TOFU_FLOAT);
            val_loss += batch_loss;

            tofu_graph_clear_ops(g);
        }
        val_loss /= val_data->num_batches;

        // Logging
        printf("Epoch %3d: train_loss=%.4f, val_loss=%.4f",
               epoch, train_loss, val_loss);

        // Save best model
        if (val_loss < best_val_loss) {
            best_val_loss = val_loss;
            printf(" (best)");
            // Save model weights here
        }
        printf("\n");

        // Early stopping
        if (train_loss < 0.01f && val_loss > train_loss * 2.0f) {
            printf("Early stopping: overfitting detected\n");
            break;
        }
    }

    // Cleanup
    tofu_optimizer_free(opt);
    model_free(model);
    tofu_graph_free(g);
}

This completes the computation graphs user guide. You now have the knowledge to build, train, and debug neural networks using Tofu's graph API.

Training Neural Networks

This guide covers how to train neural networks using TOFU's automatic differentiation and optimization capabilities. We'll walk through the complete training process with practical examples.

Introduction

Training a neural network in TOFU follows a standard pattern familiar to users of modern frameworks like PyTorch or TensorFlow. The key difference is that TOFU is designed for resource-constrained environments like microcontrollers, so we emphasize memory efficiency and explicit resource management.

What You'll Learn

In this guide you'll learn:

How to structure a complete training loop
Data preparation and batching strategies
Forward and backward pass mechanics
Loss computation and monitoring
Parameter optimization
Training strategies for embedded systems
Debugging and evaluation techniques

Prerequisites

Before starting, ensure you're familiar with:

Basic tensor operations (see Tensors guide)
Computation graphs (see Graphs guide)
Loss functions (see Loss Functions guide)
Optimizers (see Optimizers guide)

The Training Paradigm

Neural network training is an iterative process of:

Making predictions (forward pass)
Measuring error (loss computation)
Computing gradients (backward pass)
Updating weights (optimization step)

TOFU provides all the primitives needed for this cycle through its computation graph API and automatic differentiation engine.

Memory Considerations

Training on microcontrollers requires careful memory management. TOFU helps by:

Allowing graph reuse across iterations via tofu_graph_clear_ops()
Minimizing allocations during training
Providing explicit control over tensor lifetimes
Supporting in-place operations where possible

The Training Loop

Every training loop in TOFU follows a consistent five-step pattern. Understanding this pattern is essential for successful training.

The Five-Step Pattern

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    for (int batch = 0; batch < num_batches; batch++) {
        /* Step 1: Zero gradients */
        tofu_graph_zero_grad(g);

        /* Step 2: Forward pass */
        tofu_graph_node* prediction = forward_pass(g, input, params);

        /* Step 3: Compute loss */
        tofu_graph_node* loss = tofu_graph_mse_loss(g, prediction, target);

        /* Step 4: Backward pass */
        tofu_graph_backward(g, loss);

        /* Step 5: Update parameters */
        tofu_optimizer_step(optimizer);

        /* Cleanup: Clear operations but keep parameters */
        tofu_graph_clear_ops(g);
    }
}

Let's examine each step in detail.

Step 1: Zero Gradients

Before computing new gradients, clear any gradients from the previous iteration:

tofu_graph_zero_grad(g);

Why? Gradients accumulate by default. If you don't zero them, new gradients add to old ones, producing incorrect updates.

When to skip: Only when you explicitly want gradient accumulation (advanced technique).

Step 2: Forward Pass

Build the computation graph and compute predictions:

/* Create input node */
tofu_graph_node* x = tofu_graph_input(g, input_tensor);

/* Build network */
tofu_graph_node* h1 = tofu_graph_matmul(g, x, w1);
tofu_graph_node* h1_bias = tofu_graph_add(g, h1, b1);
tofu_graph_node* h1_act = tofu_graph_relu(g, h1_bias);

/* Output layer */
tofu_graph_node* output = tofu_graph_matmul(g, h1_act, w2);
tofu_graph_node* pred = tofu_graph_add(g, output, b2);

Key principle: The forward pass constructs the computational graph that defines your model.

Step 3: Compute Loss

Compare predictions to targets:

tofu_graph_node* target = tofu_graph_input(g, target_tensor);
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);

The loss node becomes the starting point for backpropagation.

Step 4: Backward Pass

Compute gradients via automatic differentiation:

tofu_graph_backward(g, loss);

This populates the grad field of all parameter nodes with gradients.

Step 5: Update Parameters

Apply gradients to update trainable parameters:

tofu_optimizer_step(optimizer);

The optimizer uses computed gradients and its internal algorithm (SGD, momentum, etc.) to update parameter values.

Graph Cleanup

After each iteration, clear operations while preserving parameters:

tofu_graph_clear_ops(g);

This frees intermediate computation nodes but keeps parameter nodes, allowing the graph to be reused in the next iteration.

Data Preparation

Proper data preparation is crucial for successful training. This section covers batching, normalization, and memory-efficient data handling.

Dataset Structure

Organize your data to facilitate batch processing:

typedef struct {
    float* images;      /* [num_samples, feature_dims] */
    int* labels;        /* [num_samples] */
    int num_samples;
    int feature_dims;
} dataset;

For the XOR problem, data preparation is simple:

float xor_inputs[4][2] = {
    {0.0f, 0.0f},
    {0.0f, 1.0f},
    {1.0f, 0.0f},
    {1.0f, 1.0f}
};

float xor_targets[4][1] = {
    {0.0f},
    {1.0f},
    {1.0f},
    {0.0f}
};

Batching Strategies

For larger datasets, process data in batches:

const int BATCH_SIZE = 4;

for (int batch_start = 0; batch_start < num_samples; batch_start += BATCH_SIZE) {
    int batch_end = (batch_start + BATCH_SIZE < num_samples)
                    ? batch_start + BATCH_SIZE
                    : num_samples;
    int actual_batch_size = batch_end - batch_start;

    /* Prepare batch data */
    float* batch_data = (float*)malloc(actual_batch_size * feature_dims * sizeof(float));
    int* batch_labels = (int*)malloc(actual_batch_size * sizeof(int));

    for (int i = 0; i < actual_batch_size; i++) {
        memcpy(batch_data + i * feature_dims,
               dataset->images + (batch_start + i) * feature_dims,
               feature_dims * sizeof(float));
        batch_labels[i] = dataset->labels[batch_start + i];
    }

    /* Create batch tensor */
    tofu_tensor* t_batch = tofu_tensor_create(batch_data, 2,
                                              (int[]){actual_batch_size, feature_dims},
                                              TOFU_FLOAT);

    /* ... training step ... */

    /* Cleanup */
    tofu_tensor_free(t_batch);
    free(batch_data);
    free(batch_labels);
}

Batch size considerations:

Larger batches: More stable gradients, better hardware utilization
Smaller batches: Less memory usage, more frequent updates
For microcontrollers: Start with batch_size=1 or very small batches

Data Normalization

Normalize inputs for better training stability:

/* Compute mean and std from training data */
void compute_statistics(const float* data, int num_samples, int dims,
                       float* mean, float* std) {
    /* Zero initialize */
    for (int d = 0; d < dims; d++) {
        mean[d] = 0.0f;
        std[d] = 0.0f;
    }

    /* Compute mean */
    for (int i = 0; i < num_samples; i++) {
        for (int d = 0; d < dims; d++) {
            mean[d] += data[i * dims + d];
        }
    }
    for (int d = 0; d < dims; d++) {
        mean[d] /= num_samples;
    }

    /* Compute std */
    for (int i = 0; i < num_samples; i++) {
        for (int d = 0; d < dims; d++) {
            float diff = data[i * dims + d] - mean[d];
            std[d] += diff * diff;
        }
    }
    for (int d = 0; d < dims; d++) {
        std[d] = sqrtf(std[d] / num_samples);
    }
}

/* Normalize data */
void normalize_data(float* data, int num_samples, int dims,
                   const float* mean, const float* std) {
    for (int i = 0; i < num_samples; i++) {
        for (int d = 0; d < dims; d++) {
            data[i * dims + d] = (data[i * dims + d] - mean[d]) / (std[d] + 1e-8f);
        }
    }
}

Common normalization strategies:

Z-score normalization: (x - mean) / std (shown above)
Min-max scaling: (x - min) / (max - min) to [0, 1]
Simple scaling: Divide by 255 for image data
No normalization: For binary inputs like XOR

One-Hot Encoding

For classification, encode labels as one-hot vectors:

/* Convert integer labels to one-hot encoding */
void create_one_hot(int* labels, int batch_size, int num_classes, float* one_hot) {
    memset(one_hot, 0, batch_size * num_classes * sizeof(float));

    for (int i = 0; i < batch_size; i++) {
        one_hot[i * num_classes + labels[i]] = 1.0f;
    }
}

/* Usage in training loop */
float* target_data = (float*)calloc(batch_size * num_classes, sizeof(float));
create_one_hot(batch_labels, batch_size, num_classes, target_data);

tofu_tensor* t_target = tofu_tensor_create(target_data, 2,
                                           (int[]){batch_size, num_classes},
                                           TOFU_FLOAT);

Forward Pass

The forward pass computes predictions by propagating input through the network. Understanding graph construction and reuse is key to efficient training.

Building the Computation Graph

For a simple feedforward network:

tofu_graph_node* forward_pass(tofu_graph* g,
                              tofu_tensor* input_data,
                              tofu_graph_node* w1, tofu_graph_node* b1,
                              tofu_graph_node* w2, tofu_graph_node* b2) {
    /* Input layer */
    tofu_graph_node* x = tofu_graph_input(g, input_data);

    /* Hidden layer: x @ w1 + b1 */
    tofu_graph_node* h1_matmul = tofu_graph_matmul(g, x, w1);
    tofu_graph_node* h1_bias = tofu_graph_add(g, h1_matmul, b1);
    tofu_graph_node* h1 = tofu_graph_relu(g, h1_bias);

    /* Output layer: h1 @ w2 + b2 */
    tofu_graph_node* out_matmul = tofu_graph_matmul(g, h1, w2);
    tofu_graph_node* prediction = tofu_graph_add(g, out_matmul, b2);

    return prediction;
}

Reusing Graphs with clear_ops

Instead of creating a new graph each iteration, reuse it:

/* Initialize graph once */
tofu_graph* g = tofu_graph_create();

/* Create parameters once (persist across iterations) */
tofu_graph_node* w1 = tofu_graph_param(g, t_w1);
tofu_graph_node* b1 = tofu_graph_param(g, t_b1);
tofu_graph_node* w2 = tofu_graph_param(g, t_w2);
tofu_graph_node* b2 = tofu_graph_param(g, t_b2);

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    tofu_graph_zero_grad(g);

    /* Build forward pass (creates new operation nodes) */
    tofu_graph_node* pred = forward_pass(g, input, w1, b1, w2, b2);
    tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);

    tofu_graph_backward(g, loss);
    tofu_optimizer_step(optimizer);

    /* Clear operations but keep parameters */
    tofu_graph_clear_ops(g);  /* This is crucial! */
}

Why clear_ops?

Frees intermediate operation nodes
Preserves parameter nodes and their values
Allows graph reuse without memory leaks
Essential for embedded systems with limited memory

Activation Functions

TOFU supports several activation functions:

/* ReLU: max(0, x) */
tofu_graph_node* relu_out = tofu_graph_relu(g, input);

/* Softmax: exp(x) / sum(exp(x)) */
tofu_graph_node* softmax_out = tofu_graph_softmax(g, logits, 1);  /* axis=1 */

Choose activation based on your task:

ReLU: Hidden layers, default choice
Softmax: Output layer for multi-class classification
None: Output layer for regression

Computing Loss

The loss function measures prediction error and drives learning. Choosing the right loss function is critical for your task.

Mean Squared Error (MSE)

Use MSE for regression tasks:

tofu_graph_node* loss = tofu_graph_mse_loss(g, prediction, target);

Formula: L = mean((pred - target)^2)

When to use:

Regression problems (predicting continuous values)
Output layer without activation (raw values)
Examples: XOR (as regression), price prediction, temperature estimation

Example from XOR training:

/* Prediction is continuous output */
tofu_graph_node* y_pred = tofu_graph_add(g, y_matmul, b2);

/* Target is also continuous */
float* target_data = (float*)malloc(OUTPUT_SIZE * sizeof(float));
target_data[0] = xor_targets[sample][0];  /* 0.0 or 1.0 */
tofu_tensor* t_target = tofu_tensor_create(target_data, 1,
                                           (int[]){OUTPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* y_target = tofu_graph_input(g, t_target);

/* MSE loss */
tofu_graph_node* loss_node = tofu_graph_mse_loss(g, y_pred, y_target);

Cross-Entropy (CE) Loss

Use cross-entropy for classification:

/* Apply softmax first */
tofu_graph_node* probs = tofu_graph_softmax(g, logits, 1);

/* Compute CE loss with one-hot targets */
tofu_graph_node* loss = tofu_graph_ce_loss(g, probs, one_hot_target);

Formula: L = -mean(sum(target * log(pred)))

When to use:

Multi-class classification
Softmax output layer (probabilities)
One-hot encoded targets
Examples: MNIST digit classification, CNN pattern recognition

Example from CNN training:

/* Forward pass with softmax */
tofu_graph_node* probs = cnn_forward_probs(g, input, params);

/* One-hot encode targets */
float* target_data = (float*)calloc(batch_size * num_classes, sizeof(float));
for (int i = 0; i < batch_size; i++) {
    target_data[i * num_classes + labels[i]] = 1.0f;
}
tofu_tensor* t_target = tofu_tensor_create(target_data, 2,
                                           (int[]){batch_size, num_classes},
                                           TOFU_FLOAT);
tofu_graph_node* target = tofu_graph_input(g, t_target);

/* Cross-entropy loss */
tofu_graph_node* loss_node = tofu_graph_ce_loss(g, probs, target);

Extracting Loss Values

Get the scalar loss value for monitoring:

tofu_tensor* loss_tensor = tofu_graph_get_value(loss_node);
float loss_value = 0.0f;
if (loss_tensor && loss_tensor->len > 0) {
    TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_value, TOFU_FLOAT);
}

Monitoring Loss

Track loss to ensure training progresses:

float epoch_loss = 0.0f;
int num_batches = 0;

for (int batch_start = 0; batch_start < num_samples; batch_start += BATCH_SIZE) {
    /* ... forward pass and loss computation ... */

    float batch_loss = 0.0f;
    tofu_tensor* loss_tensor = tofu_graph_get_value(loss_node);
    if (loss_tensor && loss_tensor->len > 0) {
        TOFU_TENSOR_DATA_TO(loss_tensor, 0, batch_loss, TOFU_FLOAT);
    }

    epoch_loss += batch_loss;
    num_batches++;

    /* ... backward pass and update ... */
}

float avg_loss = epoch_loss / num_batches;
printf("Epoch %d: avg_loss = %.6f\n", epoch, avg_loss);

Loss Patterns

Healthy training:

Loss decreases steadily
Eventually plateaus at a low value
May have small oscillations

Problem signs:

Loss increases: Learning rate too high or gradient explosion
Loss stuck: Learning rate too low or poor initialization
Loss = NaN: Numerical instability (try lower learning rate)

Backward Pass

The backward pass computes gradients through automatic differentiation. TOFU handles the complexity; you just call one function.

Invoking Backpropagation

tofu_graph_backward(g, loss_node);

This single call:

Traverses the computation graph in reverse topological order
Applies the chain rule at each operation
Accumulates gradients in each node's grad field
Populates gradients for all parameter nodes

How Automatic Differentiation Works

TOFU implements reverse-mode automatic differentiation (backpropagation):

Forward pass:  Input → Op1 → Op2 → ... → Loss
Backward pass: Loss → ∂Op2 → ∂Op1 → ... → ∂Input

Each operation knows how to compute its local gradient:

Matmul: ∂L/∂A = ∂L/∂C @ B^T and ∂L/∂B = A^T @ ∂L/∂C
Add: ∂L/∂A = ∂L/∂C and ∂L/∂B = ∂L/∂C (with broadcasting)
ReLU: ∂L/∂x = ∂L/∂y * (x > 0)
MSE: ∂L/∂pred = 2 * (pred - target) / n

Gradient Flow Example

For a simple network y = relu(x @ w + b):

/* Forward pass builds graph */
tofu_graph_node* xw = tofu_graph_matmul(g, x, w);
tofu_graph_node* xw_b = tofu_graph_add(g, xw, b);
tofu_graph_node* y = tofu_graph_relu(g, xw_b);
tofu_graph_node* loss = tofu_graph_mse_loss(g, y, target);

/* Backward pass computes gradients */
tofu_graph_backward(g, loss);

/* Now gradients are available:
 * loss->grad: Always 1.0 (starting point)
 * y->grad: ∂L/∂y
 * xw_b->grad: ∂L/∂y * relu_grad
 * xw->grad: ∂L/∂(xw+b)
 * w->grad: x^T @ ∂L/∂(xw)  <- Used by optimizer
 * b->grad: sum(∂L/∂(xw+b))  <- Used by optimizer
 */

Accessing Gradients

Check gradient values (useful for debugging):

tofu_tensor* w_grad = tofu_graph_get_grad(w1);
if (w_grad) {
    printf("w1 gradient norm: ");
    float grad_sum = 0.0f;
    for (int i = 0; i < w_grad->len; i++) {
        float val;
        TOFU_TENSOR_DATA_TO(w_grad, i, val, TOFU_FLOAT);
        grad_sum += val * val;
    }
    printf("%.6f\n", sqrtf(grad_sum));
}

Gradient Checking (Debug Tool)

Verify backprop implementation with numerical gradients:

float numerical_gradient(tofu_graph* g, tofu_graph_node* param,
                        tofu_graph_node* loss, int param_idx) {
    const float epsilon = 1e-4f;

    /* Get parameter value */
    float original;
    TOFU_TENSOR_DATA_TO(param->value, param_idx, original, TOFU_FLOAT);

    /* Compute f(x + epsilon) */
    float perturbed = original + epsilon;
    TOFU_TENSOR_DATA_FROM(param->value, param_idx, perturbed, TOFU_FLOAT);
    tofu_graph_zero_grad(g);
    /* ... rerun forward pass ... */
    float loss_plus;
    TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_plus, TOFU_FLOAT);

    /* Compute f(x - epsilon) */
    perturbed = original - epsilon;
    TOFU_TENSOR_DATA_FROM(param->value, param_idx, perturbed, TOFU_FLOAT);
    tofu_graph_zero_grad(g);
    /* ... rerun forward pass ... */
    float loss_minus;
    TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_minus, TOFU_FLOAT);

    /* Restore original value */
    TOFU_TENSOR_DATA_FROM(param->value, param_idx, original, TOFU_FLOAT);

    /* Numerical gradient: (f(x+ε) - f(x-ε)) / (2ε) */
    return (loss_plus - loss_minus) / (2.0f * epsilon);
}

/* Compare with analytical gradient */
float analytical_grad;
TOFU_TENSOR_DATA_TO(tofu_graph_get_grad(param), param_idx, analytical_grad, TOFU_FLOAT);

float numerical_grad = numerical_gradient(g, param, loss, param_idx);
float relative_error = fabsf(analytical_grad - numerical_grad) /
                       fmaxf(fabsf(analytical_grad), fabsf(numerical_grad));

if (relative_error < 1e-5f) {
    printf("Gradient check PASSED (error: %.2e)\n", relative_error);
} else {
    printf("Gradient check FAILED (error: %.2e)\n", relative_error);
}

Use gradient checking sparingly - it's expensive (requires multiple forward passes per parameter).

Common Gradient Issues

Vanishing gradients:

Gradients become very small (near zero)
Common with deep networks or saturating activations
Solutions: Better initialization (Xavier), ReLU activations, batch normalization

Exploding gradients:

Gradients become very large
Loss becomes NaN
Solutions: Lower learning rate, gradient clipping, better initialization

Parameter Updates

After computing gradients, update parameters using the optimizer. This is where learning actually happens.

Optimizer Step

tofu_optimizer_step(optimizer);

This updates all parameters according to the optimizer's algorithm:

SGD: param = param - learning_rate * grad

SGD with momentum:

velocity = momentum * velocity + grad
param = param - learning_rate * velocity

Choosing an Optimizer

TOFU provides two optimizers:

/* Vanilla SGD */
tofu_optimizer* sgd = tofu_optimizer_sgd_create(g, 0.01);

/* SGD with momentum (recommended for most tasks) */
tofu_optimizer* sgd_momentum = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

Vanilla SGD:

Simplest algorithm
Good for convex problems
Can oscillate in ravines
Use when: Memory is tight, simple problem

SGD with Momentum:

Accumulates velocity
Faster convergence
Less oscillation
Use when: Default choice for most problems

Learning Rate Selection

The learning rate is the most important hyperparameter:

/* Too high (0.5): May diverge or oscillate */
tofu_optimizer* opt_high = tofu_optimizer_sgd_create(g, 0.5);

/* Too low (0.0001): Very slow convergence */
tofu_optimizer* opt_low = tofu_optimizer_sgd_create(g, 0.0001);

/* Just right (0.01 - 0.1): Task-dependent */
tofu_optimizer* opt_good = tofu_optimizer_sgd_create(g, 0.01);

Guidelines:

Start with 0.01 or 0.1
If loss diverges: Reduce by 10x
If convergence is slow: Increase by 2-3x
Smaller networks often need larger learning rates
Batch size matters: Larger batches → higher learning rate

Parameter Update Timing

The order matters:

/* Correct order */
tofu_graph_zero_grad(g);          /* 1. Clear old gradients */
/* forward pass */                /* 2. Compute predictions */
/* loss computation */             /* 3. Measure error */
tofu_graph_backward(g, loss);     /* 4. Compute gradients */
tofu_optimizer_step(optimizer);   /* 5. Update parameters */

/* WRONG: Update before backward */
tofu_optimizer_step(optimizer);   /* Updates with old/zero gradients! */
tofu_graph_backward(g, loss);

Monitoring Parameter Changes

Track how much parameters change each step:

/* Before update */
float w_before;
TOFU_TENSOR_DATA_TO(w1->value, 0, w_before, TOFU_FLOAT);

/* Update */
tofu_optimizer_step(optimizer);

/* After update */
float w_after;
TOFU_TENSOR_DATA_TO(w1->value, 0, w_after, TOFU_FLOAT);

float change = fabsf(w_after - w_before);
printf("Parameter change: %.6f\n", change);

Healthy training:

Parameters change gradually
Change magnitude decreases over time
No sudden jumps

Training Strategies

Effective training requires more than just the basic loop. Here are strategies for better results.

Mini-Batch Training

Process data in batches instead of one sample at a time:

const int BATCH_SIZE = 16;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    /* Shuffle data (optional but recommended) */
    shuffle_dataset(dataset);

    for (int batch_start = 0; batch_start < num_samples; batch_start += BATCH_SIZE) {
        int batch_end = (batch_start + BATCH_SIZE < num_samples)
                        ? batch_start + BATCH_SIZE
                        : num_samples;
        int actual_batch_size = batch_end - batch_start;

        /* Prepare batch */
        float* batch_data = create_batch(dataset, batch_start, actual_batch_size);
        tofu_tensor* t_batch = tofu_tensor_create(batch_data, 2,
                                                  (int[]){actual_batch_size, feature_dims},
                                                  TOFU_FLOAT);

        /* Train on batch */
        tofu_graph_zero_grad(g);
        tofu_graph_node* input = tofu_graph_input(g, t_batch);
        /* ... rest of training step ... */

        /* Cleanup */
        tofu_tensor_free(t_batch);
        free(batch_data);
        tofu_graph_clear_ops(g);
    }
}

Batch size trade-offs:

Larger (32-128): More stable gradients, better GPU utilization, higher memory
Smaller (1-8): Less memory, more updates, noisier gradients
Microcontroller: Often limited to 1-4 due to memory constraints

Epoch Management

An epoch is one complete pass through the training data:

const int NUM_EPOCHS = 100;
float best_loss = INFINITY;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    float epoch_loss = 0.0f;
    int num_batches = 0;

    /* Train on all batches */
    for (int batch = 0; batch < num_batches_total; batch++) {
        /* ... training step ... */
        epoch_loss += batch_loss;
        num_batches++;
    }

    /* Average loss over epoch */
    float avg_loss = epoch_loss / num_batches;

    /* Track best model */
    if (avg_loss < best_loss) {
        best_loss = avg_loss;
        /* Optionally save parameters */
    }

    /* Report progress */
    if (epoch % 10 == 0) {
        printf("Epoch %d: loss = %.6f\n", epoch, avg_loss);
    }
}

How many epochs?

Too few: Underfitting (model hasn't learned)
Too many: Overfitting (model memorizes training data)
Monitor validation loss to determine when to stop

Learning Rate Scheduling

Adjust learning rate during training:

float initial_lr = 0.1f;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    /* Step decay: Reduce by 10x every 50 epochs */
    float lr = initial_lr;
    if (epoch >= 50) lr *= 0.1f;
    if (epoch >= 100) lr *= 0.1f;

    /* Recreate optimizer with new learning rate */
    tofu_optimizer_free(optimizer);
    optimizer = tofu_optimizer_sgd_create(g, lr);

    /* ... training for this epoch ... */
}

Common schedules:

Step decay: Reduce by constant factor at fixed intervals
Exponential decay: lr = lr0 * exp(-k * epoch)
Warmup: Start with low lr, gradually increase
Manual: Reduce when loss plateaus

Data Augmentation

Increase effective dataset size by transforming inputs:

void augment_image(float* image, int width, int height) {
    /* Random flip */
    if (rand() % 2) {
        horizontal_flip(image, width, height);
    }

    /* Random noise */
    for (int i = 0; i < width * height; i++) {
        image[i] += 0.01f * ((float)rand() / RAND_MAX - 0.5f);
    }
}

/* Apply during training */
augment_image(batch_data, 8, 8);
tofu_tensor* t_input = tofu_tensor_create(batch_data, 2, ...);

Augmentation techniques:

Rotation, flipping, cropping (images)
Noise injection (signals)
Time shifting (sequences)
Caution: Some augmentations may not make sense for your data

Early Stopping

Stop training when validation loss stops improving:

float best_val_loss = INFINITY;
int patience = 10;  /* Number of epochs to wait */
int wait = 0;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    /* Train */
    float train_loss = train_epoch(g, optimizer, train_data);

    /* Validate */
    float val_loss = evaluate(g, params, val_data);

    /* Check improvement */
    if (val_loss < best_val_loss) {
        best_val_loss = val_loss;
        wait = 0;
        /* Save best model */
    } else {
        wait++;
        if (wait >= patience) {
            printf("Early stopping at epoch %d\n", epoch);
            break;
        }
    }
}

Monitoring and Debugging

Track training progress and diagnose issues.

Loss Curves

Plot loss over time to understand training dynamics:

#define MAX_EPOCHS 200
float train_losses[MAX_EPOCHS];
float val_losses[MAX_EPOCHS];

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    train_losses[epoch] = train_epoch(...);
    val_losses[epoch] = evaluate(...);

    printf("Epoch %3d: train_loss=%.4f, val_loss=%.4f\n",
           epoch, train_losses[epoch], val_losses[epoch]);
}

/* Analyze curves */
save_losses("losses.txt", train_losses, val_losses, NUM_EPOCHS);

What to look for:

Both decreasing: Healthy training
Train decreases, val increases: Overfitting
Both plateau: Underfitting or need lower learning rate
Both increasing: Learning rate too high

Gradient Monitoring

Check gradient magnitudes:

void check_gradients(tofu_graph* g, tofu_graph_node** params, int num_params) {
    for (int i = 0; i < num_params; i++) {
        tofu_tensor* grad = tofu_graph_get_grad(params[i]);
        if (!grad) continue;

        float grad_norm = 0.0f;
        for (int j = 0; j < grad->len; j++) {
            float val;
            TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
            grad_norm += val * val;
        }
        grad_norm = sqrtf(grad_norm);

        printf("Param %d gradient norm: %.6f\n", i, grad_norm);

        /* Warning signs */
        if (grad_norm < 1e-7f) {
            printf("  WARNING: Vanishing gradient!\n");
        }
        if (grad_norm > 1e3f) {
            printf("  WARNING: Exploding gradient!\n");
        }
    }
}

/* Call after backward pass */
tofu_graph_backward(g, loss);
check_gradients(g, params, num_params);

Activation Statistics

Monitor activation distributions:

void check_activations(tofu_graph_node* node) {
    tofu_tensor* act = tofu_graph_get_value(node);
    if (!act) return;

    float min_val = INFINITY, max_val = -INFINITY, mean = 0.0f;

    for (int i = 0; i < act->len; i++) {
        float val;
        TOFU_TENSOR_DATA_TO(act, i, val, TOFU_FLOAT);

        if (val < min_val) min_val = val;
        if (val > max_val) max_val = val;
        mean += val;
    }
    mean /= act->len;

    printf("Activation stats: min=%.4f, max=%.4f, mean=%.4f\n",
           min_val, max_val, mean);

    /* Warning signs */
    if (max_val - min_val < 1e-6f) {
        printf("  WARNING: Dead activations (all same)!\n");
    }
}

Debugging Checklist

When training fails, check:

Loss is NaN:
- Reduce learning rate
- Check for division by zero
- Verify input data normalization
Loss doesn't decrease:
- Increase learning rate
- Check gradient flow (print gradients)
- Verify data/labels are correct
- Try better initialization
Training is slow:
- Increase learning rate
- Use momentum
- Check batch size
- Verify network is not too large
Overfitting:
- Add more training data
- Reduce network size
- Use validation set for early stopping

Evaluation

After training, evaluate model performance on test data.

Computing Accuracy

For classification tasks:

float compute_accuracy(tofu_graph* g, cnn_params* params,
                      float* test_data, int* test_labels, int num_samples) {
    int correct = 0;

    for (int i = 0; i < num_samples; i++) {
        tofu_graph_zero_grad(g);

        /* Forward pass */
        float* input_data = &test_data[i * INPUT_SIZE];
        tofu_tensor* t_input = tofu_tensor_create(input_data, 1,
                                                  (int[]){INPUT_SIZE}, TOFU_FLOAT);
        tofu_graph_node* input = tofu_graph_input(g, t_input);
        tofu_graph_node* probs = cnn_forward_probs(g, input, params);

        /* Get prediction */
        tofu_tensor* probs_tensor = tofu_graph_get_value(probs);
        int pred_class = argmax(probs_tensor);

        if (pred_class == test_labels[i]) {
            correct++;
        }

        tofu_tensor_free(t_input);
        tofu_graph_clear_ops(g);
    }

    return (float)correct / num_samples;
}

/* Helper function */
int argmax(tofu_tensor* tensor) {
    int max_idx = 0;
    float max_val = -INFINITY;

    for (int i = 0; i < tensor->len; i++) {
        float val;
        TOFU_TENSOR_DATA_TO(tensor, i, val, TOFU_FLOAT);
        if (val > max_val) {
            max_val = val;
            max_idx = i;
        }
    }

    return max_idx;
}

Regression Metrics

For regression tasks:

float compute_mse(tofu_graph* g, tofu_graph_node* w1, tofu_graph_node* b1,
                 tofu_graph_node* w2, tofu_graph_node* b2,
                 float test_inputs[][2], float test_targets[][1], int num_samples) {
    float total_error = 0.0f;

    for (int i = 0; i < num_samples; i++) {
        /* Forward pass */
        float* input_data = (float*)malloc(2 * sizeof(float));
        input_data[0] = test_inputs[i][0];
        input_data[1] = test_inputs[i][1];

        tofu_tensor* t_input = tofu_tensor_create(input_data, 1, (int[]){2}, TOFU_FLOAT);
        tofu_graph_node* x = tofu_graph_input(g, t_input);

        /* Network computation */
        tofu_graph_node* h1 = tofu_graph_relu(g, tofu_graph_add(g,
                                               tofu_graph_matmul(g, x, w1), b1));
        tofu_graph_node* pred = tofu_graph_add(g, tofu_graph_matmul(g, h1, w2), b2);

        /* Get prediction */
        float pred_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(pred), 0, pred_val, TOFU_FLOAT);

        /* Compute error */
        float error = pred_val - test_targets[i][0];
        total_error += error * error;

        tofu_tensor_free(t_input);
        free(input_data);
        tofu_graph_clear_ops(g);
    }

    return total_error / num_samples;
}

Confusion Matrix

For detailed classification analysis:

void compute_confusion_matrix(tofu_graph* g, cnn_params* params,
                             float* test_data, int* test_labels,
                             int num_samples, int num_classes,
                             int confusion[4][4]) {
    /* Initialize matrix */
    memset(confusion, 0, num_classes * num_classes * sizeof(int));

    for (int i = 0; i < num_samples; i++) {
        /* Get prediction */
        int pred_class = predict_sample(g, params, &test_data[i * INPUT_SIZE]);
        int true_class = test_labels[i];

        /* Update confusion matrix */
        confusion[true_class][pred_class]++;
    }

    /* Print matrix */
    printf("\nConfusion Matrix:\n");
    printf("      ");
    for (int i = 0; i < num_classes; i++) printf("%4d ", i);
    printf("\n");

    for (int i = 0; i < num_classes; i++) {
        printf("True %d: ", i);
        for (int j = 0; j < num_classes; j++) {
            printf("%4d ", confusion[i][j]);
        }
        printf("\n");
    }
}

Complete Example

Here's a complete XOR training example bringing everything together:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"

/* Xavier initialization */
float xavier_init(int fan_in) {
    float limit = sqrtf(6.0f / fan_in);
    return limit * (2.0f * (float)rand() / RAND_MAX - 1.0f);
}

int main() {
    /* Configuration */
    const int INPUT_SIZE = 2, HIDDEN_SIZE = 4, OUTPUT_SIZE = 1;
    const int NUM_EPOCHS = 2000;
    const float LEARNING_RATE = 0.1f;

    /* XOR dataset */
    float inputs[4][2] = {{0, 0}, {0, 1}, {1, 0}, {1, 1}};
    float targets[4][1] = {{0}, {1}, {1}, {0}};

    /* Create graph */
    tofu_graph* g = tofu_graph_create();

    /* Initialize parameters */
    float* w1_data = malloc(INPUT_SIZE * HIDDEN_SIZE * sizeof(float));
    for (int i = 0; i < INPUT_SIZE * HIDDEN_SIZE; i++)
        w1_data[i] = xavier_init(INPUT_SIZE);
    tofu_tensor* t_w1 = tofu_tensor_create(w1_data, 2,
                                           (int[]){INPUT_SIZE, HIDDEN_SIZE}, TOFU_FLOAT);
    tofu_graph_node* w1 = tofu_graph_param(g, t_w1);

    float* b1_data = calloc(HIDDEN_SIZE, sizeof(float));
    tofu_tensor* t_b1 = tofu_tensor_create(b1_data, 1, (int[]){HIDDEN_SIZE}, TOFU_FLOAT);
    tofu_graph_node* b1 = tofu_graph_param(g, t_b1);

    float* w2_data = malloc(HIDDEN_SIZE * OUTPUT_SIZE * sizeof(float));
    for (int i = 0; i < HIDDEN_SIZE * OUTPUT_SIZE; i++)
        w2_data[i] = xavier_init(HIDDEN_SIZE);
    tofu_tensor* t_w2 = tofu_tensor_create(w2_data, 2,
                                           (int[]){HIDDEN_SIZE, OUTPUT_SIZE}, TOFU_FLOAT);
    tofu_graph_node* w2 = tofu_graph_param(g, t_w2);

    float* b2_data = calloc(OUTPUT_SIZE, sizeof(float));
    tofu_tensor* t_b2 = tofu_tensor_create(b2_data, 1, (int[]){OUTPUT_SIZE}, TOFU_FLOAT);
    tofu_graph_node* b2 = tofu_graph_param(g, t_b2);

    /* Create optimizer */
    tofu_optimizer* optimizer = tofu_optimizer_sgd_create(g, LEARNING_RATE);

    /* Training loop */
    for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
        float epoch_loss = 0.0f;

        for (int sample = 0; sample < 4; sample++) {
            /* Zero gradients */
            tofu_graph_zero_grad(g);

            /* Create input */
            float* in_data = malloc(INPUT_SIZE * sizeof(float));
            in_data[0] = inputs[sample][0];
            in_data[1] = inputs[sample][1];
            tofu_tensor* t_in = tofu_tensor_create(in_data, 1, (int[]){INPUT_SIZE}, TOFU_FLOAT);
            tofu_graph_node* x = tofu_graph_input(g, t_in);

            /* Forward pass */
            tofu_graph_node* h1 = tofu_graph_relu(g, tofu_graph_add(g,
                                                   tofu_graph_matmul(g, x, w1), b1));
            tofu_graph_node* pred = tofu_graph_add(g, tofu_graph_matmul(g, h1, w2), b2);

            /* Create target */
            float* tgt_data = malloc(OUTPUT_SIZE * sizeof(float));
            tgt_data[0] = targets[sample][0];
            tofu_tensor* t_tgt = tofu_tensor_create(tgt_data, 1, (int[]){OUTPUT_SIZE}, TOFU_FLOAT);
            tofu_graph_node* tgt = tofu_graph_input(g, t_tgt);

            /* Compute loss */
            tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, tgt);
            float loss_val;
            TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
            epoch_loss += loss_val;

            /* Backward pass */
            tofu_graph_backward(g, loss);

            /* Update parameters */
            tofu_optimizer_step(optimizer);

            /* Cleanup */
            tofu_tensor_free(t_in);
            tofu_tensor_free(t_tgt);
            free(in_data);
            free(tgt_data);
            tofu_graph_clear_ops(g);
        }

        /* Report progress */
        if (epoch % 200 == 0) {
            printf("Epoch %4d: loss = %.6f\n", epoch, epoch_loss / 4);
        }
    }

    /* Evaluate */
    printf("\nFinal predictions:\n");
    for (int i = 0; i < 4; i++) {
        float* in_data = malloc(INPUT_SIZE * sizeof(float));
        in_data[0] = inputs[i][0];
        in_data[1] = inputs[i][1];
        tofu_tensor* t_in = tofu_tensor_create(in_data, 1, (int[]){INPUT_SIZE}, TOFU_FLOAT);
        tofu_graph_node* x = tofu_graph_input(g, t_in);

        tofu_graph_node* h1 = tofu_graph_relu(g, tofu_graph_add(g,
                                               tofu_graph_matmul(g, x, w1), b1));
        tofu_graph_node* pred = tofu_graph_add(g, tofu_graph_matmul(g, h1, w2), b2);

        float pred_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(pred), 0, pred_val, TOFU_FLOAT);

        printf("[%.0f, %.0f] -> %.4f (target: %.0f)\n",
               inputs[i][0], inputs[i][1], pred_val, targets[i][0]);

        tofu_tensor_free(t_in);
        free(in_data);
        tofu_graph_clear_ops(g);
    }

    /* Cleanup */
    tofu_optimizer_free(optimizer);
    tofu_graph_free(g);
    tofu_tensor_free_data_too(t_w1);
    tofu_tensor_free_data_too(t_b1);
    tofu_tensor_free_data_too(t_w2);
    tofu_tensor_free_data_too(t_b2);

    return 0;
}

This example demonstrates:

Parameter initialization with Xavier method
Complete training loop with all five steps
Proper memory management (malloc/free)
Graph reuse via clear_ops
Loss monitoring during training
Final evaluation on the dataset

Best Practices

Memory Management

Always free tensors in correct order:

/* Correct order */
tofu_optimizer_free(optimizer);  /* 1. Free optimizer first */
tofu_graph_free(g);              /* 2. Free graph second */
tofu_tensor_free_data_too(t_w1); /* 3. Free parameter tensors last */
tofu_tensor_free_data_too(t_b1);

Use clear_ops between iterations:

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    /* ... training step ... */
    tofu_graph_clear_ops(g);  /* Prevents memory leaks */
}

Initialization

Use Xavier/He initialization:

/* Xavier: Good for tanh/sigmoid */
float xavier = sqrtf(6.0f / fan_in) * (2.0f * rand() / RAND_MAX - 1.0f);

/* He: Better for ReLU */
float he = sqrtf(2.0f / fan_in) * (2.0f * rand() / RAND_MAX - 1.0f);

Never initialize to all zeros:

/* WRONG: Breaks symmetry, prevents learning */
float* w_data = calloc(size, sizeof(float));

/* CORRECT: Random initialization */
for (int i = 0; i < size; i++)
    w_data[i] = xavier_init(fan_in);

Hyperparameter Tuning

Start with these defaults:

Learning rate: 0.01 - 0.1
Batch size: 1 - 16 (for microcontrollers)
Hidden layer size: 2x - 4x input size
Epochs: 100 - 1000

Tune systematically:

Get the model working at all (reduce problem size if needed)
Tune learning rate (most important)
Tune architecture (layer sizes)
Tune batch size (memory permitting)

Debugging

Print everything during development:

printf("Loss: %.6f\n", loss_val);
printf("Grad norm: %.6f\n", grad_norm);
printf("Prediction: %.4f, Target: %.4f\n", pred, target);

Check intermediate values:

tofu_tensor* h1_val = tofu_graph_get_value(h1);
printf("Hidden layer stats: ");
print_tensor_stats(h1_val);

Start simple, scale up:

Verify on tiny dataset (4 samples)
Check on small network (few parameters)
Scale to full problem once working

Resource Constraints

For microcontrollers, minimize memory usage:

Use batch_size=1 if memory is tight
Keep networks small (< 10k parameters)
Reuse graph with clear_ops
Consider quantization (future work)
Profile memory usage regularly

With these best practices, you're ready to train neural networks on TOFU. See the examples directory for more complete training scripts.

Optimizers

Optimizers update neural network parameters using computed gradients. Understanding how optimizers work and how to tune them is essential for training models effectively.

Introduction

Training a neural network means finding parameter values that minimize a loss function. This is an optimization problem: start with random parameters, compute gradients that indicate how to adjust them, and iteratively update parameters to reduce loss.

Optimizers automate this process. They take computed gradients and apply update rules to parameters. Different optimizers use different strategies—some use only the current gradient (like SGD), while others accumulate information from previous steps (like momentum-based methods).

This guide explains optimizer fundamentals, shows you how to create and use optimizers, describes the algorithms available in Tofu, and provides practical guidance for tuning hyperparameters and troubleshooting training issues.

Optimizer Fundamentals

Understanding how optimizers work requires grasping two key concepts: gradient descent and the learning rate.

Gradient Descent: Following the Slope Downhill

Imagine you're standing on a mountain in fog, trying to reach the lowest point. You can't see far, but you can feel which direction slopes downward beneath your feet. Gradient descent works the same way: at each step, compute which direction reduces the loss function, then take a small step in that direction.

Mathematically, for a parameter theta and loss L:

theta_new = theta_old - learning_rate * gradient

Where gradient = dL/dtheta (the derivative of loss with respect to the parameter).

The gradient points in the direction of steepest ascent (uphill). By subtracting it, we move downhill toward lower loss.

Learning Rate: Step Size Matters

The learning rate controls how large a step to take. This is the single most important hyperparameter in training neural networks.

Too large: You'll overshoot the minimum, potentially making loss worse or causing training to diverge completely.

Loss landscape:  \    /
                  \__/
With large steps: --> X <-- (overshoot back and forth)

Too small: Training converges slowly. You'll make progress, but it might take 10x or 100x more iterations than necessary.

Loss landscape:  \    /
                  \__/
With tiny steps:  . . . . . (very slow progress)

Just right: Training converges efficiently without instability.

Loss landscape:  \    /
                  \__/
Good step size:   -> -> -> (steady progress to minimum)

Typical learning rates range from 0.0001 to 0.1. Start with 0.01 and adjust based on training behavior.

Stochastic Gradient Descent (SGD)

Classical gradient descent computes gradients using the entire training dataset. This is expensive and slow. Stochastic Gradient Descent (SGD) uses small batches of data instead—typically 32, 64, or 128 examples at a time.

The "stochastic" (random) part means each batch gives a noisy estimate of the true gradient. But averaging over many batches gives the correct direction, and computing on small batches is much faster than using the entire dataset.

In practice, when people say "SGD," they usually mean "mini-batch SGD"—computing gradients on small batches rather than single examples or the full dataset.

Why Multiple Optimizer Types?

If vanilla SGD works, why do we need other optimizers? Because SGD has limitations:

Slow convergence on complex loss landscapes
Oscillation in narrow valleys (moves back and forth rather than forward)
Sensitivity to learning rate choice

Advanced optimizers like SGD with momentum address these issues by accumulating information about previous gradients. This helps accelerate training and dampen oscillations.

Creating Optimizers

Optimizers in Tofu are tied to computation graphs. When you create an optimizer, it automatically collects all trainable parameters (nodes created with tofu_graph_param) from the graph.

Basic Setup

Creating an optimizer follows this pattern:

// 1. Create graph and add parameters
tofu_graph *g = tofu_graph_create();

tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);

tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

// 2. Create optimizer (automatically finds W and b)
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// 3. Use in training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    tofu_optimizer_zero_grad(opt);
    // ... forward pass, compute loss ...
    tofu_graph_backward(g, loss);
    tofu_optimizer_step(opt);
}

// 4. Cleanup (optimizer before graph)
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(weights);
tofu_tensor_free_data_too(bias);

Key points:

Automatic parameter collection: The optimizer scans the graph and finds all PARAM nodes when created
One optimizer per graph: Each optimizer manages parameters from a single graph
Cleanup order matters: Always free the optimizer before the graph

Choosing a Learning Rate

Start with these defaults:

0.01 - Safe starting point for most problems
0.001 - Deep networks, complex problems
0.1 - Small networks, simple problems

After a few iterations, check if loss is decreasing. If not, reduce the learning rate by 10x. If loss decreases very slowly, try increasing by 2x-5x.

Memory Considerations

Different optimizers have different memory requirements:

SGD: No extra memory (just the parameters themselves)
SGD with momentum: One velocity buffer per parameter (doubles memory)

For large networks on memory-constrained devices, vanilla SGD may be the only option. For everything else, momentum is usually worth the extra memory.

SGD: Stochastic Gradient Descent

Vanilla SGD is the simplest optimizer. It updates parameters by directly subtracting the scaled gradient.

The Algorithm

For each parameter theta:

theta = theta - learning_rate * gradient

That's it. Compute the gradient, scale it by the learning rate, subtract from the parameter.

In code:

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

The optimizer applies this update rule to every parameter automatically when you call tofu_optimizer_step().

Implementation Example

Here's a complete training loop using SGD:

// Setup
tofu_graph *g = tofu_graph_create();

// Network: linear layer (input_dim=4, output_dim=3)
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);

tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);

// Create SGD optimizer with learning rate 0.01
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Training loop
for (int epoch = 0; epoch < 100; epoch++) {
    double epoch_loss = 0.0;

    for (int batch = 0; batch < num_batches; batch++) {
        // Zero gradients before forward pass
        tofu_optimizer_zero_grad(opt);

        // Forward pass: pred = input @ W + b
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
        tofu_graph_node *h = tofu_graph_matmul(g, x, W_node);
        tofu_graph_node *pred = tofu_graph_add(g, h, b_node);

        // Compute loss
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

        // Get loss value for logging
        float loss_val;
        tofu_tensor *loss_tensor = tofu_graph_get_value(loss);
        TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_val, TOFU_FLOAT);
        epoch_loss += loss_val;

        // Backward pass: compute gradients
        tofu_graph_backward(g, loss);

        // Update parameters using gradients
        tofu_optimizer_step(opt);

        // Clear operations for next batch
        tofu_graph_clear_ops(g);
    }

    printf("Epoch %d: Loss = %.6f\n", epoch, epoch_loss / num_batches);
}

// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);

When to Use SGD

SGD works well when:

Memory is tight: SGD has no extra memory overhead
Loss landscape is smooth: Few local minima, well-conditioned gradients
You have time to tune: SGD is sensitive to learning rate, so you'll need to experiment

SGD struggles when:

Loss landscape is complex: Many local minima or saddle points
Gradients are noisy: High variance in gradient estimates
Convergence needs to be fast: SGD converges slower than momentum-based methods

Tuning SGD

The learning rate is the only hyperparameter for vanilla SGD. Here's how to tune it:

Start with 0.01:

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

Watch the first few iterations:

Loss decreasing steadily: Good sign, continue training
Loss increasing or NaN: Learning rate too high, reduce by 10x
Loss barely changing: Learning rate too low, increase by 2x-5x

Common learning rate values:

0.1 - Aggressive, works for simple problems
0.01 - Conservative, good default
0.001 - Very conservative, deep networks
0.0001 - Fine-tuning pretrained models

Monitoring training:

for (int epoch = 0; epoch < max_epochs; epoch++) {
    // ... training loop ...

    if (epoch_loss < best_loss) {
        best_loss = epoch_loss;
        no_improvement_count = 0;
    } else {
        no_improvement_count++;
    }

    // Reduce learning rate if stuck
    if (no_improvement_count > 10) {
        opt->learning_rate *= 0.5;
        printf("Reducing learning rate to %.6f\n", opt->learning_rate);
        no_improvement_count = 0;
    }
}

SGD with Momentum

Momentum helps SGD converge faster and more smoothly by accumulating a velocity term that averages gradients over time. This dampens oscillations and accelerates progress in consistent directions.

The Algorithm

Instead of directly using the current gradient, momentum maintains a velocity vector that accumulates gradients exponentially:

v = momentum * v - learning_rate * gradient
theta = theta + v

Where:

v is the velocity (initialized to zero)
momentum is a coefficient (typically 0.9)
learning_rate scales the gradient contribution
gradient is the current parameter gradient

This differs from classical momentum formulations but is mathematically equivalent. The key insight: multiply the velocity by momentum (typically 0.9), then subtract the scaled gradient and add the result to the parameter.

Why Momentum Works

Think of momentum as a ball rolling downhill. When the slope consistently points in one direction, the ball accelerates (velocity builds up). When the slope changes direction, the accumulated velocity smooths out oscillations.

Without momentum (vanilla SGD):

Narrow valley:  |        |
Path taken:     | -> <- ->| (oscillates back and forth)
                | -> <- ->|

With momentum:

Narrow valley:  |        |
Path taken:     |  -->   | (smooth progress forward)
                |   -->  |

Momentum provides two benefits:

Acceleration: Builds up speed in consistent directions
Dampening: Reduces oscillations in directions that change frequently

Implementation Example

Creating an SGD optimizer with momentum requires one additional parameter:

// Create optimizer with learning_rate=0.01, momentum=0.9
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

The rest of the training loop is identical to vanilla SGD:

// Setup
tofu_graph *g = tofu_graph_create();

// Network: two-layer MLP
tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){784, 128}, TOFU_FLOAT);
tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){128}, TOFU_FLOAT);
tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){128, 10}, TOFU_FLOAT);
tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);

tofu_graph_node *W1_node = tofu_graph_param(g, W1);
tofu_graph_node *b1_node = tofu_graph_param(g, b1);
tofu_graph_node *W2_node = tofu_graph_param(g, W2);
tofu_graph_node *b2_node = tofu_graph_param(g, b2);

// Create optimizer with momentum
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

// Training loop
for (int epoch = 0; epoch < 100; epoch++) {
    for (int batch = 0; batch < num_batches; batch++) {
        tofu_optimizer_zero_grad(opt);

        // Forward pass
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);

        // Layer 1: h1 = relu(x @ W1 + b1)
        tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1_node);
        h1 = tofu_graph_add(g, h1, b1_node);
        h1 = tofu_graph_relu(g, h1);

        // Layer 2: output = h1 @ W2 + b2
        tofu_graph_node *h2 = tofu_graph_matmul(g, h1, W2_node);
        h2 = tofu_graph_add(g, h2, b2_node);

        // Loss
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
        tofu_graph_node *loss = tofu_graph_mse_loss(g, h2, target);

        // Backward and update
        tofu_graph_backward(g, loss);
        tofu_optimizer_step(opt);

        tofu_graph_clear_ops(g);
    }
}

// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W1);
tofu_tensor_free_data_too(b1);
tofu_tensor_free_data_too(W2);
tofu_tensor_free_data_too(b2);

Tuning Momentum

Momentum has two hyperparameters: learning rate and momentum coefficient.

Learning Rate: Start with the same values as vanilla SGD (0.01 is a good default). Momentum often allows slightly higher learning rates because it dampens oscillations.

Momentum Coefficient: Controls how much past gradients influence current updates.

Common values:

0.9 - Standard choice, works well for most problems
0.95 - High momentum, use for slow convergence
0.99 - Very high momentum, use for very deep networks
0.5-0.8 - Low momentum, use if training is unstable

The momentum coefficient is easier to tune than learning rate. Start with 0.9 and adjust if needed.

When to Use Momentum

Use momentum when:

Training is slow: Momentum accelerates convergence
Gradients are noisy: Momentum smooths out noise
Deep networks: Momentum helps propagate gradients through many layers
Memory is available: Momentum requires one velocity buffer per parameter

Stick with vanilla SGD when:

Memory is very tight: Momentum doubles memory requirements
Loss landscape is simple: Vanilla SGD may be sufficient

In practice, momentum is the default choice for most problems. The memory cost is usually worth the faster convergence.

Using Optimizers in Training

Now that you understand optimizer algorithms, let's look at the mechanics of using them in training loops.

The Training Cycle

Every training iteration follows the same four-step pattern:

1. Zero gradients    (tofu_optimizer_zero_grad)
2. Forward pass      (build computation graph)
3. Backward pass     (tofu_graph_backward)
4. Update parameters (tofu_optimizer_step)

This cycle repeats for every batch in every epoch.

Step-by-Step Breakdown

Step 1: Zero Gradients

Gradients accumulate by default in Tofu. If you don't zero them, they'll keep adding up across iterations, leading to incorrect updates.

tofu_optimizer_zero_grad(opt);

This clears all gradient buffers for parameters tracked by the optimizer. Always call this before the forward pass.

Step 2: Forward Pass

Build the computation graph by adding operations. Each operation automatically computes its value:

tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *h = tofu_graph_matmul(g, x, W);
tofu_graph_node *pred = tofu_graph_add(g, h, b);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

At this point, loss contains the computed loss value, but gradients haven't been computed yet.

Step 3: Backward Pass

Compute gradients by calling backward on the loss node:

tofu_graph_backward(g, loss);

This triggers reverse-mode automatic differentiation. Tofu walks the graph backwards, computing gradients for every parameter using the chain rule. After this call, every parameter has its gradient stored in node->grad.

Step 4: Update Parameters

Apply the optimizer's update rule to adjust parameters:

tofu_optimizer_step(opt);

This uses the computed gradients to update parameters. For SGD, it subtracts learning_rate * gradient from each parameter. For momentum, it updates velocity buffers and then parameters.

Step 5: Clear Operations

Before the next iteration, clear operation nodes from the graph while preserving parameters:

tofu_graph_clear_ops(g);

This frees memory used by intermediate computations (matmul results, activations, etc.) but keeps parameters and their gradients intact.

Complete Training Loop

Here's a full training loop with all the pieces:

tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

for (int epoch = 0; epoch < num_epochs; epoch++) {
    double total_loss = 0.0;

    for (int batch = 0; batch < num_batches; batch++) {
        // 1. Zero gradients
        tofu_optimizer_zero_grad(opt);

        // 2. Forward pass
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
        tofu_graph_node *pred = forward_pass(g, x);  // Your model
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

        // Track loss for logging
        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
        total_loss += loss_val;

        // 3. Backward pass
        tofu_graph_backward(g, loss);

        // 4. Update parameters
        tofu_optimizer_step(opt);

        // 5. Clear operations
        tofu_graph_clear_ops(g);
    }

    // Log epoch statistics
    printf("Epoch %d: Avg Loss = %.6f\n", epoch, total_loss / num_batches);
}

Common Mistakes

Mistake 1: Forgetting to zero gradients

// WRONG: Gradients accumulate indefinitely
for (int i = 0; i < iterations; i++) {
    // No zero_grad call!
    // ... forward, backward, step ...
}

This causes gradients to grow without bound. Updates become incorrect after the first iteration.

Correct:

for (int i = 0; i < iterations; i++) {
    tofu_optimizer_zero_grad(opt);  // Clear old gradients
    // ... forward, backward, step ...
}

Mistake 2: Calling step before backward

// WRONG: No gradients computed yet!
tofu_optimizer_step(opt);
tofu_graph_backward(g, loss);

The optimizer needs gradients to update parameters. Always call backward before step.

Correct:

tofu_graph_backward(g, loss);   // Compute gradients
tofu_optimizer_step(opt);        // Use gradients to update

Mistake 3: Not clearing operations

// WRONG: Memory grows indefinitely
for (int batch = 0; batch < num_batches; batch++) {
    // ... training ...
    // No clear_ops call!
}

Each batch adds nodes to the graph. Without clearing, memory usage grows until the program crashes.

Correct:

for (int batch = 0; batch < num_batches; batch++) {
    // ... training ...
    tofu_graph_clear_ops(g);  // Clear after each batch
}

Monitoring Training

Track key metrics to understand training progress:

for (int epoch = 0; epoch < num_epochs; epoch++) {
    double epoch_loss = 0.0;
    int num_correct = 0;

    for (int batch = 0; batch < num_batches; batch++) {
        // ... training loop ...

        // Track loss
        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
        epoch_loss += loss_val;

        // Track accuracy (for classification)
        num_correct += count_correct_predictions(pred, target);
    }

    double avg_loss = epoch_loss / num_batches;
    double accuracy = (double)num_correct / (num_batches * batch_size);

    printf("Epoch %d: Loss = %.6f, Accuracy = %.2f%%\n",
           epoch, avg_loss, accuracy * 100);
}

Learning Rate Strategies

The learning rate often needs adjustment during training. Starting with a fixed rate works for simple problems, but complex models benefit from learning rate schedules.

Fixed Learning Rate

The simplest strategy: use the same learning rate throughout training.

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Training loop uses 0.01 for all epochs
for (int epoch = 0; epoch < 100; epoch++) {
    // ... training ...
}

This works well when:

The problem is simple
You've found a good learning rate through experimentation
Training converges in a reasonable number of epochs

Step Decay

Reduce the learning rate by a fixed factor every N epochs:

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.1);

for (int epoch = 0; epoch < 100; epoch++) {
    // Reduce learning rate by 10x every 30 epochs
    if (epoch % 30 == 0 && epoch > 0) {
        opt->learning_rate *= 0.1;
        printf("Epoch %d: Learning rate reduced to %.6f\n",
               epoch, opt->learning_rate);
    }

    // ... training loop ...
}

Common schedules:

Divide by 10 every 30 epochs (0.1 -> 0.01 -> 0.001)
Divide by 2 every 10 epochs (0.1 -> 0.05 -> 0.025)

Step decay is simple and effective for many problems.

Exponential Decay

Gradually reduce the learning rate every epoch:

double initial_lr = 0.1;
double decay_rate = 0.95;

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, initial_lr);

for (int epoch = 0; epoch < 100; epoch++) {
    // Update learning rate
    opt->learning_rate = initial_lr * pow(decay_rate, epoch);

    if (epoch % 10 == 0) {
        printf("Epoch %d: Learning rate = %.6f\n", epoch, opt->learning_rate);
    }

    // ... training loop ...
}

This provides smooth, gradual decay. The decay rate controls how quickly the learning rate decreases (0.95 is typical).

Cosine Annealing

Reduce the learning rate following a cosine curve:

#include <math.h>

double initial_lr = 0.1;
double min_lr = 0.001;
int num_epochs = 100;

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, initial_lr);

for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Cosine annealing formula
    double progress = (double)epoch / num_epochs;
    opt->learning_rate = min_lr + (initial_lr - min_lr) *
                         (1.0 + cos(M_PI * progress)) / 2.0;

    // ... training loop ...
}

Cosine annealing provides smooth decay that starts fast and slows down near the end.

Learning Rate Warmup

For very high initial learning rates, gradually increase from a small value:

double target_lr = 0.1;
int warmup_epochs = 5;

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, target_lr / warmup_epochs);

for (int epoch = 0; epoch < 100; epoch++) {
    // Warmup phase
    if (epoch < warmup_epochs) {
        opt->learning_rate = target_lr * (epoch + 1) / warmup_epochs;
    }

    // ... training loop ...
}

Warmup prevents instability in the first few epochs when using aggressive learning rates.

Adaptive Scheduling

Reduce the learning rate when progress stalls:

double best_loss = INFINITY;
int patience = 5;
int no_improvement_count = 0;

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

for (int epoch = 0; epoch < 100; epoch++) {
    // ... training loop ...
    double epoch_loss = compute_epoch_loss();

    // Track progress
    if (epoch_loss < best_loss) {
        best_loss = epoch_loss;
        no_improvement_count = 0;
    } else {
        no_improvement_count++;
    }

    // Reduce learning rate if stuck
    if (no_improvement_count >= patience) {
        opt->learning_rate *= 0.5;
        printf("Reducing learning rate to %.6f\n", opt->learning_rate);
        no_improvement_count = 0;
    }
}

This adapts to training dynamics automatically, reducing the learning rate only when needed.

Choosing a Strategy

Start simple: Use a fixed learning rate first. Only add scheduling if training plateaus.

Step decay: Good default for most problems. Easy to understand and implement.

Exponential/Cosine: Use for long training runs (100+ epochs) where smooth decay is beneficial.

Adaptive: Best when you're not sure how many epochs you need or when progress is unpredictable.

Warmup: Use when starting with very high learning rates (0.1+) to prevent early instability.

Choosing an Optimizer

With multiple optimizers available, how do you choose? Here's a practical decision guide.

Start with SGD + Momentum

For most problems, SGD with momentum (0.01 learning rate, 0.9 momentum) is the best starting point:

tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

This provides:

Good convergence speed
Reasonable memory overhead
Robustness to hyperparameter choices

Decision Tree

Is memory extremely tight? (< 2x parameter memory available)
├─ YES: Use vanilla SGD
└─ NO: Continue

Is the problem very simple? (linear model, small dataset)
├─ YES: Use vanilla SGD (momentum won't help much)
└─ NO: Continue

Is the network deep? (> 5 layers)
├─ YES: Use SGD with momentum 0.9 or higher
└─ NO: Use SGD with momentum 0.9

Comparison Table

Optimizer	Memory	Convergence	Tuning	Best For
SGD	Minimal	Slower	Difficult	Memory-constrained, simple problems
SGD+Momentum	2x params	Faster	Moderate	General purpose, deep networks

Network Depth Considerations

Shallow networks (1-3 layers):

Vanilla SGD often sufficient
Momentum helps but not essential

Medium networks (4-10 layers):

Momentum recommended
Use momentum 0.9

Deep networks (10+ layers):

Momentum essential
Use momentum 0.95-0.99

Problem Type Recommendations

Regression (MSE loss):

// Start here
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

Classification (cross-entropy loss):

// May need higher learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.05, 0.9);

Fine-tuning pretrained models:

// Very small learning rate to preserve learned features
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.0001);

When in Doubt

Default configuration for new problems:

tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

This works well for most cases. Adjust based on training behavior:

Loss diverges: Reduce learning rate by 10x
Convergence too slow: Increase learning rate by 2-5x
Still slow: Increase momentum to 0.95

Troubleshooting

Training neural networks is often an iterative process of diagnosing and fixing issues. Here are common problems and their solutions.

Loss is NaN or Infinite

Symptoms: Loss becomes NaN or infinity after a few iterations.

Causes:

Learning rate too high
Gradient explosion (very large gradients)
Numerical instability in loss function

Solutions:

Reduce learning rate dramatically:

// If using 0.01, try 0.001
opt->learning_rate = 0.001;

Check gradients for extreme values:

tofu_graph_backward(g, loss);

// Before optimizer step, check gradient magnitudes
for (int i = 0; i < opt->num_params; i++) {
    tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
    double max_grad = find_max_abs_value(grad);

    if (max_grad > 1000.0) {
        printf("Warning: Large gradient detected: %.2f\n", max_grad);
    }
}

tofu_optimizer_step(opt);

Implement gradient clipping:

void clip_gradients(tofu_optimizer *opt, double max_norm) {
    for (int i = 0; i < opt->num_params; i++) {
        tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
        if (!grad) continue;

        // Compute L2 norm
        double norm = 0.0;
        for (int j = 0; j < grad->len; j++) {
            float val;
            TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
            norm += val * val;
        }
        norm = sqrt(norm);

        // Clip if too large
        if (norm > max_norm) {
            double scale = max_norm / norm;
            for (int j = 0; j < grad->len; j++) {
                float val;
                TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
                val *= scale;
                TOFU_TENSOR_DATA_FROM(grad, j, val, TOFU_FLOAT);
            }
        }
    }
}

// Use before optimizer step
tofu_graph_backward(g, loss);
clip_gradients(opt, 1.0);  // Clip to max norm of 1.0
tofu_optimizer_step(opt);

Loss Not Decreasing

Symptoms: Loss stays constant or decreases very slowly.

Causes:

Learning rate too low
Model stuck in poor initialization
Gradient vanishing
Wrong loss function or labels

Solutions:

Increase learning rate:

opt->learning_rate *= 10.0;  // Try 10x higher

Check if gradients are flowing:

tofu_graph_backward(g, loss);

// Check if gradients are non-zero
for (int i = 0; i < opt->num_params; i++) {
    tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
    double sum_abs = 0.0;

    for (int j = 0; j < grad->len; j++) {
        float val;
        TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
        sum_abs += fabs(val);
    }

    double mean_abs = sum_abs / grad->len;

    if (mean_abs < 1e-7) {
        printf("Warning: Very small gradients (%.2e) for parameter %d\n",
               mean_abs, i);
    }
}

Try momentum if using vanilla SGD:

// Replace vanilla SGD with momentum
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

Loss Oscillates

Symptoms: Loss goes up and down rather than steadily decreasing.

Causes:

Learning rate too high
Batch size too small (noisy gradients)
Wrong momentum setting

Solutions:

Reduce learning rate:

opt->learning_rate *= 0.5;  // Try half the current rate

Use or increase momentum:

// If using vanilla SGD, add momentum
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

// If already using momentum, increase it
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.95);

Training Slows Down Over Time

Symptoms: Loss decreases quickly at first, then stalls.

Causes:

Learning rate too high for fine-tuning
Converging to local minimum
Need learning rate schedule

Solutions:

Implement step decay:

for (int epoch = 0; epoch < 100; epoch++) {
    // Reduce learning rate when progress slows
    if (epoch == 30 || epoch == 60 || epoch == 90) {
        opt->learning_rate *= 0.1;
        printf("Reduced learning rate to %.6f\n", opt->learning_rate);
    }

    // ... training loop ...
}

Use adaptive scheduling:

double best_loss = INFINITY;
int no_improvement = 0;

for (int epoch = 0; epoch < 100; epoch++) {
    // ... training ...

    if (epoch_loss < best_loss) {
        best_loss = epoch_loss;
        no_improvement = 0;
    } else {
        no_improvement++;
    }

    if (no_improvement > 5) {
        opt->learning_rate *= 0.5;
        no_improvement = 0;
    }
}

Memory Issues

Symptoms: Program crashes with allocation errors or runs out of memory.

Causes:

Not clearing operations between batches
Momentum optimizer on large networks
Accumulating tensors unintentionally

Solutions:

Always clear operations:

for (int batch = 0; batch < num_batches; batch++) {
    // ... training ...
    tofu_graph_clear_ops(g);  // Essential for memory management
}

Use vanilla SGD if momentum exhausts memory:

// Switch from momentum to vanilla SGD
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_create(g, 0.01);

Best Practices

Here are guidelines to help you train models effectively and avoid common pitfalls.

Always Zero Gradients

Make this the first line of every training iteration:

for (int batch = 0; batch < num_batches; batch++) {
    tofu_optimizer_zero_grad(opt);  // Never forget this!
    // ... rest of training loop ...
}

Without this, gradients accumulate across batches, leading to incorrect updates.

Monitor Multiple Metrics

Don't rely on loss alone. Track additional metrics:

for (int epoch = 0; epoch < num_epochs; epoch++) {
    double total_loss = 0.0;
    double total_l2_norm = 0.0;
    double max_grad = 0.0;

    for (int batch = 0; batch < num_batches; batch++) {
        // ... training ...

        // Track loss
        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
        total_loss += loss_val;

        // Track parameter norm
        total_l2_norm += compute_parameter_norm(opt);

        // Track max gradient
        max_grad = fmax(max_grad, compute_max_gradient(opt));
    }

    printf("Epoch %d: Loss=%.4f, Param_Norm=%.4f, Max_Grad=%.4f\n",
           epoch, total_loss / num_batches,
           total_l2_norm / num_batches, max_grad);
}

Save Checkpoints

Periodically save model parameters during training:

void save_parameters(tofu_optimizer *opt, const char *filename) {
    FILE *f = fopen(filename, "wb");
    if (!f) return;

    for (int i = 0; i < opt->num_params; i++) {
        tofu_tensor *param = tofu_graph_get_value(opt->params[i]);
        fwrite(param->data, 1, param->len * sizeof(float), f);
    }

    fclose(f);
}

// Use during training
for (int epoch = 0; epoch < 100; epoch++) {
    // ... training loop ...

    // Save every 10 epochs
    if (epoch % 10 == 0) {
        char filename[256];
        snprintf(filename, sizeof(filename), "model_epoch_%d.bin", epoch);
        save_parameters(opt, filename);
    }
}

Start Conservative

Begin with conservative hyperparameters and increase aggressiveness only if needed:

// Conservative defaults
double learning_rate = 0.01;  // Not too high
double momentum = 0.9;         // Standard momentum

tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, learning_rate, momentum);

It's easier to increase the learning rate if training is too slow than to recover from instability caused by too-high rates.

Test on Small Data First

Before training on the full dataset, verify your setup on a small subset:

// Test with 10 batches first
int test_batches = 10;

for (int epoch = 0; epoch < 5; epoch++) {
    for (int batch = 0; batch < test_batches; batch++) {
        // ... training loop ...
    }
}

// If loss decreases on small data, scale to full dataset

This quickly reveals issues with the model, loss function, or optimizer configuration.

Use Learning Rate Warmup for High Rates

When using aggressive learning rates (> 0.05), warm up gradually:

double target_lr = 0.1;
int warmup_epochs = 5;

for (int epoch = 0; epoch < 100; epoch++) {
    if (epoch < warmup_epochs) {
        opt->learning_rate = target_lr * (epoch + 1) / warmup_epochs;
    } else {
        opt->learning_rate = target_lr;
    }

    // ... training loop ...
}

Document Your Configuration

Keep track of hyperparameters that work:

// Document successful configurations
printf("Configuration:\n");
printf("  Optimizer: SGD with Momentum\n");
printf("  Learning Rate: %.6f\n", opt->learning_rate);
printf("  Momentum: %.2f\n", 0.9);
printf("  Batch Size: %d\n", batch_size);
printf("  Schedule: Step decay by 0.1 every 30 epochs\n");

This helps when you need to replicate results or adjust for similar problems.

Complete Example

Here's a complete training example that demonstrates all the concepts from this guide.

Problem: Binary Classification

We'll train a two-layer neural network to classify binary data.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "tofu.h"

// Network architecture: input(10) -> hidden(20) -> output(1)

int main(void) {
    // Hyperparameters
    const int input_dim = 10;
    const int hidden_dim = 20;
    const int output_dim = 1;
    const int batch_size = 32;
    const int num_batches = 100;
    const int num_epochs = 50;
    const double learning_rate = 0.01;
    const double momentum = 0.9;

    // Create graph
    tofu_graph *g = tofu_graph_create();

    // Initialize parameters
    tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){input_dim, hidden_dim}, TOFU_FLOAT);
    tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){hidden_dim}, TOFU_FLOAT);
    tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){hidden_dim, output_dim}, TOFU_FLOAT);
    tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){output_dim}, TOFU_FLOAT);

    // Add parameters to graph
    tofu_graph_node *W1_node = tofu_graph_param(g, W1);
    tofu_graph_node *b1_node = tofu_graph_param(g, b1);
    tofu_graph_node *W2_node = tofu_graph_param(g, W2);
    tofu_graph_node *b2_node = tofu_graph_param(g, b2);

    // Create optimizer
    tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, learning_rate, momentum);

    printf("Training Configuration:\n");
    printf("  Architecture: %d -> %d -> %d\n", input_dim, hidden_dim, output_dim);
    printf("  Optimizer: SGD with Momentum\n");
    printf("  Learning Rate: %.4f\n", learning_rate);
    printf("  Momentum: %.2f\n", momentum);
    printf("  Batch Size: %d\n", batch_size);
    printf("  Epochs: %d\n\n", num_epochs);

    // Training loop
    for (int epoch = 0; epoch < num_epochs; epoch++) {
        double epoch_loss = 0.0;

        // Learning rate schedule: reduce by 0.1 every 20 epochs
        if (epoch % 20 == 0 && epoch > 0) {
            opt->learning_rate *= 0.1;
            printf("Reduced learning rate to %.6f\n", opt->learning_rate);
        }

        for (int batch = 0; batch < num_batches; batch++) {
            // 1. Zero gradients
            tofu_optimizer_zero_grad(opt);

            // 2. Generate synthetic batch data (normally loaded from dataset)
            tofu_tensor *batch_x = generate_batch_data(batch_size, input_dim);
            tofu_tensor *batch_y = generate_batch_labels(batch_size, output_dim);

            // 3. Forward pass
            tofu_graph_node *x = tofu_graph_input(g, batch_x);

            // Layer 1: h = relu(x @ W1 + b1)
            tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1_node);
            h1 = tofu_graph_add(g, h1, b1_node);
            h1 = tofu_graph_relu(g, h1);

            // Layer 2: pred = h @ W2 + b2
            tofu_graph_node *pred = tofu_graph_matmul(g, h1, W2_node);
            pred = tofu_graph_add(g, pred, b2_node);

            // Loss
            tofu_graph_node *target = tofu_graph_input(g, batch_y);
            tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

            // Track loss
            float loss_val;
            TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
            epoch_loss += loss_val;

            // 4. Backward pass
            tofu_graph_backward(g, loss);

            // 5. Update parameters
            tofu_optimizer_step(opt);

            // 6. Clear operations
            tofu_graph_clear_ops(g);

            // Free batch data
            tofu_tensor_free_data_too(batch_x);
            tofu_tensor_free_data_too(batch_y);
        }

        // Log progress
        double avg_loss = epoch_loss / num_batches;
        printf("Epoch %2d: Loss = %.6f\n", epoch, avg_loss);

        // Early stopping
        if (avg_loss < 0.001) {
            printf("Converged! Stopping early.\n");
            break;
        }
    }

    // Cleanup
    tofu_optimizer_free(opt);
    tofu_graph_free(g);
    tofu_tensor_free_data_too(W1);
    tofu_tensor_free_data_too(b1);
    tofu_tensor_free_data_too(W2);
    tofu_tensor_free_data_too(b2);

    printf("\nTraining complete!\n");

    return 0;
}

// Helper function to generate synthetic data (replace with real data loading)
tofu_tensor* generate_batch_data(int batch_size, int input_dim) {
    float *data = (float*)malloc(batch_size * input_dim * sizeof(float));
    for (int i = 0; i < batch_size * input_dim; i++) {
        data[i] = ((float)rand() / RAND_MAX) * 2.0f - 1.0f;  // Random in [-1, 1]
    }
    tofu_tensor *t = tofu_tensor_create_with_values(data, 2,
                                                     (int[]){batch_size, input_dim});
    free(data);
    return t;
}

tofu_tensor* generate_batch_labels(int batch_size, int output_dim) {
    float *data = (float*)malloc(batch_size * output_dim * sizeof(float));
    for (int i = 0; i < batch_size * output_dim; i++) {
        data[i] = ((float)rand() / RAND_MAX) > 0.5f ? 1.0f : 0.0f;
    }
    tofu_tensor *t = tofu_tensor_create_with_values(data, 2,
                                                     (int[]){batch_size, output_dim});
    free(data);
    return t;
}

This example demonstrates:

Creating a computation graph with parameters
Building a multi-layer network
Setting up an optimizer with momentum
Implementing a learning rate schedule
Proper training loop structure
Monitoring loss over time
Correct cleanup order

Adapt this template for your specific problem by:

Changing network architecture (layer sizes, activations)
Loading real data instead of synthetic batches
Adding validation/test evaluation
Saving model checkpoints
Implementing gradient clipping if needed

Summary

Optimizers are the engine of neural network training. They take computed gradients and update parameters to minimize loss. Here are the key takeaways:

Core Concepts:

Gradient descent follows gradients downhill toward lower loss
Learning rate controls step size (most important hyperparameter)
Momentum accumulates velocity to accelerate convergence and dampen oscillations

Choosing an Optimizer:

Start with SGD + momentum (learning_rate=0.01, momentum=0.9)
Use vanilla SGD only for memory-constrained or very simple problems
Deep networks benefit from higher momentum (0.95-0.99)

Using Optimizers:

Always follow the pattern: zero_grad, forward, backward, step
Clear operations between batches to manage memory
Monitor training metrics (loss, gradients, parameter norms)

Tuning:

Start with learning_rate=0.01
Increase if training is too slow, decrease if unstable
Use learning rate schedules for long training runs
Implement gradient clipping for unstable gradients

Troubleshooting:

NaN loss: Reduce learning rate, clip gradients
No progress: Increase learning rate, add momentum
Oscillations: Reduce learning rate, increase momentum
Slow convergence: Use learning rate schedule, higher momentum

With these principles, you can train neural networks effectively and debug issues when they arise. Experiment with different configurations, monitor training carefully, and adjust based on what you observe. The best optimizer and hyperparameters depend on your specific problem, so be prepared to iterate.

Loss Functions

Loss functions are the core mechanism that guides neural network training. They measure how far your model's predictions are from the true values, producing a single scalar number that quantifies the error. During training, the optimizer uses gradients of this loss to adjust model parameters and improve predictions.

This guide explains how loss functions work, when to use each type, and how to integrate them into your training loops. You'll learn to choose the right loss function for your task and interpret loss values during training.

Introduction

Every machine learning model needs a way to evaluate how well it's performing. A loss function (also called an objective function or cost function) provides this evaluation by computing a numerical score representing prediction error.

During training:

The model makes predictions on input data
The loss function compares predictions to true target values
The result is a single number (scalar) representing total error
Gradients of this loss tell us how to adjust weights
The optimizer updates weights to reduce the loss

The choice of loss function depends on your task type (regression vs classification) and the structure of your data. Tofu provides two fundamental loss functions that cover most use cases:

Mean Squared Error (MSE): For regression tasks where you predict continuous values
Cross-Entropy Loss: For classification tasks where you predict discrete classes

Let's explore the fundamental properties all loss functions must have, then dive into each type.

Loss Function Fundamentals

To work correctly with gradient-based optimization, loss functions must satisfy three key requirements.

1. Objective Function

A loss function defines the optimization objective—the quantity we want to minimize during training. Lower loss means better predictions. The training process iteratively adjusts model parameters to find weights that minimize this function.

Think of it like hiking down a mountain in fog. The loss value tells you your current altitude, and the gradient tells you which direction is downhill. Your goal is to reach the lowest point (minimize loss).

2. Scalar Output

Loss functions must return a single number (scalar), not a vector or matrix. This scalar summarizes all prediction errors across all samples and features into one value.

Why scalar? Because optimization algorithms need a single objective to minimize. You can't simultaneously minimize multiple conflicting objectives without combining them into one number.

Example shapes:

Predictions:  [batch_size, features]  e.g., [32, 10]
Targets:      [batch_size, features]  e.g., [32, 10]
Loss:         [1]                     (scalar)

The loss computation typically:

Computes per-element errors
Sums or averages across all elements
Returns a single scalar value

3. Differentiable

Loss functions must be differentiable (smooth, with computable gradients). The gradient tells us how loss changes when we adjust each parameter—it's the compass that guides optimization.

Non-differentiable functions (like step functions or absolute value at zero) create problems for gradient-based optimizers. They can't compute meaningful gradients, so training fails or converges slowly.

Mathematically, we need:

∂L/∂w  (gradient of loss L with respect to each weight w)

Tofu's automatic differentiation system computes these gradients automatically through backpropagation, so you don't need to derive formulas manually.

Loss Function Workflow

Here's how a loss function fits into one training iteration:

1. Forward Pass:
   Input → Model → Predictions

2. Loss Computation:
   Loss = loss_function(Predictions, Targets)

3. Backward Pass:
   Compute ∂Loss/∂weights for all parameters

4. Parameter Update:
   weights = weights - learning_rate * ∂Loss/∂weights

5. Repeat until loss is minimized

Now let's examine specific loss functions and when to use them.

Mean Squared Error

Mean Squared Error (MSE) is the most common loss function for regression tasks—problems where you predict continuous numerical values rather than discrete classes.

Mathematical Formula

MSE computes the average squared difference between predictions and targets:

MSE = (1/n) * Σ(prediction - target)²

Where:
- n = number of elements (batch_size × features)
- Σ = sum over all elements

The squaring operation ensures:

Errors are always positive (negative errors don't cancel positive ones)
Large errors are penalized more heavily than small errors
The function is smooth and differentiable everywhere

Implementation in Tofu

Use tofu_graph_mse_loss() to add MSE loss to your computation graph:

tofu_graph_node* tofu_graph_mse_loss(tofu_graph* g,
                                      tofu_graph_node* pred,
                                      tofu_graph_node* target);

Parameters:

g: Computation graph
pred: Model predictions (any shape)
target: True target values (must match pred shape)

Returns: Scalar loss node (shape [1])

Requirements:

pred and target must have identical shapes
Both must be non-NULL

When to Use MSE

MSE is ideal for regression problems:

Perfect use cases:

Predicting house prices (continuous dollar values)
Estimating temperature (continuous degrees)
Forecasting stock prices
Predicting ages, distances, or other continuous quantities
Image denoising (pixel value reconstruction)

Why it works:

Treats all dimensions equally
Penalizes large errors more than small ones (squared term)
Has nice mathematical properties (convex, smooth gradients)
Easy to interpret (units are squared target units)

When NOT to use:

Classification tasks (use cross-entropy instead)
When outliers are common (MSE heavily penalizes outliers)
When you care about percentage error rather than absolute error

Practical Example

Here's a complete regression example predicting house prices:

// Setup: Simple linear regression y = x @ W + b
tofu_graph *g = tofu_graph_create();

// Training data: 4 samples with 2 features each
float input_data[] = {
    1.0f, 2.0f,   // Sample 1: [sqft=1000, bedrooms=2]
    2.0f, 3.0f,   // Sample 2: [sqft=2000, bedrooms=3]
    3.0f, 4.0f,   // Sample 3: [sqft=3000, bedrooms=4]
    4.0f, 5.0f    // Sample 4: [sqft=4000, bedrooms=5]
};

// Target prices (in $100k)
float target_data[] = {
    150.0f,  // $150k
    250.0f,  // $250k
    350.0f,  // $350k
    450.0f   // $450k
};

// Create tensors
tofu_tensor *x_tensor = tofu_tensor_create(
    input_data, 2, (int[]){4, 2}, TOFU_FLOAT);
tofu_tensor *y_tensor = tofu_tensor_create(
    target_data, 2, (int[]){4, 1}, TOFU_FLOAT);

// Model parameters (weights and bias)
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){2, 1}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){1}, TOFU_FLOAT);

// Build computation graph
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

// Forward pass: prediction = x @ W + b
tofu_graph_node *matmul_result = tofu_graph_matmul(g, x, W);
tofu_graph_node *prediction = tofu_graph_add(g, matmul_result, b);

// Target
tofu_graph_node *target = tofu_graph_input(g, y_tensor);

// Compute MSE loss
tofu_graph_node *loss = tofu_graph_mse_loss(g, prediction, target);

// Get loss value
tofu_tensor *loss_value = tofu_graph_get_value(loss);
float loss_scalar;
TOFU_TENSOR_DATA_TO(loss_value, 0, loss_scalar, TOFU_FLOAT);
printf("MSE Loss: %.6f\n", loss_scalar);

// Backward pass to compute gradients
tofu_graph_backward(g, loss);

// Now W->grad and b->grad contain gradients for parameter updates

Understanding MSE Values

MSE values depend on the scale of your target values:

Small targets (e.g., normalized to [0, 1]):

Good MSE: < 0.01
Acceptable: 0.01 - 0.1
Poor: > 0.1

Large targets (e.g., house prices in thousands):

MSE = 10,000 means average error is √10,000 = $100
MSE = 1,000 means average error is √1,000 ≈ $32
MSE = 100 means average error is √100 = $10

Tip: Take the square root of MSE to get Root Mean Squared Error (RMSE), which has the same units as your target variable and is easier to interpret.

Gradient Behavior

The gradient of MSE with respect to predictions is:

∂MSE/∂pred = (2/n) * (pred - target)

Key properties:

Gradient magnitude is proportional to error size
Large errors produce large gradients (faster learning)
Small errors produce small gradients (slower learning)
Can cause exploding gradients if predictions are very wrong

Cross-Entropy Loss

Cross-Entropy Loss (also called log loss) is the standard loss function for classification tasks—problems where you assign inputs to discrete categories.

Mathematical Formula

Cross-entropy measures the difference between predicted probability distributions and true labels:

CE = -(1/n) * Σ(target * log(prediction))

Where:
- n = batch_size × num_classes
- target = one-hot encoded true class (or class probabilities)
- prediction = softmax probabilities (sum to 1)
- log = natural logarithm

The formula rewards correct classifications (low loss) and heavily penalizes confident wrong predictions (high loss).

Why Cross-Entropy for Classification?

Cross-entropy has special properties that make it ideal for classification:

Probabilistic interpretation: It measures the "surprise" of predictions given the true distribution
Strong gradients: Even for small probability errors, gradients remain strong enough to drive learning
Numerical stability: Works well with softmax activation (more on this below)
Theoretical foundation: Derived from maximum likelihood estimation in statistics

MSE doesn't work well for classification because:

It treats class probabilities as arbitrary numbers, ignoring their sum-to-one constraint
Gradients vanish when the model is confident but wrong
No probabilistic interpretation

Softmax and Cross-Entropy Connection

Cross-entropy is almost always used with softmax activation on the final layer. Here's why:

Softmax converts raw scores (logits) into probabilities:

softmax(x_i) = exp(x_i) / Σ(exp(x_j))

Properties:
- All outputs are in range (0, 1)
- Outputs sum to 1 (valid probability distribution)
- Highlights the maximum value (turns scores into confident predictions)

Together, softmax + cross-entropy creates a powerful combination:

Softmax outputs represent class probabilities
Cross-entropy compares these probabilities to true labels
Gradients flow efficiently even when predictions are wrong

Implementation in Tofu

Use tofu_graph_ce_loss() with softmax probabilities:

tofu_graph_node* tofu_graph_ce_loss(tofu_graph* g,
                                     tofu_graph_node* pred,
                                     tofu_graph_node* target);

Parameters:

g: Computation graph
pred: Predicted probabilities from softmax (shape: [batch, num_classes])
target: One-hot encoded true labels (shape: [batch, num_classes])

Returns: Scalar loss node (shape [1])

Requirements:

pred must be softmax probabilities (values in [0, 1], sum to 1 per sample)
target must be one-hot encoded (1 for true class, 0 for others)
Shapes must match

Numerical stability: Tofu's implementation adds epsilon (1e-7) to avoid log(0), which would be undefined.

When to Use Cross-Entropy

Cross-entropy is ideal for classification:

Perfect use cases:

Image classification (cat vs dog vs bird)
Text classification (spam vs not spam)
Sentiment analysis (positive/negative/neutral)
Multi-class problems (10 digits, 1000 object categories, etc.)
Any problem with discrete categorical outputs

Why it works:

Designed for probability distributions
Strong gradients throughout training
Natural pairing with softmax
Well-studied theoretical properties

When NOT to use:

Regression problems (use MSE instead)
Multi-label classification where multiple classes can be true simultaneously (requires binary cross-entropy per class)

Practical Example: MNIST-style Digit Classification

Here's a complete classification example for 4 classes:

// Setup: Neural network for 4-class classification
tofu_graph *g = tofu_graph_create();

// Training data: 8 samples with 10 features each
float input_data[8 * 10] = { /* ...fill with data... */ };

// One-hot encoded labels (8 samples, 4 classes)
float label_data[8 * 4] = {
    1, 0, 0, 0,  // Sample 0: class 0
    0, 1, 0, 0,  // Sample 1: class 1
    0, 0, 1, 0,  // Sample 2: class 2
    0, 0, 0, 1,  // Sample 3: class 3
    1, 0, 0, 0,  // Sample 4: class 0
    0, 1, 0, 0,  // Sample 5: class 1
    0, 0, 1, 0,  // Sample 6: class 2
    0, 0, 0, 1   // Sample 7: class 3
};

// Create tensors
tofu_tensor *x_tensor = tofu_tensor_create(
    input_data, 2, (int[]){8, 10}, TOFU_FLOAT);
tofu_tensor *y_tensor = tofu_tensor_create(
    label_data, 2, (int[]){8, 4}, TOFU_FLOAT);

// Model parameters
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){10, 4}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);

// Build graph
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

// Forward pass: logits = x @ W + b
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
logits = tofu_graph_add(g, logits, b);

// Softmax activation (converts logits to probabilities)
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);  // axis=1

// Target labels
tofu_graph_node *target = tofu_graph_input(g, y_tensor);

// Cross-entropy loss
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);

// Get loss value
tofu_tensor *loss_value = tofu_graph_get_value(loss);
float loss_scalar;
TOFU_TENSOR_DATA_TO(loss_value, 0, loss_scalar, TOFU_FLOAT);
printf("Cross-Entropy Loss: %.6f\n", loss_scalar);

// Backward pass
tofu_graph_backward(g, loss);

// Gradients are now available in W->grad and b->grad

Understanding Cross-Entropy Values

Cross-entropy values depend on the number of classes:

Binary classification (2 classes):

Random guessing: ~0.693 (log(2))
Good model: < 0.3
Excellent model: < 0.1

Multi-class (e.g., 10 classes):

Random guessing: ~2.303 (log(10))
Good model: < 1.0
Excellent model: < 0.5

Key insight: Cross-entropy can never be negative. Zero loss means perfect predictions (100% confidence in correct class). As predictions get worse, loss increases without bound.

Interpreting Loss During Training

Watch for these patterns:

Healthy training:

Epoch 0:  loss = 2.30  (random initialization)
Epoch 10: loss = 1.20  (learning started)
Epoch 50: loss = 0.50  (converging)
Epoch 100: loss = 0.15  (well-trained)

Problems:

Loss stays at log(num_classes): Model isn't learning (check learning rate)
Loss increases: Learning rate too high or numerical instability
Loss plateaus early: Model too simple or data too hard

Gradient Behavior

The gradient of cross-entropy with respect to predictions is:

∂CE/∂pred = -(1/n) * (target / pred)

Key properties:

When prediction is very wrong (pred ≈ 0 but target = 1), gradient is very large
When prediction is correct and confident (pred ≈ 1 and target = 1), gradient is small
This creates strong learning signals when needed most

Combined with softmax, the gradient simplifies beautifully to:

∂(CE + softmax)/∂logits = pred - target

This is why softmax + cross-entropy is the gold standard for classification.

Choosing a Loss Function

Selecting the right loss function is critical—it defines what "good" means for your model. Here's a decision guide.

Decision Tree

Start: What type of problem are you solving?
│
├─ Predicting continuous values (numbers)?
│  └─ Use Mean Squared Error (MSE)
│     Examples: Regression, image denoising, forecasting
│
└─ Predicting discrete categories (classes)?
   └─ Use Cross-Entropy Loss
      Examples: Classification, object recognition, sentiment analysis

Quick Reference Table

Task Type	Loss Function	Output Activation	Target Format
Regression	MSE	None (linear)	Continuous values
Binary Classification	Cross-Entropy	Softmax (2 classes)	One-hot [0,1] or [1,0]
Multi-class Classification	Cross-Entropy	Softmax	One-hot encoding
Image Reconstruction	MSE	None or Sigmoid	Pixel values

Detailed Recommendations

Use MSE when:

Output is a continuous number (prices, temperatures, distances)
You care about absolute error magnitude
Your task is regression or reconstruction
Outliers are rare or acceptable

Use Cross-Entropy when:

Output is a discrete category (class label)
You need probability predictions
Your task is classification
You want strong gradients throughout training

Example scenarios:

Problem	Input	Output	Loss	Why
House price prediction	Features (sqft, bedrooms)	Price ($)	MSE	Continuous value
Spam detection	Email text	Spam/Not Spam	Cross-Entropy	Binary classification
Digit recognition	Image pixels	Digit (0-9)	Cross-Entropy	Multi-class classification
Temperature forecast	Historical data	Temperature (°F)	MSE	Continuous value
Sentiment analysis	Review text	Pos/Neg/Neutral	Cross-Entropy	Multi-class classification

Common Mistakes

Mistake 1: Using MSE for classification

// WRONG: Using MSE to predict classes
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *loss = tofu_graph_mse_loss(g, logits, target);  // Bad!

Problem: MSE treats class probabilities as arbitrary numbers, leading to weak gradients and poor convergence.

Fix: Use softmax + cross-entropy:

// CORRECT: Classification with cross-entropy
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);  // Good!

Mistake 2: Forgetting softmax before cross-entropy

// WRONG: Cross-entropy without softmax
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *loss = tofu_graph_ce_loss(g, logits, target);  // Bad!

Problem: Cross-entropy expects probabilities (sum to 1), but logits are raw scores.

Fix: Always apply softmax first:

// CORRECT: Softmax before cross-entropy
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);  // Good!

Mistake 3: Wrong target format

// WRONG: Class indices instead of one-hot for cross-entropy
float targets[] = {0, 1, 2, 1};  // Class indices

Problem: Cross-entropy expects one-hot encoded targets, not class indices.

Fix: Convert to one-hot:

// CORRECT: One-hot encoded targets
float targets[] = {
    1, 0, 0,  // Class 0
    0, 1, 0,  // Class 1
    0, 0, 1,  // Class 2
    0, 1, 0   // Class 1
};

Loss in Training Loop

The loss function integrates into the training loop at a specific point in the forward-backward cycle. Understanding this workflow ensures correct implementation.

Training Loop Structure

A typical training iteration follows this pattern:

1. Zero gradients       (clear previous gradients)
2. Forward pass         (compute predictions)
3. Compute loss         (evaluate predictions)
4. Backward pass        (compute gradients via backpropagation)
5. Optimizer step       (update parameters)
6. Repeat

Loss computation happens after the forward pass and before the backward pass. It's the bridge connecting prediction to optimization.

Complete Training Loop Example

Here's a full training loop showing loss integration:

// Setup: Create graph, parameters, and optimizer
tofu_graph *g = tofu_graph_create();

tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);

tofu_graph_node *W_node = tofu_graph_param(g, weights);
tofu_graph_node *b_node = tofu_graph_param(g, bias);

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Training loop
const int NUM_EPOCHS = 100;
const int BATCH_SIZE = 32;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    float epoch_loss = 0.0f;
    int num_batches = 0;

    for (int batch = 0; batch < num_batches_in_dataset; batch++) {
        // 1. Zero gradients
        tofu_optimizer_zero_grad(opt);

        // 2. Forward pass
        // Load batch data
        tofu_tensor *batch_x = load_batch_input(batch);
        tofu_tensor *batch_y = load_batch_target(batch);

        tofu_graph_node *x = tofu_graph_input(g, batch_x);
        tofu_graph_node *y_true = tofu_graph_input(g, batch_y);

        // Model forward pass
        tofu_graph_node *h = tofu_graph_matmul(g, x, W_node);
        tofu_graph_node *pred = tofu_graph_add(g, h, b_node);

        // 3. Compute loss
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, y_true);

        // Extract loss value for monitoring
        tofu_tensor *loss_tensor = tofu_graph_get_value(loss);
        float loss_value;
        TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_value, TOFU_FLOAT);
        epoch_loss += loss_value;

        // 4. Backward pass
        tofu_graph_backward(g, loss);

        // 5. Optimizer step
        tofu_optimizer_step(opt);

        // Cleanup batch resources
        tofu_tensor_free(batch_x);
        tofu_tensor_free(batch_y);

        // Clear graph operations (keeps parameters)
        tofu_graph_clear_ops(g);

        num_batches++;
    }

    // Report progress
    float avg_loss = epoch_loss / num_batches;
    printf("Epoch %d: loss = %.6f\n", epoch, avg_loss);
}

Monitoring Loss During Training

Track loss values across epochs to monitor training progress:

// Loss tracking
float loss_history[NUM_EPOCHS];

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    // ... training code ...

    loss_history[epoch] = avg_loss;

    // Print every 10 epochs
    if (epoch % 10 == 0) {
        printf("Epoch %3d: loss = %.6f\n", epoch, avg_loss);
    }
}

Loss Curves

Visualizing loss over time reveals training behavior:

Healthy training curve:

Loss
│
2.0 ┤●
    │ ●
1.5 ┤  ●
    │   ●●
1.0 ┤     ●●
    │       ●●●
0.5 ┤          ●●●●●●●●●●●
    └──────────────────────────> Epoch

Characteristics:

Smooth decrease
Eventually plateaus
No wild fluctuations

Warning signs:

Loss increasing:
    │    ●●●
    │  ●●
    │●●
    └─────> Learning rate too high

Loss plateauing early:
    │●●●●●●●●●●●●●●
    │
    └─────> Model too simple or stuck

Loss oscillating:
    │ ● ● ● ● ●
    │● ● ● ● ● ●
    └─────> Batch size too small or LR too high

Understanding Loss Values

Interpreting loss values correctly helps diagnose training issues and assess model quality.

Absolute Loss Magnitude

Loss value interpretation depends heavily on context:

For MSE:

Scale depends on target value range
MSE = 100 is terrible for normalized data [0, 1]
MSE = 100 might be excellent for house prices in thousands
Always consider: What's the typical magnitude of your targets?

For Cross-Entropy:

Random guessing baseline: log(num_classes)
Binary classification random: 0.693
10-class random: 2.303
Perfect predictions: 0.0

Rule of thumb: Compare loss to a baseline (random guessing or simple heuristic) to assess improvement.

Relative Changes Matter More

Focus on loss trends rather than absolute values:

// Good trend (decreasing)
Epoch 0:   loss = 1.50
Epoch 10:  loss = 1.20  (20% reduction)
Epoch 20:  loss = 0.85  (29% reduction)
Epoch 50:  loss = 0.45  (47% reduction)

// Bad trend (increasing)
Epoch 0:   loss = 1.50
Epoch 10:  loss = 1.65  (increasing - problem!)

Common Loss Troubleshooting

Problem: Loss is NaN or infinite

Causes:

Learning rate too high (exploding gradients)
Numerical overflow in loss computation
Invalid data (NaN in input)

Fixes:

// 1. Reduce learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.001);  // Was 0.1

// 2. Check for NaN in data
for (int i = 0; i < tensor->len; i++) {
    float val;
    TOFU_TENSOR_DATA_TO(tensor, i, val, TOFU_FLOAT);
    if (isnan(val) || isinf(val)) {
        fprintf(stderr, "Invalid data at index %d\n", i);
    }
}

// 3. Add gradient clipping (manual)
tofu_tensor *grad = tofu_graph_get_grad(param_node);
// Clip gradients to [-max_grad, max_grad]

Problem: Loss doesn't decrease

Causes:

Learning rate too low
Model too simple (can't fit data)
Weights initialized poorly
Wrong loss function for task

Fixes:

// 1. Increase learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.1);  // Was 0.001

// 2. Add hidden layers (increase model capacity)
tofu_graph_node *h1 = tofu_graph_relu(g, tofu_graph_add(g,
    tofu_graph_matmul(g, x, W1), b1));
tofu_graph_node *h2 = tofu_graph_relu(g, tofu_graph_add(g,
    tofu_graph_matmul(g, h1, W2), b2));

// 3. Check you're using the right loss function
// Classification → Cross-Entropy, Regression → MSE

Problem: Loss plateaus too early

Causes:

Model capacity too small
Learning rate needs adjustment
Reached local minimum
Need more training time

Fixes:

// 1. Train longer
const int NUM_EPOCHS = 500;  // Was 100

// 2. Add capacity
// Increase hidden layer size or add more layers

// 3. Try learning rate schedule
float lr = (epoch < 50) ? 0.1 : 0.01;  // Reduce LR after 50 epochs

Problem: Loss oscillates wildly

Causes:

Learning rate too high
Batch size too small
Numerical instability

Fixes:

// 1. Reduce learning rate
lr = 0.001;  // Was 0.1

// 2. Increase batch size
BATCH_SIZE = 64;  // Was 16

// 3. Add momentum (helps smooth updates)
tofu_optimizer *opt = tofu_optimizer_adam_create(g, 0.001);

Comparing Train vs Validation Loss

Always monitor loss on held-out validation data:

// Training loop with validation
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    // Training
    float train_loss = train_one_epoch(g, opt, train_data);

    // Validation (no gradient updates)
    float val_loss = evaluate_loss(g, val_data);

    printf("Epoch %d: train_loss=%.4f, val_loss=%.4f\n",
           epoch, train_loss, val_loss);

    // Check for overfitting
    if (val_loss > train_loss * 1.5) {
        printf("Warning: Model may be overfitting\n");
    }
}

Healthy pattern:

Train loss: 0.50, Val loss: 0.55  (close - good generalization)

Overfitting:

Train loss: 0.10, Val loss: 0.80  (gap too large - overfitting)

Advanced Topics

Beyond basic loss functions, advanced techniques can improve training stability and model performance.

Loss Weighting

Sometimes you want to emphasize certain samples or classes. Loss weighting adjusts the contribution of individual samples.

Class weighting for imbalanced data:

If you have 90% negative samples and 10% positive samples in binary classification, the model may ignore the minority class. Weight the minority class higher:

// Manually weight loss by class
// Assume we have per-sample weights
float class_weights[2] = {1.0f, 9.0f};  // Weight minority class 9x

// Compute weighted loss manually
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *loss_unweighted = tofu_graph_ce_loss(g, probs, target);

// Get loss and multiply by weights
tofu_tensor *loss_tensor = tofu_graph_get_value(loss_unweighted);
float loss_val;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_val, TOFU_FLOAT);

// Weight by class (simplified - production code would weight per-sample)
int predicted_class = /* determine class */;
float weighted_loss = loss_val * class_weights[predicted_class];

Note: Tofu doesn't have built-in weighted loss functions yet, so implement weighting manually or at the sample level.

Regularization Loss Terms

Regularization adds a penalty term to prevent overfitting:

Total Loss = Task Loss + λ * Regularization Term

Where λ controls regularization strength

L2 Regularization (Weight Decay):

Penalize large weights to prevent overfitting:

// Compute L2 regularization manually
float l2_penalty = 0.0f;
const float lambda = 0.01f;

tofu_tensor *W = param_tensor;
for (int i = 0; i < W->len; i++) {
    float w;
    TOFU_TENSOR_DATA_TO(W, i, w, TOFU_FLOAT);
    l2_penalty += w * w;
}
l2_penalty *= lambda;

// Add to loss
float total_loss = task_loss + l2_penalty;

Note: Most optimizers (like Adam) have built-in weight decay support, which is more efficient than manual regularization.

Custom Loss Functions

For specialized tasks, you may need custom losses. Implement them by:

Computing the loss value using tensor operations
Implementing the backward pass (gradient computation)

Example: Huber loss (robust to outliers)

// Huber loss: Combines MSE (small errors) with MAE (large errors)
// Loss = 0.5 * (pred - target)^2        if |error| < delta
//      = delta * (|error| - 0.5*delta)  otherwise

// This requires implementing a custom graph operation
// (beyond basic usage - see advanced tutorials)

For most use cases, MSE and cross-entropy are sufficient. Custom losses require deeper knowledge of Tofu's backward pass implementation.

Multi-Task Learning

Train one model for multiple tasks by combining losses:

// Example: Predict both class and bounding box
tofu_graph_node *class_logits = /* classification head */;
tofu_graph_node *bbox_pred = /* regression head */;

// Classification loss
tofu_graph_node *class_probs = tofu_graph_softmax(g, class_logits, 1);
tofu_graph_node *class_loss = tofu_graph_ce_loss(g, class_probs, class_target);

// Bounding box regression loss
tofu_graph_node *bbox_loss = tofu_graph_mse_loss(g, bbox_pred, bbox_target);

// Combined loss (weighted sum)
// Note: Must be done manually as Tofu doesn't support loss addition yet
float class_loss_val = extract_scalar(class_loss);
float bbox_loss_val = extract_scalar(bbox_loss);
float total_loss = class_loss_val + 0.5 * bbox_loss_val;  // Weight bbox 50%

Complete Examples

Let's walk through two complete, practical examples: regression and classification.

Example 1: Regression - House Price Prediction

Goal: Predict house prices from square footage and number of bedrooms.

#include <stdio.h>
#include <stdlib.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"

int main() {
    // Dataset: 4 houses
    float features[] = {
        1000.0f, 2.0f,  // 1000 sqft, 2 bedrooms → $150k
        1500.0f, 3.0f,  // 1500 sqft, 3 bedrooms → $200k
        2000.0f, 3.0f,  // 2000 sqft, 3 bedrooms → $250k
        2500.0f, 4.0f   // 2500 sqft, 4 bedrooms → $300k
    };

    float prices[] = {150.0f, 200.0f, 250.0f, 300.0f};

    // Create tensors
    tofu_tensor *X = tofu_tensor_create(features, 2, (int[]){4, 2}, TOFU_FLOAT);
    tofu_tensor *y = tofu_tensor_create(prices, 2, (int[]){4, 1}, TOFU_FLOAT);

    // Model parameters (linear regression: y = X @ W + b)
    tofu_tensor *W = tofu_tensor_zeros(2, (int[]){2, 1}, TOFU_FLOAT);
    tofu_tensor *b = tofu_tensor_zeros(1, (int[]){1}, TOFU_FLOAT);

    // Create graph and optimizer
    tofu_graph *g = tofu_graph_create();
    tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.0001);  // Small LR

    // Training loop
    for (int epoch = 0; epoch < 1000; epoch++) {
        tofu_optimizer_zero_grad(opt);

        // Forward pass
        tofu_graph_node *x_node = tofu_graph_input(g, X);
        tofu_graph_node *W_node = tofu_graph_param(g, W);
        tofu_graph_node *b_node = tofu_graph_param(g, b);
        tofu_graph_node *y_node = tofu_graph_input(g, y);

        tofu_graph_node *pred = tofu_graph_add(g,
            tofu_graph_matmul(g, x_node, W_node), b_node);

        // MSE loss
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, y_node);

        // Extract loss value
        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);

        // Backward and optimize
        tofu_graph_backward(g, loss);
        tofu_optimizer_step(opt);

        if (epoch % 100 == 0) {
            printf("Epoch %4d: MSE = %.2f\n", epoch, loss_val);
        }

        tofu_graph_clear_ops(g);
    }

    // Final predictions
    printf("\nFinal predictions:\n");
    tofu_graph_node *x_node = tofu_graph_input(g, X);
    tofu_graph_node *W_node = tofu_graph_param(g, W);
    tofu_graph_node *b_node = tofu_graph_param(g, b);
    tofu_graph_node *pred = tofu_graph_add(g,
        tofu_graph_matmul(g, x_node, W_node), b_node);

    tofu_tensor *predictions = tofu_graph_get_value(pred);
    for (int i = 0; i < 4; i++) {
        float pred_price;
        TOFU_TENSOR_DATA_TO(predictions, i, pred_price, TOFU_FLOAT);
        printf("House %d: Predicted=%.1f, Actual=%.1f\n",
               i, pred_price, prices[i]);
    }

    // Cleanup
    tofu_optimizer_free(opt);
    tofu_graph_free(g);
    tofu_tensor_free(X);
    tofu_tensor_free(y);
    tofu_tensor_free_data_too(W);
    tofu_tensor_free_data_too(b);

    return 0;
}

Expected output:

Epoch    0: MSE = 42500.00
Epoch  100: MSE = 1250.50
Epoch  200: MSE = 523.75
Epoch  900: MSE = 12.30

Final predictions:
House 0: Predicted=148.5, Actual=150.0
House 1: Predicted=201.2, Actual=200.0
House 2: Predicted=251.8, Actual=250.0
House 3: Predicted=298.7, Actual=300.0

Example 2: Classification - XOR Problem

Goal: Learn the XOR function (classic non-linear classification).

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"

// Xavier initialization
float xavier_init(int fan_in) {
    float limit = sqrtf(6.0f / fan_in);
    return limit * (2.0f * rand() / RAND_MAX - 1.0f);
}

int main() {
    // XOR dataset
    float inputs[] = {
        0, 0,  // → 0
        0, 1,  // → 1
        1, 0,  // → 1
        1, 1   // → 0
    };

    // One-hot targets (2 classes: [1,0] = class 0, [0,1] = class 1)
    float targets[] = {
        1, 0,  // XOR(0,0) = 0
        0, 1,  // XOR(0,1) = 1
        0, 1,  // XOR(1,0) = 1
        1, 0   // XOR(1,1) = 0
    };

    // Create tensors
    tofu_tensor *X = tofu_tensor_create(inputs, 2, (int[]){4, 2}, TOFU_FLOAT);
    tofu_tensor *y = tofu_tensor_create(targets, 2, (int[]){4, 2}, TOFU_FLOAT);

    // Model: 2 → 4 → 2 (need hidden layer for non-linearity)
    tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){2, 4}, TOFU_FLOAT);
    tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
    tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){4, 2}, TOFU_FLOAT);
    tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){2}, TOFU_FLOAT);

    // Xavier initialization for W1, W2
    for (int i = 0; i < W1->len; i++) {
        float val = xavier_init(2);
        TOFU_TENSOR_DATA_FROM(W1, i, val, TOFU_FLOAT);
    }
    for (int i = 0; i < W2->len; i++) {
        float val = xavier_init(4);
        TOFU_TENSOR_DATA_FROM(W2, i, val, TOFU_FLOAT);
    }

    // Create graph and optimizer
    tofu_graph *g = tofu_graph_create();
    tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.5);  // Higher LR

    // Training loop
    for (int epoch = 0; epoch < 2000; epoch++) {
        tofu_optimizer_zero_grad(opt);

        // Forward pass
        tofu_graph_node *x = tofu_graph_input(g, X);
        tofu_graph_node *w1 = tofu_graph_param(g, W1);
        tofu_graph_node *b1_node = tofu_graph_param(g, b1);
        tofu_graph_node *w2 = tofu_graph_param(g, W2);
        tofu_graph_node *b2_node = tofu_graph_param(g, b2);
        tofu_graph_node *y_node = tofu_graph_input(g, y);

        // Layer 1: x @ W1 + b1 → ReLU
        tofu_graph_node *h1 = tofu_graph_relu(g, tofu_graph_add(g,
            tofu_graph_matmul(g, x, w1), b1_node));

        // Layer 2: h1 @ W2 + b2 → softmax
        tofu_graph_node *logits = tofu_graph_add(g,
            tofu_graph_matmul(g, h1, w2), b2_node);
        tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);

        // Cross-entropy loss
        tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, y_node);

        float loss_val;
        TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);

        // Backward and optimize
        tofu_graph_backward(g, loss);
        tofu_optimizer_step(opt);

        if (epoch % 200 == 0) {
            printf("Epoch %4d: CE Loss = %.4f\n", epoch, loss_val);
        }

        tofu_graph_clear_ops(g);
    }

    // Test predictions
    printf("\nFinal predictions:\n");
    tofu_graph_node *x = tofu_graph_input(g, X);
    tofu_graph_node *w1 = tofu_graph_param(g, W1);
    tofu_graph_node *b1_node = tofu_graph_param(g, b1);
    tofu_graph_node *w2 = tofu_graph_param(g, W2);
    tofu_graph_node *b2_node = tofu_graph_param(g, b2);

    tofu_graph_node *h1 = tofu_graph_relu(g, tofu_graph_add(g,
        tofu_graph_matmul(g, x, w1), b1_node));
    tofu_graph_node *logits = tofu_graph_add(g,
        tofu_graph_matmul(g, h1, w2), b2_node);
    tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);

    tofu_tensor *predictions = tofu_graph_get_value(probs);

    for (int i = 0; i < 4; i++) {
        float prob0, prob1;
        TOFU_TENSOR_DATA_TO(predictions, i*2, prob0, TOFU_FLOAT);
        TOFU_TENSOR_DATA_TO(predictions, i*2+1, prob1, TOFU_FLOAT);
        int pred_class = (prob1 > prob0) ? 1 : 0;
        int true_class = (targets[i*2+1] > 0.5f) ? 1 : 0;

        printf("[%.0f, %.0f] → Pred=%d (%.3f, %.3f), True=%d\n",
               inputs[i*2], inputs[i*2+1],
               pred_class, prob0, prob1, true_class);
    }

    // Cleanup
    tofu_optimizer_free(opt);
    tofu_graph_free(g);
    tofu_tensor_free(X);
    tofu_tensor_free(y);
    tofu_tensor_free_data_too(W1);
    tofu_tensor_free_data_too(b1);
    tofu_tensor_free_data_too(W2);
    tofu_tensor_free_data_too(b2);

    return 0;
}

Expected output:

Epoch    0: CE Loss = 0.7120
Epoch  200: CE Loss = 0.4532
Epoch  400: CE Loss = 0.2145
Epoch 1800: CE Loss = 0.0523

Final predictions:
[0, 0] → Pred=0 (0.972, 0.028), True=0
[0, 1] → Pred=1 (0.045, 0.955), True=1
[1, 0] → Pred=1 (0.039, 0.961), True=1
[1, 1] → Pred=0 (0.968, 0.032), True=0

Best Practices

Follow these guidelines for effective loss function usage:

1. Match Loss to Task Type

Always use:

MSE for regression
Cross-Entropy for classification

Never mix them: Using the wrong loss leads to poor convergence and incorrect learning.

2. Monitor Loss During Training

// Log loss to file or console
FILE *log = fopen("training_log.txt", "w");
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    float loss = train_epoch(...);
    fprintf(log, "%d,%.6f\n", epoch, loss);
    if (epoch % 10 == 0) {
        printf("Epoch %d: loss=%.6f\n", epoch, loss);
    }
}
fclose(log);

Track trends, not just final values. Loss should decrease smoothly over time.

3. Use Appropriate Learning Rates

Loss behavior reveals learning rate issues:

// Too high: Loss explodes or oscillates wildly
// Solution: Reduce by 10x
lr = 0.01;  // Was 0.1

// Too low: Loss barely decreases
// Solution: Increase by 10x
lr = 0.1;  // Was 0.01

4. Normalize Your Data

Large input/output ranges cause numerical instability:

// Bad: Raw house prices ($100k - $500k)
float price = 250000.0f;

// Good: Normalized to reasonable range
float price_normalized = (250000.0f - mean) / std_dev;
// or
float price_scaled = 250000.0f / 1000.0f;  // Scale to [100-500]

Normalization prevents exploding gradients and improves convergence.

5. Check for Numerical Issues

// Add checks during training
if (isnan(loss_val) || isinf(loss_val)) {
    fprintf(stderr, "ERROR: Loss is %f at epoch %d\n", loss_val, epoch);
    // Reduce learning rate or check data
    break;
}

6. Compare to Baselines

Always establish a baseline before training:

// Baseline 1: Random predictions
// For classification: loss ≈ log(num_classes)
// For regression: loss ≈ variance of targets

// Baseline 2: Simple heuristic
// Classification: Always predict most common class
// Regression: Always predict mean target value

printf("Random baseline loss: %.4f\n", baseline_loss);
printf("Trained model loss: %.4f\n", final_loss);
printf("Improvement: %.1f%%\n",
       100.0 * (baseline_loss - final_loss) / baseline_loss);

7. Use Validation Data

Never trust training loss alone:

// Split data: 80% train, 20% validation
float train_loss = evaluate_loss(g, train_data);
float val_loss = evaluate_loss(g, val_data);

if (val_loss > train_loss * 1.5) {
    printf("Warning: Possible overfitting\n");
}

8. Save Best Model Based on Validation Loss

float best_val_loss = INFINITY;
tofu_tensor *best_W = NULL;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    train_epoch(...);
    float val_loss = evaluate_validation(...);

    if (val_loss < best_val_loss) {
        best_val_loss = val_loss;
        // Save model parameters
        if (best_W) tofu_tensor_free_data_too(best_W);
        best_W = tofu_tensor_clone(W);
        printf("New best model at epoch %d: val_loss=%.4f\n",
               epoch, val_loss);
    }
}

9. Early Stopping

Stop training when validation loss stops improving:

int patience = 20;  // Wait 20 epochs for improvement
int no_improve_count = 0;

for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
    float val_loss = train_and_validate(...);

    if (val_loss < best_val_loss) {
        best_val_loss = val_loss;
        no_improve_count = 0;
    } else {
        no_improve_count++;
    }

    if (no_improve_count >= patience) {
        printf("Early stopping at epoch %d\n", epoch);
        break;
    }
}

10. Document Your Loss Function Choice

// At the top of your training code, document decisions:
/*
 * Model: Image classifier for 10 classes
 * Loss: Cross-Entropy (classification task)
 * Architecture: Input → 128 → 64 → 10 (softmax)
 * Optimizer: SGD with lr=0.01
 * Expected loss: Random ~2.3, Target <0.5
 */

This helps future debugging and maintains clear expectations.

Summary

Loss functions are the foundation of neural network training. Key takeaways:

Match loss to task:
- Regression → MSE
- Classification → Cross-Entropy (with softmax)
Loss must be:
- Scalar (single number)
- Differentiable (smooth gradients)
- Representative of task objective
Monitor loss trends:
- Decreasing = learning
- Plateauing = convergence or stuck
- Increasing = problem (LR too high, numerical issues)
Interpret loss in context:
- Compare to baselines (random guessing)
- Track validation loss (detect overfitting)
- Understand scale (depends on data range)
Debug with loss values:
- NaN/Inf → Check learning rate, data validity
- No decrease → Increase LR or model capacity
- Oscillation → Reduce LR or increase batch size

With proper loss function selection and monitoring, you'll train neural networks that converge reliably and achieve strong performance on your task.

For more details on loss function implementation and gradients, see the Graph API Reference. For optimizer integration, see the Optimizer User Guide.

Linear Regression

Tutorial on implementing linear regression with Tofu.

Classification

Tutorial on building classification models.

CNN Training

Training Convolutional Neural Networks with Tofu.

Residual Networks

Building and training residual networks (ResNets).

Memory Management

Best practices for memory management in Tofu.

Error Handling

Best practices for error handling.

Debugging

Tips and techniques for debugging Tofu applications.

Performance

Performance optimization tips and techniques.

Tensor API Reference

The Tensor API provides the core data structures and operations for Tofu. Tensors are multi-dimensional arrays that support automatic differentiation when used with the Graph API.

Data Structures

`tofu_tensor`

The core tensor structure representing a multi-dimensional array.

struct tofu_tensor {
    tofu_dtype dtype;           // Data type (TOFU_FLOAT, TOFU_INT32, etc.)
    int len;                    // Total number of elements
    int ndim;                   // Number of dimensions
    int *dims;                  // Array of dimension sizes
    void *data;                 // Pointer to data buffer
    struct tofu_tensor *owner;  // Data owner (NULL if self-owned)
    void *backend_data;         // Backend-specific data
};

Data Types (`tofu_dtype`)

Supported tensor data types:

TOFU_FLOAT - 32-bit floating point (most common for neural networks)
TOFU_DOUBLE - 64-bit floating point
TOFU_INT32 - 32-bit signed integer
TOFU_INT64 - 64-bit signed integer
TOFU_INT16 - 16-bit signed integer
TOFU_INT8 - 8-bit signed integer
TOFU_UINT32 - 32-bit unsigned integer
TOFU_UINT64 - 64-bit unsigned integer
TOFU_UINT16 - 16-bit unsigned integer
TOFU_UINT8 - 8-bit unsigned integer
TOFU_BOOL - Boolean type

Element-wise Operations (`tofu_elew_op`)

TOFU_MUL - Multiplication (*)
TOFU_DIV - Division (/)
TOFU_SUM - Addition (+)
TOFU_SUB - Subtraction (-)
TOFU_MAX - Element-wise maximum
TOFU_MIN - Element-wise minimum
TOFU_POW - Power (^)

Creation Functions

`tofu_tensor_create`

Create a tensor with an existing data buffer.

tofu_tensor *tofu_tensor_create(void *data, int ndim, const int *dims, tofu_dtype dtype);

Parameters:

data - Pointer to data buffer (cannot be NULL)
ndim - Number of dimensions (must be > 0 and <= TOFU_MAXDIM = 8)
dims - Array of dimension sizes, length must be ndim
dtype - Data type (TOFU_FLOAT, TOFU_INT32, etc.)

Returns: Pointer to newly allocated tensor (caller owns, must call tofu_tensor_free)

Ownership:

The tensor does NOT take ownership of the data buffer
Caller must manage data lifetime and free both tensor and data separately
Even when passed to tofu_graph_param(), caller still owns tensor

Example:

float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
int dims[] = {2, 2};
tofu_tensor *t = tofu_tensor_create(data, 2, dims, TOFU_FLOAT);

// Use tensor...

tofu_tensor_free(t);  // Free tensor structure
// data is still valid - free manually if needed

Notes:

Typical pattern: create tensor → use in graph → tofu_graph_free() → tofu_tensor_free() → free data
Violating preconditions triggers assert() and crashes

See also: tofu_tensor_zeros, tofu_tensor_create_with_values

`tofu_tensor_create_with_values`

Create a tensor with heap-allocated copy of provided values.

tofu_tensor *tofu_tensor_create_with_values(const float *values, int ndim, const int *dims);

Parameters:

values - Array of initial values (cannot be NULL)
ndim - Number of dimensions (must be > 0 and <= TOFU_MAXDIM)
dims - Array of dimension sizes, length must be ndim

Returns: Pointer to newly allocated tensor with copied data (caller owns, must call tofu_tensor_free_data_too)

Important:

Creates heap-allocated copy of values (safe for gradients)
DO NOT use compound literals like (float[]){1.0f} as they create stack memory
Number of values must match product of dims
Caller must call tofu_tensor_free_data_too to free both tensor and data

Example:

float values[] = {1.0f, 2.0f, 3.0f, 4.0f};
int dims[] = {2, 2};
tofu_tensor *t = tofu_tensor_create_with_values(values, 2, dims);

// Use tensor...

tofu_tensor_free_data_too(t);  // Free both tensor and data

`tofu_tensor_zeros`

Create a zero-initialized tensor with allocated data buffer.

tofu_tensor *tofu_tensor_zeros(int ndim, const int *dims, tofu_dtype dtype);

Parameters:

ndim - Number of dimensions (must be > 0 and <= TOFU_MAXDIM)
dims - Array of dimension sizes, length must be ndim
dtype - Data type (TOFU_FLOAT, TOFU_INT32, etc.)

Returns: Pointer to newly allocated zero-filled tensor (caller owns, must call tofu_tensor_free_data_too)

Ownership:

Allocates both tensor structure and data buffer
Caller must call tofu_tensor_free_data_too to free both
Even when passed to tofu_graph_param(), caller still owns tensor

Example:

int dims[] = {3, 4};
tofu_tensor *t = tofu_tensor_zeros(2, dims, TOFU_FLOAT);

// All elements are 0.0f
// Use tensor...

tofu_tensor_free_data_too(t);  // Free both tensor and data

See also: tofu_tensor_create, tofu_tensor_clone

`tofu_tensor_clone`

Create a deep copy of a tensor.

tofu_tensor *tofu_tensor_clone(const tofu_tensor *src);

Parameters:

src - Source tensor to clone (cannot be NULL)

Returns: Pointer to newly allocated tensor (caller owns, must call tofu_tensor_free_data_too)

Behavior:

Creates both new tensor structure and new data buffer
Copies all data from source to new tensor
Preserves shape and data type

Example:

tofu_tensor *original = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor *copy = tofu_tensor_clone(original);

// copy is independent of original
tofu_tensor_free_data_too(copy);
tofu_tensor_free_data_too(original);

`tofu_tensor_repeat`

Create a tensor by repeating data multiple times.

tofu_tensor *tofu_tensor_repeat(const tofu_tensor *src, int times);

Parameters:

src - Source tensor to repeat (cannot be NULL)
times - Number of repetitions (must be > 0)

Returns: Pointer to newly allocated tensor (caller owns, must call tofu_tensor_free_data_too)

Behavior:

Creates new tensor with size = src->len * times
Repeats source data sequentially

Example:

float data[] = {1.0f, 2.0f};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){2}, TOFU_FLOAT);
tofu_tensor *repeated = tofu_tensor_repeat(t, 3);
// repeated contains: [1.0, 2.0, 1.0, 2.0, 1.0, 2.0]

tofu_tensor_free_data_too(repeated);
tofu_tensor_free(t);

`tofu_tensor_arange`

Create a 1-D tensor with evenly spaced values (similar to NumPy arange).

tofu_tensor *tofu_tensor_arange(double start, double stop, double step, tofu_dtype dtype);

Parameters:

start - Starting value (inclusive)
stop - Ending value (exclusive)
step - Step size between values
dtype - Data type for the resulting tensor

Returns: Pointer to newly allocated 1-D tensor (caller owns, must call tofu_tensor_free_data_too)

Behavior:

Creates values [start, start+step, start+2*step, ..., stop)
Number of elements = ceil((stop - start) / step)

Example:

tofu_tensor *t = tofu_tensor_arange(0.0, 5.0, 1.0, TOFU_FLOAT);
// t contains: [0.0, 1.0, 2.0, 3.0, 4.0]

tofu_tensor_free_data_too(t);

See also: tofu_tensor_rearange for in-place filling

`tofu_tensor_rearange`

Fill existing tensor with evenly spaced values (in-place arange).

void tofu_tensor_rearange(tofu_tensor *src, double start, double stop, double step);

Parameters:

src - Tensor to fill (cannot be NULL)
start - Starting value (inclusive)
stop - Ending value (exclusive)
step - Step size between values

Behavior:

Fills tensor with [start, start+step, start+2*step, ...]
Number of values written is min(tensor size, ceil((stop-start)/step))
Modifies tensor data in-place

Cleanup Functions

`tofu_tensor_free`

Free tensor structure (does NOT free data buffer).

void tofu_tensor_free(tofu_tensor *t);

Parameters:

t - Tensor to free (can be NULL, no-op if NULL)

Behavior:

Frees only the tensor structure and dims array
Does NOT free the data buffer - caller must free data separately
Safe to call even if tensor was used with tofu_graph_param()
Call AFTER tofu_graph_free() if tensor was used with graph

Example:

float data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){4}, TOFU_FLOAT);

tofu_tensor_free(t);  // Free tensor structure only
// data is still valid

See also: tofu_tensor_free_data_too

`tofu_tensor_free_data_too`

Free both tensor structure and data buffer.

void tofu_tensor_free_data_too(tofu_tensor *t);

Parameters:

t - Tensor to free (can be NULL, no-op if NULL)

Behavior:

Frees both the tensor and its associated data buffer
Only use if tensor owns its data (created with tofu_tensor_zeros, tofu_tensor_clone, etc.)
Do NOT use if tensor was created with tofu_tensor_create() (use tofu_tensor_free)
Safe to call if tensor was used with tofu_graph_param()
Call AFTER tofu_graph_free() if tensor was used with graph

Example:

tofu_tensor *t = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT);

tofu_tensor_free_data_too(t);  // Free both tensor and data

Warning: Using this on tensors created with tofu_tensor_create() will cause undefined behavior!

Shape Operations

`tofu_tensor_size`

Get total number of elements in tensor.

size_t tofu_tensor_size(tofu_tensor *t);

Parameters:

t - Tensor (cannot be NULL)

Returns: Total element count (product of all dimensions)

Example:

tofu_tensor *t = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
size_t size = tofu_tensor_size(t);  // Returns 12

`tofu_tensor_reshape`

Reshape tensor to new dimensions (view operation, no data copy).

tofu_tensor *tofu_tensor_reshape(tofu_tensor *src, int ndim, const int *dims);

Parameters:

src - Source tensor (cannot be NULL)
ndim - Number of dimensions for reshaped tensor
dims - Array of new dimension sizes

Returns: New tensor structure sharing data with source (caller owns, must call tofu_tensor_free)

Behavior:

Does NOT copy data - result shares memory with source
Only changes shape metadata, not data layout
Source must outlive result tensor
Product of dims must equal tofu_tensor_size(src)

Warning: Do NOT call tofu_tensor_free_data_too on the reshaped view - this would free the shared data while the source tensor still references it! Only use tofu_tensor_free on views.

Example:

tofu_tensor *t = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor *reshaped = tofu_tensor_reshape(t, 2, (int[]){3, 4});

// reshaped is a view of t with shape [3, 4]
tofu_tensor_free(reshaped);  // Free view
tofu_tensor_free_data_too(t);  // Free original

See also: tofu_tensor_reshape_src for in-place reshape

`tofu_tensor_reshape_src`

Reshape tensor in-place (modifies source tensor metadata).

void tofu_tensor_reshape_src(tofu_tensor *src, int ndim, const int *dims);

Parameters:

src - Tensor to reshape (cannot be NULL)
ndim - Number of dimensions for reshaped tensor
dims - Array of new dimension sizes

Behavior:

Modifies src tensor structure in-place
Does NOT copy or reallocate data
Only changes shape metadata
Product of dims must equal tofu_tensor_size(src)

Example:

tofu_tensor *t = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){3, 4});
// t now has shape [3, 4]

`tofu_tensor_transpose`

Transpose tensor by permuting dimensions.

tofu_tensor *tofu_tensor_transpose(const tofu_tensor *src, tofu_tensor *dst, const int *axes);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
axes - Permutation array (can be NULL for reverse order)

Returns: Result tensor (caller owns if dst was NULL)

Behavior:

If axes is NULL, reverses dimension order (e.g., [2,3,4] → [4,3,2])
If axes is non-NULL, permutes according to axes (e.g., axes=[1,0] swaps dims)
For 2-D matrix, axes=NULL transposes (rows ↔ columns)

Example:

// Matrix transpose
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *transposed = tofu_tensor_transpose(matrix, NULL, NULL);
// transposed has shape [4, 3]

// Custom permutation
int axes[] = {2, 0, 1};
tofu_tensor *t3d = tofu_tensor_zeros(3, (int[]){2, 3, 4}, TOFU_FLOAT);
tofu_tensor *permuted = tofu_tensor_transpose(t3d, NULL, axes);
// permuted has shape [4, 2, 3]

`tofu_tensor_slice`

Extract slice from tensor (copies data).

tofu_tensor *tofu_tensor_slice(const tofu_tensor *src, tofu_tensor *dst,
                               int axis, int start, int len);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
axis - Axis along which to slice
start - Starting index along axis
len - Length of slice

Returns: Result tensor (caller owns if dst was NULL)

Preconditions:

axis < src->ndim
start >= 0 and start + len <= src->dims[axis]
If dst is non-NULL, it must have correct shape for slice

Example:

tofu_tensor *t = tofu_tensor_arange(0.0, 10.0, 1.0, TOFU_FLOAT);
tofu_tensor *slice = tofu_tensor_slice(t, NULL, 0, 2, 5);
// slice contains: [2.0, 3.0, 4.0, 5.0, 6.0]

See also: tofu_tensor_slice_nocopy for view without copying

`tofu_tensor_slice_nocopy`

Create view of tensor slice (no data copy).

tofu_tensor *tofu_tensor_slice_nocopy(tofu_tensor *src, tofu_tensor *dst,
                                      int axis, int start, int len);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
axis - Axis along which to slice
start - Starting index along axis
len - Length of slice

Returns: Result tensor sharing data with source (caller owns if dst was NULL)

Behavior:

Does NOT copy data - result shares memory with source
Modifying result will modify source tensor
Source must outlive result tensor

Warning: This is a view operation - changes affect the original tensor!

`tofu_tensor_concat`

Concatenate two tensors along specified axis.

tofu_tensor *tofu_tensor_concat(const tofu_tensor *src1, const tofu_tensor *src2,
                                tofu_tensor *dst, int axis);

Parameters:

src1 - First tensor (cannot be NULL)
src2 - Second tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
axis - Axis along which to concatenate

Returns: Result tensor (caller owns if dst was NULL)

Preconditions:

All dimensions except axis must match between src1 and src2

Behavior:

Result dims[axis] = src1->dims[axis] + src2->dims[axis]

Example:

tofu_tensor *a = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor *concat = tofu_tensor_concat(a, b, NULL, 0);
// concat has shape [4, 3]

Mathematical Operations

`tofu_tensor_matmul`

Compute matrix multiplication with broadcasting.

tofu_tensor *tofu_tensor_matmul(const tofu_tensor *src1, const tofu_tensor *src2,
                                tofu_tensor *dst);

Parameters:

src1 - Left operand tensor (cannot be NULL)
src2 - Right operand tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)

Returns: Result tensor (caller owns if dst was NULL)

Preconditions:

For 1-D @ 1-D: src1->dims[0] must equal src2->dims[0]
For 2-D and higher: src1->dims[src1->ndim-1] must equal src2->dims[src2->ndim-2]

Behavior:

1-D @ 1-D: Dot product → scalar
2-D @ 2-D: Standard matrix multiplication
N-D @ 1-D: Matrix-vector (drops last dim)
1-D @ N-D: Vector-matrix (drops first dim)
N-D @ N-D: Batch matmul with broadcasting

Example:

// Matrix multiplication
tofu_tensor *A = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *B = tofu_tensor_zeros(2, (int[]){4, 5}, TOFU_FLOAT);
tofu_tensor *C = tofu_tensor_matmul(A, B, NULL);
// C has shape [3, 5]

// Batch matrix multiplication
tofu_tensor *batch_A = tofu_tensor_zeros(3, (int[]){2, 3, 4}, TOFU_FLOAT);
tofu_tensor *batch_B = tofu_tensor_zeros(3, (int[]){2, 4, 5}, TOFU_FLOAT);
tofu_tensor *batch_C = tofu_tensor_matmul(batch_A, batch_B, NULL);
// batch_C has shape [2, 3, 5]

Notes:

Most commonly used operation for neural networks
Broadcasts batch dimensions automatically

See also: tofu_tensor_inner for inner product

`tofu_tensor_inner`

Compute inner product (sum-product over last axes).

tofu_tensor *tofu_tensor_inner(const tofu_tensor *src1, const tofu_tensor *src2,
                               tofu_tensor *dst);

Parameters:

src1 - First tensor (cannot be NULL)
src2 - Second tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)

Returns: Result tensor (caller owns if dst was NULL)

Preconditions:

src1->dims[src1->ndim-1] must equal src2->dims[src2->ndim-1]

Behavior:

1-D × 1-D: Dot product → scalar
2-D × 2-D: result[i,j] = sum(a[i,:] * b[j,:])
N-D × N-D: Cartesian product of non-last dimensions
Output shape: (*a.shape[:-1], *b.shape[:-1])

Example:

tofu_tensor *a = tofu_tensor_arange(0.0, 3.0, 1.0, TOFU_FLOAT);  // [0, 1, 2]
tofu_tensor *b = tofu_tensor_arange(1.0, 4.0, 1.0, TOFU_FLOAT);  // [1, 2, 3]
tofu_tensor *result = tofu_tensor_inner(a, b, NULL);
// result = 0*1 + 1*2 + 2*3 = 8.0

See also: tofu_tensor_matmul, tofu_tensor_outer

`tofu_tensor_outer`

Compute outer product (cartesian product without summation).

tofu_tensor *tofu_tensor_outer(const tofu_tensor *src1, const tofu_tensor *src2,
                               tofu_tensor *dst);

Parameters:

src1 - First tensor (cannot be NULL)
src2 - Second tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)

Returns: Result tensor (caller owns if dst was NULL)

Behavior:

Flattens both input tensors
Computes: result[i,j] = a[i] * b[j]
Always produces 2-D output
Output shape: [a.size, b.size] where size is total element count

Example:

tofu_tensor *a = tofu_tensor_arange(0.0, 3.0, 1.0, TOFU_FLOAT);  // [0, 1, 2]
tofu_tensor *b = tofu_tensor_arange(1.0, 3.0, 1.0, TOFU_FLOAT);  // [1, 2]
tofu_tensor *result = tofu_tensor_outer(a, b, NULL);
// result shape [3, 2]:
// [[0, 0],
//  [1, 2],
//  [2, 4]]

Element-wise Operations

`tofu_tensor_elew`

Apply element-wise binary operation with broadcasting.

tofu_tensor *tofu_tensor_elew(const tofu_tensor *src1, const tofu_tensor *src2,
                              tofu_tensor *dst, tofu_elew_op elew_op);

Parameters:

src1 - First tensor (cannot be NULL)
src2 - Second tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
elew_op - Operation to apply (TOFU_MUL, TOFU_DIV, TOFU_SUM, TOFU_SUB, TOFU_POW, etc.)

Returns: Result tensor (caller owns if dst was NULL)

Preconditions:

src1 and src2 must be broadcastable (NumPy rules)

Operations:

TOFU_MUL - Element-wise multiplication (*)
TOFU_DIV - Element-wise division (/)
TOFU_SUM - Element-wise addition (+)
TOFU_SUB - Element-wise subtraction (-)
TOFU_POW - Element-wise power (^)
TOFU_MAX - Element-wise maximum
TOFU_MIN - Element-wise minimum

Example:

tofu_tensor *a = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT);  // [1, 2, 3, 4]
tofu_tensor *b = tofu_tensor_arange(2.0, 6.0, 1.0, TOFU_FLOAT);  // [2, 3, 4, 5]

tofu_tensor *sum = tofu_tensor_elew(a, b, NULL, TOFU_SUM);
// sum = [3, 5, 7, 9]

tofu_tensor *prod = tofu_tensor_elew(a, b, NULL, TOFU_MUL);
// prod = [2, 6, 12, 20]

// Broadcasting example
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
float scalar_data[] = {2.0f};
tofu_tensor *scalar = tofu_tensor_create(scalar_data, 1, (int[]){1}, TOFU_FLOAT);
tofu_tensor *scaled = tofu_tensor_elew(matrix, scalar, NULL, TOFU_MUL);
// All elements of matrix multiplied by 2.0

See also: tofu_tensor_elew_param, tofu_tensor_elew_broadcast

`tofu_tensor_elew_param`

Apply element-wise operation between tensor and scalar.

tofu_tensor *tofu_tensor_elew_param(const tofu_tensor *src, double param,
                                    tofu_tensor *dst, tofu_elew_op elew_op);

Parameters:

src - Source tensor (cannot be NULL)
param - Scalar parameter
dst - Destination tensor (can be NULL to allocate new)
elew_op - Operation to apply

Returns: Result tensor with same shape as src (caller owns if dst was NULL)

Behavior:

Applies operation element-wise: op(tensor_element, param)

Example:

tofu_tensor *t = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT);  // [1, 2, 3, 4]

tofu_tensor *scaled = tofu_tensor_elew_param(t, 2.0, NULL, TOFU_MUL);
// scaled = [2, 4, 6, 8]

tofu_tensor *shifted = tofu_tensor_elew_param(t, 10.0, NULL, TOFU_SUM);
// shifted = [11, 12, 13, 14]

tofu_tensor *squared = tofu_tensor_elew_param(t, 2.0, NULL, TOFU_POW);
// squared = [1, 4, 9, 16]

`tofu_tensor_elew_broadcast`

Apply element-wise operation with automatic broadcasting.

tofu_tensor *tofu_tensor_elew_broadcast(const tofu_tensor *src1, const tofu_tensor *src2,
                                        tofu_tensor *dst, tofu_elew_op elew_op);

Parameters:

src1 - First tensor (cannot be NULL)
src2 - Second tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
elew_op - Operation to apply

Returns: Result tensor with broadcast shape (caller owns if dst was NULL)

Notes:

Automatically broadcasts inputs to compatible shape
Equivalent to tofu_tensor_elew but with explicit broadcast handling
Follows NumPy broadcasting rules

Reductions

`tofu_tensor_sumreduce`

Reduce tensor along axis using sum operation.

tofu_tensor *tofu_tensor_sumreduce(const tofu_tensor *src, tofu_tensor *dst, int axis);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
axis - Axis along which to reduce

Returns: Result tensor with dims[axis] removed (caller owns if dst was NULL)

Behavior:

Output shape: src->dims with dims[axis] removed
Computes sum of all elements along specified axis

Example:

tofu_tensor *t = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
// Fill with 1.0
for (int i = 0; i < 12; i++) {
    float val = 1.0f;
    TOFU_TENSOR_DATA_FROM(t, i, val, TOFU_FLOAT);
}

tofu_tensor *row_sum = tofu_tensor_sumreduce(t, NULL, 1);
// row_sum has shape [3], each element = 4.0

tofu_tensor *col_sum = tofu_tensor_sumreduce(t, NULL, 0);
// col_sum has shape [4], each element = 3.0

See also: tofu_tensor_meanreduce, tofu_tensor_maxreduce

`tofu_tensor_meanreduce`

Reduce tensor along axis using mean operation.

tofu_tensor *tofu_tensor_meanreduce(const tofu_tensor *src, tofu_tensor *dst, int axis);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
axis - Axis along which to reduce

Returns: Result tensor with dims[axis] removed (caller owns if dst was NULL)

Behavior:

Output shape: src->dims with dims[axis] removed
Computes arithmetic mean of all elements along specified axis

Example:

tofu_tensor *t = tofu_tensor_arange(0.0, 12.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){3, 4});

tofu_tensor *row_mean = tofu_tensor_meanreduce(t, NULL, 1);
// row_mean has shape [3]
// row_mean[0] = mean([0,1,2,3]) = 1.5
// row_mean[1] = mean([4,5,6,7]) = 5.5
// row_mean[2] = mean([8,9,10,11]) = 9.5

`tofu_tensor_maxreduce`

Reduce tensor along axis using max operation.

tofu_tensor *tofu_tensor_maxreduce(const tofu_tensor *src, tofu_tensor *dst,
                                   tofu_tensor *arg, int axis);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
arg - Argmax indices tensor (can be NULL if indices not needed)
axis - Axis along which to reduce

Returns: Result tensor with dims[axis] removed (caller owns if dst was NULL)

Behavior:

Output shape: src->dims with dims[axis] removed
If arg is non-NULL, fills it with indices of maximum values

Example:

float data[] = {3.0f, 1.0f, 4.0f, 1.0f, 5.0f, 9.0f};
tofu_tensor *t = tofu_tensor_create(data, 2, (int[]){2, 3}, TOFU_FLOAT);

tofu_tensor *max_vals = tofu_tensor_maxreduce(t, NULL, NULL, 1);
// max_vals = [4.0, 9.0]

tofu_tensor *indices = tofu_tensor_zeros(1, (int[]){2}, TOFU_INT32);
max_vals = tofu_tensor_maxreduce(t, NULL, indices, 1);
// indices = [2, 2] (position of max in each row)

`tofu_tensor_sub_broadcast`

Subtract reduced tensor from source with broadcasting.

tofu_tensor *tofu_tensor_sub_broadcast(const tofu_tensor *src, const tofu_tensor *reduced,
                                       tofu_tensor *dst, int axis);

Parameters:

src - Source tensor (cannot be NULL)
reduced - Reduced tensor to subtract (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
axis - Axis along which reduction was performed

Returns: Result tensor with same shape as src (caller owns if dst was NULL)

Preconditions:

reduced->ndim = src->ndim - 1 (one dimension removed)

Behavior:

Broadcasts reduced tensor back along axis and subtracts
Useful for normalization operations (subtract mean, etc.)

Activation Functions

`tofu_tensor_lrelu`

Apply Leaky ReLU activation function.

tofu_tensor *tofu_tensor_lrelu(const tofu_tensor *src, tofu_tensor *dst, float negslope);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
negslope - Slope for negative values (typically 0.01)

Returns: Result tensor with same shape as src (caller owns if dst was NULL)

Behavior:

Computes: x if x >= 0, else negslope * x
Standard ReLU equivalent when negslope = 0

Example:

float data[] = {-2.0f, -1.0f, 0.0f, 1.0f, 2.0f};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){5}, TOFU_FLOAT);

tofu_tensor *relu = tofu_tensor_lrelu(t, NULL, 0.0f);
// relu = [0.0, 0.0, 0.0, 1.0, 2.0]

tofu_tensor *leaky = tofu_tensor_lrelu(t, NULL, 0.01f);
// leaky = [-0.02, -0.01, 0.0, 1.0, 2.0]

Note: For use in computation graphs with automatic differentiation, use tofu_graph_relu() instead.

`tofu_tensor_softmax`

Apply softmax activation along specified axis.

tofu_tensor *tofu_tensor_softmax(const tofu_tensor *src, tofu_tensor *dst, int axis);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
axis - Axis along which to apply softmax

Returns: Result tensor with same shape as src (caller owns if dst was NULL)

Behavior:

Computes: exp(x_i) / sum(exp(x_j)) along axis
Uses numerically stable implementation (subtracts max before exp)
Output values sum to 1.0 along specified axis

Example:

float logits[] = {1.0f, 2.0f, 3.0f};
tofu_tensor *t = tofu_tensor_create(logits, 1, (int[]){3}, TOFU_FLOAT);
tofu_tensor *probs = tofu_tensor_softmax(t, NULL, 0);
// probs ≈ [0.09, 0.24, 0.67] (sums to 1.0)

Note: For use in computation graphs with automatic differentiation, use tofu_graph_softmax() instead.

`tofu_tensor_layer_norm`

Apply layer normalization with learnable affine transform.

tofu_tensor *tofu_tensor_layer_norm(const tofu_tensor *src, tofu_tensor *dst,
                                    const tofu_tensor *gamma, const tofu_tensor *beta,
                                    int axis, double eps);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
gamma - Scale parameter tensor (can be NULL for no scaling)
beta - Shift parameter tensor (can be NULL for no shift)
axis - Axis along which to normalize
eps - Small constant for numerical stability (typically 1e-5)

Returns: Result tensor with same shape as src (caller owns if dst was NULL)

Behavior:

Normalizes: (x - mean) / sqrt(variance + eps)
Then applies: gamma * normalized + beta (if gamma/beta non-NULL)
If gamma/beta are NULL, only normalization is applied

Example:

tofu_tensor *x = tofu_tensor_zeros(2, (int[]){2, 4}, TOFU_FLOAT);
float gamma_data[] = {1.0f, 1.0f, 1.0f, 1.0f};
float beta_data[] = {0.0f, 0.0f, 0.0f, 0.0f};
tofu_tensor *gamma = tofu_tensor_create(gamma_data, 1, (int[]){4}, TOFU_FLOAT);
tofu_tensor *beta = tofu_tensor_create(beta_data, 1, (int[]){4}, TOFU_FLOAT);

tofu_tensor *normalized = tofu_tensor_layer_norm(x, NULL, gamma, beta, 1, 1e-5);

Utilities

`tofu_tensor_issameshape`

Check if two tensors have the same shape.

int tofu_tensor_issameshape(const tofu_tensor *t1, const tofu_tensor *t2);

Parameters:

t1 - First tensor (cannot be NULL)
t2 - Second tensor (cannot be NULL)

Returns: 1 if same shape, 0 otherwise

`tofu_tensor_isbroadcastable`

Check if two tensors can be broadcast together (NumPy semantics).

int tofu_tensor_isbroadcastable(const tofu_tensor *t1, const tofu_tensor *t2);

Parameters:

t1 - First tensor (cannot be NULL)
t2 - Second tensor (cannot be NULL)

Returns: 1 if broadcastable, 0 otherwise

Broadcasting Rules:

Arrays with fewer dimensions are prepended with size-1 dimensions
Size-1 dimensions are stretched to match the other array
Dimensions must match or one must be 1

Example:

tofu_tensor *a = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
int can_broadcast = tofu_tensor_isbroadcastable(a, b);  // Returns 1

tofu_tensor *c = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);
can_broadcast = tofu_tensor_isbroadcastable(a, c);  // Returns 0

`tofu_tensor_broadcast_to`

Broadcast tensor to specified shape (NumPy semantics).

tofu_tensor *tofu_tensor_broadcast_to(const tofu_tensor *src, tofu_tensor *dst,
                                      int ndim, const int *dims);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
ndim - Number of dimensions for target shape
dims - Target dimension sizes

Returns: Result tensor with target shape (caller owns if dst was NULL)

Preconditions:

src must be broadcastable to target shape (NumPy rules)

Behavior:

Follows NumPy broadcasting rules
Size-1 dimensions are stretched to match target

`tofu_tensor_print`

Print tensor to stdout with custom format.

void tofu_tensor_print(const tofu_tensor *t, const char *fmt);

Parameters:

t - Tensor to print (cannot be NULL)
fmt - Format string for each element (e.g., "%.6f", "%d")

Example:

tofu_tensor *t = tofu_tensor_arange(0.0, 6.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){2, 3});

tofu_tensor_print(t, "%.1f");
// Output:
// [[0.0, 1.0, 2.0],
//  [3.0, 4.0, 5.0]]

See also: tofu_tensor_fprint for printing to arbitrary stream, tofu_tensor_save for saving to file

`tofu_tensor_fprint`

Print tensor to file stream with custom format.

void tofu_tensor_fprint(FILE *stream, const tofu_tensor *t, const char *fmt);

Parameters:

stream - File stream to write to (cannot be NULL)
t - Tensor to print (cannot be NULL)
fmt - Format string for each element

`tofu_tensor_save`

Save tensor to file with custom format.

int tofu_tensor_save(const char *file_name, const tofu_tensor *t, const char *fmt);

Parameters:

file_name - Path to output file (cannot be NULL)
t - Tensor to save (cannot be NULL)
fmt - Format string for each element

Returns: 0 on success, non-zero on error

`tofu_tensor_convert`

Convert tensor to different data type.

tofu_tensor *tofu_tensor_convert(const tofu_tensor *src, tofu_tensor *dst,
                                 tofu_dtype dtype_d);

Parameters:

src - Source tensor (cannot be NULL)
dst - Destination tensor (can be NULL to allocate new)
dtype_d - Target data type

Returns: Result tensor with same shape as src but different dtype (caller owns if dst was NULL)

Behavior:

Converts each element to target type with appropriate casting
May lose precision (e.g., float to int truncates)

Example:

float data[] = {1.7f, 2.3f, 3.9f};
tofu_tensor *floats = tofu_tensor_create(data, 1, (int[]){3}, TOFU_FLOAT);
tofu_tensor *ints = tofu_tensor_convert(floats, NULL, TOFU_INT32);
// ints = [1, 2, 3]

`tofu_tensor_index`

Convert multi-dimensional coordinates to flat index.

int tofu_tensor_index(const tofu_tensor *t, int *coords);

Parameters:

t - Tensor (cannot be NULL)
coords - Array of coordinates, length must be t->ndim

Returns: Flat index into tensor data array

`tofu_tensor_coords`

Convert flat index to multi-dimensional coordinates.

void tofu_tensor_coords(const tofu_tensor *t, int index, int *coords);

Parameters:

t - Tensor (cannot be NULL)
index - Flat index into tensor data array
coords - Output array for coordinates, length must be t->ndim

Common Patterns

Working with Tensor Memory

// Pattern 1: User manages data buffer
float data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){4}, TOFU_FLOAT);
// Use tensor...
tofu_tensor_free(t);
// data is still valid

// Pattern 2: Library manages data buffer
tofu_tensor *t = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
// Use tensor...
tofu_tensor_free_data_too(t);
// Both tensor and data are freed

Accessing Tensor Elements

tofu_tensor *t = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);

// Read element at index i
float value;
TOFU_TENSOR_DATA_TO(t, i, value, TOFU_FLOAT);

// Write element at index i
value = 42.0f;
TOFU_TENSOR_DATA_FROM(t, i, value, TOFU_FLOAT);

// Copy element from src[si] to dst[di]
TOFU_TENSOR_DATA_ASSIGN(dst, di, src, si);

Broadcasting Example

// Add scalar to matrix (broadcasting)
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *result = tofu_tensor_elew_param(matrix, 5.0, NULL, TOFU_SUM);

// Add vector to matrix rows (broadcasting)
tofu_tensor *row_vec = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
result = tofu_tensor_elew_broadcast(matrix, row_vec, NULL, TOFU_SUM);

Graph API Reference

The Graph API provides computational graph construction and automatic differentiation for training neural networks. It implements reverse-mode automatic differentiation (backpropagation) for computing gradients.

Data Structures

`tofu_graph`

The computation graph structure that manages all nodes and their relationships.

struct tofu_graph {
    tofu_graph_node** nodes;         // All nodes in graph
    int num_nodes;                   // Number of nodes
    int capacity;                    // Allocated capacity

    tofu_graph_node** topo_order;    // Nodes in reverse topological order
    int topo_size;                   // Size of topo_order
    int topo_capacity;               // Allocated capacity

    int next_id;                     // Next available node ID
};

`tofu_graph_node`

A node in the computation graph representing an operation or leaf value.

struct tofu_graph_node {
    int id;                          // Unique node ID within graph
    tofu_op_type op;                 // Operation type

    tofu_tensor* value;              // Forward pass result
    tofu_tensor* grad;               // Gradient (∂L/∂value)

    tofu_graph_node** inputs;        // Input nodes
    int num_inputs;                  // Number of inputs
    int capacity_inputs;             // Allocated capacity for inputs

    tofu_backward_fn backward_fn;    // Backward pass function
    void* backward_ctx;              // Context for backward (saved tensors, etc.)

    int requires_grad;               // Does this need gradient computation?
    int visited;                     // For topological sort

    tofu_graph* graph;               // Parent graph
};

Operation Types (`tofu_op_type`)

Enumeration of all supported operations:

Leaf nodes:
- TOFU_OP_INPUT - Input node (no gradient)
- TOFU_OP_PARAM - Trainable parameter (requires gradient)
Binary operations:
- TOFU_OP_MATMUL - Matrix multiplication
- TOFU_OP_ADD - Element-wise addition
- TOFU_OP_MUL - Element-wise multiplication
Activations:
- TOFU_OP_RELU - ReLU activation
- TOFU_OP_SOFTMAX - Softmax activation
- TOFU_OP_LAYER_NORM - Layer normalization
Shape operations:
- TOFU_OP_RESHAPE - Reshape operation
- TOFU_OP_TRANSPOSE - Transpose operation
Reductions:
- TOFU_OP_MEAN - Mean reduction
- TOFU_OP_SUM - Sum reduction
Loss functions:
- TOFU_OP_MSE_LOSS - Mean squared error loss
- TOFU_OP_CE_LOSS - Cross-entropy loss

Graph Lifecycle

`tofu_graph_create`

Create a new empty computation graph.

tofu_graph* tofu_graph_create(void);

Returns: Pointer to newly allocated graph (caller owns, must call tofu_graph_free)

Behavior:

Graph starts empty - add nodes via tofu_graph_input, tofu_graph_param, etc.
Graph does NOT take ownership of tensors passed to tofu_graph_param
Caller must call tofu_graph_free to free graph and all nodes

Example:

tofu_graph *g = tofu_graph_create();

// Build graph...

tofu_graph_free(g);

`tofu_graph_free`

Free computation graph and all nodes.

void tofu_graph_free(tofu_graph* g);

Parameters:

g - Graph to free (can be NULL, no-op if NULL)

Behavior:

Frees all graph nodes and their gradients
Frees intermediate operation results (matmul, add, etc.)
Does NOT free INPUT or PARAM tensors (caller owns them)
Caller must separately free tensors passed to input/param functions
Safe to call multiple times (idempotent)

Ownership Pattern:

// Create tensors
tofu_tensor *input = tofu_tensor_zeros(2, (int[]){1, 4}, TOFU_FLOAT);
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);

// Build graph
tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, input);
tofu_graph_node *W = tofu_graph_param(g, weights);
// ... more operations ...

// Cleanup (important order!)
tofu_graph_free(g);               // 1. Free graph first
tofu_tensor_free_data_too(input);  // 2. Then free tensors
tofu_tensor_free_data_too(weights);

See also: tofu_graph_clear_ops

`tofu_graph_clear_ops`

Clear all operation nodes but keep parameter nodes.

void tofu_graph_clear_ops(tofu_graph* g);

Parameters:

g - Graph to clear (cannot be NULL)

Behavior:

Frees all nodes except PARAM and INPUT nodes
Preserves trainable parameters for next forward pass
Use between training iterations to reset computation graph

Use Case:

tofu_graph *g = tofu_graph_create();

// Add parameters (preserved across iterations)
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Build forward graph for this batch
    tofu_graph_node *x = tofu_graph_input(g, batch_data);
    tofu_graph_node *y = tofu_graph_matmul(g, x, W);
    tofu_graph_node *out = tofu_graph_add(g, y, b);

    // Backward and optimize...

    // Clear operations for next iteration (W and b are preserved)
    tofu_graph_clear_ops(g);
}

Notes:

More efficient than creating new graph each iteration
Parameters maintain their values and gradients
Violating preconditions triggers assert() and crashes

Leaf Nodes

Leaf nodes are the starting points of computation - they have no inputs and represent either data or learnable parameters.

`tofu_graph_input`

Create input node (non-trainable data source).

tofu_graph_node* tofu_graph_input(tofu_graph* g, tofu_tensor* data);

Parameters:

g - Graph to add node to (cannot be NULL)
data - Input tensor data (cannot be NULL)

Returns: Pointer to newly created graph node (graph owns node, caller owns tensor)

Behavior:

Input nodes do NOT compute gradients
IMPORTANT: Graph does NOT take ownership of data tensor
Caller must free data tensor separately after tofu_graph_free()
Use for input data that doesn't require backpropagation

Example:

float input_data[] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *x_tensor = tofu_tensor_create(input_data, 2, (int[]){1, 4}, TOFU_FLOAT);

tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, x_tensor);

// Use x in graph operations...

tofu_graph_free(g);
tofu_tensor_free(x_tensor);  // Caller must free tensor

Notes:

Typical pattern: create tensor → input node → use → graph_free → free tensor
Violating preconditions triggers assert() and crashes

See also: tofu_graph_param for trainable parameters

`tofu_graph_param`

Create parameter node (trainable weights/biases).

tofu_graph_node* tofu_graph_param(tofu_graph* g, tofu_tensor* data);

Parameters:

g - Graph to add node to (cannot be NULL)
data - Parameter tensor data (cannot be NULL)

Returns: Pointer to newly created graph node (graph owns node, caller owns tensor)

Behavior:

IMPORTANT: Graph does NOT take ownership of data tensor
Caller must free data tensor separately after tofu_graph_free()
Parameter nodes compute gradients during backward pass
Use for trainable weights, biases, etc.

Example:

// Create trainable weights
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);

tofu_graph *g = tofu_graph_create();
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);

// Build network...
// Training loop with backward pass computes W->grad and b->grad

// Cleanup
tofu_graph_free(g);
tofu_tensor_free_data_too(W);  // Caller must free tensors
tofu_tensor_free_data_too(b);

Notes:

Typical pattern: create tensor → param node → free tensor after graph_free
Gradients are stored in the node, accessible via tofu_graph_get_grad()
Violating preconditions triggers assert() and crashes

See also: tofu_graph_input for non-trainable inputs, tofu_graph_get_grad to access gradients

Operations

Operations create new nodes in the graph that compute values during forward pass and gradients during backward pass.

`tofu_graph_matmul`

Add matrix multiplication node to graph.

tofu_graph_node* tofu_graph_matmul(tofu_graph* g, tofu_graph_node* a, tofu_graph_node* b);

Parameters:

g - Graph to add node to (cannot be NULL)
a - Left operand node (cannot be NULL)
b - Right operand node (cannot be NULL)

Returns: Pointer to result node (graph owns, freed by tofu_graph_free)

Preconditions:

a->value->dims[last] must equal b->value->dims[second-to-last]

Behavior:

Computes matrix multiplication with broadcasting
Implements backward pass for gradient computation
Result node requires gradient if any input requires gradient

Example:

// Neural network layer: y = x @ W
tofu_graph_node *x = tofu_graph_input(g, input_tensor);
tofu_graph_node *W = tofu_graph_param(g, weights_tensor);
tofu_graph_node *y = tofu_graph_matmul(g, x, W);

Notes:

Most commonly used operation for neural networks
Follows same semantics as tofu_tensor_matmul (see Tensor API)
Violating preconditions triggers assert() and crashes

`tofu_graph_add`

Add element-wise addition node to graph.

tofu_graph_node* tofu_graph_add(tofu_graph* g, tofu_graph_node* a, tofu_graph_node* b);

Parameters:

g - Graph to add node to (cannot be NULL)
a - First operand node (cannot be NULL)
b - Second operand node (cannot be NULL)

Returns: Pointer to result node (graph owns, freed by tofu_graph_free)

Preconditions:

a and b must be broadcastable (NumPy rules)

Behavior:

Computes element-wise addition with broadcasting
Implements backward pass for gradient computation

Example:

// Add bias: y = x + b
tofu_graph_node *x = tofu_graph_matmul(g, input, weights);
tofu_graph_node *b = tofu_graph_param(g, bias_tensor);
tofu_graph_node *y = tofu_graph_add(g, x, b);

Notes:

Supports NumPy-style broadcasting
Common for adding biases to layer outputs

`tofu_graph_mul`

Add element-wise multiplication node to graph.

tofu_graph_node* tofu_graph_mul(tofu_graph* g, tofu_graph_node* a, tofu_graph_node* b);

Parameters:

g - Graph to add node to (cannot be NULL)
a - First operand node (cannot be NULL)
b - Second operand node (cannot be NULL)

Returns: Pointer to result node (graph owns, freed by tofu_graph_free)

Preconditions:

a and b must be broadcastable (NumPy rules)

Behavior:

Computes element-wise multiplication with broadcasting
Implements backward pass for gradient computation

Example:

// Attention mechanism: scaled dot product
tofu_graph_node *qk = tofu_graph_matmul(g, q, k);
tofu_graph_node *scale = tofu_graph_param(g, scale_tensor);
tofu_graph_node *scaled = tofu_graph_mul(g, qk, scale);

`tofu_graph_relu`

Add ReLU activation node to graph.

tofu_graph_node* tofu_graph_relu(tofu_graph* g, tofu_graph_node* x);

Parameters:

g - Graph to add node to (cannot be NULL)
x - Input node (cannot be NULL)

Returns: Pointer to result node (graph owns, freed by tofu_graph_free)

Behavior:

Computes ReLU: max(0, x)
Implements backward pass for gradient computation
Gradient is 1 where x > 0, else 0

Example:

// Hidden layer with ReLU
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1);
tofu_graph_node *h1_bias = tofu_graph_add(g, h1, b1);
tofu_graph_node *h1_relu = tofu_graph_relu(g, h1_bias);

`tofu_graph_softmax`

Add softmax activation node to graph.

tofu_graph_node* tofu_graph_softmax(tofu_graph* g, tofu_graph_node* x, int axis);

Parameters:

g - Graph to add node to (cannot be NULL)
x - Input node (cannot be NULL)
axis - Axis along which to apply softmax

Returns: Pointer to result node (graph owns, freed by tofu_graph_free)

Preconditions:

axis < x->value->ndim

Behavior:

Computes softmax along specified axis (exp normalization)
Implements backward pass for gradient computation
Numerically stable (subtracts max before exp)

Example:

// Classification output layer
tofu_graph_node *logits = tofu_graph_matmul(g, h, W_out);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);

Notes:

Typically used for classification tasks
Output values sum to 1.0 along specified axis

`tofu_graph_layer_norm`

Add layer normalization node to graph.

tofu_graph_node* tofu_graph_layer_norm(tofu_graph* g, tofu_graph_node* x,
                                       tofu_graph_node* gamma, tofu_graph_node* beta,
                                       int axis, double eps);

Parameters:

g - Graph to add node to (cannot be NULL)
x - Input node (cannot be NULL)
gamma - Scale parameter node (can be NULL for no scaling)
beta - Shift parameter node (can be NULL for no shift)
axis - Axis along which to normalize
eps - Small constant for numerical stability (typically 1e-5)

Returns: Pointer to result node (graph owns, freed by tofu_graph_free)

Preconditions:

axis < x->value->ndim
eps > 0

Behavior:

Normalizes: (x - mean) / sqrt(variance + eps)
Then applies: gamma * normalized + beta (if gamma/beta non-NULL)
Implements backward pass for gradient computation

Example:

// Transformer-style layer norm
tofu_graph_node *gamma = tofu_graph_param(g, gamma_tensor);
tofu_graph_node *beta = tofu_graph_param(g, beta_tensor);
tofu_graph_node *normalized = tofu_graph_layer_norm(g, x, gamma, beta, 1, 1e-5);

Notes:

Common in transformer architectures
Helps stabilize training of deep networks

`tofu_graph_reshape`

Add reshape node to graph.

tofu_graph_node* tofu_graph_reshape(tofu_graph* g, tofu_graph_node* x, int ndim, const int* dims);

Parameters:

g - Graph to add node to (cannot be NULL)
x - Input node (cannot be NULL)
ndim - Number of dimensions for reshaped tensor
dims - Array of new dimension sizes

Returns: Pointer to result node (graph owns, freed by tofu_graph_free)

Preconditions:

Product of dims must equal x->value total elements

Behavior:

View operation (no data copy) - reshaped tensor shares data with input
Implements backward pass for gradient computation

Example:

// Flatten for fully connected layer
// Input: [batch, channels, height, width]
// Output: [batch, channels * height * width]
int flat_dim = channels * height * width;
tofu_graph_node *flat = tofu_graph_reshape(g, x, 2, (int[]){batch_size, flat_dim});

`tofu_graph_transpose`

Add transpose node to graph.

tofu_graph_node* tofu_graph_transpose(tofu_graph* g, tofu_graph_node* x, const int* axes);

Parameters:

g - Graph to add node to (cannot be NULL)
x - Input node (cannot be NULL)
axes - Permutation array (can be NULL for reverse order)

Returns: Pointer to result node (graph owns, freed by tofu_graph_free)

Preconditions:

If axes is non-NULL, it must be valid permutation of [0, ..., ndim-1]

Behavior:

If axes is NULL, reverses dimension order
Implements backward pass for gradient computation

Example:

// Transpose matrix for weight matrix
tofu_graph_node *W_T = tofu_graph_transpose(g, W, NULL);

Loss Functions

Loss functions compute scalar values representing model error. They're typically the final nodes in a computation graph before calling tofu_graph_backward().

`tofu_graph_mse_loss`

Add mean squared error loss node to graph.

tofu_graph_node* tofu_graph_mse_loss(tofu_graph* g, tofu_graph_node* pred, tofu_graph_node* target);

Parameters:

g - Graph to add node to (cannot be NULL)
pred - Prediction node (cannot be NULL)
target - Target/ground truth node (cannot be NULL)

Returns: Pointer to scalar loss node (graph owns, freed by tofu_graph_free)

Preconditions:

pred and target must have same shape

Behavior:

Computes: mean((pred - target)^2)
Returns scalar (average over all elements)
Use for regression tasks
Implements backward pass for gradient computation

Example:

// Regression task
tofu_graph_node *pred = tofu_graph_matmul(g, x, W);
tofu_graph_node *target = tofu_graph_input(g, target_tensor);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

// Compute gradients
tofu_graph_backward(g, loss);

See also: tofu_graph_ce_loss for classification

`tofu_graph_ce_loss`

Add cross-entropy loss node to graph.

tofu_graph_node* tofu_graph_ce_loss(tofu_graph* g, tofu_graph_node* pred, tofu_graph_node* target);

Parameters:

g - Graph to add node to (cannot be NULL)
pred - Prediction node (softmax probabilities) (cannot be NULL)
target - Target/ground truth node (class indices or one-hot) (cannot be NULL)

Returns: Pointer to scalar loss node (graph owns, freed by tofu_graph_free)

Behavior:

Computes: -sum(target * log(pred))
Returns scalar (average over batch)
Use for classification tasks
Numerically stable implementation
Implements backward pass for gradient computation

Example:

// Classification task
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *target = tofu_graph_input(g, target_tensor);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);

// Compute gradients
tofu_graph_backward(g, loss);

See also: tofu_graph_mse_loss for regression

Backward Pass

`tofu_graph_backward`

Perform backward pass (backpropagation) from loss node.

void tofu_graph_backward(tofu_graph* g, tofu_graph_node* loss);

Parameters:

g - Graph containing loss node (cannot be NULL)
loss - Loss node to backpropagate from (cannot be NULL)

Preconditions:

loss must be scalar (single element tensor)

Behavior:

Computes gradients for all nodes requiring gradient
Populates node->grad for all PARAM nodes
Uses reverse-mode automatic differentiation
Call after forward pass, before optimizer step
Gradients accumulate across multiple backward passes and from multiple computational paths
Always call tofu_graph_zero_grad before each training iteration unless you intentionally want gradient accumulation (e.g., for gradient accumulation across mini-batches)

Example:

// Training iteration
tofu_graph *g = tofu_graph_create();

// Forward pass
tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *pred = tofu_graph_matmul(g, x, W);
tofu_graph_node *target = tofu_graph_input(g, target_data);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

// Backward pass
tofu_graph_backward(g, loss);

// Now W->grad contains ∂loss/∂W
tofu_tensor *W_grad = tofu_graph_get_grad(W);

Notes:

Automatically builds topological sort for efficient gradient computation
Violating preconditions triggers assert() and crashes

See also: tofu_graph_zero_grad to clear gradients, tofu_graph_get_grad to access gradients

Utilities

`tofu_graph_get_value`

Get forward pass result from graph node.

tofu_tensor* tofu_graph_get_value(tofu_graph_node* node);

Parameters:

node - Graph node (cannot be NULL)

Returns: Pointer to result tensor (node owns, do NOT free)

Behavior:

Returns tensor computed during forward pass
Do NOT free returned tensor (node owns it)

Example:

tofu_graph_node *pred = tofu_graph_matmul(g, x, W);
tofu_tensor *pred_value = tofu_graph_get_value(pred);

// Print predictions
tofu_tensor_print(pred_value, "%.6f");

Warning: Do not free the returned tensor!

`tofu_graph_get_grad`

Get gradient from graph node.

tofu_tensor* tofu_graph_get_grad(tofu_graph_node* node);

Parameters:

node - Graph node (cannot be NULL)

Returns: Pointer to gradient tensor (node owns, do NOT free), or NULL if no gradient

Behavior:

Returns gradient computed during backward pass
Returns NULL if backward hasn't been called yet
Do NOT free returned tensor (node owns it)

Example:

tofu_graph_node *W = tofu_graph_param(g, weights);
// ... build graph and backward pass ...

tofu_tensor *W_grad = tofu_graph_get_grad(W);
if (W_grad) {
    // Use gradient for parameter update
    // (or use optimizer which handles this automatically)
}

Warning: Do not free the returned tensor!

`tofu_graph_zero_grad`

Zero out all gradients in graph.

void tofu_graph_zero_grad(tofu_graph* g);

Parameters:

g - Graph to zero gradients for (cannot be NULL)

Behavior:

Sets all node->grad tensors to zero
Call before each training iteration to prevent gradient accumulation
Does NOT free gradient tensors, just zeros values

Example:

for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Zero gradients before forward pass
    tofu_graph_zero_grad(g);

    // Forward pass
    // ... build graph ...

    // Backward pass
    tofu_graph_backward(g, loss);

    // Update parameters
    tofu_optimizer_step(optimizer);
}

Notes:

Essential for correct training - prevents gradient accumulation
Typically called by optimizer's zero_grad() function

See also: tofu_optimizer_zero_grad

Usage Patterns

Basic Training Loop

// Setup
tofu_graph *g = tofu_graph_create();
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);

tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    for (int batch = 0; batch < num_batches; batch++) {
        // Zero gradients
        tofu_optimizer_zero_grad(opt);

        // Forward pass
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
        tofu_graph_node *y = tofu_graph_matmul(g, x, W_node);
        tofu_graph_node *out = tofu_graph_add(g, y, b_node);
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
        tofu_graph_node *loss = tofu_graph_mse_loss(g, out, target);

        // Backward pass
        tofu_graph_backward(g, loss);

        // Update parameters
        tofu_optimizer_step(opt);

        // Clear operations for next batch (keeps parameters)
        tofu_graph_clear_ops(g);
    }
}

// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);

Multi-Layer Neural Network

// Define network architecture
typedef struct {
    tofu_tensor *W1, *b1;
    tofu_tensor *W2, *b2;
    tofu_tensor *W3, *b3;
} Network;

// Forward pass function
tofu_graph_node* forward_pass(tofu_graph *g, tofu_graph_node *x, Network *net) {
    // Layer 1
    tofu_graph_node *W1 = tofu_graph_param(g, net->W1);
    tofu_graph_node *b1 = tofu_graph_param(g, net->b1);
    tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1);
    h1 = tofu_graph_add(g, h1, b1);
    h1 = tofu_graph_relu(g, h1);

    // Layer 2
    tofu_graph_node *W2 = tofu_graph_param(g, net->W2);
    tofu_graph_node *b2 = tofu_graph_param(g, net->b2);
    tofu_graph_node *h2 = tofu_graph_matmul(g, h1, W2);
    h2 = tofu_graph_add(g, h2, b2);
    h2 = tofu_graph_relu(g, h2);

    // Output layer
    tofu_graph_node *W3 = tofu_graph_param(g, net->W3);
    tofu_graph_node *b3 = tofu_graph_param(g, net->b3);
    tofu_graph_node *out = tofu_graph_matmul(g, h2, W3);
    out = tofu_graph_add(g, out, b3);

    return out;
}

// Usage
tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *pred = forward_pass(g, x, &network);
tofu_graph_node *target = tofu_graph_input(g, target_data);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

tofu_graph_backward(g, loss);

Classification with Softmax and Cross-Entropy

// Classification network
tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

// Logits
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
logits = tofu_graph_add(g, logits, b);

// Softmax (probabilities)
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);

// Cross-entropy loss
tofu_graph_node *target = tofu_graph_input(g, target_data);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);

// Backward and optimize
tofu_graph_backward(g, loss);
tofu_optimizer_step(optimizer);

Memory Management Best Practices

// 1. Create tensors for parameters (library manages data)
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);

// 2. Create graph and add parameters
tofu_graph *g = tofu_graph_create();
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

// 3. Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Create input tensors (user manages data)
    float *batch_data = load_batch(epoch);
    tofu_tensor *x_tensor = tofu_tensor_create(batch_data, 2, (int[]){32, 784}, TOFU_FLOAT);

    // Build graph
    tofu_graph_node *x = tofu_graph_input(g, x_tensor);
    // ... forward pass ...

    // Training step
    tofu_graph_backward(g, loss);
    tofu_optimizer_step(opt);

    // Clean up batch resources
    tofu_tensor_free(x_tensor);  // Free tensor structure
    free(batch_data);            // Free data buffer

    // Clear operations (keeps parameters)
    tofu_graph_clear_ops(g);
}

// 4. Cleanup (IMPORTANT ORDER!)
tofu_optimizer_free(opt);           // Free optimizer first
tofu_graph_free(g);                 // Then free graph
tofu_tensor_free_data_too(weights);  // Finally free parameter tensors
tofu_tensor_free_data_too(bias);

Efficient Batch Processing

tofu_graph *g = tofu_graph_create();

// Add parameters once (persists across batches)
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);

for (int batch = 0; batch < num_batches; batch++) {
    tofu_optimizer_zero_grad(opt);

    // Add input for this batch
    tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);

    // Build forward graph
    tofu_graph_node *out = tofu_graph_add(g, tofu_graph_matmul(g, x, W), b);
    tofu_graph_node *loss = tofu_graph_mse_loss(g, out, batch_targets[batch]);

    // Backward and update
    tofu_graph_backward(g, loss);
    tofu_optimizer_step(opt);

    // Clear operations for next batch (W and b are preserved)
    tofu_graph_clear_ops(g);
}

Notes

Gradient Accumulation

Gradients accumulate by default. Always call tofu_graph_zero_grad() or tofu_optimizer_zero_grad() before each training iteration:

// CORRECT: Zero gradients before each iteration
for (int i = 0; i < num_iterations; i++) {
    tofu_optimizer_zero_grad(opt);  // Clear previous gradients
    // ... forward and backward ...
}

// INCORRECT: Gradients accumulate indefinitely
for (int i = 0; i < num_iterations; i++) {
    // ... forward and backward ...
    // Gradients from all iterations accumulate!
}

Dynamic Graphs

Tofu uses dynamic computation graphs (define-by-run). The graph structure can change between iterations:

for (int epoch = 0; epoch < num_epochs; epoch++) {
    if (epoch < 10) {
        // Simple network
        out = tofu_graph_matmul(g, x, W1);
    } else {
        // More complex network
        h = tofu_graph_relu(g, tofu_graph_matmul(g, x, W1));
        out = tofu_graph_matmul(g, h, W2);
    }

    tofu_graph_backward(g, loss);
    tofu_graph_clear_ops(g);  // Clear for next iteration
}

Error Checking

Most functions use assert() for precondition checking. In release builds with assertions disabled, violating preconditions leads to undefined behavior. Always ensure:

Pointers are non-NULL
Shapes are compatible
Tensors are broadcastable
Loss is scalar before calling backward

Optimizer API Reference

The Optimizer API provides algorithms for updating trainable parameters based on computed gradients. Optimizers automatically collect parameters from the computation graph and apply update rules during training.

Data Structures

`tofu_optimizer`

The optimizer structure that manages parameters and their update strategy.

struct tofu_optimizer {
    tofu_optim_type type;            // Optimizer type
    tofu_graph* graph;               // Associated computation graph

    tofu_graph_node** params;        // Array of parameter nodes
    int num_params;                  // Number of parameters
    int capacity_params;             // Allocated capacity

    double learning_rate;            // Learning rate

    void* state;                     // Optimizer state (momentum buffers, etc.)

    tofu_optim_step_fn step_fn;      // Parameter update function
};

Optimizer Types (`tofu_optim_type`)

Available optimization algorithms:

TOFU_OPTIM_SGD - Vanilla Stochastic Gradient Descent
TOFU_OPTIM_SGD_MOMENTUM - SGD with momentum
TOFU_OPTIM_ADAM - Adam optimizer (future)

Creating Optimizers

`tofu_optimizer_sgd_create`

Create SGD (Stochastic Gradient Descent) optimizer.

tofu_optimizer* tofu_optimizer_sgd_create(tofu_graph* g, double learning_rate);

Parameters:

g - Computation graph containing parameters (cannot be NULL)
learning_rate - Learning rate (step size) (must be > 0)

Returns: Pointer to newly allocated optimizer (caller owns, must call tofu_optimizer_free)

Preconditions:

g must not be NULL
learning_rate > 0

Behavior:

Implements vanilla SGD: param = param - learning_rate * grad
Automatically collects all PARAM nodes from graph
Caller must call tofu_optimizer_free to free optimizer

Algorithm:

for each parameter θ:
    θ ← θ - η * ∇θL
where:
    η = learning_rate
    ∇θL = gradient of loss w.r.t. parameter

Example:

tofu_graph *g = tofu_graph_create();

// Add parameters to graph
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_graph_node *W_node = tofu_graph_param(g, W);

// Create optimizer
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    tofu_optimizer_zero_grad(opt);
    // ... forward and backward pass ...
    tofu_optimizer_step(opt);
}

// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);

Notes:

Simple and robust, good baseline optimizer
No momentum or adaptive learning rates
May converge slowly on complex problems
Violating preconditions triggers assert() and crashes

See also: tofu_optimizer_sgd_momentum_create for SGD with momentum

`tofu_optimizer_sgd_momentum_create`

Create SGD optimizer with momentum.

tofu_optimizer* tofu_optimizer_sgd_momentum_create(tofu_graph* g, double learning_rate, double momentum);

Parameters:

g - Computation graph containing parameters (cannot be NULL)
learning_rate - Learning rate (step size) (must be > 0)
momentum - Momentum coefficient (typically 0.9) (must be >= 0 and < 1)

Returns: Pointer to newly allocated optimizer (caller owns, must call tofu_optimizer_free)

Preconditions:

g must not be NULL
learning_rate > 0
0 <= momentum < 1

Behavior:

Implements SGD with momentum:
- velocity = momentum * velocity - learning_rate * grad
- param = param + velocity
Momentum helps accelerate training and reduces oscillations
Automatically collects all PARAM nodes from graph
Caller must call tofu_optimizer_free to free optimizer

Algorithm:

for each parameter θ:
    v ← μ * v - η * ∇θL
    θ ← θ + v
where:
    η = learning_rate
    μ = momentum
    v = velocity (accumulated gradients)
    ∇θL = gradient of loss w.r.t. parameter

Note: This is mathematically equivalent to classical momentum
(v = μ*v + ∇θL, θ = θ - η*v) but incorporates the learning
rate into the velocity update rather than the parameter update.

Example:

tofu_graph *g = tofu_graph_create();

// Add parameters
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_graph_node *W_node = tofu_graph_param(g, W);

// Create optimizer with momentum
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    tofu_optimizer_zero_grad(opt);
    // ... forward and backward pass ...
    tofu_optimizer_step(opt);
}

// Cleanup
tofu_optimizer_free(opt);

Notes:

Momentum helps escape local minima and speeds up convergence
Typical momentum values: 0.9 (standard), 0.99 (high momentum)
More effective than vanilla SGD for deep networks
Violating preconditions triggers assert() and crashes

See also: tofu_optimizer_sgd_create for vanilla SGD

Cleanup

`tofu_optimizer_free`

Free optimizer and its state.

void tofu_optimizer_free(tofu_optimizer* opt);

Parameters:

opt - Optimizer to free (can be NULL, no-op if NULL)

Behavior:

Frees optimizer structure and internal state (momentum buffers, etc.)
Does NOT free the graph or parameters (graph owns them)
Safe to call multiple times (idempotent)

Cleanup Order:

// CORRECT order:
tofu_optimizer_free(opt);           // 1. Free optimizer
tofu_graph_free(g);                 // 2. Free graph
tofu_tensor_free_data_too(weights);  // 3. Free tensors

// INCORRECT order (may crash):
tofu_graph_free(g);                 // DON'T free graph before optimizer!
tofu_optimizer_free(opt);           // Optimizer may access freed memory

Training Operations

`tofu_optimizer_step`

Perform one optimization step (update parameters).

void tofu_optimizer_step(tofu_optimizer* opt);

Parameters:

opt - Optimizer (cannot be NULL)

Preconditions:

opt must not be NULL
Gradients must be computed (call tofu_graph_backward first)

Behavior:

Updates all parameters using computed gradients
Algorithm depends on optimizer type (SGD, SGD+momentum, etc.)
Call after backward pass: forward → backward → step
Does NOT zero gradients - call tofu_optimizer_zero_grad if needed

Training Sequence:

for (int iteration = 0; iteration < num_iterations; iteration++) {
    // 1. Zero gradients
    tofu_optimizer_zero_grad(opt);

    // 2. Forward pass
    tofu_graph_node *x = tofu_graph_input(g, input_data);
    tofu_graph_node *pred = forward_pass(g, x);
    tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

    // 3. Backward pass
    tofu_graph_backward(g, loss);

    // 4. Update parameters
    tofu_optimizer_step(opt);

    // 5. Clear operations for next iteration
    tofu_graph_clear_ops(g);
}

Notes:

Must call tofu_graph_backward() before this function
Modifies parameter tensors in-place
Violating preconditions triggers assert() and crashes

See also: tofu_graph_backward, tofu_optimizer_zero_grad

`tofu_optimizer_zero_grad`

Zero out all parameter gradients.

void tofu_optimizer_zero_grad(tofu_optimizer* opt);

Parameters:

opt - Optimizer (cannot be NULL)

Preconditions:

opt must not be NULL

Behavior:

Sets gradients to zero for all tracked parameters
Call before each training iteration to prevent gradient accumulation
Equivalent to tofu_graph_zero_grad but works via optimizer

Example:

for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Zero gradients before forward pass
    tofu_optimizer_zero_grad(opt);

    // Forward pass
    tofu_graph_node *pred = forward_pass(g, input);
    tofu_graph_node *loss = compute_loss(g, pred, target);

    // Backward pass
    tofu_graph_backward(g, loss);

    // Update parameters
    tofu_optimizer_step(opt);
}

Notes:

Essential for correct training - prevents gradient accumulation
Must call before each training iteration
Violating preconditions triggers assert() and crashes

See also: tofu_graph_zero_grad

Parameter Management

Most users won't need these functions - parameters are automatically collected during optimizer creation. These are useful for advanced use cases like dynamic network architectures.

`tofu_optimizer_add_param`

Manually add parameter node to optimizer.

int tofu_optimizer_add_param(tofu_optimizer* opt, tofu_graph_node* param);

Parameters:

opt - Optimizer (cannot be NULL)
param - Parameter node to track (cannot be NULL)

Returns: 0 on success, non-zero on error

Preconditions:

opt and param must not be NULL
param must be a PARAM node (requires gradient)

Behavior:

Usually not needed - optimizer auto-collects params at creation
Use if you need to add parameters dynamically

Example:

// Create optimizer
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Add parameters dynamically (rare use case)
tofu_tensor *new_weight = tofu_tensor_zeros(2, (int[]){10, 5}, TOFU_FLOAT);
tofu_graph_node *W_new = tofu_graph_param(g, new_weight);
tofu_optimizer_add_param(opt, W_new);

Notes:

Rarely needed - use only for dynamic architectures
Violating preconditions triggers assert() and crashes

See also: tofu_optimizer_collect_params to scan graph for all params

`tofu_optimizer_collect_params`

Collect all parameter nodes from graph.

void tofu_optimizer_collect_params(tofu_optimizer* opt);

Parameters:

opt - Optimizer (cannot be NULL)

Preconditions:

opt must not be NULL

Behavior:

Scans graph and adds all PARAM nodes to optimizer
Called automatically during optimizer creation
Use if graph structure changes and you need to rescan
Clears existing parameter list before collecting

Example:

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Add more parameters to graph later
tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){10, 5}, TOFU_FLOAT);
tofu_graph_node *W2_node = tofu_graph_param(g, W2);

// Rescan graph to include new parameters
tofu_optimizer_collect_params(opt);

Notes:

Rarely needed - parameters auto-collected at creation
Use only if network structure changes dynamically
Violating preconditions triggers assert() and crashes

Usage Patterns

Basic Training Loop

// Setup
tofu_graph *g = tofu_graph_create();

// Create parameters
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);

// Add to graph
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);

// Create optimizer
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);

// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    for (int batch = 0; batch < num_batches; batch++) {
        // 1. Zero gradients
        tofu_optimizer_zero_grad(opt);

        // 2. Forward pass
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
        tofu_graph_node *h = tofu_graph_matmul(g, x, W_node);
        tofu_graph_node *pred = tofu_graph_add(g, h, b_node);

        // 3. Compute loss
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
        tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);

        // 4. Backward pass
        tofu_graph_backward(g, loss);

        // 5. Update parameters
        tofu_optimizer_step(opt);

        // 6. Clear operations for next batch
        tofu_graph_clear_ops(g);
    }
}

// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);

Training with Momentum

// Setup with momentum optimizer
tofu_graph *g = tofu_graph_create();

// Network parameters
tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){784, 128}, TOFU_FLOAT);
tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){128}, TOFU_FLOAT);
tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){128, 10}, TOFU_FLOAT);
tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);

// Add to graph
tofu_graph_node *W1_node = tofu_graph_param(g, W1);
tofu_graph_node *b1_node = tofu_graph_param(g, b1);
tofu_graph_node *W2_node = tofu_graph_param(g, W2);
tofu_graph_node *b2_node = tofu_graph_param(g, b2);

// Create optimizer with momentum
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    for (int batch = 0; batch < num_batches; batch++) {
        tofu_optimizer_zero_grad(opt);

        // Forward pass
        tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);

        // Layer 1
        tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1_node);
        h1 = tofu_graph_add(g, h1, b1_node);
        h1 = tofu_graph_relu(g, h1);

        // Layer 2
        tofu_graph_node *h2 = tofu_graph_matmul(g, h1, W2_node);
        h2 = tofu_graph_add(g, h2, b2_node);

        // Loss
        tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
        tofu_graph_node *loss = tofu_graph_mse_loss(g, h2, target);

        // Backward and update
        tofu_graph_backward(g, loss);
        tofu_optimizer_step(opt);

        tofu_graph_clear_ops(g);
    }
}

// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W1);
tofu_tensor_free_data_too(b1);
tofu_tensor_free_data_too(W2);
tofu_tensor_free_data_too(b2);

Learning Rate Scheduling

Manual learning rate adjustment during training:

// Create optimizer
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.1);

for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Reduce learning rate every 10 epochs
    if (epoch % 10 == 0 && epoch > 0) {
        opt->learning_rate *= 0.5;
        printf("Epoch %d: Reduced learning rate to %.6f\n", epoch, opt->learning_rate);
    }

    // Training loop for this epoch
    for (int batch = 0; batch < num_batches; batch++) {
        tofu_optimizer_zero_grad(opt);
        // ... forward, backward, step ...
    }
}

Monitoring Gradients

Useful for debugging and understanding training dynamics:

// After backward pass, before optimizer step
tofu_tensor *W_grad = tofu_graph_get_grad(W_node);

// Compute gradient statistics
double grad_sum = 0.0;
double grad_max = -INFINITY;
for (int i = 0; i < W_grad->len; i++) {
    float val;
    TOFU_TENSOR_DATA_TO(W_grad, i, val, TOFU_FLOAT);
    grad_sum += fabs(val);
    if (fabs(val) > grad_max) grad_max = fabs(val);
}

printf("Gradient mean: %.6f, max: %.6f\n",
       grad_sum / W_grad->len, grad_max);

// Now update parameters
tofu_optimizer_step(opt);

Gradient Clipping (Manual)

Prevent exploding gradients:

void clip_gradients(tofu_optimizer *opt, double max_norm) {
    for (int i = 0; i < opt->num_params; i++) {
        tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
        if (!grad) continue;

        // Compute gradient norm
        double norm = 0.0;
        for (int j = 0; j < grad->len; j++) {
            float val;
            TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
            norm += val * val;
        }
        norm = sqrt(norm);

        // Clip if necessary
        if (norm > max_norm) {
            double scale = max_norm / norm;
            for (int j = 0; j < grad->len; j++) {
                float val;
                TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
                val *= scale;
                TOFU_TENSOR_DATA_FROM(grad, j, val, TOFU_FLOAT);
            }
        }
    }
}

// Usage in training loop
tofu_graph_backward(g, loss);
clip_gradients(opt, 1.0);  // Clip to max norm of 1.0
tofu_optimizer_step(opt);

Hyperparameter Guidance

Learning Rate

The learning rate is the most important hyperparameter. It controls the step size of parameter updates.

Guidelines:

Problem Type	Recommended Range	Notes
Small networks	0.01 - 0.1	Can use larger learning rates
Deep networks	0.001 - 0.01	Need smaller learning rates
Fine-tuning	0.0001 - 0.001	Very small to preserve learned features

Common values:

0.1 - Starting point for small networks
0.01 - Default safe choice for most problems
0.001 - Deep networks, complex problems
0.0001 - Fine-tuning pre-trained models

Signs of incorrect learning rate:

Too high: Loss diverges (increases), NaN values, training unstable
Too low: Very slow convergence, loss decreases too slowly

Example - Finding good learning rate:

// Try multiple learning rates
double learning_rates[] = {0.001, 0.01, 0.1};

for (int lr_idx = 0; lr_idx < 3; lr_idx++) {
    printf("\n=== Testing LR: %.4f ===\n", learning_rates[lr_idx]);

    // Reset parameters
    reinitialize_parameters(W, b);

    // Create optimizer with this learning rate
    tofu_optimizer *opt = tofu_optimizer_sgd_create(g, learning_rates[lr_idx]);

    // Train for a few epochs
    for (int epoch = 0; epoch < 10; epoch++) {
        // ... training loop ...
        printf("Epoch %d, Loss: %.6f\n", epoch, loss_value);
    }

    tofu_optimizer_free(opt);
}

Momentum

Momentum helps accelerate convergence and dampen oscillations.

Guidelines:

Scenario	Recommended Value	Effect
Default	0.9	Good balance for most problems
High momentum	0.95 - 0.99	Faster convergence, may overshoot
Low momentum	0.5 - 0.8	More stable, slower convergence
No momentum	0.0	Vanilla SGD, most stable but slowest

Common values:

0.9 - Standard choice for most problems
0.95 - Deep networks, when convergence is slow
0.99 - Very deep networks (ResNet, Transformers)
0.5 - Noisy gradients, unstable training

Example:

// Standard momentum for deep network
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

// Higher momentum for very deep network
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.99);

// Low momentum for noisy gradients
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.5);

Batch Size Considerations

Batch size affects effective learning rate:

// Larger batches → more stable gradients → can use higher learning rate
int batch_size = 128;
double lr = 0.01;

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, lr);

// If you increase batch size, consider increasing learning rate proportionally
// batch_size = 256 → lr = 0.02
// batch_size = 512 → lr = 0.04

Learning Rate Schedules

Common strategies for adjusting learning rate during training:

Step Decay:

// Reduce learning rate every N epochs
if (epoch % 30 == 0 && epoch > 0) {
    opt->learning_rate *= 0.1;  // Reduce by 10x
}

Exponential Decay:

// Decay gradually every epoch
double initial_lr = 0.1;
double decay_rate = 0.96;
opt->learning_rate = initial_lr * pow(decay_rate, epoch);

Cosine Annealing:

// Smooth decay following cosine curve
double initial_lr = 0.1;
double min_lr = 0.001;
opt->learning_rate = min_lr + (initial_lr - min_lr) *
                     (1 + cos(M_PI * epoch / num_epochs)) / 2;

Training Tips

1. Start with a reasonable learning rate:

// Good defaults:
tofu_optimizer *opt_sgd = tofu_optimizer_sgd_create(g, 0.01);
tofu_optimizer *opt_momentum = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

2. Monitor loss and adjust:

double prev_loss = INFINITY;

for (int epoch = 0; epoch < num_epochs; epoch++) {
    // ... training ...

    // Check if loss is improving
    if (loss_value > prev_loss * 1.1) {
        printf("Loss increased! Consider reducing learning rate.\n");
    }

    prev_loss = loss_value;
}

3. Use learning rate warmup for large learning rates:

double target_lr = 0.1;
int warmup_epochs = 5;

for (int epoch = 0; epoch < num_epochs; epoch++) {
    if (epoch < warmup_epochs) {
        // Gradually increase learning rate
        opt->learning_rate = target_lr * (epoch + 1) / warmup_epochs;
    } else {
        opt->learning_rate = target_lr;
    }

    // ... training ...
}

4. Weight decay (L2 regularization) - manual implementation:

double weight_decay = 0.0001;

void apply_weight_decay(tofu_optimizer *opt, double weight_decay) {
    for (int i = 0; i < opt->num_params; i++) {
        tofu_tensor *param = tofu_graph_get_value(opt->params[i]);

        for (int j = 0; j < param->len; j++) {
            float val;
            TOFU_TENSOR_DATA_TO(param, j, val, TOFU_FLOAT);
            val *= (1.0 - weight_decay * opt->learning_rate);
            TOFU_TENSOR_DATA_FROM(param, j, val, TOFU_FLOAT);
        }
    }
}

// Use before optimizer step
tofu_graph_backward(g, loss);
apply_weight_decay(opt, 0.0001);
tofu_optimizer_step(opt);

Common Pitfalls

Forgetting to Zero Gradients

Problem:

// WRONG: Gradients accumulate indefinitely
for (int i = 0; i < num_iterations; i++) {
    // forward, backward, step...
    // Gradients keep accumulating!
}

Solution:

// CORRECT: Zero gradients each iteration
for (int i = 0; i < num_iterations; i++) {
    tofu_optimizer_zero_grad(opt);  // Clear gradients
    // forward, backward, step...
}

Incorrect Cleanup Order

Problem:

// WRONG: Freeing graph before optimizer
tofu_graph_free(g);        // Graph freed
tofu_optimizer_free(opt);  // Optimizer tries to access freed graph!

Solution:

// CORRECT: Free optimizer before graph
tofu_optimizer_free(opt);  // Free optimizer first
tofu_graph_free(g);        // Then free graph

Learning Rate Too High

Symptoms:

Loss becomes NaN
Loss diverges (increases)
Training unstable

Solution:

// Reduce learning rate by 10x
double new_lr = opt->learning_rate * 0.1;
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_create(g, new_lr);

Learning Rate Too Low

Symptoms:

Loss decreases very slowly
Training takes many epochs
No progress after many iterations

Solution:

// Increase learning rate by 10x
double new_lr = opt->learning_rate * 10.0;
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_create(g, new_lr);

Notes

Optimizer State Persistence

Optimizer state (like momentum buffers) persists across training iterations:

// Momentum accumulates across iterations
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

for (int epoch = 0; epoch < num_epochs; epoch++) {
    // Momentum from previous epochs affects current updates
    // ... training ...
}

Parameter Collection

Optimizers automatically collect parameters when created:

// All PARAM nodes are collected automatically
tofu_graph_node *W1 = tofu_graph_param(g, weights1);
tofu_graph_node *W2 = tofu_graph_param(g, weights2);

tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// opt now tracks both W1 and W2

Memory Management

Optimizer owns its internal state but not the graph or parameters:

// Optimizer allocates momentum buffers (if using momentum)
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);

// When freed, optimizer releases momentum buffers
tofu_optimizer_free(opt);

// Graph and parameters remain valid
// (must be freed separately)

Changelog

Version history and changes.

API Stability

Information about API stability guarantees and versioning.

Keyboard shortcuts

Tofu User Guide