Introduction
Welcome to the Tofu User Guide.
Installation
Learn how to build and install Tofu on Linux, macOS, and ESP32 microcontrollers. This guide covers everything from downloading the source code to verifying your installation works correctly.
In this guide you'll learn:
- Prerequisites and system requirements for each platform
- How to configure and build Tofu from source
- How to run tests to verify your build
- How to install Tofu to your system
- How to cross-compile for ESP32 embedded systems
Prerequisites:
- A C compiler (GCC 7+ or Clang 10+)
- GNU Make build system
- Git version control
- pkg-config package manager
Estimated time: 10 minutes for basic installation, 15 minutes including tests
Quick Installation
If you're already familiar with building C projects, here's the fastest path to get Tofu running:
# Clone the repository
cd ~/projects # or your preferred workspace
git clone https://github.com/c2akula/tofu.git
cd tofu
# Configure and build
chmod +x configure
./configure
make
# Optional: Run tests to verify the build
make test
# Install to your system
sudo make install
# Verify installation
pkg-config --modversion tofu
After installation, you can include Tofu in your projects with:
gcc -o myprogram myprogram.c $(pkg-config --cflags --libs tofu)
Detailed Installation - Linux
This section covers step-by-step installation on Ubuntu 20.04 and similar Linux distributions.
System Requirements
Tofu requires the following packages on Linux:
- Build tools: GCC (7+), GNU Make, GNU Autotools utilities
- Development headers: Standard C library headers
- Package management: pkg-config for dependency tracking
- Testing (optional): Check unit testing framework
Installing Dependencies
On Ubuntu and Debian-based distributions:
sudo apt-get update
sudo apt-get install build-essential perl git pkg-config check
This installs:
gccandg++: C/C++ compilersmake: Build automationperl: Required by the configure scriptgit: Version controlpkg-config: Package configuration utilitycheck: Unit testing framework (optional but recommended)
Cloning the Repository
cd ~/projects # Choose your workspace location
git clone https://github.com/c2akula/tofu.git
cd tofu
The repository is approximately 5 MB and includes:
- Source code in
src/directory - Tests in
test/directory - Examples in
examples/directory - Documentation in
docs/directory
Configuring the Build
Before building, configure the build system:
chmod +x configure
./configure
You'll see output like:
configure tofu version 1.0.0
Checking autotools...
Checking pkg-config...
Checking C compiler...
Checking math library...
Generated config.mk
For custom installation directories, use:
./configure --install-dir=/opt/tofu
./configure --prefix=$HOME/.local
To see all available options:
./configure -h
Available options include:
--build-dir=DIR: Where to place build artifacts (default:build)--install-dir=DIR: Where to install Tofu (default:/usr/local)--debug=yes: Include debug symbols for development--esp32=yes: Configure for ESP32 cross-compilation
Building the Library
After configuration, compile Tofu:
make lib
This builds both static and shared libraries. Expected output:
Compiling src/tofu_tensor.c
Compiling src/tofu_graph.c
Compiling src/tofu_optimizer.c
...
Linking build/lib/libtofu.so.1.0.0
Created: build/lib/libtofu.a
Created: build/lib/libtofu.so -> libtofu.so.1.0.0
Build artifacts are placed in the build/ directory:
build/lib/libtofu.a: Static librarybuild/lib/libtofu.so.1.0.0: Shared library (versioned)build/include/tofu/: Public header files
Running Tests
Verify your build with the comprehensive test suite:
make test
This builds and runs all tests. Expected output:
Compiling test/test_tensor.c
Compiling test/test_graph.c
Compiling test/test_optimizer.c
...
Running tests...
[====] Completed: 13 test(s), passed: 13, failed: 0
All tests should pass (66 checks across 5 test suites). If any fail:
- Check that all dependencies are installed
- Run
make cleanand retry - Report the issue with your system details
To run individual tests:
cd test
../build/test/test_tofu tensor_creation
Installing to Your System
Once tests pass, install Tofu to your system:
sudo make install
This installs:
- Headers to
/usr/local/include/tofu/ - Static library to
/usr/local/lib/libtofu.a - Shared library to
/usr/local/lib/libtofu.so.1.0.0 - Package config file to
/usr/local/lib/pkgconfig/tofu.pc
If you configured with a custom prefix:
# Install to home directory (no sudo needed)
./configure --prefix=$HOME/.local
make
make install
Verifying Installation
Check that Tofu is installed correctly:
pkg-config --modversion tofu
Expected output:
1.0.0
Get compilation and linking flags:
pkg-config --cflags --libs tofu
Expected output:
-I/usr/local/include/tofu -L/usr/local/lib -ltofu
Detailed Installation - macOS
This section covers installation on macOS 13 and newer.
System Requirements
Tofu requires:
- Xcode Command Line Tools: Apple's development toolchain with GCC/Clang
- Homebrew (optional): Package manager for additional tools
- pkg-config: For dependency management
Installing Xcode Command Line Tools
First, install the Xcode Command Line Tools:
xcode-select --install
A dialog will appear. Click Install and wait for completion (5-10 minutes).
Verify installation:
gcc --version
make --version
Expected output shows GCC (Apple Clang) version 14+.
Installing Additional Dependencies
Using Homebrew to install the check testing framework:
brew install pkg-config check
If Homebrew isn't installed, see https://brew.sh
Cloning and Building
The build process is identical to Linux:
cd ~/Projects # macOS convention
git clone https://github.com/c2akula/tofu.git
cd tofu
chmod +x configure
./configure
make lib
make test
Installing to Your System
Install to /usr/local (default) or your home directory:
# System-wide installation
sudo make install
# Or to home directory (no sudo needed)
./configure --prefix=$HOME/.local
make
make install
macOS-Specific Notes
- M1/M2 Arm chips: Tofu builds natively on Apple Silicon. No special configuration needed.
- Intel Macs: Fully supported. Tofu uses standard C that compiles on both architectures.
- Shared library permissions: You may need to adjust Gatekeeper policies if running compiled binaries. See Apple's code signing documentation.
- Homebrew path: If using Homebrew, you may need
export PATH="/opt/homebrew/bin:$PATH"on M1/M2 Macs.
Cross-Compilation - ESP32
Build Tofu for embedded deployment on ESP32 microcontrollers.
Why ESP32?
ESP32 is a popular microcontroller combining:
- WiFi and Bluetooth connectivity
- 240 MHz dual-core processor
- 520 KB RAM
- Capable of running lightweight neural networks for inference
Tofu's lightweight C implementation makes it ideal for ESP32 deployment.
Prerequisites
You need the ESP-IDF (Espressif IoT Development Framework) toolchain:
-
Install ESP-IDF following the official guide: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/get-started/index.html
-
Verify toolchain is available:
which xtensa-esp32-elf-gccShould show the path to the toolchain. If not, add to your shell profile:
export PATH="$PATH:/path/to/esp-idf/tools/xtensa-esp32-elf/bin" -
Verify the toolchain works:
xtensa-esp32-elf-gcc --version
Configuring for ESP32
Clone Tofu and configure for cross-compilation:
git clone https://github.com/c2akula/tofu.git
cd tofu
chmod +x configure
./configure --esp32=yes
If your ESP32 toolchain is not in your PATH:
./configure --esp32=yes --esp32-toolchain-dir=/opt/esp-idf/tools/xtensa-esp32-elf
The configure output will show:
configure tofu version 1.0.0
Configuring for ESP32 cross-compilation
Using toolchain: /path/to/xtensa-esp32-elf
Generated config.mk
Building for ESP32
Build the library for ESP32:
make lib
This creates:
build/lib/libtofu.a: Static library for ESP32
Note: Tests are skipped during ESP32 cross-compilation:
Skipping tests when cross-compiling for ESP32
This is expected. Tests require POSIX system calls not available on ESP32.
Using Tofu on ESP32
Once built, link the static library into your ESP32 project:
-
Copy the built library:
cp build/lib/libtofu.a /path/to/your/esp32-project/components/tofu/lib/ cp -r build/include/tofu/* /path/to/your/esp32-project/components/tofu/include/ -
In your ESP32 project's CMakeLists.txt:
set(COMPONENT_LIBS tofu m c) # Link tofu library -
Include headers in your code:
#include "tofu_tensor.h" #include "tofu_graph.h" -
Build and deploy:
idf.py build idf.py flash
Verification
After installation, verify everything works by running a simple test program.
Using pkg-config
Verify Tofu is discoverable by pkg-config:
pkg-config --list-all | grep tofu
Should show:
tofu - A light-weight neural network compiler for different software/hardware backends.
Using pkg-config in Your Build
When building applications that use Tofu:
gcc -o myapp myapp.c $(pkg-config --cflags --libs tofu)
This automatically includes:
- Correct compiler flags:
-I/usr/local/include/tofu - Linker flags:
-L/usr/local/lib -ltofu
Example Verification Program
Create a simple C program to test your installation:
// test_tofu.c
#include "tofu_tensor.h"
#include <stdio.h>
int main() {
// Create a simple tensor [2, 3]
float data[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
int dims[] = {2, 3};
tofu_tensor *t = tofu_tensor_create(data, 2, dims, TOFU_FLOAT);
if (t == NULL) {
printf("Failed to create tensor\n");
return 1;
}
printf("Tensor created successfully\n");
printf("Shape: [%d, %d]\n", t->dims[0], t->dims[1]);
printf("Size: %d elements\n", t->len);
tofu_tensor_free(t);
printf("Tensor freed successfully\n");
return 0;
}
Compile and run:
gcc -o test_tofu test_tofu.c $(pkg-config --cflags --libs tofu)
./test_tofu
Expected output:
Tensor created successfully
Shape: [2, 3]
Size: 6 elements
Tensor freed successfully
Troubleshooting Common Issues
Issue: pkg-config: command not found
- Solution: Install pkg-config (
sudo apt-get install pkg-configon Linux,brew install pkg-configon macOS)
Issue: fatal error: tofu_tensor.h: No such file or directory
- Solution: Run
make installto install headers. Or use the pkg-config output in your compilation flags.
Issue: undefined reference to 'tofu_tensor_create'
- Solution: Ensure you're using
$(pkg-config --libs tofu)in your linker flags
Issue: Tests fail with "check not found"
- Solution: Install check (
sudo apt-get install checkon Linux,brew install checkon macOS)
Issue: Configure fails with unknown compiler
- Solution: Ensure GCC/Clang is installed and in your PATH. Run
gcc --versionto verify.
Issue: Build fails on macOS with permission errors
- Solution: Try
sudo make installor configure with a home directory prefix:./configure --prefix=$HOME/.local
Uninstallation
To remove Tofu from your system:
sudo make uninstall
This removes all installed files, headers, and pkg-config configuration.
If you installed to a home directory:
make uninstall
Next Steps
Now that Tofu is installed, explore:
- Quick Start: Write your first neural network
- First Network: Build a complete training example
- API Reference: Full API documentation
Happy building!
Quick Start
Get started with Tofu in 5 minutes.
What You'll Build
In this quick-start guide, you'll create your first Tofu program that performs a basic tensor operation: matrix multiplication. This will introduce you to the fundamentals of working with tensors in Tofu.
Prerequisites
You should have Tofu installed and built on your system. If not, please see the Installation Guide first.
Your First Program
Create a file called first.c with the following code:
#include <stdio.h>
#include "tofu/tofu_tensor.h"
int main() {
// Create two matrices: a (2x3) and b (3x2)
int dims_a[] = {2, 3};
tofu_tensor *a = tofu_tensor_zeros(2, dims_a, TOFU_FLOAT);
int dims_b[] = {3, 2};
tofu_tensor *b = tofu_tensor_zeros(2, dims_b, TOFU_FLOAT);
// Initialize matrix a with values [1, 2, 3, 4, 5, 6]
float vals_a[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
for (int i = 0; i < 6; i++) {
TOFU_TENSOR_DATA_FROM(a, i, vals_a[i], TOFU_FLOAT);
}
// Initialize matrix b with values [1, 2, 3, 4, 5, 6]
float vals_b[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
for (int i = 0; i < 6; i++) {
TOFU_TENSOR_DATA_FROM(b, i, vals_b[i], TOFU_FLOAT);
}
// Perform matrix multiplication: result = a @ b
tofu_tensor *result = tofu_tensor_matmul(a, b, NULL);
// Print the result to stdout
printf("Result of a @ b:\n");
tofu_tensor_print(result, "%.1f");
// Free all tensor memory
tofu_tensor_free_data_too(a);
tofu_tensor_free_data_too(b);
tofu_tensor_free_data_too(result);
return 0;
}
Compiling Your Program
To compile this program, you'll need to link against the Tofu library:
gcc -o first first.c -I/path/to/tofu/build/include \
-L/path/to/tofu/build/src -ltofu -lm
Or if you've installed Tofu to your system:
gcc -o first first.c -ltofu -lm
The -lm flag is needed for the math library, and -ltofu links against Tofu.
Expected Output
When you run the program, you should see:
Result of a @ b:
[[22.0 28.0]
[49.0 64.0]]
This is the result of multiplying:
- Matrix a:
[[1, 2, 3], [4, 5, 6]](shape 2×3) - Matrix b:
[[1, 2], [3, 4], [5, 6]](shape 3×2)
What Just Happened?
Let's break down the key components of your first tensor program:
Tensor Creation
tofu_tensor *a = tofu_tensor_zeros(2, dims_a, TOFU_FLOAT);
This creates a tensor (multi-dimensional array) with:
2dimensions (a matrix)- Shape defined by
dims_a(2 rows, 3 columns) - Data type
TOFU_FLOAT(32-bit floating-point) - All elements initialized to zero
Tensor Data Access
TOFU_TENSOR_DATA_FROM(a, i, vals_a[i], TOFU_FLOAT);
This macro writes a floating-point value into the tensor at flat index i. Tensors store data in a flat array, but Tofu handles the indexing for multi-dimensional access.
Operations
tofu_tensor *result = tofu_tensor_matmul(a, b, NULL);
Matrix multiplication is one of the fundamental operations in Tofu. Passing NULL for the destination tensor tells Tofu to allocate a new tensor for the result.
Memory Management
tofu_tensor_free_data_too(a);
Always clean up your tensors when done. Use tofu_tensor_free_data_too() when you created the tensor with tofu_tensor_zeros() (which allocates both the structure and data). This prevents memory leaks.
For deeper explanations of tensors, operations, and data types, see Core Concepts.
Next Steps
Now that you've created your first tensor program, you're ready to explore more:
- Build a neural network: Learn how to create layers and train models in Your First Network
- Explore more operations: Check out the API reference for all available tensor operations
- Try more examples: Look for example programs in the
examples/directory
Happy tensor computing!
Your First Neural Network
Welcome to Tofu! In this guide, you'll build and train your first neural network to solve the classic XOR problem. By the end, you'll understand how to construct computation graphs, perform forward and backward passes, and optimize model parameters.
What You'll Build
You'll build a neural network that learns the XOR (exclusive OR) function. XOR is a simple yet elegant problem that demonstrates why neural networks need hidden layers. Your final model will output correct predictions for all four XOR inputs.
Why XOR?
XOR is the perfect learning problem because:
- Non-linear: You can't solve XOR with a single linear layer. This teaches you why hidden layers matter.
- Small: The dataset has only 4 examples, so training is fast.
- Well-understood: We know exactly what "correct" looks like.
- Practical: The same patterns apply to larger, real-world problems.
What You'll Learn
- How to create and structure computation graphs in Tofu
- How to build a multi-layer neural network
- How to execute forward passes (predictions)
- How to perform backward passes (gradient computation)
- How to run training loops with optimizers
- How to manage memory ownership correctly
- How to verify that your network actually learned
The XOR Problem
Understanding XOR
XOR returns 1 when inputs are different, 0 when they're the same:
[0, 0] → 0 (same, output 0)
[0, 1] → 1 (different, output 1)
[1, 0] → 1 (different, output 1)
[1, 1] → 0 (same, output 0)
Why It's Special
A single linear layer cannot learn XOR. Mathematically, XOR is not linearly separable—you cannot draw a single straight line to separate the 1s from the 0s on a 2D plane.
However, a network with a hidden layer can solve it by learning intermediate features. The hidden layer performs a non-linear transformation that makes XOR linearly separable in the higher-dimensional hidden space.
Network Architecture
To solve XOR, we'll use this architecture:
Input Layer (2 units)
↓
Hidden Layer (4 units with ReLU activation)
↓
Output Layer (1 unit)
The flow:
- Input layer: Takes [x1, x2] (the two binary inputs)
- Hidden layer: Learns 4 intermediate features via matrix multiplication and ReLU
- Output layer: Combines hidden features to produce final prediction
The ReLU (Rectified Linear Unit) activation in the hidden layer is crucial—it introduces non-linearity. Without it, stacking layers would be equivalent to a single linear layer.
Complete Code Walkthrough
Here's the full XOR training program. We'll break it down into sections and explain each part.
Section 1: Setup and Includes
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"
/* Xavier weight initialization for better convergence */
static float xor_xavier_init(int fan_in) {
float limit = sqrtf(6.0f / (float)fan_in);
return 2.0f * (float)rand() / RAND_MAX - 1.0f;
}
We include Tofu's three core modules:
- tofu_tensor.h: Tensor creation and manipulation
- tofu_graph.h: Computation graph construction and differentiation
- tofu_optimizer.h: Gradient-based parameter updates
The xor_xavier_init function initializes weights using Xavier initialization. This ensures weights start in a good range that helps training converge faster than random initialization.
Section 2: Main Function and Configuration
int main() {
printf("============================================================\n");
printf("XOR Neural Network Training Example\n");
printf("============================================================\n\n");
/* Configuration */
const int INPUT_SIZE = 2;
const int HIDDEN_SIZE = 4;
const int OUTPUT_SIZE = 1;
const int NUM_EPOCHS = 2000;
const float LEARNING_RATE = 0.1f;
const int REPORT_INTERVAL = 200;
These constants define our network shape and training hyperparameters:
- INPUT_SIZE (2): Two binary inputs for XOR
- HIDDEN_SIZE (4): Four hidden units (more than enough to solve XOR)
- OUTPUT_SIZE (1): Single output for binary classification
- NUM_EPOCHS (2000): Number of times we iterate through the dataset
- LEARNING_RATE (0.1): Controls step size in parameter updates (higher = faster but riskier)
- REPORT_INTERVAL (200): How often to print progress
Section 3: Data Preparation
/* Prepare XOR dataset */
float xor_inputs[4][2] = {
{0.0f, 0.0f},
{0.0f, 1.0f},
{1.0f, 0.0f},
{1.0f, 1.0f}
};
float xor_targets[4][1] = {
{0.0f},
{1.0f},
{1.0f},
{0.0f}
};
printf("XOR Dataset:\n");
for (int i = 0; i < 4; i++) {
printf(" [%.0f, %.0f] -> %.0f\n",
xor_inputs[i][0], xor_inputs[i][1], xor_targets[i][0]);
}
printf("\n");
We hardcode the complete XOR dataset. This is all the training data we need—the network must generalize from just 4 examples (and it can because of the structure of the XOR problem).
Section 4: Creating the Computation Graph
/* Create computation graph */
tofu_graph* g = tofu_graph_create();
assert(g != NULL);
We create an empty computation graph. All our operations will be added to this graph. Think of it as a blueprint for computations—nodes represent operations, edges represent data flow.
Important concept: The graph doesn't own the tensor data. We allocate tensors separately and pass them to the graph. We're responsible for freeing them later.
Section 5: Initializing Network Parameters
/* Initialize weights with Xavier initialization */
float* w1_data = (float*)malloc(INPUT_SIZE * HIDDEN_SIZE * sizeof(float));
for (int i = 0; i < INPUT_SIZE * HIDDEN_SIZE; i++) {
w1_data[i] = xor_xavier_init(INPUT_SIZE);
}
tofu_tensor* t_w1 = tofu_tensor_create(w1_data, 2,
(int[]){INPUT_SIZE, HIDDEN_SIZE}, TOFU_FLOAT);
tofu_graph_node* w1 = tofu_graph_param(g, t_w1);
/* Initialize first bias */
float* b1_data = (float*)calloc(HIDDEN_SIZE, sizeof(float));
tofu_tensor* t_b1 = tofu_tensor_create(b1_data, 1, (int[]){HIDDEN_SIZE}, TOFU_FLOAT);
tofu_graph_node* b1 = tofu_graph_param(g, t_b1);
/* Initialize weights for hidden to output */
float* w2_data = (float*)malloc(HIDDEN_SIZE * OUTPUT_SIZE * sizeof(float));
for (int i = 0; i < HIDDEN_SIZE * OUTPUT_SIZE; i++) {
w2_data[i] = xor_xavier_init(HIDDEN_SIZE);
}
tofu_tensor* t_w2 = tofu_tensor_create(w2_data, 2,
(int[]){HIDDEN_SIZE, OUTPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* w2 = tofu_graph_param(g, t_w2);
/* Initialize second bias */
float* b2_data = (float*)calloc(OUTPUT_SIZE, sizeof(float));
tofu_tensor* t_b2 = tofu_tensor_create(b2_data, 1, (int[]){OUTPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* b2 = tofu_graph_param(g, t_b2);
We initialize four parameter tensors:
- w1: Shape [2, 4]. Transforms input to hidden layer. Each column is one hidden unit's weights from both inputs.
- b1: Shape [4]. Bias for each hidden unit. We use
calloc(zeros) for biases. - w2: Shape [4, 1]. Transforms hidden to output. Weights from all 4 hidden units to the single output.
- b2: Shape [1]. Bias for the output unit.
Each tensor is converted to a graph node using tofu_graph_param. These are "parameter" nodes (trainable) as opposed to "input" nodes (non-trainable). The optimizer will update these during training.
Section 6: Creating the Optimizer
/* Create optimizer */
tofu_optimizer* optimizer = tofu_optimizer_sgd_create(g, LEARNING_RATE);
assert(optimizer != NULL);
We create an SGD (Stochastic Gradient Descent) optimizer with our learning rate. The optimizer will:
- Collect all parameter nodes from the graph
- Compute gradients during backward passes
- Update parameters based on those gradients
SGD is simple and effective: param = param - learning_rate * gradient
Section 7: Training Loop Structure
float best_loss = INFINITY;
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
float epoch_loss = 0.0f;
/* Process each training example */
for (int sample = 0; sample < 4; sample++) {
We train for 2000 epochs (full passes through the dataset). Each epoch processes all 4 XOR examples. We track the best loss and accumulate epoch loss for reporting.
This is "online learning" (one example at a time) rather than "batch learning," which is appropriate for tiny datasets.
Section 8: Forward Pass
/* Zero gradients */
tofu_graph_zero_grad(g);
/* Create input tensor for this sample */
float* input_data = (float*)malloc(INPUT_SIZE * sizeof(float));
input_data[0] = xor_inputs[sample][0];
input_data[1] = xor_inputs[sample][1];
tofu_tensor* t_input = tofu_tensor_create(input_data, 1,
(int[]){INPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* x = tofu_graph_input(g, t_input);
/* Forward pass: Layer 1 */
/* h1 = x @ w1 + b1 */
tofu_graph_node* h1_matmul = tofu_graph_matmul(g, x, w1);
tofu_graph_node* h1_bias = tofu_graph_add(g, h1_matmul, b1);
/* Apply ReLU activation */
tofu_graph_node* h1_relu = tofu_graph_relu(g, h1_bias);
/* Forward pass: Layer 2 (output) */
/* y = h1 @ w2 + b2 */
tofu_graph_node* y_matmul = tofu_graph_matmul(g, h1_relu, w2);
tofu_graph_node* y_pred = tofu_graph_add(g, y_matmul, b2);
/* Create target tensor */
float* target_data = (float*)malloc(OUTPUT_SIZE * sizeof(float));
target_data[0] = xor_targets[sample][0];
tofu_tensor* t_target = tofu_tensor_create(target_data, 1,
(int[]){OUTPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* y_target = tofu_graph_input(g, t_target);
/* Compute MSE loss */
tofu_graph_node* loss_node = tofu_graph_mse_loss(g, y_pred, y_target);
Let's trace through the computation:
Zero gradients: Before computing new gradients, we clear old ones to prevent accumulation.
Create input node: Each example is a separate tensor created fresh, wrapped in a graph input node (non-trainable).
Hidden layer computation:
h1_matmul = x @ w1: Matrix multiply. Input [1, 2] @ weights [2, 4] → [1, 4]h1_bias = h1_matmul + b1: Add bias [4] to each row (broadcasting)h1_relu = ReLU(h1_bias): Apply ReLU activation element-wise (max(0, x))
Output layer computation:
y_matmul = h1_relu @ w2: Matrix multiply. Hidden [1, 4] @ weights [4, 1] → [1, 1]y_pred = y_matmul + b2: Add output bias [1]
Loss computation: loss = MSE(y_pred, y_target) = mean squared error = (y_pred - y_target)²
The computation graph now contains a chain: input → matmul → add → ReLU → matmul → add → loss
Section 9: Backward Pass and Parameter Update
/* Extract loss value */
tofu_tensor* loss_tensor = tofu_graph_get_value(loss_node);
float sample_loss = 0.0f;
if (loss_tensor && loss_tensor->len > 0) {
TOFU_TENSOR_DATA_TO(loss_tensor, 0, sample_loss, TOFU_FLOAT);
}
epoch_loss += sample_loss;
/* Backward pass: compute gradients */
tofu_graph_backward(g, loss_node);
/* Optimizer step: update weights and biases */
tofu_optimizer_step(optimizer);
/* Cleanup input/target tensors for this sample */
tofu_tensor_free(t_input);
tofu_tensor_free(t_target);
free(input_data);
free(target_data);
/* Clear operations for next sample (keeps parameters) */
tofu_graph_clear_ops(g);
}
/* Average loss over all 4 samples */
epoch_loss /= 4;
Extract loss: We read the numerical loss value from the computed tensor using TOFU_TENSOR_DATA_TO.
Backward pass: tofu_graph_backward(g, loss_node) propagates gradients backward through the graph. Starting from the loss scalar, it computes:
- ∂loss/∂y_pred (what change in prediction would reduce loss)
- ∂loss/∂h1_relu (through the output layer)
- ∂loss/∂h1_bias (through ReLU)
- ∂loss/∂h1_matmul (through addition)
- ∂loss/∂w1, ∂loss/∂b1 (through matmul and add)
- ∂loss/∂w2, ∂loss/∂b2 (through second layer)
This is automatic differentiation—Tofu handles all the calculus!
Parameter update: tofu_optimizer_step(optimizer) updates all parameters using their gradients:
- w1 ← w1 - learning_rate × ∂loss/∂w1
- b1 ← b1 - learning_rate × ∂loss/∂b1
- (and similarly for w2, b2)
Cleanup: We free the input and target tensors (we allocated them, so we own them). Importantly, we keep w1, b1, w2, b2 (the parameters) intact.
Clear operations: tofu_graph_clear_ops(g) removes all the intermediate computation nodes but keeps the parameters. This prepares for the next sample without recreating parameters.
Section 10: Training Progress Reporting
/* Report progress */
if (epoch % REPORT_INTERVAL == 0 || epoch == NUM_EPOCHS - 1) {
printf("Epoch %4d: loss = %.6f\n", epoch, epoch_loss);
}
}
printf("\nTraining Complete!\n");
printf("Final average loss: %.6f\n", best_loss);
We print loss every 200 epochs and at the end. Watching loss decrease is satisfying and helps you spot problems (e.g., loss increasing means learning rate is too high).
Section 11: Evaluation
printf("\n");
printf("Final Predictions:\n");
printf("Input Predicted Target\n");
printf("---- --------- ------\n");
for (int sample = 0; sample < 4; sample++) {
/* Build inference graph for this sample */
float* input_data = (float*)malloc(INPUT_SIZE * sizeof(float));
input_data[0] = xor_inputs[sample][0];
input_data[1] = xor_inputs[sample][1];
tofu_tensor* t_input = tofu_tensor_create(input_data, 1,
(int[]){INPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* x = tofu_graph_input(g, t_input);
/* Forward pass (same as training) */
tofu_graph_node* h1_matmul = tofu_graph_matmul(g, x, w1);
tofu_graph_node* h1_bias = tofu_graph_add(g, h1_matmul, b1);
tofu_graph_node* h1_relu = tofu_graph_relu(g, h1_bias);
tofu_graph_node* y_matmul = tofu_graph_matmul(g, h1_relu, w2);
tofu_graph_node* y_pred = tofu_graph_add(g, y_matmul, b2);
/* Get prediction */
tofu_tensor* pred_tensor = tofu_graph_get_value(y_pred);
float prediction = 0.0f;
if (pred_tensor && pred_tensor->len > 0) {
TOFU_TENSOR_DATA_TO(pred_tensor, 0, prediction, TOFU_FLOAT);
}
printf("[%.0f, %.0f] %.4f %.0f\n",
xor_inputs[sample][0], xor_inputs[sample][1],
prediction, xor_targets[sample][0]);
tofu_tensor_free(t_input);
free(input_data);
tofu_graph_clear_ops(g);
}
After training, we perform inference (forward pass without backward) on all 4 examples. We print predicted values vs. targets. Since our output is a real number (not discrete), we'll see values close to 0 or 1.
Section 12: Accuracy Check and Cleanup
/* Check accuracy (threshold at 0.5) */
int correct = 0;
for (int sample = 0; sample < 4; sample++) {
/* ... build prediction graph ... */
int pred_class = (prediction > 0.5f) ? 1 : 0;
int true_class = (int)xor_targets[sample][0];
if (pred_class == true_class) {
correct++;
}
/* ... cleanup ... */
}
float accuracy = (float)correct / 4.0f;
printf("Accuracy: %d/4 (%.1f%%)\n", correct, accuracy * 100.0f);
if (accuracy == 1.0f) {
printf("\nSuccess! Network learned XOR perfectly!\n");
}
/* Cleanup (IMPORTANT ORDER) */
printf("\n");
printf("Cleaning up resources...\n");
tofu_optimizer_free(optimizer);
tofu_graph_free(g);
/* Free parameter tensors (caller owns them) */
tofu_tensor_free_data_too(t_w1);
tofu_tensor_free_data_too(t_b1);
tofu_tensor_free_data_too(t_w2);
tofu_tensor_free_data_too(t_b2);
printf("Done!\n");
printf("============================================================\n");
return 0;
}
We convert predictions to binary (threshold at 0.5) and compute accuracy. Finally, cleanup is critical and in the right order:
- Free optimizer (it might hold references to the graph)
- Free graph (it owns the nodes but not the tensor data)
- Free parameter tensors (we allocated the data, so we free it)
This order prevents use-after-free errors.
Compiling and Running
Save the complete program as examples/xor_training.c (or copy from Tofu's examples directory).
Compile
# From the tofu directory
make lib # Build the library
cc -I./src examples/xor_training.c build/src/libtofu.a -o xor_training -lm
Run
./xor_training
Expected Output
============================================================
XOR Neural Network Training Example
============================================================
Network Architecture: [2] -> [4] -> [1]
Training: 2000 epochs with SGD, learning_rate=0.100
XOR Dataset:
[0, 0] -> 0
[0, 1] -> 1
[1, 0] -> 1
[1, 1] -> 0
Epoch 0: loss = 0.488130
Epoch 200: loss = 0.000000
Epoch 400: loss = 0.000000
...
Epoch 1999: loss = 0.000000
Training Complete!
Final average loss: 0.000000
Final Predictions:
Input Predicted Target
---- --------- ------
[0, 0] 0.0000 0
[0, 1] 1.0000 1
[1, 0] 1.0000 1
[1, 1] 0.0000 0
Accuracy: 4/4 (100.0%)
Success! Network learned XOR perfectly!
Cleaning up resources...
Done!
Key observation: Loss rapidly converges to ~0 within a few hundred epochs! The network quickly learns to solve XOR.
Understanding the Results
Why Loss Decreases
Initially, the network makes random predictions (loss ≈ 0.49, far from correct). During training:
- Gradients computed via backprop tell each parameter how it contributed to the error
- Parameters move (via optimizer) in the direction that reduces loss
- With each example, the network improves
- After sufficient epochs, predictions are nearly perfect (loss → 0)
This iterative improvement is the essence of machine learning.
What the Network Learned
The hidden layer developed 4 internal features (learned by the 4 hidden units). These features transform the input space so that XOR becomes linearly separable. Think of it as the network learning new coordinate axes in which the problem is easier.
The output layer learned to combine these 4 hidden features into a single decision.
How to Verify Learning
The predictions match targets perfectly:
[0, 0] → 0.0000(should be 0) ✓[0, 1] → 1.0000(should be 1) ✓[1, 0] → 1.0000(should be 1) ✓[1, 1] → 0.0000(should be 0) ✓
100% accuracy is the highest possible.
Experimenting Further
Now that you've trained a network, try modifying parameters to understand their effects:
Try Different Learning Rates
Change LEARNING_RATE to 0.01 (slower) or 0.5 (faster, but risky). Watch how convergence speed changes.
Try Different Hidden Sizes
Change HIDDEN_SIZE to 2 (too small—might not converge) or 8 (overkill). Can the network still solve XOR?
Add More Hidden Layers
Modify the forward pass to add another hidden layer:
tofu_graph_node* h2 = tofu_graph_matmul(g, h1_relu, w2);
h2 = tofu_graph_add(g, h2, b2);
h2 = tofu_graph_relu(g, h2);
tofu_graph_node* y_matmul = tofu_graph_matmul(g, h2, w3);
Does a deeper network help? (For XOR, it shouldn't be necessary.)
Monitor Individual Gradients
After tofu_graph_backward(), print gradient values to understand what each parameter is learning:
tofu_tensor* w1_grad = tofu_graph_get_grad(w1);
printf("W1 gradient[0]: %.6f\n", ...);
Next Steps
You've mastered the fundamentals! Here's your learning path:
-
Dive Deeper: Read the Concepts Guide to understand backpropagation and automatic differentiation in detail.
-
Build Bigger: Study the CNN Training Example to see how to scale to realistic datasets and architectures.
-
Real Datasets: Try training on real data:
- MNIST for digit classification
- Iris for flower classification
- Your own custom dataset
-
Advanced Optimizers: Experiment with SGD with momentum or Adam (if available in your version).
-
API Reference: Consult the Graph API and Optimizer API for complete documentation of all functions.
Key Takeaways
- Computation graphs let you define complex computations and differentiate them automatically
- Forward pass computes predictions (operations evaluate top-to-bottom)
- Backward pass computes gradients (automatic differentiation flows bottom-to-top)
- Optimizers update parameters based on gradients to minimize loss
- Memory ownership is crucial: you own input/parameter tensors, graph owns computed nodes
- Iteration (epochs) matters: neural networks improve with repeated exposure to data
You now understand the full training pipeline. You're ready to tackle more complex problems!
Core Concepts
Understanding the fundamental concepts behind Tofu will help you build neural networks efficiently and write correct, safe code. This guide introduces the key ideas that make Tofu powerful.
Introduction
Tofu is built around a small number of core concepts: tensors, computation graphs, and automatic differentiation. These three ideas work together to enable rapid development of machine learning systems.
If you come from a NumPy background, tensors will feel familiar—they're just multi-dimensional arrays. The difference is that Tofu tensors can be organized into computation graphs that automatically compute gradients for training. You won't need to manually derive derivative formulas or write error-prone gradient code.
This guide builds intuition for each concept with concrete examples. By the end, you'll understand how they fit together in a complete training loop.
Tensors
A tensor is simply a multi-dimensional array of numbers. Tensors are the fundamental data structure in Tofu—all neural network operations work with tensors.
Understanding Tensor Dimensions
Think of tensors as a natural extension of scalar numbers:
- Scalar: A single number. Shape:
[](0 dimensions). Example:5.0 - Vector: A 1-D list of numbers. Shape:
[3](1 dimension). Example:[1.0, 2.0, 3.0] - Matrix: A 2-D grid of numbers. Shape:
[2, 3](2 dimensions). Example:[[1, 2, 3], [4, 5, 6]] - Tensor (3-D+): Higher-dimensional arrays. Shape:
[2, 3, 4](3 dimensions), etc.
In practice, machine learning uses tensors extensively:
- An image might be shape
[height, width, channels](e.g.,[28, 28, 1]for 28x28 grayscale) - A batch of images might be shape
[batch_size, height, width, channels](e.g.,[32, 28, 28, 1]) - Neural network weights are often 2-D matrices: shape
[input_dim, output_dim]
Tensor Shape and Size
Every tensor has a shape—a tuple of integer dimensions. The total number of elements is the product of all dimensions.
Tensor shape [2, 3, 4] contains 2 * 3 * 4 = 24 elements
Tofu's tensor structure stores both the shape and a flat data buffer:
tofu_tensor {
int ndim; // Number of dimensions (e.g., 3)
int *dims; // Array of dimension sizes (e.g., [2, 3, 4])
void *data; // Flat buffer of 24 floating point numbers
tofu_dtype dtype; // Data type (TOFU_FLOAT, TOFU_INT32, etc.)
}
Data Types
Tensors can hold different numeric types depending on your needs:
TOFU_FLOAT- 32-bit floating point (most common for neural networks)TOFU_DOUBLE- 64-bit floating point (higher precision)TOFU_INT32,TOFU_INT64- Integer typesTOFU_BOOL- Boolean values
For machine learning, you'll typically use TOFU_FLOAT for efficiency and simplicity.
Creating Tensors
Tofu provides several ways to create tensors:
// Create tensor from existing data buffer (you manage the buffer)
float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
int dims[] = {2, 2};
tofu_tensor *t = tofu_tensor_create(data, 2, dims, TOFU_FLOAT);
// Create tensor with values (library manages allocation)
float values[] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *t = tofu_tensor_create_with_values(values, 2, dims);
// Create zero-filled tensor
int dims[] = {3, 4};
tofu_tensor *t = tofu_tensor_zeros(2, dims, TOFU_FLOAT);
// Create tensor with sequential values (like NumPy arange)
tofu_tensor *t = tofu_tensor_arange(0.0, 10.0, 1.0, TOFU_FLOAT); // [0, 1, ..., 9]
Tensor Operations
Tofu provides three main categories of tensor operations:
Element-wise operations apply an operation to each element independently:
Addition: [1, 2] + [3, 4] = [4, 6]
Multiply: [1, 2] * [3, 4] = [3, 8]
Power: [2, 3] ^ 2 = [4, 9]
Matrix operations perform linear algebra calculations:
Matrix multiplication: [2,3] @ [3,4] = [2,4]
Reduction operations combine elements along an axis:
Sum reduction along axis 1:
[[1, 2, 3], [6] (1+2+3=6)
[4, 5, 6]] --> [15] (4+5+6=15)
These operations form the building blocks of neural networks. For example, a fully-connected layer performs: output = matmul(input, weights) + bias.
Broadcasting: Implicit Dimension Expansion
A powerful feature of tensor operations is broadcasting—automatically expanding smaller tensors to match larger ones:
Matrix [3, 4] + Vector [4] broadcasts the vector to [3, 4]:
[[a, b, c, d], [[a+x, b+y, c+z, d+w],
[e, f, g, h], + [e+x, f+y, g+z, h+w],
[i, j, k, l]] x [i+x, j+y, k+z, l+w]]
where vector [4] is implicitly treated as [x, y, z, w]
and repeated for each row
This allows you to add biases to layer outputs without manually replicating them.
Computation Graphs
A computation graph is a way of representing how data flows through operations. Instead of computing results immediately, you describe the computation, then ask the graph to compute outputs and gradients.
Why Use Computation Graphs?
There are two key advantages:
- Memory efficiency: The graph knows all operations in advance, so it can optimize memory usage.
- Automatic differentiation: Once you have the graph, computing gradients is automatic—no manual derivative math needed.
Graph Structure: Directed Acyclic Graph (DAG)
A computation graph is a directed acyclic graph (DAG) where:
- Nodes represent tensors or operations
- Edges represent data flow between nodes
- No cycles (data only flows forward)
Here's a simple example:
x (INPUT) W (PARAM)
| |
v v
matmul ←────+
|
| y
v
add ← bias (PARAM)
|
v
relu
|
v
output
This graph represents: output = relu((x @ W) + bias)
Node Types
Nodes in a computation graph come in three flavors:
Leaf nodes have no inputs:
INPUT: Data that doesn't need gradients (e.g., batch data)PARAM: Trainable parameters (e.g., weights, biases)
Operation nodes combine inputs:
MATMUL: Matrix multiplicationADD: Element-wise additionMUL: Element-wise multiplicationRELU: Activation functionSOFTMAX: Softmax activationMSE_LOSS: Mean squared error lossCE_LOSS: Cross-entropy loss- And many more...
Important: Graph nodes own their results (the tensors computed by operations), but the graph does NOT own INPUT or PARAM tensors. You create those tensors, add them to the graph, and you're responsible for freeing them later.
Building a Graph
Creating a graph follows this pattern:
// 1. Create tensors (you own these)
tofu_tensor *input = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 5}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){5}, TOFU_FLOAT);
// 2. Create graph and add leaf nodes
tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, input);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);
// 3. Build computation by adding operations
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
tofu_graph_node *z = tofu_graph_add(g, y, b);
tofu_graph_node *out = tofu_graph_relu(g, z);
The graph now contains all the computation. Each operation node automatically computes its value during construction.
Forward Pass: Computing Outputs
When you add the first operation node to a graph, forward pass computation happens automatically:
tofu_graph_node *y = tofu_graph_matmul(g, x, W); // Forward pass computes matmul
tofu_tensor *result = tofu_graph_get_value(y); // Get the computed result
Each node stores its result in node->value. You can inspect these at any time during or after building the graph.
Automatic Differentiation
Automatic differentiation is the magic that lets you compute gradients without writing a single derivative formula. It works by building on the chain rule from calculus.
The Chain Rule in Action
Recall from calculus: if z = f(g(x)), then dz/dx = (df/dg) * (dg/dx).
In neural networks, this chains together:
Forward: x → square → * 2 → y
Backward: ∂y/∂x = ∂(*2)/∂(square) * ∂(square)/∂x
= 2 * (2*x)
= 4x
For x = 3: Forward gives y = (3^2) * 2 = 18. Backward gives dy/dx = 4*3 = 12.
Two Phases: Forward and Backward
Training has two phases:
Forward pass: Compute outputs by executing the graph. Each node records its operation and inputs for later use.
Backward pass: Starting from a loss scalar, compute gradients by working backwards through the graph, applying the chain rule at each node.
Forward Pass:
x → op1 → op2 → loss
(compute intermediate values)
Backward Pass (reverse):
loss → ∂op2 → ∂op1 → gradients for x
(compute gradients using chain rule)
Reverse-Mode Autodiff: How Tofu Does It
Tofu uses reverse-mode automatic differentiation (also called backpropagation):
- Build the graph by adding nodes (forward pass happens automatically)
- Call
tofu_graph_backward(g, loss)to compute gradients - Gradients accumulate in
node->gradfor each node
During backward:
// Each node that requires gradients gets a ∂loss/∂node value in node->grad
tofu_graph_node *W = tofu_graph_param(g, weights);
// ... build graph and call backward ...
tofu_tensor *W_grad = tofu_graph_get_grad(W); // Now contains ∂loss/∂W
The backward pass visits nodes in reverse topological order, so gradients flow correctly through the entire graph.
Gradient Accumulation
Gradients accumulate by default. If you call backward twice without zeroing, gradients add up:
tofu_graph_backward(g, loss1); // node->grad = ∂loss1/∂node
tofu_graph_backward(g, loss2); // node->grad += ∂loss2/∂node (accumulates!)
This is why you must call tofu_graph_zero_grad() before each training iteration:
for (int epoch = 0; epoch < 100; epoch++) {
tofu_graph_zero_grad(g); // Clear old gradients
// Forward pass (build graph)
tofu_graph_node *loss = build_forward_pass(g);
// Backward pass (compute new gradients)
tofu_graph_backward(g, loss);
}
Memory Management and Ownership
Proper memory management is critical in Tofu. Understanding ownership prevents memory leaks and use-after-free bugs.
Ownership Rules
Caller owns: INPUT and PARAM tensors
When you create a tensor and pass it to tofu_graph_input() or tofu_graph_param(), the graph does NOT take ownership. You must free it yourself:
tofu_tensor *input = tofu_tensor_create(data, 2, dims, TOFU_FLOAT);
tofu_graph_node *x = tofu_graph_input(g, input);
// After you're done with the graph...
tofu_graph_free(g); // Graph is freed
tofu_tensor_free(input); // YOU must free the input tensor
Graph owns: Operation results
When an operation creates a result (like matmul, add, relu), the graph owns that tensor. You never free operation results—the graph does:
tofu_graph_node *result = tofu_graph_matmul(g, a, b);
tofu_tensor *result_value = tofu_graph_get_value(result);
// When you're done:
tofu_graph_free(g); // Frees result_value automatically
View Operations: Sharing Memory
Some operations create "views"—new tensor structures that share memory with originals. No data is copied:
tofu_tensor *t = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor *reshaped = tofu_tensor_reshape(t, 2, (int[]){3, 4});
// reshaped shares data with t
// Both point to the same 12 floats, just interpreted differently
When using views, remember:
- Don't free the view with
free_data_too(that would free shared memory) - Use
tofu_tensor_free()on views (just free the structure) - The original must outlive the view
Cleanup Order
Always clean up in this order:
// 1. Free optimizer (if used)
tofu_optimizer_free(opt);
// 2. Free graph (frees operation results and nodes)
tofu_graph_free(g);
// 3. Free parameter tensors (you created these)
tofu_tensor_free_data_too(weights);
tofu_tensor_free_data_too(bias);
// 4. Free input tensors (you created these)
tofu_tensor_free(input);
Common Mistakes
Mistake 1: Freeing a view with free_data_too
// WRONG!
tofu_tensor *view = tofu_tensor_reshape(t, 2, (int[]){3, 4});
tofu_tensor_free_data_too(view); // Frees shared memory!
Correct:
tofu_tensor *view = tofu_tensor_reshape(t, 2, (int[]){3, 4});
tofu_tensor_free(view); // Just free the structure
Mistake 2: Using graph node results after freeing the graph
// WRONG!
tofu_graph_free(g);
tofu_tensor *result = tofu_graph_get_value(node); // Dangling pointer!
Correct:
tofu_tensor *result = tofu_graph_get_value(node); // Get before freeing
tofu_graph_free(g);
Mistake 3: Forgetting to free parameter tensors
// WRONG!
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_free(g); // Forgetting to free weights!
Correct:
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_free(g);
tofu_tensor_free_data_too(weights); // Free the tensor you created
Training Loop Pattern
A typical training loop follows a consistent pattern:
for each epoch:
for each batch:
1. Zero gradients
2. Build forward graph and compute loss
3. Backward pass (compute gradients)
4. Optimizer step (update parameters using gradients)
5. Clear operations (keep parameters, discard computation nodes)
Here's what this looks like in code:
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01); // Learning rate
for (int epoch = 0; epoch < 100; epoch++) {
for (int batch = 0; batch < num_batches; batch++) {
// 1. Zero gradients
tofu_optimizer_zero_grad(opt);
// 2. Forward pass
tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
tofu_graph_node *pred = tofu_graph_matmul(g, x, W);
pred = tofu_graph_add(g, pred, b);
tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
// 3. Compute loss
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
// 4. Backward pass (compute gradients)
tofu_graph_backward(g, loss);
// 5. Update parameters
tofu_optimizer_step(opt);
// 6. Clear operations for next batch (W and b are preserved)
tofu_graph_clear_ops(g);
}
}
The order is important:
- Zero before forward: Clean slate for new gradients
- Forward then backward: Must compute values before computing gradients
- Backward before step: Optimizer needs gradients to update
- Clear after step: Makes room for next batch while keeping parameters
How These Concepts Work Together
Now you can see how everything fits together:
- Tensors hold your data—inputs, parameters, and intermediate results
- Computation graphs describe the structure of your model and automatically compute results
- Automatic differentiation computes gradients by applying the chain rule through the graph
- Training loop repeats: zero gradients, forward, backward, update parameters
The power of this design: you describe your model once, and Tofu automatically computes all the gradients. No manual derivative formulas. No gradient bugs. Just correct, efficient training.
Summary
Understanding these core concepts will serve you well as you build neural networks with Tofu:
- Tensors are multi-dimensional arrays—the fundamental data structure
- Computation graphs organize operations in a way that enables automatic differentiation
- Automatic differentiation computes gradients automatically using the chain rule
- Memory ownership is explicit: you own inputs and parameters, graphs own operations
- Training follows a pattern: zero gradients, forward, backward, update, repeat
Next, check out the tutorials to see these concepts in action. The first tutorial will walk you through building a complete neural network from scratch.
Tensors
Tensors are the fundamental data structure in Tofu, representing multi-dimensional arrays that flow through neural networks. This comprehensive guide covers everything you need to know about creating, manipulating, and operating on tensors.
Introduction
Tensors are multi-dimensional arrays that generalize scalars (0-D), vectors (1-D), and matrices (2-D) to arbitrary dimensions. In neural networks, tensors represent:
- Input data (images, text, sensor readings)
- Model parameters (weights, biases)
- Intermediate activations
- Gradients during backpropagation
Prerequisites
This guide assumes you've completed the Getting Started guide and understand:
- Basic tensor concepts (shape, dimensions)
- C programming and memory management
- How to compile and link against Tofu
What This Guide Covers
- Tensor fundamentals and memory layout
- Creation methods for different use cases
- Data access and modification patterns
- Shape operations (reshape, transpose, slice)
- Mathematical operations (matmul, element-wise, broadcasting)
- Reduction operations (sum, mean, max)
- Activation functions
- Memory management and ownership
Tensor Fundamentals
Tensor Structure
A tofu_tensor represents a multi-dimensional array with the following key properties:
struct tofu_tensor {
tofu_dtype dtype; // Data type (FLOAT, INT32, etc.)
int len; // Total number of elements
int ndim; // Number of dimensions
int *dims; // Array of dimension sizes
void *data; // Pointer to data buffer
struct tofu_tensor *owner; // For view tensors, points to data owner
void *backend_data; // Backend-specific data
};
Example of a 2×3 matrix:
ndim = 2
dims = [2, 3]
len = 6
data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
Visual representation:
[[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]]
Data Types
Tofu supports multiple data types via the tofu_dtype enum:
| Type | Description | Size | Use Cases |
|---|---|---|---|
TOFU_FLOAT | 32-bit floating point | 4 bytes | Neural network weights, activations |
TOFU_DOUBLE | 64-bit floating point | 8 bytes | High-precision computation |
TOFU_INT32 | 32-bit signed integer | 4 bytes | Labels, indices, counters |
TOFU_INT64 | 64-bit signed integer | 8 bytes | Large indices |
TOFU_INT8 | 8-bit signed integer | 1 byte | Quantized weights |
TOFU_INT16 | 16-bit signed integer | 2 bytes | Quantized activations |
TOFU_BOOL | Boolean | 4 bytes | Masks, conditions |
Most neural network operations use TOFU_FLOAT for weights and activations, while TOFU_INT32 is common for labels and class predictions.
Memory Layout
Tofu uses row-major (C-style) memory layout, where the last dimension varies fastest:
// 2×3 matrix
float data[] = {1.0, 2.0, 3.0, // Row 0
4.0, 5.0, 6.0}; // Row 1
// 2×3×4 tensor (2 matrices, each 3×4)
// data[0..11] = first matrix (row-major)
// data[12..23] = second matrix (row-major)
This affects how you iterate and index tensors:
// Iterating in memory order (efficient)
for (int i = 0; i < dims[0]; i++) {
for (int j = 0; j < dims[1]; j++) {
int index = i * dims[1] + j;
// Access data[index]
}
}
Creating Tensors
Wrapping Existing Data
Use tofu_tensor_create() to wrap existing data without copying:
tofu_tensor *tofu_tensor_create(void *data, int ndim, const int *dims,
tofu_dtype dtype);
This is efficient when you already have data in memory:
float weights[] = {0.1, 0.2, 0.3, 0.4};
tofu_tensor *W = tofu_tensor_create(weights, 2, (int[]){2, 2}, TOFU_FLOAT);
// Use W...
tofu_tensor_free(W); // Frees structure, NOT data
// weights[] is still valid
Use when: Data is managed elsewhere (stack, static, or you handle malloc/free)
Zero Initialization
tofu_tensor_zeros() allocates and zero-initializes a tensor:
tofu_tensor *tofu_tensor_zeros(int ndim, const int *dims, tofu_dtype dtype);
Example:
tofu_tensor *t = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
// t is a 3×4 matrix filled with 0.0
tofu_tensor_free_data_too(t); // Frees both structure and data
Use when: You need a new tensor and will populate it later (common for parameters)
Creating With Values
tofu_tensor_create_with_values() creates and initializes from an array:
float values[] = {1.0, 2.0, 3.0, 4.0};
tofu_tensor *t = tofu_tensor_create_with_values(values, 1, (int[]){4});
tofu_tensor_free_data_too(t);
Use when: You have initial values ready (common for biases, small constants)
Range of Values
tofu_tensor_arange() creates a tensor with evenly spaced values:
tofu_tensor *tofu_tensor_arange(double start, double stop, double step,
tofu_dtype dtype);
Examples:
// Forward slicing (positive step)
tofu_tensor *t = tofu_tensor_arange(0.0, 10.0, 2.0, TOFU_FLOAT);
// t = [0.0, 2.0, 4.0, 6.0, 8.0]
// Reverse slicing (negative step) - v1.1.0+
tofu_tensor *r = tofu_tensor_arange(10.0, 0.0, -2.0, TOFU_FLOAT);
// r = [10.0, 8.0, 6.0, 4.0, 2.0]
tofu_tensor_free_data_too(t);
tofu_tensor_free_data_too(r);
Use when: Generating test data, indices, or sequences
Note: Returns NULL for empty ranges (start == stop) or incompatible step directions (e.g., arange(0, 10, -1))
Deep Copy
tofu_tensor_clone() creates an independent copy:
tofu_tensor *original = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *copy = tofu_tensor_clone(original);
// Modifying copy doesn't affect original
tofu_tensor_free_data_too(copy);
tofu_tensor_free_data_too(original);
Use when: You need to preserve original data while modifying a copy
Accessing and Modifying Data
Reading Values
Use the TOFU_TENSOR_DATA_FROM() macro to safely read values:
tofu_tensor *t = tofu_tensor_arange(0.0, 5.0, 1.0, TOFU_FLOAT);
for (int i = 0; i < t->len; i++) {
float value;
TOFU_TENSOR_DATA_FROM(t, i, value, TOFU_FLOAT);
printf("t[%d] = %.1f\n", i, value);
}
tofu_tensor_free_data_too(t);
Writing Values
Use TOFU_TENSOR_DATA_TO() macro to safely write values:
tofu_tensor *t = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
for (int i = 0; i < t->len; i++) {
float value = i * 0.5;
TOFU_TENSOR_DATA_TO(t, i, value, TOFU_FLOAT);
}
tofu_tensor_print(t, "%.1f"); // [0.0, 0.5, 1.0, 1.5]
tofu_tensor_free_data_too(t);
Direct Pointer Access
For performance-critical code, access data directly:
tofu_tensor *t = tofu_tensor_zeros(2, (int[]){100, 100}, TOFU_FLOAT);
float *data = (float*)t->data;
// Fast iteration
for (int i = 0; i < t->len; i++) {
data[i] = i * 0.1;
}
tofu_tensor_free_data_too(t);
Warning: Ensure type safety - casting to wrong type causes undefined behavior.
Iterating Multi-Dimensional Tensors
For 2-D tensors (matrices):
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
float *data = (float*)matrix->data;
for (int i = 0; i < matrix->dims[0]; i++) { // Rows
for (int j = 0; j < matrix->dims[1]; j++) { // Columns
int index = i * matrix->dims[1] + j;
data[index] = i + j;
}
}
tofu_tensor_free_data_too(matrix);
For 3-D tensors:
tofu_tensor *tensor = tofu_tensor_zeros(3, (int[]){2, 3, 4}, TOFU_FLOAT);
float *data = (float*)tensor->data;
for (int i = 0; i < tensor->dims[0]; i++) {
for (int j = 0; j < tensor->dims[1]; j++) {
for (int k = 0; k < tensor->dims[2]; k++) {
int index = (i * tensor->dims[1] + j) * tensor->dims[2] + k;
data[index] = i + j + k;
}
}
}
tofu_tensor_free_data_too(tensor);
Shape Operations
Reshape
tofu_tensor_reshape() changes tensor shape without copying data:
tofu_tensor *tofu_tensor_reshape(const tofu_tensor *src, int ndim,
const int *dims);
Example:
tofu_tensor *t = tofu_tensor_arange(0.0, 12.0, 1.0, TOFU_FLOAT);
// t shape: [12]
tofu_tensor *matrix = tofu_tensor_reshape(t, 2, (int[]){3, 4});
// matrix shape: [3, 4], shares data with t
tofu_tensor_print(matrix, "%.1f");
// [[0.0, 1.0, 2.0, 3.0],
// [4.0, 5.0, 6.0, 7.0],
// [8.0, 9.0, 10.0, 11.0]]
tofu_tensor_free(matrix); // View only
tofu_tensor_free_data_too(t); // Original with data
Important: Product of new dimensions must equal original size.
In-Place Reshape
tofu_tensor_reshape_src() reshapes a tensor in place:
void tofu_tensor_reshape_src(tofu_tensor *t, int ndim, const int *new_dims);
Example:
tofu_tensor *t = tofu_tensor_arange(0.0, 6.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){2, 3});
tofu_tensor_print(t, "%.1f");
// [[0.0, 1.0, 2.0],
// [3.0, 4.0, 5.0]]
tofu_tensor_free_data_too(t);
Transpose
tofu_tensor_transpose() swaps dimensions:
tofu_tensor *tofu_tensor_transpose(const tofu_tensor *src, tofu_tensor *dst);
For 2-D tensors, transposes rows and columns:
float data[] = {1, 2, 3,
4, 5, 6};
tofu_tensor *A = tofu_tensor_create(data, 2, (int[]){2, 3}, TOFU_FLOAT);
// [[1, 2, 3],
// [4, 5, 6]]
tofu_tensor *AT = tofu_tensor_transpose(A, NULL, NULL);
// [[1, 4],
// [2, 5],
// [3, 6]]
tofu_tensor_free_data_too(AT);
tofu_tensor_free(A);
Use cases: Matrix operations, batch dimensions, image transformations
Slice
tofu_tensor_slice() extracts a subtensor:
tofu_tensor *tofu_tensor_slice(const tofu_tensor *src, tofu_tensor *dst,
int axis, int start, int len);
Example:
tofu_tensor *t = tofu_tensor_arange(0.0, 10.0, 1.0, TOFU_FLOAT);
// [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
tofu_tensor *slice = tofu_tensor_slice(t, NULL, 0, 2, 5);
// [2, 3, 4, 5, 6]
tofu_tensor_free_data_too(slice);
tofu_tensor_free_data_too(t);
For matrices, slice rows:
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){10, 5}, TOFU_FLOAT);
tofu_tensor *rows = tofu_tensor_slice(matrix, NULL, 0, 2, 3);
// Extracts rows 2, 3, 4 → shape [3, 5]
tofu_tensor_free_data_too(rows);
tofu_tensor_free_data_too(matrix);
Concatenate
tofu_tensor_concat() joins tensors along an axis:
tofu_tensor *tofu_tensor_concat(const tofu_tensor *src1, const tofu_tensor *src2,
tofu_tensor *dst, int axis);
Example:
tofu_tensor *a = tofu_tensor_arange(0.0, 3.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(a, 2, (int[]){1, 3}); // [[0, 1, 2]]
tofu_tensor *b = tofu_tensor_arange(3.0, 6.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(b, 2, (int[]){1, 3}); // [[3, 4, 5]]
tofu_tensor *c = tofu_tensor_concat(a, b, NULL, 0); // Concatenate rows
// [[0, 1, 2],
// [3, 4, 5]]
tofu_tensor_free_data_too(c);
tofu_tensor_free_data_too(b);
tofu_tensor_free_data_too(a);
Mathematical Operations
Matrix Multiplication
tofu_tensor_matmul() performs matrix multiplication with broadcasting:
tofu_tensor *tofu_tensor_matmul(const tofu_tensor *src1, const tofu_tensor *src2,
tofu_tensor *dst);
Basic matrix multiplication:
tofu_tensor *A = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT); // 2×3
tofu_tensor *B = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT); // 3×4
tofu_tensor *C = tofu_tensor_matmul(A, B, NULL); // 2×4
// C[i,j] = Σ(A[i,k] * B[k,j])
tofu_tensor_free_data_too(C);
tofu_tensor_free_data_too(B);
tofu_tensor_free_data_too(A);
Dimension rules:
- For 1-D @ 1-D:
src1->dims[0]must equalsrc2->dims[0](dot product) - For 2-D and higher:
src1->dims[src1->ndim-1]must equalsrc2->dims[src2->ndim-2]
Batch matrix multiplication:
tofu_tensor *batches = tofu_tensor_zeros(3, (int[]){10, 2, 3}, TOFU_FLOAT);
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *results = tofu_tensor_matmul(batches, weights, NULL);
// Shape: [10, 2, 4] - broadcasts weights across batch
tofu_tensor_free_data_too(results);
tofu_tensor_free_data_too(weights);
tofu_tensor_free_data_too(batches);
Inner Product
tofu_tensor_inner() computes dot product (sum of element-wise products):
tofu_tensor *a = tofu_tensor_arange(0.0, 3.0, 1.0, TOFU_FLOAT); // [0, 1, 2]
tofu_tensor *b = tofu_tensor_arange(1.0, 4.0, 1.0, TOFU_FLOAT); // [1, 2, 3]
tofu_tensor *result = tofu_tensor_inner(a, b, NULL);
// result = 0*1 + 1*2 + 2*3 = 8
tofu_tensor_free_data_too(result);
tofu_tensor_free_data_too(b);
tofu_tensor_free_data_too(a);
Outer Product
tofu_tensor_outer() computes outer product:
tofu_tensor *a = tofu_tensor_arange(1.0, 4.0, 1.0, TOFU_FLOAT); // [1, 2, 3]
tofu_tensor *b = tofu_tensor_arange(1.0, 3.0, 1.0, TOFU_FLOAT); // [1, 2]
tofu_tensor *result = tofu_tensor_outer(a, b, NULL);
// [[1, 2],
// [2, 4],
// [3, 6]]
tofu_tensor_free_data_too(result);
tofu_tensor_free_data_too(b);
tofu_tensor_free_data_too(a);
Element-Wise Operations
tofu_tensor_elew() performs element-wise operations:
tofu_tensor *tofu_tensor_elew(const tofu_tensor *src1, const tofu_tensor *src2,
tofu_tensor *dst, tofu_elew_op op);
Supported operations:
| Operation | Description | Example |
|---|---|---|
TOFU_SUM | Addition | a + b |
TOFU_SUB | Subtraction | a - b |
TOFU_MUL | Multiplication | a * b |
TOFU_DIV | Division | a / b |
TOFU_MAX | Maximum | max(a, b) |
TOFU_MIN | Minimum | min(a, b) |
Example:
tofu_tensor *a = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT); // [1, 2, 3, 4]
tofu_tensor *b = tofu_tensor_arange(2.0, 6.0, 1.0, TOFU_FLOAT); // [2, 3, 4, 5]
tofu_tensor *sum = tofu_tensor_elew(a, b, NULL, TOFU_SUM); // [3, 5, 7, 9]
tofu_tensor *prod = tofu_tensor_elew(a, b, NULL, TOFU_MUL); // [2, 6, 12, 20]
tofu_tensor_free_data_too(prod);
tofu_tensor_free_data_too(sum);
tofu_tensor_free_data_too(b);
tofu_tensor_free_data_too(a);
Element-Wise with Scalar
tofu_tensor_elew_param() applies operation with a scalar:
tofu_tensor *tofu_tensor_elew_param(const tofu_tensor *src, double param,
tofu_tensor *dst, tofu_elew_op op);
Example:
tofu_tensor *t = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT); // [1, 2, 3, 4]
tofu_tensor *scaled = tofu_tensor_elew_param(t, 2.0, NULL, TOFU_MUL); // [2, 4, 6, 8]
tofu_tensor *shifted = tofu_tensor_elew_param(t, 10.0, NULL, TOFU_SUM); // [11, 12, 13, 14]
tofu_tensor_free_data_too(shifted);
tofu_tensor_free_data_too(scaled);
tofu_tensor_free_data_too(t);
Broadcasting
Broadcasting allows operations on tensors with different but compatible shapes:
Rules:
- Start from trailing dimensions
- Dimensions are compatible if they're equal or one is 1
- Missing dimensions are treated as 1
Examples:
// Shape [3, 4] + Shape [4] → broadcasts to [3, 4]
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT); // [4]
tofu_tensor *result = tofu_tensor_elew_broadcast(matrix, bias, NULL, TOFU_SUM);
// bias is added to each row of matrix
tofu_tensor_free_data_too(result);
tofu_tensor_free_data_too(bias);
tofu_tensor_free_data_too(matrix);
Reduction Operations
Sum Reduction
tofu_tensor_sumreduce() sums along an axis:
tofu_tensor *t = tofu_tensor_arange(0.0, 12.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){3, 4});
tofu_tensor *col_sums = tofu_tensor_sumreduce(t, NULL, 0); // Sum rows
// Shape: [1, 4], values: [[12, 15, 18, 21]]
tofu_tensor *row_sums = tofu_tensor_sumreduce(t, NULL, 1); // Sum columns
// Shape: [3, 1], values: [[6], [22], [38]]
tofu_tensor_free_data_too(row_sums);
tofu_tensor_free_data_too(col_sums);
tofu_tensor_free_data_too(t);
Mean Reduction
tofu_tensor_meanreduce() computes mean along an axis:
tofu_tensor *data = tofu_tensor_arange(0.0, 12.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(data, 2, (int[]){3, 4});
tofu_tensor *row_means = tofu_tensor_meanreduce(data, NULL, 1);
// Shape: [3, 1], values: [[1.5], [5.5], [9.5]]
tofu_tensor_free_data_too(row_means);
tofu_tensor_free_data_too(data);
Max Reduction
tofu_tensor_maxreduce() finds maximum values and optionally their indices:
tofu_tensor *tofu_tensor_maxreduce(const tofu_tensor *src, tofu_tensor *dst,
tofu_tensor *arg, int axis);
Example:
float data[] = {3, 1, 4, 1, 5, 9, 2, 6, 5};
tofu_tensor *t = tofu_tensor_create(data, 2, (int[]){3, 3}, TOFU_FLOAT);
tofu_tensor *indices = tofu_tensor_zeros(2, (int[]){3, 1}, TOFU_INT32);
tofu_tensor *max_vals = tofu_tensor_maxreduce(t, NULL, indices, 1);
// max_vals shape: [3, 1], values: [[4], [9], [6]]
// indices shape: [3, 1], values: [[2], [2], [1]]
tofu_tensor_free_data_too(max_vals);
tofu_tensor_free_data_too(indices);
tofu_tensor_free(t);
Activation Functions
Leaky ReLU
tofu_tensor *tofu_tensor_lrelu(const tofu_tensor *src, tofu_tensor *dst,
float negslope);
Example:
float data[] = {-2, -1, 0, 1, 2};
tofu_tensor *x = tofu_tensor_create(data, 1, (int[]){5}, TOFU_FLOAT);
tofu_tensor *relu = tofu_tensor_lrelu(x, NULL, 0.0f); // Standard ReLU
// [0, 0, 0, 1, 2]
tofu_tensor *leaky = tofu_tensor_lrelu(x, NULL, 0.01f); // Leaky ReLU
// [-0.02, -0.01, 0, 1, 2]
tofu_tensor_free_data_too(leaky);
tofu_tensor_free_data_too(relu);
tofu_tensor_free(x);
Softmax
tofu_tensor *tofu_tensor_softmax(const tofu_tensor *src, tofu_tensor *dst,
int axis);
Example:
float logits[] = {1, 2, 3};
tofu_tensor *t = tofu_tensor_create(logits, 1, (int[]){3}, TOFU_FLOAT);
tofu_tensor *probs = tofu_tensor_softmax(t, NULL, 0);
// Approximately [0.09, 0.24, 0.67]
tofu_tensor_free_data_too(probs);
tofu_tensor_free(t);
Layer Normalization
tofu_tensor *tofu_tensor_layer_norm(const tofu_tensor *src, tofu_tensor *dst,
const tofu_tensor *gamma, const tofu_tensor *beta,
int axis, double eps);
Utility Functions
Printing
void tofu_tensor_print(const tofu_tensor *t, const char *fmt);
Example:
tofu_tensor *t = tofu_tensor_arange(0.0, 6.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){2, 3});
tofu_tensor_print(t, "%.1f");
// [[0.0, 1.0, 2.0],
// [3.0, 4.0, 5.0]]
tofu_tensor_free_data_too(t);
Size Queries
size_t size = tofu_tensor_size(t); // Total elements
int same_shape = tofu_tensor_issameshape(t1, t2);
int broadcastable = tofu_tensor_isbroadcastable(t1, t2);
Type Conversion
tofu_tensor *ints = tofu_tensor_convert(floats, NULL, TOFU_INT32);
Memory Management
Ownership Rules
Rule 1: Tensors created with tofu_tensor_create() don't own their data
float data[4] = {1, 2, 3, 4};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){4}, TOFU_FLOAT);
tofu_tensor_free(t); // Only frees tensor structure
// data is still valid
Rule 2: Tensors created with tofu_tensor_zeros(), tofu_tensor_clone(), etc. own their data
tofu_tensor *t = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
tofu_tensor_free_data_too(t); // Frees both structure and data
Rule 3: View operations (reshape, transpose) share data
tofu_tensor *original = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor *view = tofu_tensor_reshape(original, 2, (int[]){3, 4});
// view shares data with original
tofu_tensor_free(view); // Free view only
tofu_tensor_free_data_too(original); // Free data with original
Common Mistakes
Mistake 1: Using free_data_too on user-owned data
// WRONG
float data[4];
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){4}, TOFU_FLOAT);
tofu_tensor_free_data_too(t); // Tries to free stack memory!
Mistake 2: Memory leak from not freeing library-owned data
// WRONG
tofu_tensor *t = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
tofu_tensor_free(t); // Leaks data buffer!
Mistake 3: Freeing view data
// WRONG
tofu_tensor *original = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor *view = tofu_tensor_reshape(original, 2, (int[]){3, 4});
tofu_tensor_free_data_too(view); // Frees shared data!
tofu_tensor_free_data_too(original); // Double free!
Best Practices
Efficiency
- Reuse destination tensors to avoid allocations
- Use views when possible (reshape, not clone)
- Choose appropriate data types (INT32 for labels, FLOAT for weights)
- Batch operations for efficiency
Debugging
- Validate shapes before operations
- Print intermediate results with
tofu_tensor_print() - Check for NaN/Inf in numerical operations
- Track allocations to find memory leaks
Common Pitfalls
- Don't modify views expecting independence
- Don't use
free_data_tooontensor_create()tensors - Verify broadcast compatibility before operations
- Ensure consistent data types in operations
Next Steps
Now that you understand tensors, continue to:
- Computation Graphs - Build neural networks with automatic differentiation
- Training Guide - Implement complete training loops
- API Reference - Detailed function specifications
For practical examples, see the tutorials section for complete neural network implementations.
Computation Graphs
Computation graphs are the foundation of automatic differentiation and neural network training in Tofu. This guide explains how to build, manage, and use computation graphs for deep learning applications.
Introduction
A computation graph is a directed acyclic graph (DAG) that represents mathematical operations and their dependencies. Each node in the graph represents either:
- A leaf node: input data or trainable parameter
- An operation node: a mathematical operation (matmul, add, relu, etc.)
Tofu uses dynamic computation graphs (define-by-run), meaning the graph structure is built on-the-fly as operations execute. This provides flexibility for control flow and conditional computation.
Key Concepts
Forward Pass: Computes values by flowing data through the graph from inputs to outputs. Each operation node stores its result in node->value.
Backward Pass: Computes gradients by flowing derivatives backward from the loss to parameters. Uses reverse-mode automatic differentiation (backpropagation). Each node stores its gradient in node->grad.
Requires Gradient: A flag indicating whether a node needs gradient computation. Parameters always require gradients, while inputs do not.
Topological Order: The graph automatically sorts nodes in reverse topological order for efficient backward pass execution.
A Simple Example
// Create graph
tofu_graph *g = tofu_graph_create();
// Add input and parameters
float x_data[] = {1.0f, 2.0f};
float w_data[] = {0.5f, -0.3f};
tofu_tensor *x_tensor = tofu_tensor_create(x_data, 1, (int[]){2}, TOFU_FLOAT);
tofu_tensor *w_tensor = tofu_tensor_create(w_data, 2, (int[]){2, 1}, TOFU_FLOAT);
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, w_tensor);
// Compute: y = x @ W
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
// Backward pass
tofu_graph_backward(g, y);
// Access gradient
tofu_tensor *W_grad = tofu_graph_get_grad(W); // Contains dL/dW
// Cleanup
tofu_graph_free(g);
tofu_tensor_free(x_tensor);
tofu_tensor_free(w_tensor);
When to Use Graphs
Use computation graphs when you need:
- Automatic gradient computation for training
- Complex neural network architectures
- Efficient backpropagation through multiple layers
- Dynamic control flow in models
For simple tensor operations without gradients, you can use the Tensor API directly without graphs.
Graph Fundamentals
Directed Acyclic Graphs (DAGs)
Computation graphs must be acyclic to ensure well-defined forward and backward passes. Each operation creates a new node that depends on its input nodes, forming a directed edge from inputs to outputs.
// This creates a DAG:
// x ---\
// matmul --> y --> relu --> z
// W ---/
// b -----------------------------> add --> out
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *b = tofu_graph_param(g, b_tensor);
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
tofu_graph_node *z = tofu_graph_relu(g, y);
tofu_graph_node *out = tofu_graph_add(g, z, b);
The graph automatically tracks dependencies. When you call tofu_graph_backward(g, out), it computes gradients for all parameters by traversing edges backward.
Nodes and Their Roles
Leaf Nodes are the sources of data in the graph:
TOFU_OP_INPUT: Non-trainable data (features, targets). Does not receive gradients.TOFU_OP_PARAM: Trainable parameters (weights, biases). Receives and accumulates gradients.
Operation Nodes perform computations:
- Binary operations:
matmul,add,mul - Activations:
relu,softmax,layer_norm - Shape operations:
reshape,transpose - Reductions:
mean,sum - Loss functions:
mse_loss,ce_loss
Each operation node stores:
value: Result of forward computationgrad: Gradient from backward passinputs: Pointer to input nodesbackward_fn: Function to compute gradients for inputsbackward_ctx: Saved tensors needed for backward (e.g., input values for ReLU)
Forward Pass Execution
The forward pass computes outputs by executing operations in order:
tofu_graph *g = tofu_graph_create();
// Create computation: y = relu(x @ W + b)
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *b = tofu_graph_param(g, b_tensor);
// Forward pass happens automatically as operations are added
tofu_graph_node *xW = tofu_graph_matmul(g, x, W); // Computes x @ W immediately
tofu_graph_node *xWb = tofu_graph_add(g, xW, b); // Computes xW + b immediately
tofu_graph_node *y = tofu_graph_relu(g, xWb); // Computes relu(xWb) immediately
// At this point, y->value contains the final result
tofu_tensor *result = tofu_graph_get_value(y);
Each operation executes immediately when called, computing and storing the result in the new node's value field.
Backward Pass Execution
The backward pass computes gradients using the chain rule:
// Continuing from above...
tofu_graph_backward(g, y);
// Now all parameter nodes have gradients:
tofu_tensor *W_grad = tofu_graph_get_grad(W); // dL/dW
tofu_tensor *b_grad = tofu_graph_get_grad(b); // dL/db
The backward pass:
- Initializes
loss->grad = 1.0(derivative of loss w.r.t. itself) - Sorts nodes in reverse topological order
- For each node (from loss back to inputs):
- Calls its
backward_fnto compute input gradients - Accumulates gradients for nodes that appear multiple times
- Calls its
Important: Gradients accumulate across backward passes. Always call tofu_graph_zero_grad() before each training iteration unless you explicitly want gradient accumulation.
Gradient Flow and Requires Grad
The requires_grad flag determines whether a node needs gradient computation:
tofu_graph_node *x = tofu_graph_input(g, x_tensor); // requires_grad = 0
tofu_graph_node *W = tofu_graph_param(g, W_tensor); // requires_grad = 1
tofu_graph_node *y = tofu_graph_matmul(g, x, W); // requires_grad = 1 (because W does)
tofu_graph_node *z = tofu_graph_add(g, y, const_node); // requires_grad = 1 (because y does)
An operation node requires gradients if ANY of its inputs require gradients. This propagates through the graph automatically.
Gradients only flow to nodes with requires_grad = 1:
PARAMnodes always receive gradients (these are your trainable weights)INPUTnodes never receive gradients (they're just data)- Operation nodes receive gradients if they're on a path from a parameter to the loss
Creating and Managing Graphs
Creating a Graph
Use tofu_graph_create() to create a new empty graph:
tofu_graph *g = tofu_graph_create();
// Graph starts empty
// num_nodes = 0
// next_id = 0
// ... build your graph ...
The graph allocates memory dynamically and grows as you add nodes. It manages all nodes internally and frees them when tofu_graph_free() is called.
Freeing a Graph
Use tofu_graph_free() to clean up a graph and all its nodes:
tofu_graph *g = tofu_graph_create();
// Build graph...
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
// Free graph (but NOT tensors!)
tofu_graph_free(g);
// You must still free tensors separately
tofu_tensor_free(x_tensor);
tofu_tensor_free(W_tensor);
Critical Memory Management Rule: The graph does NOT take ownership of tensors passed to tofu_graph_input() or tofu_graph_param(). You must:
- Free the graph first:
tofu_graph_free(g) - Then free tensors:
tofu_tensor_free(tensor)
The graph DOES own and free:
- All graph nodes
- All gradients (
node->grad) - All intermediate operation results (e.g., the result of
matmul,add, etc.)
Correct Cleanup Order
// Create tensors
tofu_tensor *x_tensor = tofu_tensor_zeros(2, (int[]){1, 4}, TOFU_FLOAT);
tofu_tensor *W_tensor = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
// Build graph
tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
// Training loop...
// CORRECT CLEANUP ORDER:
// 1. Free optimizer (if used)
tofu_optimizer_free(opt);
// 2. Free graph
tofu_graph_free(g);
// 3. Free tensors
tofu_tensor_free_data_too(x_tensor);
tofu_tensor_free_data_too(W_tensor);
Clearing Operations
Use tofu_graph_clear_ops() to remove operation nodes while keeping parameters:
tofu_graph *g = tofu_graph_create();
// Add parameters (persist across iterations)
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *b = tofu_graph_param(g, b_tensor);
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Build forward graph for this batch
tofu_graph_node *x = tofu_graph_input(g, batch_data);
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
tofu_graph_node *out = tofu_graph_add(g, y, b);
// Backward pass and optimization...
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
// Clear operations (W and b are preserved!)
tofu_graph_clear_ops(g);
}
This is essential for training loops to prevent node accumulation and memory growth. After clear_ops():
- All operation nodes are freed
PARAMandINPUTnodes remain in the graph- Parameter values and gradients are preserved
- The graph is ready for the next forward pass
When to Use Clear Ops:
- Between training iterations in a loop
- When reusing the same graph with different input data
- To prevent unbounded memory growth during training
When NOT to Clear Ops:
- If you need to access intermediate operation results after backward
- During a single forward/backward pass
- Before calling the optimizer (gradients would be lost!)
Adding Leaf Nodes
Leaf nodes are the starting points of computation. They represent either input data or trainable parameters.
Input Nodes
Input nodes represent non-trainable data like features or labels:
float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *x_tensor = tofu_tensor_create(data, 2, (int[]){1, 4}, TOFU_FLOAT);
tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
// Input nodes do NOT compute gradients
// x->requires_grad == 0
// x->op == TOFU_OP_INPUT
Characteristics of input nodes:
requires_grad = 0(no gradient computation)- Used for features, labels, or other non-trainable data
- Graph does NOT own the tensor (caller must free it)
- Typically created fresh for each training iteration
Parameter Nodes
Parameter nodes represent trainable weights or biases:
float weights[] = {0.5f, -0.3f, 0.2f, 0.1f};
tofu_tensor *W_tensor = tofu_tensor_create(weights, 2, (int[]){2, 2}, TOFU_FLOAT);
tofu_graph *g = tofu_graph_create();
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
// Parameter nodes DO compute gradients
// W->requires_grad == 1
// W->op == TOFU_OP_PARAM
Characteristics of parameter nodes:
requires_grad = 1(gradient computation enabled)- Used for weights, biases, or other learnable parameters
- Graph does NOT own the tensor (caller must free it)
- Typically created once and reused across iterations
- Preserved by
tofu_graph_clear_ops()
Ownership Rules
Critical: The graph does NOT take ownership of tensors passed to tofu_graph_input() or tofu_graph_param().
// Create tensor (you own this)
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
// Add to graph (graph does NOT take ownership)
tofu_graph_node *W_node = tofu_graph_param(g, W);
// Later: free graph first, then tensor
tofu_graph_free(g); // Frees the node, but NOT the tensor
tofu_tensor_free_data_too(W); // You must free the tensor
Why this design?
- Parameters persist across multiple training iterations
- You may need to save/load parameters independently
- Gives you full control over memory management
Typical Usage Pattern
// Setup phase (once)
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);
tofu_graph *g = tofu_graph_create();
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);
// Training loop (many iterations)
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Create fresh input for each iteration
float *batch_data = load_batch(epoch);
tofu_tensor *x_tensor = tofu_tensor_create(batch_data, 2, (int[]){32, 784}, TOFU_FLOAT);
// Add input node
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
// Build forward graph using parameters
tofu_graph_node *logits = tofu_graph_matmul(g, x, W_node);
tofu_graph_node *out = tofu_graph_add(g, logits, b_node);
// Training step...
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
// Clean up this iteration's input
tofu_tensor_free(x_tensor);
free(batch_data);
// Clear operations (W_node and b_node are preserved)
tofu_graph_clear_ops(g);
}
// Cleanup (once at end)
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);
Common Pitfalls
Pitfall 1: Freeing tensors before graph
// WRONG - this will crash!
tofu_tensor_free(W_tensor); // Frees tensor
tofu_graph_free(g); // Graph still references freed memory
// CORRECT
tofu_graph_free(g); // Free graph first
tofu_tensor_free(W_tensor); // Then free tensor
Pitfall 2: Not freeing tensors
// WRONG - memory leak!
tofu_graph_free(g);
// Forgot to free tensors!
// CORRECT
tofu_graph_free(g);
tofu_tensor_free_data_too(W_tensor);
tofu_tensor_free_data_too(b_tensor);
Pitfall 3: Clearing ops without re-adding params
// WRONG - params are gone after clear_ops if added in loop
for (int i = 0; i < 100; i++) {
tofu_graph_node *W = tofu_graph_param(g, W_tensor); // Re-adding param each time
// ... training ...
tofu_graph_clear_ops(g); // Removes W!
}
// CORRECT - add params once before loop
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
for (int i = 0; i < 100; i++) {
// ... use W ...
tofu_graph_clear_ops(g); // W is preserved
}
Mathematical Operations
Mathematical operations create new nodes that compute values during the forward pass and propagate gradients during the backward pass.
Matrix Multiplication
Matrix multiplication is the workhorse of neural networks:
// y = x @ W
// x: [batch, in_features]
// W: [in_features, out_features]
// y: [batch, out_features]
tofu_graph_node *x = tofu_graph_input(g, x_tensor); // [32, 784]
tofu_graph_node *W = tofu_graph_param(g, W_tensor); // [784, 10]
tofu_graph_node *y = tofu_graph_matmul(g, x, W); // [32, 10]
The operation:
- Computes standard matrix multiplication with broadcasting
- Supports batched operations (3D, 4D tensors)
- Implements backward pass:
dL/dx = dL/dy @ W^TanddL/dW = x^T @ dL/dy
Precondition: Inner dimensions must match: a->dims[last] == b->dims[second-to-last]
Element-wise Addition
Addition is commonly used for adding biases:
// out = x + b
// x: [batch, features]
// b: [features]
// out: [batch, features]
tofu_graph_node *x = tofu_graph_matmul(g, input, W); // [32, 10]
tofu_graph_node *b = tofu_graph_param(g, b_tensor); // [10]
tofu_graph_node *out = tofu_graph_add(g, x, b); // [32, 10]
The operation:
- Performs element-wise addition with broadcasting
- Follows NumPy broadcasting rules
- Implements backward pass: gradients flow to both inputs
Broadcasting example:
// Broadcasting [2, 3] + [3] -> [2, 3]
float a_data[] = {1, 2, 3, 4, 5, 6};
float b_data[] = {10, 20, 30};
tofu_tensor *a = tofu_tensor_create(a_data, 2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_create(b_data, 1, (int[]){3}, TOFU_FLOAT);
tofu_graph_node *a_node = tofu_graph_input(g, a);
tofu_graph_node *b_node = tofu_graph_input(g, b);
tofu_graph_node *c = tofu_graph_add(g, a_node, b_node);
// Result: [[11, 22, 33], [14, 25, 36]]
Element-wise Multiplication
Multiplication is useful for attention mechanisms and gating:
// out = x * y
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *y = tofu_graph_input(g, y_tensor);
tofu_graph_node *out = tofu_graph_mul(g, x, y);
The operation:
- Performs element-wise multiplication with broadcasting
- Implements backward pass:
dL/dx = dL/dout * yanddL/dy = dL/dout * x
Example: Attention scaling
// Attention: scale * (Q @ K^T)
tofu_graph_node *qk = tofu_graph_matmul(g, Q, K_T);
tofu_graph_node *scale_tensor = tofu_graph_param(g, scale);
tofu_graph_node *scaled = tofu_graph_mul(g, qk, scale_tensor);
Chaining Operations
Operations can be chained to build complex expressions:
// Build: y = ReLU(x @ W + b)
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, W_tensor);
tofu_graph_node *b = tofu_graph_param(g, b_tensor);
// Chain operations
tofu_graph_node *xW = tofu_graph_matmul(g, x, W); // Linear transformation
tofu_graph_node *xWb = tofu_graph_add(g, xW, b); // Add bias
tofu_graph_node *y = tofu_graph_relu(g, xWb); // Apply activation
// Each intermediate result is stored in the node's value
tofu_tensor *xW_value = tofu_graph_get_value(xW); // Can inspect intermediates
Multi-Layer Networks
Chaining creates deeper networks:
// Two-layer network: h = ReLU(x @ W1 + b1), out = h @ W2 + b2
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
// Layer 1: [batch, 784] -> [batch, 128]
tofu_graph_node *W1 = tofu_graph_param(g, W1_tensor);
tofu_graph_node *b1 = tofu_graph_param(g, b1_tensor);
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1);
tofu_graph_node *h1b = tofu_graph_add(g, h1, b1);
tofu_graph_node *h1a = tofu_graph_relu(g, h1b);
// Layer 2: [batch, 128] -> [batch, 10]
tofu_graph_node *W2 = tofu_graph_param(g, W2_tensor);
tofu_graph_node *b2 = tofu_graph_param(g, b2_tensor);
tofu_graph_node *h2 = tofu_graph_matmul(g, h1a, W2);
tofu_graph_node *out = tofu_graph_add(g, h2, b2);
The backward pass automatically computes gradients for all parameters (W1, b1, W2, b2) using the chain rule.
Operation Results
Every operation stores its result immediately:
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
// Result is available immediately
tofu_tensor *result = tofu_graph_get_value(y);
tofu_tensor_print(result, "%.2f");
// The tensor is owned by the node - don't free it!
// It will be freed when you call tofu_graph_free(g)
Activation Functions
Activation functions introduce non-linearity, enabling neural networks to learn complex patterns.
ReLU (Rectified Linear Unit)
ReLU is the most common activation function:
// y = max(0, x)
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *y = tofu_graph_relu(g, x);
// Values: [-2, -1, 0, 1, 2] -> [0, 0, 0, 1, 2]
Properties:
- Simple and efficient:
y = (x > 0) ? x : 0 - Gradient: 1 where
x > 0, else 0 - Helps avoid vanishing gradients in deep networks
- Creates sparse activations (many zeros)
Usage pattern in networks:
// Hidden layer with ReLU
tofu_graph_node *h = tofu_graph_matmul(g, x, W);
tofu_graph_node *h_bias = tofu_graph_add(g, h, b);
tofu_graph_node *h_act = tofu_graph_relu(g, h_bias); // Apply ReLU after bias
Softmax
Softmax converts logits to probabilities for classification:
// Apply softmax along axis 1 (last dimension)
// Input: [[1, 2, 3], [4, 5, 6]] (logits)
// Output: [[0.09, 0.24, 0.67], [0.09, 0.24, 0.67]] (probabilities)
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
Properties:
- Outputs sum to 1.0 along the specified axis
- Numerically stable (subtracts max before exp)
- Used in the final layer for classification
- Axis parameter specifies normalization dimension
Formula: softmax(x_i) = exp(x_i - max(x)) / sum(exp(x_j - max(x)))
Multi-class classification example:
// 10-class classifier
tofu_graph_node *x = tofu_graph_input(g, x_tensor); // [batch, features]
tofu_graph_node *W = tofu_graph_param(g, W_tensor); // [features, 10]
tofu_graph_node *logits = tofu_graph_matmul(g, x, W); // [batch, 10]
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1); // [batch, 10]
// probs now contains class probabilities for each sample
Layer Normalization
Layer normalization stabilizes training in deep networks:
// Normalize along axis 1
// out = gamma * (x - mean) / sqrt(var + eps) + beta
tofu_graph_node *x = tofu_graph_input(g, x_tensor); // [batch, features]
tofu_graph_node *gamma = tofu_graph_param(g, gamma_tensor); // [features]
tofu_graph_node *beta = tofu_graph_param(g, beta_tensor); // [features]
tofu_graph_node *normalized = tofu_graph_layer_norm(g, x, gamma, beta, 1, 1e-5);
Parameters:
x: Input tensorgamma: Scale parameter (can be NULL for no scaling)beta: Shift parameter (can be NULL for no shift)axis: Normalization axis (typically last dimension)eps: Small constant for numerical stability (typically 1e-5)
Properties:
- Normalizes activations to zero mean and unit variance
- Helps stabilize training and enables higher learning rates
- Common in transformers and deep networks
Typical usage in transformers:
// Layer norm after self-attention
tofu_graph_node *attn_out = tofu_graph_matmul(g, attn_weights, V);
tofu_graph_node *normed = tofu_graph_layer_norm(g, attn_out, gamma, beta, 1, 1e-5);
Combining Activations
Different activations serve different purposes:
// Multi-layer network with different activations
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
// Hidden layer 1: ReLU for non-linearity
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1);
tofu_graph_node *h1_bias = tofu_graph_add(g, h1, b1);
tofu_graph_node *h1_act = tofu_graph_relu(g, h1_bias);
// Hidden layer 2: ReLU + Layer Norm
tofu_graph_node *h2 = tofu_graph_matmul(g, h1_act, W2);
tofu_graph_node *h2_bias = tofu_graph_add(g, h2, b2);
tofu_graph_node *h2_act = tofu_graph_relu(g, h2_bias);
tofu_graph_node *h2_norm = tofu_graph_layer_norm(g, h2_act, gamma, beta, 1, 1e-5);
// Output layer: Softmax for classification
tofu_graph_node *logits = tofu_graph_matmul(g, h2_norm, W_out);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
Shape Operations
Shape operations manipulate tensor dimensions without changing the underlying data.
Reshape
Reshape changes tensor dimensions while preserving total elements:
// Flatten: [batch, height, width, channels] -> [batch, height * width * channels]
int batch = 32;
int h = 28, w = 28, c = 1;
int flat_dim = h * w * c; // 784
tofu_graph_node *img = tofu_graph_input(g, img_tensor); // [32, 28, 28, 1]
tofu_graph_node *flat = tofu_graph_reshape(g, img, 2, (int[]){batch, flat_dim}); // [32, 784]
Properties:
- View operation (no data copy)
- Total number of elements must remain constant
- Useful for transitioning between convolutional and fully-connected layers
Common patterns:
// Flatten for fully-connected layer
tofu_graph_node *flat = tofu_graph_reshape(g, x, 2, (int[]){batch, -1});
// Unflatten for visualization
tofu_graph_node *img = tofu_graph_reshape(g, flat, 4, (int[]){batch, 28, 28, 1});
// Prepare patches for Vision Transformer
tofu_graph_node *patches = tofu_graph_reshape(g, img, 3, (int[]){batch, num_patches, patch_dim});
Transpose
Transpose permutes tensor dimensions:
// Transpose matrix: [m, n] -> [n, m]
tofu_graph_node *W = tofu_graph_param(g, W_tensor); // [784, 10]
tofu_graph_node *W_T = tofu_graph_transpose(g, W, NULL); // [10, 784]
// NULL means reverse dimension order
With explicit axis permutation:
// Permute: [batch, seq, features] -> [batch, features, seq]
int axes[] = {0, 2, 1};
tofu_graph_node *x = tofu_graph_input(g, x_tensor); // [32, 100, 64]
tofu_graph_node *x_T = tofu_graph_transpose(g, x, axes); // [32, 64, 100]
Common usage in attention:
// Attention: Q @ K^T
tofu_graph_node *Q = tofu_graph_matmul(g, x, W_q); // [batch, seq, dim]
tofu_graph_node *K = tofu_graph_matmul(g, x, W_k); // [batch, seq, dim]
tofu_graph_node *K_T = tofu_graph_transpose(g, K, NULL); // [batch, dim, seq]
tofu_graph_node *scores = tofu_graph_matmul(g, Q, K_T); // [batch, seq, seq]
Mean Reduction
Compute mean along specified axes (coming soon - API being finalized).
Sum Reduction
Compute sum along specified axes (coming soon - API being finalized).
Combining Shape Operations
Shape operations often work together:
// Vision Transformer patch embedding
// Input: [batch, height, width, channels]
// Output: [batch, num_patches, embed_dim]
tofu_graph_node *img = tofu_graph_input(g, img_tensor); // [32, 224, 224, 3]
// Step 1: Reshape to patches
int patch_size = 16;
int num_patches = (224 / patch_size) * (224 / patch_size); // 196
int patch_dim = patch_size * patch_size * 3; // 768
tofu_graph_node *patches = tofu_graph_reshape(g, img, 2,
(int[]){32, num_patches, patch_dim}); // [32, 196, 768]
// Step 2: Project patches to embedding dimension
tofu_graph_node *W_proj = tofu_graph_param(g, W_proj_tensor); // [768, 512]
tofu_graph_node *embeddings = tofu_graph_matmul(g, patches, W_proj); // [32, 196, 512]
(Part 1 of 2) --- (Part 2 of 2) ---
8. Loss Functions
Loss functions quantify how well your model performs by comparing predictions against ground truth. Tofu provides two essential loss functions optimized for different tasks.
Mean Squared Error (MSE)
MSE measures the average squared difference between predictions and targets. Use it for regression tasks where you predict continuous values.
tofu_graph_node* tofu_graph_mse_loss(tofu_graph* g,
tofu_graph_node* pred,
tofu_graph_node* target);
Mathematical definition:
MSE = mean((pred - target)²)
When to use MSE:
- Regression problems (predicting house prices, temperatures, etc.)
- When output values are continuous and unbounded
- When you want to penalize larger errors more heavily (squared term)
Example: Linear regression
// Predict continuous values
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
tofu_graph_node* W = tofu_graph_param(g, weights);
tofu_graph_node* b = tofu_graph_param(g, bias);
tofu_graph_node* pred = tofu_graph_add(g, tofu_graph_matmul(g, x, W), b);
tofu_graph_node* target = tofu_graph_input(g, target_tensor);
// MSE loss for regression
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);
tofu_graph_backward(g, loss);
Key properties:
- Output is a scalar (single value)
- Gradients scale linearly with error magnitude
- Sensitive to outliers due to squaring
- Always non-negative
Cross-Entropy Loss
Cross-entropy measures the difference between predicted and true probability distributions. Use it for classification tasks.
tofu_graph_node* tofu_graph_ce_loss(tofu_graph* g,
tofu_graph_node* pred,
tofu_graph_node* target);
Mathematical definition:
CE = -sum(target * log(pred))
When to use cross-entropy:
- Classification problems (image recognition, sentiment analysis)
- When outputs represent class probabilities
- Multi-class or binary classification tasks
Example: Multi-class classification
// Predict class probabilities
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
tofu_graph_node* W = tofu_graph_param(g, weights);
tofu_graph_node* b = tofu_graph_param(g, bias);
// Forward pass: logits -> softmax -> probabilities
tofu_graph_node* logits = tofu_graph_add(g, tofu_graph_matmul(g, x, W), b);
tofu_graph_node* probs = tofu_graph_softmax(g, logits, 1); // axis=1 for batch
// Target should be one-hot encoded: [0, 1, 0, 0] for class 1
tofu_graph_node* target = tofu_graph_input(g, target_one_hot);
// Cross-entropy loss
tofu_graph_node* loss = tofu_graph_ce_loss(g, probs, target);
tofu_graph_backward(g, loss);
Target format: Targets must be one-hot encoded vectors:
// For batch_size=2, num_classes=4
// If sample 0 is class 2 and sample 1 is class 0:
float target_data[] = {
0.0f, 0.0f, 1.0f, 0.0f, // Sample 0: class 2
1.0f, 0.0f, 0.0f, 0.0f // Sample 1: class 0
};
tofu_tensor* target = tofu_tensor_create(target_data, 2,
(int[]){2, 4}, TOFU_FLOAT);
Key properties:
- Numerically stable implementation (avoids log(0))
- Works well with softmax activation
- Penalizes confident wrong predictions heavily
- Output is a scalar (averaged over batch)
Loss Function Comparison
| Property | MSE Loss | Cross-Entropy Loss |
|---|---|---|
| Use case | Regression | Classification |
| Output type | Continuous values | Probabilities (0-1) |
| Activation | Linear or ReLU | Softmax |
| Gradient behavior | Linear with error | Exponential confidence penalty |
| Outlier sensitivity | High (squared) | Moderate (logarithmic) |
9. Forward Pass
The forward pass computes outputs by propagating data through your graph from inputs to loss. Results are automatically stored in each node.
Accessing Results with get_value
After building your graph, each node contains its computed result in the value field. Access it using:
tofu_tensor* tofu_graph_get_value(tofu_graph_node* node);
Important: The returned tensor is owned by the node. Never free it yourself.
Example: Inspecting intermediate activations
// Build network
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
tofu_graph_node* W1 = tofu_graph_param(g, weights1);
tofu_graph_node* h = tofu_graph_relu(g, tofu_graph_matmul(g, x, W1));
// Access hidden layer activations
tofu_tensor* hidden_values = tofu_graph_get_value(h);
printf("Hidden layer statistics:\n");
tofu_tensor_print(hidden_values, "%.4f");
Typical Forward Pass Pattern
// 1. Create inputs
tofu_graph_node* x = tofu_graph_input(g, input_data);
tofu_graph_node* target = tofu_graph_input(g, target_data);
// 2. Add parameters
tofu_graph_node* W = tofu_graph_param(g, weights);
tofu_graph_node* b = tofu_graph_param(g, bias);
// 3. Build computation graph
tofu_graph_node* logits = tofu_graph_add(g, tofu_graph_matmul(g, x, W), b);
tofu_graph_node* probs = tofu_graph_softmax(g, logits, 1);
// 4. Compute loss
tofu_graph_node* loss = tofu_graph_ce_loss(g, probs, target);
// 5. Access results
tofu_tensor* loss_value = tofu_graph_get_value(loss);
tofu_tensor* predictions = tofu_graph_get_value(probs);
Reading Loss Values
Loss is typically a scalar tensor (single element):
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);
tofu_tensor* loss_tensor = tofu_graph_get_value(loss);
// Extract scalar value
float loss_value;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_value, TOFU_FLOAT);
printf("Training loss: %.6f\n", loss_value);
Making Predictions
For inference, forward pass without computing loss:
// Inference mode (no target needed)
tofu_graph_node* x = tofu_graph_input(g, test_input);
tofu_graph_node* W = tofu_graph_param(g, trained_weights);
tofu_graph_node* b = tofu_graph_param(g, trained_bias);
tofu_graph_node* pred = tofu_graph_softmax(g,
tofu_graph_add(g, tofu_graph_matmul(g, x, W), b), 1);
// Get predictions
tofu_tensor* predictions = tofu_graph_get_value(pred);
// Find class with highest probability
int pred_class = 0;
float max_prob = -1.0f;
for (int i = 0; i < num_classes; i++) {
float prob;
TOFU_TENSOR_DATA_TO(predictions, i, prob, TOFU_FLOAT);
if (prob > max_prob) {
max_prob = prob;
pred_class = i;
}
}
10. Backward Pass
The backward pass computes gradients using reverse-mode automatic differentiation (backpropagation). This enables training via gradient descent.
Understanding Backpropagation
When you call tofu_graph_backward(), the graph computes how changes to each parameter affect the loss using the chain rule:
∂loss/∂W = ∂loss/∂output × ∂output/∂W
The algorithm:
- Starts from the loss node (scalar)
- Propagates gradients backward through operations
- Accumulates gradients at parameter nodes
- Stores results in
node->grad
Calling Backward
void tofu_graph_backward(tofu_graph* g, tofu_graph_node* loss);
Requirements:
lossmust be a scalar (single element tensor)- Call after forward pass completes
- Gradients accumulate with each call
Example: Training iteration
// Forward pass
tofu_graph_node* x = tofu_graph_input(g, batch_data);
tofu_graph_node* W = tofu_graph_param(g, weights);
tofu_graph_node* pred = tofu_graph_matmul(g, x, W);
tofu_graph_node* target = tofu_graph_input(g, batch_targets);
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);
// Backward pass - computes all gradients
tofu_graph_backward(g, loss);
// Now W->grad contains ∂loss/∂W
Accessing Gradients with get_grad
After backward pass, retrieve gradients from parameter nodes:
tofu_tensor* tofu_graph_get_grad(tofu_graph_node* node);
Returns: Pointer to gradient tensor, or NULL if backward hasn't been called yet.
Important: The returned tensor is owned by the node. Never free it yourself.
Example: Manual parameter update
tofu_graph_backward(g, loss);
// Get gradient
tofu_tensor* W_grad = tofu_graph_get_grad(W);
tofu_tensor* W_value = tofu_graph_get_value(W);
// Manual SGD update: W = W - learning_rate * grad
float lr = 0.01f;
for (int i = 0; i < W_value->len; i++) {
float w, grad;
TOFU_TENSOR_DATA_TO(W_value, i, w, TOFU_FLOAT);
TOFU_TENSOR_DATA_TO(W_grad, i, grad, TOFU_FLOAT);
float updated = w - lr * grad;
TOFU_TENSOR_DATA_FROM(&updated, W_value, i, TOFU_FLOAT);
}
Zeroing Gradients with zero_grad
Gradients accumulate by default. Always zero them before each training iteration:
void tofu_graph_zero_grad(tofu_graph* g);
Why this matters:
// WRONG: Gradients accumulate forever
for (int epoch = 0; epoch < 100; epoch++) {
tofu_graph_backward(g, loss); // Adds to existing gradients!
tofu_optimizer_step(opt);
}
// CORRECT: Clear gradients each iteration
for (int epoch = 0; epoch < 100; epoch++) {
tofu_graph_zero_grad(g); // Start fresh
tofu_graph_backward(g, loss); // Compute gradients
tofu_optimizer_step(opt); // Update parameters
}
Gradient Accumulation (Advanced)
Sometimes you intentionally want gradients to accumulate across multiple batches:
// Simulate larger batch by accumulating gradients
int accumulation_steps = 4;
for (int step = 0; step < accumulation_steps; step++) {
// Forward pass on mini-batch
tofu_graph_node* loss = compute_loss(g, mini_batches[step]);
// Accumulate gradients (don't zero between mini-batches)
tofu_graph_backward(g, loss);
if (step < accumulation_steps - 1) {
tofu_graph_clear_ops(g); // Clear graph but keep gradients
}
}
// Update once with accumulated gradients
tofu_optimizer_step(opt);
// Now zero for next iteration
tofu_graph_zero_grad(g);
Complete Backward Pass Example
// Training loop with proper gradient handling
for (int epoch = 0; epoch < num_epochs; epoch++) {
// 1. Zero gradients from previous iteration
tofu_graph_zero_grad(g);
// 2. Forward pass
tofu_graph_node* x = tofu_graph_input(g, train_data);
tofu_graph_node* W = tofu_graph_param(g, weights);
tofu_graph_node* pred = tofu_graph_matmul(g, x, W);
tofu_graph_node* target = tofu_graph_input(g, train_targets);
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);
// 3. Backward pass
tofu_graph_backward(g, loss);
// 4. Check gradients (debugging)
tofu_tensor* W_grad = tofu_graph_get_grad(W);
if (W_grad) {
float grad_norm = 0.0f;
for (int i = 0; i < W_grad->len; i++) {
float g;
TOFU_TENSOR_DATA_TO(W_grad, i, g, TOFU_FLOAT);
grad_norm += g * g;
}
printf("Gradient norm: %.6f\n", sqrtf(grad_norm));
}
// 5. Update parameters (use optimizer in practice)
tofu_optimizer_step(optimizer);
// 6. Clear operations for next iteration
tofu_graph_clear_ops(g);
}
Debugging Gradients
Common issues and solutions:
Vanishing gradients (gradients near zero):
tofu_tensor* grad = tofu_graph_get_grad(W);
float max_grad = 0.0f;
for (int i = 0; i < grad->len; i++) {
float g;
TOFU_TENSOR_DATA_TO(grad, i, g, TOFU_FLOAT);
if (fabsf(g) > max_grad) max_grad = fabsf(g);
}
if (max_grad < 1e-7f) {
printf("WARNING: Vanishing gradients detected\n");
}
Exploding gradients (gradients too large):
if (max_grad > 100.0f) {
printf("WARNING: Exploding gradients detected\n");
// Consider gradient clipping or reducing learning rate
}
11. Memory and Ownership
Understanding memory management is critical for correct and leak-free code.
Ownership Rules (CRITICAL)
Rule 1: Graph does NOT own input/parameter tensors
tofu_tensor* W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node* W_node = tofu_graph_param(g, W);
// You still own W! Must free it after graph_free
tofu_graph_free(g);
tofu_tensor_free_data_too(W); // Your responsibility
Rule 2: Graph OWNS intermediate operation results
tofu_graph_node* result = tofu_graph_matmul(g, x, W);
// result->value is owned by the graph
// Don't free it - tofu_graph_free() handles it
tofu_graph_free(g); // Frees result->value automatically
Rule 3: Graph OWNS all nodes
tofu_graph_node* node = tofu_graph_relu(g, x);
// Don't free node - graph owns it
tofu_graph_free(g); // Frees all nodes
Rule 4: Never free tensors returned by get_value/get_grad
tofu_tensor* value = tofu_graph_get_value(node); // Node owns this
tofu_tensor* grad = tofu_graph_get_grad(node); // Node owns this
// WRONG: tofu_tensor_free(value); // CRASH!
// CORRECT: Just use the tensor, don't free it
Complete Cleanup Pattern
// 1. Allocate parameter tensors
tofu_tensor* W = tofu_tensor_zeros(2, (int[]){64, 32}, TOFU_FLOAT);
tofu_tensor* b = tofu_tensor_zeros(1, (int[]){32}, TOFU_FLOAT);
// 2. Create graph
tofu_graph* g = tofu_graph_create();
// 3. Add parameters to graph
tofu_graph_node* W_node = tofu_graph_param(g, W);
tofu_graph_node* b_node = tofu_graph_param(g, b);
// 4. Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Allocate batch data (you manage this)
float* batch_data = load_batch(epoch);
tofu_tensor* x_tensor = tofu_tensor_create(batch_data, 2,
(int[]){32, 64}, TOFU_FLOAT);
// Build graph (operations owned by graph)
tofu_graph_node* x = tofu_graph_input(g, x_tensor);
tofu_graph_node* out = tofu_graph_add(g,
tofu_graph_matmul(g, x, W_node), b_node);
// ... training ...
// Free batch resources (you own these)
tofu_tensor_free(x_tensor);
free(batch_data);
// Clear operations but keep parameters
tofu_graph_clear_ops(g);
}
// 5. Cleanup (CRITICAL ORDER!)
tofu_graph_free(g); // Graph owns: nodes, ops, gradients
tofu_tensor_free_data_too(W); // You own: parameter tensors
tofu_tensor_free_data_too(b);
Memory Management with Optimizers
// Create optimizer (holds references to graph and parameters)
tofu_optimizer* opt = tofu_optimizer_adam_create(g, 0.001, 0.9, 0.999, 1e-8);
// Training...
// Cleanup order matters!
tofu_optimizer_free(opt); // 1. Free optimizer first
tofu_graph_free(g); // 2. Then graph
tofu_tensor_free_data_too(W); // 3. Then parameter tensors
tofu_tensor_free_data_too(b);
Common Memory Pitfalls
Pitfall 1: Freeing parameter tensors too early
// WRONG
tofu_tensor* W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node* W_node = tofu_graph_param(g, W);
tofu_tensor_free_data_too(W); // DON'T DO THIS! Graph needs it
tofu_graph_backward(g, loss); // CRASH: W is freed but graph uses it
Pitfall 2: Freeing operation results
// WRONG
tofu_graph_node* result = tofu_graph_matmul(g, x, W);
tofu_tensor* result_value = tofu_graph_get_value(result);
tofu_tensor_free(result_value); // CRASH! Graph owns this
Pitfall 3: Forgetting to free parameter tensors
// Memory leak
tofu_tensor* W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_graph_node* W_node = tofu_graph_param(g, W);
tofu_graph_free(g); // Graph freed but W still allocated!
// Missing: tofu_tensor_free_data_too(W);
Pitfall 4: Double-free via clear_ops
// WRONG
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
tofu_graph_clear_ops(g); // Removes x node
tofu_tensor_free(input_tensor); // OK
tofu_tensor_free(input_tensor); // CRASH: Double free!
Batch Processing Memory Pattern
Efficient pattern for processing multiple batches:
tofu_graph* g = tofu_graph_create();
// Parameters persist across batches
tofu_tensor* W = tofu_tensor_zeros(2, (int[]){64, 32}, TOFU_FLOAT);
tofu_graph_node* W_node = tofu_graph_param(g, W);
for (int batch = 0; batch < num_batches; batch++) {
// Allocate batch-specific data
float* batch_data = malloc(batch_size * 64 * sizeof(float));
load_batch_data(batch_data, batch);
tofu_tensor* x_tensor = tofu_tensor_create(batch_data, 2,
(int[]){batch_size, 64}, TOFU_FLOAT);
// Build graph for this batch
tofu_graph_node* x = tofu_graph_input(g, x_tensor);
tofu_graph_node* out = tofu_graph_matmul(g, x, W_node);
// ... compute loss, backward, optimize ...
// Free batch resources
tofu_tensor_free(x_tensor); // Free tensor wrapper
free(batch_data); // Free data buffer
// Clear operations (keeps W_node!)
tofu_graph_clear_ops(g);
}
// Final cleanup
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
12. Building Complex Networks
Move beyond single layers to build sophisticated architectures.
Multi-Layer Perceptron (MLP)
A complete MLP with multiple hidden layers:
typedef struct {
tofu_tensor *W1, *b1; // Input -> Hidden1
tofu_tensor *W2, *b2; // Hidden1 -> Hidden2
tofu_tensor *W3, *b3; // Hidden2 -> Output
} MLP;
// Initialize weights with Xavier/He initialization
MLP* mlp_create(int input_dim, int hidden1, int hidden2, int output_dim) {
MLP* mlp = malloc(sizeof(MLP));
// Layer 1: input_dim -> hidden1
mlp->W1 = tofu_tensor_zeros(2, (int[]){input_dim, hidden1}, TOFU_FLOAT);
mlp->b1 = tofu_tensor_zeros(1, (int[]){hidden1}, TOFU_FLOAT);
// Initialize W1 with Xavier: uniform(-sqrt(6/n), sqrt(6/n))
float limit1 = sqrtf(6.0f / input_dim);
for (int i = 0; i < mlp->W1->len; i++) {
float val = (2.0f * rand() / RAND_MAX - 1.0f) * limit1;
TOFU_TENSOR_DATA_FROM(&val, mlp->W1, i, TOFU_FLOAT);
}
// Layer 2: hidden1 -> hidden2
mlp->W2 = tofu_tensor_zeros(2, (int[]){hidden1, hidden2}, TOFU_FLOAT);
mlp->b2 = tofu_tensor_zeros(1, (int[]){hidden2}, TOFU_FLOAT);
float limit2 = sqrtf(6.0f / hidden1);
for (int i = 0; i < mlp->W2->len; i++) {
float val = (2.0f * rand() / RAND_MAX - 1.0f) * limit2;
TOFU_TENSOR_DATA_FROM(&val, mlp->W2, i, TOFU_FLOAT);
}
// Layer 3: hidden2 -> output_dim
mlp->W3 = tofu_tensor_zeros(2, (int[]){hidden2, output_dim}, TOFU_FLOAT);
mlp->b3 = tofu_tensor_zeros(1, (int[]){output_dim}, TOFU_FLOAT);
float limit3 = sqrtf(6.0f / hidden2);
for (int i = 0; i < mlp->W3->len; i++) {
float val = (2.0f * rand() / RAND_MAX - 1.0f) * limit3;
TOFU_TENSOR_DATA_FROM(&val, mlp->W3, i, TOFU_FLOAT);
}
return mlp;
}
// Forward pass
tofu_graph_node* mlp_forward(tofu_graph* g, tofu_graph_node* x, MLP* mlp) {
// Add parameters to graph
tofu_graph_node* W1 = tofu_graph_param(g, mlp->W1);
tofu_graph_node* b1 = tofu_graph_param(g, mlp->b1);
tofu_graph_node* W2 = tofu_graph_param(g, mlp->W2);
tofu_graph_node* b2 = tofu_graph_param(g, mlp->b2);
tofu_graph_node* W3 = tofu_graph_param(g, mlp->W3);
tofu_graph_node* b3 = tofu_graph_param(g, mlp->b3);
// Layer 1: x @ W1 + b1 -> ReLU
tofu_graph_node* h1 = tofu_graph_matmul(g, x, W1);
h1 = tofu_graph_add(g, h1, b1);
h1 = tofu_graph_relu(g, h1);
// Layer 2: h1 @ W2 + b2 -> ReLU
tofu_graph_node* h2 = tofu_graph_matmul(g, h1, W2);
h2 = tofu_graph_add(g, h2, b2);
h2 = tofu_graph_relu(g, h2);
// Layer 3: h2 @ W3 + b3 (logits)
tofu_graph_node* out = tofu_graph_matmul(g, h2, W3);
out = tofu_graph_add(g, out, b3);
return out;
}
// Cleanup
void mlp_free(MLP* mlp) {
tofu_tensor_free_data_too(mlp->W1);
tofu_tensor_free_data_too(mlp->b1);
tofu_tensor_free_data_too(mlp->W2);
tofu_tensor_free_data_too(mlp->b2);
tofu_tensor_free_data_too(mlp->W3);
tofu_tensor_free_data_too(mlp->b3);
free(mlp);
}
// Usage
MLP* model = mlp_create(784, 256, 128, 10); // MNIST-style
tofu_graph* g = tofu_graph_create();
for (int epoch = 0; epoch < 100; epoch++) {
tofu_graph_node* x = tofu_graph_input(g, batch_data);
tofu_graph_node* logits = mlp_forward(g, x, model);
tofu_graph_node* probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node* target = tofu_graph_input(g, batch_targets);
tofu_graph_node* loss = tofu_graph_ce_loss(g, probs, target);
tofu_graph_backward(g, loss);
tofu_optimizer_step(optimizer);
tofu_graph_clear_ops(g);
}
mlp_free(model);
tofu_graph_free(g);
Residual Connections
Residual connections (skip connections) help training deep networks:
// Residual block: output = ReLU(x + F(x))
tofu_graph_node* residual_block(tofu_graph* g, tofu_graph_node* x,
tofu_tensor* W1, tofu_tensor* b1,
tofu_tensor* W2, tofu_tensor* b2) {
// Main path F(x)
tofu_graph_node* W1_node = tofu_graph_param(g, W1);
tofu_graph_node* b1_node = tofu_graph_param(g, b1);
tofu_graph_node* W2_node = tofu_graph_param(g, W2);
tofu_graph_node* b2_node = tofu_graph_param(g, b2);
// F(x) = W2 @ ReLU(W1 @ x + b1) + b2
tofu_graph_node* h1 = tofu_graph_matmul(g, x, W1_node);
h1 = tofu_graph_add(g, h1, b1_node);
h1 = tofu_graph_relu(g, h1);
tofu_graph_node* h2 = tofu_graph_matmul(g, h1, W2_node);
h2 = tofu_graph_add(g, h2, b2_node);
// Skip connection: x + F(x)
tofu_graph_node* residual = tofu_graph_add(g, x, h2);
// Final activation
return tofu_graph_relu(g, residual);
}
// Stack multiple residual blocks
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
x = residual_block(g, x, W1a, b1a, W1b, b1b);
x = residual_block(g, x, W2a, b2a, W2b, b2b);
x = residual_block(g, x, W3a, b3a, W3b, b3b);
tofu_graph_node* out = tofu_graph_matmul(g, x, W_out);
Custom Layer Abstractions
Encapsulate common patterns:
// Linear layer: y = x @ W + b
typedef struct {
tofu_tensor* W;
tofu_tensor* b;
} LinearLayer;
LinearLayer* linear_create(int in_features, int out_features) {
LinearLayer* layer = malloc(sizeof(LinearLayer));
layer->W = tofu_tensor_zeros(2, (int[]){in_features, out_features}, TOFU_FLOAT);
layer->b = tofu_tensor_zeros(1, (int[]){out_features}, TOFU_FLOAT);
// Initialize weights
float limit = sqrtf(6.0f / in_features);
for (int i = 0; i < layer->W->len; i++) {
float val = (2.0f * rand() / RAND_MAX - 1.0f) * limit;
TOFU_TENSOR_DATA_FROM(&val, layer->W, i, TOFU_FLOAT);
}
return layer;
}
tofu_graph_node* linear_forward(tofu_graph* g, tofu_graph_node* x, LinearLayer* layer) {
tofu_graph_node* W = tofu_graph_param(g, layer->W);
tofu_graph_node* b = tofu_graph_param(g, layer->b);
tofu_graph_node* out = tofu_graph_matmul(g, x, W);
return tofu_graph_add(g, out, b);
}
void linear_free(LinearLayer* layer) {
tofu_tensor_free_data_too(layer->W);
tofu_tensor_free_data_too(layer->b);
free(layer);
}
// Build network with layer abstractions
LinearLayer* fc1 = linear_create(784, 256);
LinearLayer* fc2 = linear_create(256, 10);
tofu_graph_node* x = tofu_graph_input(g, input);
x = linear_forward(g, x, fc1);
x = tofu_graph_relu(g, x);
x = linear_forward(g, x, fc2);
tofu_graph_node* probs = tofu_graph_softmax(g, x, 1);
Transformer-Style Attention (Simplified)
Basic attention mechanism pattern:
// Simplified attention: softmax(Q @ K^T) @ V
tofu_graph_node* attention(tofu_graph* g,
tofu_graph_node* Q, // Query
tofu_graph_node* K, // Key
tofu_graph_node* V) // Value
{
// 1. Compute attention scores: Q @ K^T
tofu_graph_node* K_T = tofu_graph_transpose(g, K, NULL);
tofu_graph_node* scores = tofu_graph_matmul(g, Q, K_T);
// 2. Softmax over last dimension
tofu_graph_node* attn_weights = tofu_graph_softmax(g, scores, -1);
// 3. Apply attention: attn_weights @ V
tofu_graph_node* output = tofu_graph_matmul(g, attn_weights, V);
return output;
}
13. Best Practices
Guidelines for robust and maintainable graph-based code.
Graph Design Principles
1. Separate model structure from training logic
// Good: Model is a reusable structure
typedef struct {
tofu_tensor *W1, *b1, *W2, *b2;
} Model;
tofu_graph_node* model_forward(tofu_graph* g, tofu_graph_node* x, Model* m);
// Train the model
void train(Model* model, Dataset* data) {
tofu_graph* g = tofu_graph_create();
// Training loop uses model_forward()
tofu_graph_free(g);
}
2. Use clear_ops between iterations
// Efficient: Reuse graph structure
for (int batch = 0; batch < num_batches; batch++) {
// Build graph for this batch
tofu_graph_node* loss = build_forward_graph(g, batch);
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
// Clear ops but keep parameters
tofu_graph_clear_ops(g);
}
3. Always check tensor shapes during development
tofu_graph_node* result = tofu_graph_matmul(g, x, W);
tofu_tensor* result_tensor = tofu_graph_get_value(result);
printf("Result shape: [");
for (int i = 0; i < result_tensor->ndim; i++) {
printf("%d%s", result_tensor->dims[i],
i < result_tensor->ndim - 1 ? ", " : "");
}
printf("]\n");
Debugging Strategies
Monitor loss values
tofu_tensor* loss_tensor = tofu_graph_get_value(loss);
float loss_val;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_val, TOFU_FLOAT);
if (isnan(loss_val) || isinf(loss_val)) {
printf("ERROR: Loss is NaN or Inf at epoch %d\n", epoch);
// Check gradients, learning rate, or input data
}
if (loss_val > prev_loss * 2.0f) {
printf("WARNING: Loss spiked at epoch %d\n", epoch);
// Consider reducing learning rate
}
Check gradient magnitudes
void print_gradient_stats(tofu_graph_node* param, const char* name) {
tofu_tensor* grad = tofu_graph_get_grad(param);
if (!grad) return;
float min = 1e9f, max = -1e9f, sum = 0.0f;
for (int i = 0; i < grad->len; i++) {
float g;
TOFU_TENSOR_DATA_TO(grad, i, g, TOFU_FLOAT);
if (g < min) min = g;
if (g > max) max = g;
sum += g * g;
}
printf("%s grad: min=%.6f, max=%.6f, norm=%.6f\n",
name, min, max, sqrtf(sum));
}
Validate forward pass outputs
tofu_tensor* probs = tofu_graph_get_value(softmax_node);
// Check probabilities sum to 1
float sum = 0.0f;
for (int i = 0; i < probs->dims[1]; i++) {
float p;
TOFU_TENSOR_DATA_TO(probs, i, p, TOFU_FLOAT);
sum += p;
}
if (fabsf(sum - 1.0f) > 1e-5f) {
printf("WARNING: Probabilities don't sum to 1: %.6f\n", sum);
}
Performance Tips
1. Batch your data
// Slow: Process one sample at a time
for (int i = 0; i < 1000; i++) {
tofu_graph_node* x = tofu_graph_input(g, single_samples[i]);
// ... forward, backward, update ...
}
// Fast: Process batches
int batch_size = 32;
for (int i = 0; i < 1000; i += batch_size) {
tofu_graph_node* x = tofu_graph_input(g, batched_samples[i/batch_size]);
// ... forward, backward, update ...
}
2. Reuse graph structure
// Less efficient: Create new graph each iteration
for (int epoch = 0; epoch < 100; epoch++) {
tofu_graph* g = tofu_graph_create();
// ... train ...
tofu_graph_free(g);
}
// More efficient: Reuse graph
tofu_graph* g = tofu_graph_create();
for (int epoch = 0; epoch < 100; epoch++) {
// ... train ...
tofu_graph_clear_ops(g);
}
tofu_graph_free(g);
3. Profile your code
#include <time.h>
clock_t start = clock();
tofu_graph_backward(g, loss);
clock_t end = clock();
double time_ms = 1000.0 * (end - start) / CLOCKS_PER_SEC;
printf("Backward pass: %.2f ms\n", time_ms);
Common Pitfalls to Avoid
-
Forgetting to zero gradients
- Always call
tofu_graph_zero_grad()before backward pass
- Always call
-
Freeing tensors too early
- Don't free parameter tensors until after
tofu_graph_free()
- Don't free parameter tensors until after
-
Wrong loss node
- Ensure loss is scalar before calling backward
-
Shape mismatches
- Use
tofu_tensor_print()to debug shape issues
- Use
-
Learning rate too high
- Start with small values (0.001-0.01) and adjust
-
No validation set
- Always evaluate on separate data to detect overfitting
Complete Training Template
// Complete training example with best practices
void train_model(Dataset* train_data, Dataset* val_data) {
// Initialize
tofu_graph* g = tofu_graph_create();
Model* model = model_create(input_dim, hidden_dim, output_dim);
tofu_optimizer* opt = tofu_optimizer_adam_create(g, 0.001, 0.9, 0.999, 1e-8);
float best_val_loss = 1e9f;
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Training phase
float train_loss = 0.0f;
for (int batch = 0; batch < train_data->num_batches; batch++) {
tofu_graph_zero_grad(g);
tofu_graph_node* x = tofu_graph_input(g, train_data->batches[batch].x);
tofu_graph_node* pred = model_forward(g, x, model);
tofu_graph_node* target = tofu_graph_input(g, train_data->batches[batch].y);
tofu_graph_node* loss = tofu_graph_ce_loss(g, pred, target);
tofu_tensor* loss_tensor = tofu_graph_get_value(loss);
float batch_loss;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, batch_loss, TOFU_FLOAT);
train_loss += batch_loss;
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
tofu_graph_clear_ops(g);
}
train_loss /= train_data->num_batches;
// Validation phase (no gradient computation)
float val_loss = 0.0f;
for (int batch = 0; batch < val_data->num_batches; batch++) {
tofu_graph_node* x = tofu_graph_input(g, val_data->batches[batch].x);
tofu_graph_node* pred = model_forward(g, x, model);
tofu_graph_node* target = tofu_graph_input(g, val_data->batches[batch].y);
tofu_graph_node* loss = tofu_graph_ce_loss(g, pred, target);
tofu_tensor* loss_tensor = tofu_graph_get_value(loss);
float batch_loss;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, batch_loss, TOFU_FLOAT);
val_loss += batch_loss;
tofu_graph_clear_ops(g);
}
val_loss /= val_data->num_batches;
// Logging
printf("Epoch %3d: train_loss=%.4f, val_loss=%.4f",
epoch, train_loss, val_loss);
// Save best model
if (val_loss < best_val_loss) {
best_val_loss = val_loss;
printf(" (best)");
// Save model weights here
}
printf("\n");
// Early stopping
if (train_loss < 0.01f && val_loss > train_loss * 2.0f) {
printf("Early stopping: overfitting detected\n");
break;
}
}
// Cleanup
tofu_optimizer_free(opt);
model_free(model);
tofu_graph_free(g);
}
This completes the computation graphs user guide. You now have the knowledge to build, train, and debug neural networks using Tofu's graph API.
Training Neural Networks
This guide covers how to train neural networks using TOFU's automatic differentiation and optimization capabilities. We'll walk through the complete training process with practical examples.
Introduction
Training a neural network in TOFU follows a standard pattern familiar to users of modern frameworks like PyTorch or TensorFlow. The key difference is that TOFU is designed for resource-constrained environments like microcontrollers, so we emphasize memory efficiency and explicit resource management.
What You'll Learn
In this guide you'll learn:
- How to structure a complete training loop
- Data preparation and batching strategies
- Forward and backward pass mechanics
- Loss computation and monitoring
- Parameter optimization
- Training strategies for embedded systems
- Debugging and evaluation techniques
Prerequisites
Before starting, ensure you're familiar with:
- Basic tensor operations (see Tensors guide)
- Computation graphs (see Graphs guide)
- Loss functions (see Loss Functions guide)
- Optimizers (see Optimizers guide)
The Training Paradigm
Neural network training is an iterative process of:
- Making predictions (forward pass)
- Measuring error (loss computation)
- Computing gradients (backward pass)
- Updating weights (optimization step)
TOFU provides all the primitives needed for this cycle through its computation graph API and automatic differentiation engine.
Memory Considerations
Training on microcontrollers requires careful memory management. TOFU helps by:
- Allowing graph reuse across iterations via
tofu_graph_clear_ops() - Minimizing allocations during training
- Providing explicit control over tensor lifetimes
- Supporting in-place operations where possible
The Training Loop
Every training loop in TOFU follows a consistent five-step pattern. Understanding this pattern is essential for successful training.
The Five-Step Pattern
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
for (int batch = 0; batch < num_batches; batch++) {
/* Step 1: Zero gradients */
tofu_graph_zero_grad(g);
/* Step 2: Forward pass */
tofu_graph_node* prediction = forward_pass(g, input, params);
/* Step 3: Compute loss */
tofu_graph_node* loss = tofu_graph_mse_loss(g, prediction, target);
/* Step 4: Backward pass */
tofu_graph_backward(g, loss);
/* Step 5: Update parameters */
tofu_optimizer_step(optimizer);
/* Cleanup: Clear operations but keep parameters */
tofu_graph_clear_ops(g);
}
}
Let's examine each step in detail.
Step 1: Zero Gradients
Before computing new gradients, clear any gradients from the previous iteration:
tofu_graph_zero_grad(g);
Why? Gradients accumulate by default. If you don't zero them, new gradients add to old ones, producing incorrect updates.
When to skip: Only when you explicitly want gradient accumulation (advanced technique).
Step 2: Forward Pass
Build the computation graph and compute predictions:
/* Create input node */
tofu_graph_node* x = tofu_graph_input(g, input_tensor);
/* Build network */
tofu_graph_node* h1 = tofu_graph_matmul(g, x, w1);
tofu_graph_node* h1_bias = tofu_graph_add(g, h1, b1);
tofu_graph_node* h1_act = tofu_graph_relu(g, h1_bias);
/* Output layer */
tofu_graph_node* output = tofu_graph_matmul(g, h1_act, w2);
tofu_graph_node* pred = tofu_graph_add(g, output, b2);
Key principle: The forward pass constructs the computational graph that defines your model.
Step 3: Compute Loss
Compare predictions to targets:
tofu_graph_node* target = tofu_graph_input(g, target_tensor);
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);
The loss node becomes the starting point for backpropagation.
Step 4: Backward Pass
Compute gradients via automatic differentiation:
tofu_graph_backward(g, loss);
This populates the grad field of all parameter nodes with gradients.
Step 5: Update Parameters
Apply gradients to update trainable parameters:
tofu_optimizer_step(optimizer);
The optimizer uses computed gradients and its internal algorithm (SGD, momentum, etc.) to update parameter values.
Graph Cleanup
After each iteration, clear operations while preserving parameters:
tofu_graph_clear_ops(g);
This frees intermediate computation nodes but keeps parameter nodes, allowing the graph to be reused in the next iteration.
Data Preparation
Proper data preparation is crucial for successful training. This section covers batching, normalization, and memory-efficient data handling.
Dataset Structure
Organize your data to facilitate batch processing:
typedef struct {
float* images; /* [num_samples, feature_dims] */
int* labels; /* [num_samples] */
int num_samples;
int feature_dims;
} dataset;
For the XOR problem, data preparation is simple:
float xor_inputs[4][2] = {
{0.0f, 0.0f},
{0.0f, 1.0f},
{1.0f, 0.0f},
{1.0f, 1.0f}
};
float xor_targets[4][1] = {
{0.0f},
{1.0f},
{1.0f},
{0.0f}
};
Batching Strategies
For larger datasets, process data in batches:
const int BATCH_SIZE = 4;
for (int batch_start = 0; batch_start < num_samples; batch_start += BATCH_SIZE) {
int batch_end = (batch_start + BATCH_SIZE < num_samples)
? batch_start + BATCH_SIZE
: num_samples;
int actual_batch_size = batch_end - batch_start;
/* Prepare batch data */
float* batch_data = (float*)malloc(actual_batch_size * feature_dims * sizeof(float));
int* batch_labels = (int*)malloc(actual_batch_size * sizeof(int));
for (int i = 0; i < actual_batch_size; i++) {
memcpy(batch_data + i * feature_dims,
dataset->images + (batch_start + i) * feature_dims,
feature_dims * sizeof(float));
batch_labels[i] = dataset->labels[batch_start + i];
}
/* Create batch tensor */
tofu_tensor* t_batch = tofu_tensor_create(batch_data, 2,
(int[]){actual_batch_size, feature_dims},
TOFU_FLOAT);
/* ... training step ... */
/* Cleanup */
tofu_tensor_free(t_batch);
free(batch_data);
free(batch_labels);
}
Batch size considerations:
- Larger batches: More stable gradients, better hardware utilization
- Smaller batches: Less memory usage, more frequent updates
- For microcontrollers: Start with batch_size=1 or very small batches
Data Normalization
Normalize inputs for better training stability:
/* Compute mean and std from training data */
void compute_statistics(const float* data, int num_samples, int dims,
float* mean, float* std) {
/* Zero initialize */
for (int d = 0; d < dims; d++) {
mean[d] = 0.0f;
std[d] = 0.0f;
}
/* Compute mean */
for (int i = 0; i < num_samples; i++) {
for (int d = 0; d < dims; d++) {
mean[d] += data[i * dims + d];
}
}
for (int d = 0; d < dims; d++) {
mean[d] /= num_samples;
}
/* Compute std */
for (int i = 0; i < num_samples; i++) {
for (int d = 0; d < dims; d++) {
float diff = data[i * dims + d] - mean[d];
std[d] += diff * diff;
}
}
for (int d = 0; d < dims; d++) {
std[d] = sqrtf(std[d] / num_samples);
}
}
/* Normalize data */
void normalize_data(float* data, int num_samples, int dims,
const float* mean, const float* std) {
for (int i = 0; i < num_samples; i++) {
for (int d = 0; d < dims; d++) {
data[i * dims + d] = (data[i * dims + d] - mean[d]) / (std[d] + 1e-8f);
}
}
}
Common normalization strategies:
- Z-score normalization:
(x - mean) / std(shown above) - Min-max scaling:
(x - min) / (max - min)to [0, 1] - Simple scaling: Divide by 255 for image data
- No normalization: For binary inputs like XOR
One-Hot Encoding
For classification, encode labels as one-hot vectors:
/* Convert integer labels to one-hot encoding */
void create_one_hot(int* labels, int batch_size, int num_classes, float* one_hot) {
memset(one_hot, 0, batch_size * num_classes * sizeof(float));
for (int i = 0; i < batch_size; i++) {
one_hot[i * num_classes + labels[i]] = 1.0f;
}
}
/* Usage in training loop */
float* target_data = (float*)calloc(batch_size * num_classes, sizeof(float));
create_one_hot(batch_labels, batch_size, num_classes, target_data);
tofu_tensor* t_target = tofu_tensor_create(target_data, 2,
(int[]){batch_size, num_classes},
TOFU_FLOAT);
Forward Pass
The forward pass computes predictions by propagating input through the network. Understanding graph construction and reuse is key to efficient training.
Building the Computation Graph
For a simple feedforward network:
tofu_graph_node* forward_pass(tofu_graph* g,
tofu_tensor* input_data,
tofu_graph_node* w1, tofu_graph_node* b1,
tofu_graph_node* w2, tofu_graph_node* b2) {
/* Input layer */
tofu_graph_node* x = tofu_graph_input(g, input_data);
/* Hidden layer: x @ w1 + b1 */
tofu_graph_node* h1_matmul = tofu_graph_matmul(g, x, w1);
tofu_graph_node* h1_bias = tofu_graph_add(g, h1_matmul, b1);
tofu_graph_node* h1 = tofu_graph_relu(g, h1_bias);
/* Output layer: h1 @ w2 + b2 */
tofu_graph_node* out_matmul = tofu_graph_matmul(g, h1, w2);
tofu_graph_node* prediction = tofu_graph_add(g, out_matmul, b2);
return prediction;
}
Reusing Graphs with clear_ops
Instead of creating a new graph each iteration, reuse it:
/* Initialize graph once */
tofu_graph* g = tofu_graph_create();
/* Create parameters once (persist across iterations) */
tofu_graph_node* w1 = tofu_graph_param(g, t_w1);
tofu_graph_node* b1 = tofu_graph_param(g, t_b1);
tofu_graph_node* w2 = tofu_graph_param(g, t_w2);
tofu_graph_node* b2 = tofu_graph_param(g, t_b2);
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
tofu_graph_zero_grad(g);
/* Build forward pass (creates new operation nodes) */
tofu_graph_node* pred = forward_pass(g, input, w1, b1, w2, b2);
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, target);
tofu_graph_backward(g, loss);
tofu_optimizer_step(optimizer);
/* Clear operations but keep parameters */
tofu_graph_clear_ops(g); /* This is crucial! */
}
Why clear_ops?
- Frees intermediate operation nodes
- Preserves parameter nodes and their values
- Allows graph reuse without memory leaks
- Essential for embedded systems with limited memory
Activation Functions
TOFU supports several activation functions:
/* ReLU: max(0, x) */
tofu_graph_node* relu_out = tofu_graph_relu(g, input);
/* Softmax: exp(x) / sum(exp(x)) */
tofu_graph_node* softmax_out = tofu_graph_softmax(g, logits, 1); /* axis=1 */
Choose activation based on your task:
- ReLU: Hidden layers, default choice
- Softmax: Output layer for multi-class classification
- None: Output layer for regression
Computing Loss
The loss function measures prediction error and drives learning. Choosing the right loss function is critical for your task.
Mean Squared Error (MSE)
Use MSE for regression tasks:
tofu_graph_node* loss = tofu_graph_mse_loss(g, prediction, target);
Formula: L = mean((pred - target)^2)
When to use:
- Regression problems (predicting continuous values)
- Output layer without activation (raw values)
- Examples: XOR (as regression), price prediction, temperature estimation
Example from XOR training:
/* Prediction is continuous output */
tofu_graph_node* y_pred = tofu_graph_add(g, y_matmul, b2);
/* Target is also continuous */
float* target_data = (float*)malloc(OUTPUT_SIZE * sizeof(float));
target_data[0] = xor_targets[sample][0]; /* 0.0 or 1.0 */
tofu_tensor* t_target = tofu_tensor_create(target_data, 1,
(int[]){OUTPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* y_target = tofu_graph_input(g, t_target);
/* MSE loss */
tofu_graph_node* loss_node = tofu_graph_mse_loss(g, y_pred, y_target);
Cross-Entropy (CE) Loss
Use cross-entropy for classification:
/* Apply softmax first */
tofu_graph_node* probs = tofu_graph_softmax(g, logits, 1);
/* Compute CE loss with one-hot targets */
tofu_graph_node* loss = tofu_graph_ce_loss(g, probs, one_hot_target);
Formula: L = -mean(sum(target * log(pred)))
When to use:
- Multi-class classification
- Softmax output layer (probabilities)
- One-hot encoded targets
- Examples: MNIST digit classification, CNN pattern recognition
Example from CNN training:
/* Forward pass with softmax */
tofu_graph_node* probs = cnn_forward_probs(g, input, params);
/* One-hot encode targets */
float* target_data = (float*)calloc(batch_size * num_classes, sizeof(float));
for (int i = 0; i < batch_size; i++) {
target_data[i * num_classes + labels[i]] = 1.0f;
}
tofu_tensor* t_target = tofu_tensor_create(target_data, 2,
(int[]){batch_size, num_classes},
TOFU_FLOAT);
tofu_graph_node* target = tofu_graph_input(g, t_target);
/* Cross-entropy loss */
tofu_graph_node* loss_node = tofu_graph_ce_loss(g, probs, target);
Extracting Loss Values
Get the scalar loss value for monitoring:
tofu_tensor* loss_tensor = tofu_graph_get_value(loss_node);
float loss_value = 0.0f;
if (loss_tensor && loss_tensor->len > 0) {
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_value, TOFU_FLOAT);
}
Monitoring Loss
Track loss to ensure training progresses:
float epoch_loss = 0.0f;
int num_batches = 0;
for (int batch_start = 0; batch_start < num_samples; batch_start += BATCH_SIZE) {
/* ... forward pass and loss computation ... */
float batch_loss = 0.0f;
tofu_tensor* loss_tensor = tofu_graph_get_value(loss_node);
if (loss_tensor && loss_tensor->len > 0) {
TOFU_TENSOR_DATA_TO(loss_tensor, 0, batch_loss, TOFU_FLOAT);
}
epoch_loss += batch_loss;
num_batches++;
/* ... backward pass and update ... */
}
float avg_loss = epoch_loss / num_batches;
printf("Epoch %d: avg_loss = %.6f\n", epoch, avg_loss);
Loss Patterns
Healthy training:
- Loss decreases steadily
- Eventually plateaus at a low value
- May have small oscillations
Problem signs:
- Loss increases: Learning rate too high or gradient explosion
- Loss stuck: Learning rate too low or poor initialization
- Loss = NaN: Numerical instability (try lower learning rate)
Backward Pass
The backward pass computes gradients through automatic differentiation. TOFU handles the complexity; you just call one function.
Invoking Backpropagation
tofu_graph_backward(g, loss_node);
This single call:
- Traverses the computation graph in reverse topological order
- Applies the chain rule at each operation
- Accumulates gradients in each node's
gradfield - Populates gradients for all parameter nodes
How Automatic Differentiation Works
TOFU implements reverse-mode automatic differentiation (backpropagation):
Forward pass: Input → Op1 → Op2 → ... → Loss
Backward pass: Loss → ∂Op2 → ∂Op1 → ... → ∂Input
Each operation knows how to compute its local gradient:
- Matmul:
∂L/∂A = ∂L/∂C @ B^Tand∂L/∂B = A^T @ ∂L/∂C - Add:
∂L/∂A = ∂L/∂Cand∂L/∂B = ∂L/∂C(with broadcasting) - ReLU:
∂L/∂x = ∂L/∂y * (x > 0) - MSE:
∂L/∂pred = 2 * (pred - target) / n
Gradient Flow Example
For a simple network y = relu(x @ w + b):
/* Forward pass builds graph */
tofu_graph_node* xw = tofu_graph_matmul(g, x, w);
tofu_graph_node* xw_b = tofu_graph_add(g, xw, b);
tofu_graph_node* y = tofu_graph_relu(g, xw_b);
tofu_graph_node* loss = tofu_graph_mse_loss(g, y, target);
/* Backward pass computes gradients */
tofu_graph_backward(g, loss);
/* Now gradients are available:
* loss->grad: Always 1.0 (starting point)
* y->grad: ∂L/∂y
* xw_b->grad: ∂L/∂y * relu_grad
* xw->grad: ∂L/∂(xw+b)
* w->grad: x^T @ ∂L/∂(xw) <- Used by optimizer
* b->grad: sum(∂L/∂(xw+b)) <- Used by optimizer
*/
Accessing Gradients
Check gradient values (useful for debugging):
tofu_tensor* w_grad = tofu_graph_get_grad(w1);
if (w_grad) {
printf("w1 gradient norm: ");
float grad_sum = 0.0f;
for (int i = 0; i < w_grad->len; i++) {
float val;
TOFU_TENSOR_DATA_TO(w_grad, i, val, TOFU_FLOAT);
grad_sum += val * val;
}
printf("%.6f\n", sqrtf(grad_sum));
}
Gradient Checking (Debug Tool)
Verify backprop implementation with numerical gradients:
float numerical_gradient(tofu_graph* g, tofu_graph_node* param,
tofu_graph_node* loss, int param_idx) {
const float epsilon = 1e-4f;
/* Get parameter value */
float original;
TOFU_TENSOR_DATA_TO(param->value, param_idx, original, TOFU_FLOAT);
/* Compute f(x + epsilon) */
float perturbed = original + epsilon;
TOFU_TENSOR_DATA_FROM(param->value, param_idx, perturbed, TOFU_FLOAT);
tofu_graph_zero_grad(g);
/* ... rerun forward pass ... */
float loss_plus;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_plus, TOFU_FLOAT);
/* Compute f(x - epsilon) */
perturbed = original - epsilon;
TOFU_TENSOR_DATA_FROM(param->value, param_idx, perturbed, TOFU_FLOAT);
tofu_graph_zero_grad(g);
/* ... rerun forward pass ... */
float loss_minus;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_minus, TOFU_FLOAT);
/* Restore original value */
TOFU_TENSOR_DATA_FROM(param->value, param_idx, original, TOFU_FLOAT);
/* Numerical gradient: (f(x+ε) - f(x-ε)) / (2ε) */
return (loss_plus - loss_minus) / (2.0f * epsilon);
}
/* Compare with analytical gradient */
float analytical_grad;
TOFU_TENSOR_DATA_TO(tofu_graph_get_grad(param), param_idx, analytical_grad, TOFU_FLOAT);
float numerical_grad = numerical_gradient(g, param, loss, param_idx);
float relative_error = fabsf(analytical_grad - numerical_grad) /
fmaxf(fabsf(analytical_grad), fabsf(numerical_grad));
if (relative_error < 1e-5f) {
printf("Gradient check PASSED (error: %.2e)\n", relative_error);
} else {
printf("Gradient check FAILED (error: %.2e)\n", relative_error);
}
Use gradient checking sparingly - it's expensive (requires multiple forward passes per parameter).
Common Gradient Issues
Vanishing gradients:
- Gradients become very small (near zero)
- Common with deep networks or saturating activations
- Solutions: Better initialization (Xavier), ReLU activations, batch normalization
Exploding gradients:
- Gradients become very large
- Loss becomes NaN
- Solutions: Lower learning rate, gradient clipping, better initialization
Parameter Updates
After computing gradients, update parameters using the optimizer. This is where learning actually happens.
Optimizer Step
tofu_optimizer_step(optimizer);
This updates all parameters according to the optimizer's algorithm:
SGD: param = param - learning_rate * grad
SGD with momentum:
velocity = momentum * velocity + grad
param = param - learning_rate * velocity
Choosing an Optimizer
TOFU provides two optimizers:
/* Vanilla SGD */
tofu_optimizer* sgd = tofu_optimizer_sgd_create(g, 0.01);
/* SGD with momentum (recommended for most tasks) */
tofu_optimizer* sgd_momentum = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
Vanilla SGD:
- Simplest algorithm
- Good for convex problems
- Can oscillate in ravines
- Use when: Memory is tight, simple problem
SGD with Momentum:
- Accumulates velocity
- Faster convergence
- Less oscillation
- Use when: Default choice for most problems
Learning Rate Selection
The learning rate is the most important hyperparameter:
/* Too high (0.5): May diverge or oscillate */
tofu_optimizer* opt_high = tofu_optimizer_sgd_create(g, 0.5);
/* Too low (0.0001): Very slow convergence */
tofu_optimizer* opt_low = tofu_optimizer_sgd_create(g, 0.0001);
/* Just right (0.01 - 0.1): Task-dependent */
tofu_optimizer* opt_good = tofu_optimizer_sgd_create(g, 0.01);
Guidelines:
- Start with 0.01 or 0.1
- If loss diverges: Reduce by 10x
- If convergence is slow: Increase by 2-3x
- Smaller networks often need larger learning rates
- Batch size matters: Larger batches → higher learning rate
Parameter Update Timing
The order matters:
/* Correct order */
tofu_graph_zero_grad(g); /* 1. Clear old gradients */
/* forward pass */ /* 2. Compute predictions */
/* loss computation */ /* 3. Measure error */
tofu_graph_backward(g, loss); /* 4. Compute gradients */
tofu_optimizer_step(optimizer); /* 5. Update parameters */
/* WRONG: Update before backward */
tofu_optimizer_step(optimizer); /* Updates with old/zero gradients! */
tofu_graph_backward(g, loss);
Monitoring Parameter Changes
Track how much parameters change each step:
/* Before update */
float w_before;
TOFU_TENSOR_DATA_TO(w1->value, 0, w_before, TOFU_FLOAT);
/* Update */
tofu_optimizer_step(optimizer);
/* After update */
float w_after;
TOFU_TENSOR_DATA_TO(w1->value, 0, w_after, TOFU_FLOAT);
float change = fabsf(w_after - w_before);
printf("Parameter change: %.6f\n", change);
Healthy training:
- Parameters change gradually
- Change magnitude decreases over time
- No sudden jumps
Training Strategies
Effective training requires more than just the basic loop. Here are strategies for better results.
Mini-Batch Training
Process data in batches instead of one sample at a time:
const int BATCH_SIZE = 16;
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
/* Shuffle data (optional but recommended) */
shuffle_dataset(dataset);
for (int batch_start = 0; batch_start < num_samples; batch_start += BATCH_SIZE) {
int batch_end = (batch_start + BATCH_SIZE < num_samples)
? batch_start + BATCH_SIZE
: num_samples;
int actual_batch_size = batch_end - batch_start;
/* Prepare batch */
float* batch_data = create_batch(dataset, batch_start, actual_batch_size);
tofu_tensor* t_batch = tofu_tensor_create(batch_data, 2,
(int[]){actual_batch_size, feature_dims},
TOFU_FLOAT);
/* Train on batch */
tofu_graph_zero_grad(g);
tofu_graph_node* input = tofu_graph_input(g, t_batch);
/* ... rest of training step ... */
/* Cleanup */
tofu_tensor_free(t_batch);
free(batch_data);
tofu_graph_clear_ops(g);
}
}
Batch size trade-offs:
- Larger (32-128): More stable gradients, better GPU utilization, higher memory
- Smaller (1-8): Less memory, more updates, noisier gradients
- Microcontroller: Often limited to 1-4 due to memory constraints
Epoch Management
An epoch is one complete pass through the training data:
const int NUM_EPOCHS = 100;
float best_loss = INFINITY;
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
float epoch_loss = 0.0f;
int num_batches = 0;
/* Train on all batches */
for (int batch = 0; batch < num_batches_total; batch++) {
/* ... training step ... */
epoch_loss += batch_loss;
num_batches++;
}
/* Average loss over epoch */
float avg_loss = epoch_loss / num_batches;
/* Track best model */
if (avg_loss < best_loss) {
best_loss = avg_loss;
/* Optionally save parameters */
}
/* Report progress */
if (epoch % 10 == 0) {
printf("Epoch %d: loss = %.6f\n", epoch, avg_loss);
}
}
How many epochs?
- Too few: Underfitting (model hasn't learned)
- Too many: Overfitting (model memorizes training data)
- Monitor validation loss to determine when to stop
Learning Rate Scheduling
Adjust learning rate during training:
float initial_lr = 0.1f;
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
/* Step decay: Reduce by 10x every 50 epochs */
float lr = initial_lr;
if (epoch >= 50) lr *= 0.1f;
if (epoch >= 100) lr *= 0.1f;
/* Recreate optimizer with new learning rate */
tofu_optimizer_free(optimizer);
optimizer = tofu_optimizer_sgd_create(g, lr);
/* ... training for this epoch ... */
}
Common schedules:
- Step decay: Reduce by constant factor at fixed intervals
- Exponential decay:
lr = lr0 * exp(-k * epoch) - Warmup: Start with low lr, gradually increase
- Manual: Reduce when loss plateaus
Data Augmentation
Increase effective dataset size by transforming inputs:
void augment_image(float* image, int width, int height) {
/* Random flip */
if (rand() % 2) {
horizontal_flip(image, width, height);
}
/* Random noise */
for (int i = 0; i < width * height; i++) {
image[i] += 0.01f * ((float)rand() / RAND_MAX - 0.5f);
}
}
/* Apply during training */
augment_image(batch_data, 8, 8);
tofu_tensor* t_input = tofu_tensor_create(batch_data, 2, ...);
Augmentation techniques:
- Rotation, flipping, cropping (images)
- Noise injection (signals)
- Time shifting (sequences)
- Caution: Some augmentations may not make sense for your data
Early Stopping
Stop training when validation loss stops improving:
float best_val_loss = INFINITY;
int patience = 10; /* Number of epochs to wait */
int wait = 0;
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
/* Train */
float train_loss = train_epoch(g, optimizer, train_data);
/* Validate */
float val_loss = evaluate(g, params, val_data);
/* Check improvement */
if (val_loss < best_val_loss) {
best_val_loss = val_loss;
wait = 0;
/* Save best model */
} else {
wait++;
if (wait >= patience) {
printf("Early stopping at epoch %d\n", epoch);
break;
}
}
}
Monitoring and Debugging
Track training progress and diagnose issues.
Loss Curves
Plot loss over time to understand training dynamics:
#define MAX_EPOCHS 200
float train_losses[MAX_EPOCHS];
float val_losses[MAX_EPOCHS];
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
train_losses[epoch] = train_epoch(...);
val_losses[epoch] = evaluate(...);
printf("Epoch %3d: train_loss=%.4f, val_loss=%.4f\n",
epoch, train_losses[epoch], val_losses[epoch]);
}
/* Analyze curves */
save_losses("losses.txt", train_losses, val_losses, NUM_EPOCHS);
What to look for:
- Both decreasing: Healthy training
- Train decreases, val increases: Overfitting
- Both plateau: Underfitting or need lower learning rate
- Both increasing: Learning rate too high
Gradient Monitoring
Check gradient magnitudes:
void check_gradients(tofu_graph* g, tofu_graph_node** params, int num_params) {
for (int i = 0; i < num_params; i++) {
tofu_tensor* grad = tofu_graph_get_grad(params[i]);
if (!grad) continue;
float grad_norm = 0.0f;
for (int j = 0; j < grad->len; j++) {
float val;
TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
grad_norm += val * val;
}
grad_norm = sqrtf(grad_norm);
printf("Param %d gradient norm: %.6f\n", i, grad_norm);
/* Warning signs */
if (grad_norm < 1e-7f) {
printf(" WARNING: Vanishing gradient!\n");
}
if (grad_norm > 1e3f) {
printf(" WARNING: Exploding gradient!\n");
}
}
}
/* Call after backward pass */
tofu_graph_backward(g, loss);
check_gradients(g, params, num_params);
Activation Statistics
Monitor activation distributions:
void check_activations(tofu_graph_node* node) {
tofu_tensor* act = tofu_graph_get_value(node);
if (!act) return;
float min_val = INFINITY, max_val = -INFINITY, mean = 0.0f;
for (int i = 0; i < act->len; i++) {
float val;
TOFU_TENSOR_DATA_TO(act, i, val, TOFU_FLOAT);
if (val < min_val) min_val = val;
if (val > max_val) max_val = val;
mean += val;
}
mean /= act->len;
printf("Activation stats: min=%.4f, max=%.4f, mean=%.4f\n",
min_val, max_val, mean);
/* Warning signs */
if (max_val - min_val < 1e-6f) {
printf(" WARNING: Dead activations (all same)!\n");
}
}
Debugging Checklist
When training fails, check:
-
Loss is NaN:
- Reduce learning rate
- Check for division by zero
- Verify input data normalization
-
Loss doesn't decrease:
- Increase learning rate
- Check gradient flow (print gradients)
- Verify data/labels are correct
- Try better initialization
-
Training is slow:
- Increase learning rate
- Use momentum
- Check batch size
- Verify network is not too large
-
Overfitting:
- Add more training data
- Reduce network size
- Use validation set for early stopping
Evaluation
After training, evaluate model performance on test data.
Computing Accuracy
For classification tasks:
float compute_accuracy(tofu_graph* g, cnn_params* params,
float* test_data, int* test_labels, int num_samples) {
int correct = 0;
for (int i = 0; i < num_samples; i++) {
tofu_graph_zero_grad(g);
/* Forward pass */
float* input_data = &test_data[i * INPUT_SIZE];
tofu_tensor* t_input = tofu_tensor_create(input_data, 1,
(int[]){INPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* input = tofu_graph_input(g, t_input);
tofu_graph_node* probs = cnn_forward_probs(g, input, params);
/* Get prediction */
tofu_tensor* probs_tensor = tofu_graph_get_value(probs);
int pred_class = argmax(probs_tensor);
if (pred_class == test_labels[i]) {
correct++;
}
tofu_tensor_free(t_input);
tofu_graph_clear_ops(g);
}
return (float)correct / num_samples;
}
/* Helper function */
int argmax(tofu_tensor* tensor) {
int max_idx = 0;
float max_val = -INFINITY;
for (int i = 0; i < tensor->len; i++) {
float val;
TOFU_TENSOR_DATA_TO(tensor, i, val, TOFU_FLOAT);
if (val > max_val) {
max_val = val;
max_idx = i;
}
}
return max_idx;
}
Regression Metrics
For regression tasks:
float compute_mse(tofu_graph* g, tofu_graph_node* w1, tofu_graph_node* b1,
tofu_graph_node* w2, tofu_graph_node* b2,
float test_inputs[][2], float test_targets[][1], int num_samples) {
float total_error = 0.0f;
for (int i = 0; i < num_samples; i++) {
/* Forward pass */
float* input_data = (float*)malloc(2 * sizeof(float));
input_data[0] = test_inputs[i][0];
input_data[1] = test_inputs[i][1];
tofu_tensor* t_input = tofu_tensor_create(input_data, 1, (int[]){2}, TOFU_FLOAT);
tofu_graph_node* x = tofu_graph_input(g, t_input);
/* Network computation */
tofu_graph_node* h1 = tofu_graph_relu(g, tofu_graph_add(g,
tofu_graph_matmul(g, x, w1), b1));
tofu_graph_node* pred = tofu_graph_add(g, tofu_graph_matmul(g, h1, w2), b2);
/* Get prediction */
float pred_val;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(pred), 0, pred_val, TOFU_FLOAT);
/* Compute error */
float error = pred_val - test_targets[i][0];
total_error += error * error;
tofu_tensor_free(t_input);
free(input_data);
tofu_graph_clear_ops(g);
}
return total_error / num_samples;
}
Confusion Matrix
For detailed classification analysis:
void compute_confusion_matrix(tofu_graph* g, cnn_params* params,
float* test_data, int* test_labels,
int num_samples, int num_classes,
int confusion[4][4]) {
/* Initialize matrix */
memset(confusion, 0, num_classes * num_classes * sizeof(int));
for (int i = 0; i < num_samples; i++) {
/* Get prediction */
int pred_class = predict_sample(g, params, &test_data[i * INPUT_SIZE]);
int true_class = test_labels[i];
/* Update confusion matrix */
confusion[true_class][pred_class]++;
}
/* Print matrix */
printf("\nConfusion Matrix:\n");
printf(" ");
for (int i = 0; i < num_classes; i++) printf("%4d ", i);
printf("\n");
for (int i = 0; i < num_classes; i++) {
printf("True %d: ", i);
for (int j = 0; j < num_classes; j++) {
printf("%4d ", confusion[i][j]);
}
printf("\n");
}
}
Complete Example
Here's a complete XOR training example bringing everything together:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"
/* Xavier initialization */
float xavier_init(int fan_in) {
float limit = sqrtf(6.0f / fan_in);
return limit * (2.0f * (float)rand() / RAND_MAX - 1.0f);
}
int main() {
/* Configuration */
const int INPUT_SIZE = 2, HIDDEN_SIZE = 4, OUTPUT_SIZE = 1;
const int NUM_EPOCHS = 2000;
const float LEARNING_RATE = 0.1f;
/* XOR dataset */
float inputs[4][2] = {{0, 0}, {0, 1}, {1, 0}, {1, 1}};
float targets[4][1] = {{0}, {1}, {1}, {0}};
/* Create graph */
tofu_graph* g = tofu_graph_create();
/* Initialize parameters */
float* w1_data = malloc(INPUT_SIZE * HIDDEN_SIZE * sizeof(float));
for (int i = 0; i < INPUT_SIZE * HIDDEN_SIZE; i++)
w1_data[i] = xavier_init(INPUT_SIZE);
tofu_tensor* t_w1 = tofu_tensor_create(w1_data, 2,
(int[]){INPUT_SIZE, HIDDEN_SIZE}, TOFU_FLOAT);
tofu_graph_node* w1 = tofu_graph_param(g, t_w1);
float* b1_data = calloc(HIDDEN_SIZE, sizeof(float));
tofu_tensor* t_b1 = tofu_tensor_create(b1_data, 1, (int[]){HIDDEN_SIZE}, TOFU_FLOAT);
tofu_graph_node* b1 = tofu_graph_param(g, t_b1);
float* w2_data = malloc(HIDDEN_SIZE * OUTPUT_SIZE * sizeof(float));
for (int i = 0; i < HIDDEN_SIZE * OUTPUT_SIZE; i++)
w2_data[i] = xavier_init(HIDDEN_SIZE);
tofu_tensor* t_w2 = tofu_tensor_create(w2_data, 2,
(int[]){HIDDEN_SIZE, OUTPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* w2 = tofu_graph_param(g, t_w2);
float* b2_data = calloc(OUTPUT_SIZE, sizeof(float));
tofu_tensor* t_b2 = tofu_tensor_create(b2_data, 1, (int[]){OUTPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* b2 = tofu_graph_param(g, t_b2);
/* Create optimizer */
tofu_optimizer* optimizer = tofu_optimizer_sgd_create(g, LEARNING_RATE);
/* Training loop */
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
float epoch_loss = 0.0f;
for (int sample = 0; sample < 4; sample++) {
/* Zero gradients */
tofu_graph_zero_grad(g);
/* Create input */
float* in_data = malloc(INPUT_SIZE * sizeof(float));
in_data[0] = inputs[sample][0];
in_data[1] = inputs[sample][1];
tofu_tensor* t_in = tofu_tensor_create(in_data, 1, (int[]){INPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* x = tofu_graph_input(g, t_in);
/* Forward pass */
tofu_graph_node* h1 = tofu_graph_relu(g, tofu_graph_add(g,
tofu_graph_matmul(g, x, w1), b1));
tofu_graph_node* pred = tofu_graph_add(g, tofu_graph_matmul(g, h1, w2), b2);
/* Create target */
float* tgt_data = malloc(OUTPUT_SIZE * sizeof(float));
tgt_data[0] = targets[sample][0];
tofu_tensor* t_tgt = tofu_tensor_create(tgt_data, 1, (int[]){OUTPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* tgt = tofu_graph_input(g, t_tgt);
/* Compute loss */
tofu_graph_node* loss = tofu_graph_mse_loss(g, pred, tgt);
float loss_val;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
epoch_loss += loss_val;
/* Backward pass */
tofu_graph_backward(g, loss);
/* Update parameters */
tofu_optimizer_step(optimizer);
/* Cleanup */
tofu_tensor_free(t_in);
tofu_tensor_free(t_tgt);
free(in_data);
free(tgt_data);
tofu_graph_clear_ops(g);
}
/* Report progress */
if (epoch % 200 == 0) {
printf("Epoch %4d: loss = %.6f\n", epoch, epoch_loss / 4);
}
}
/* Evaluate */
printf("\nFinal predictions:\n");
for (int i = 0; i < 4; i++) {
float* in_data = malloc(INPUT_SIZE * sizeof(float));
in_data[0] = inputs[i][0];
in_data[1] = inputs[i][1];
tofu_tensor* t_in = tofu_tensor_create(in_data, 1, (int[]){INPUT_SIZE}, TOFU_FLOAT);
tofu_graph_node* x = tofu_graph_input(g, t_in);
tofu_graph_node* h1 = tofu_graph_relu(g, tofu_graph_add(g,
tofu_graph_matmul(g, x, w1), b1));
tofu_graph_node* pred = tofu_graph_add(g, tofu_graph_matmul(g, h1, w2), b2);
float pred_val;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(pred), 0, pred_val, TOFU_FLOAT);
printf("[%.0f, %.0f] -> %.4f (target: %.0f)\n",
inputs[i][0], inputs[i][1], pred_val, targets[i][0]);
tofu_tensor_free(t_in);
free(in_data);
tofu_graph_clear_ops(g);
}
/* Cleanup */
tofu_optimizer_free(optimizer);
tofu_graph_free(g);
tofu_tensor_free_data_too(t_w1);
tofu_tensor_free_data_too(t_b1);
tofu_tensor_free_data_too(t_w2);
tofu_tensor_free_data_too(t_b2);
return 0;
}
This example demonstrates:
- Parameter initialization with Xavier method
- Complete training loop with all five steps
- Proper memory management (malloc/free)
- Graph reuse via clear_ops
- Loss monitoring during training
- Final evaluation on the dataset
Best Practices
Memory Management
Always free tensors in correct order:
/* Correct order */
tofu_optimizer_free(optimizer); /* 1. Free optimizer first */
tofu_graph_free(g); /* 2. Free graph second */
tofu_tensor_free_data_too(t_w1); /* 3. Free parameter tensors last */
tofu_tensor_free_data_too(t_b1);
Use clear_ops between iterations:
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
/* ... training step ... */
tofu_graph_clear_ops(g); /* Prevents memory leaks */
}
Initialization
Use Xavier/He initialization:
/* Xavier: Good for tanh/sigmoid */
float xavier = sqrtf(6.0f / fan_in) * (2.0f * rand() / RAND_MAX - 1.0f);
/* He: Better for ReLU */
float he = sqrtf(2.0f / fan_in) * (2.0f * rand() / RAND_MAX - 1.0f);
Never initialize to all zeros:
/* WRONG: Breaks symmetry, prevents learning */
float* w_data = calloc(size, sizeof(float));
/* CORRECT: Random initialization */
for (int i = 0; i < size; i++)
w_data[i] = xavier_init(fan_in);
Hyperparameter Tuning
Start with these defaults:
- Learning rate: 0.01 - 0.1
- Batch size: 1 - 16 (for microcontrollers)
- Hidden layer size: 2x - 4x input size
- Epochs: 100 - 1000
Tune systematically:
- Get the model working at all (reduce problem size if needed)
- Tune learning rate (most important)
- Tune architecture (layer sizes)
- Tune batch size (memory permitting)
Debugging
Print everything during development:
printf("Loss: %.6f\n", loss_val);
printf("Grad norm: %.6f\n", grad_norm);
printf("Prediction: %.4f, Target: %.4f\n", pred, target);
Check intermediate values:
tofu_tensor* h1_val = tofu_graph_get_value(h1);
printf("Hidden layer stats: ");
print_tensor_stats(h1_val);
Start simple, scale up:
- Verify on tiny dataset (4 samples)
- Check on small network (few parameters)
- Scale to full problem once working
Resource Constraints
For microcontrollers, minimize memory usage:
- Use batch_size=1 if memory is tight
- Keep networks small (< 10k parameters)
- Reuse graph with clear_ops
- Consider quantization (future work)
- Profile memory usage regularly
With these best practices, you're ready to train neural networks on TOFU. See the examples directory for more complete training scripts.
Optimizers
Optimizers update neural network parameters using computed gradients. Understanding how optimizers work and how to tune them is essential for training models effectively.
Introduction
Training a neural network means finding parameter values that minimize a loss function. This is an optimization problem: start with random parameters, compute gradients that indicate how to adjust them, and iteratively update parameters to reduce loss.
Optimizers automate this process. They take computed gradients and apply update rules to parameters. Different optimizers use different strategies—some use only the current gradient (like SGD), while others accumulate information from previous steps (like momentum-based methods).
This guide explains optimizer fundamentals, shows you how to create and use optimizers, describes the algorithms available in Tofu, and provides practical guidance for tuning hyperparameters and troubleshooting training issues.
Optimizer Fundamentals
Understanding how optimizers work requires grasping two key concepts: gradient descent and the learning rate.
Gradient Descent: Following the Slope Downhill
Imagine you're standing on a mountain in fog, trying to reach the lowest point. You can't see far, but you can feel which direction slopes downward beneath your feet. Gradient descent works the same way: at each step, compute which direction reduces the loss function, then take a small step in that direction.
Mathematically, for a parameter theta and loss L:
theta_new = theta_old - learning_rate * gradient
Where gradient = dL/dtheta (the derivative of loss with respect to the parameter).
The gradient points in the direction of steepest ascent (uphill). By subtracting it, we move downhill toward lower loss.
Learning Rate: Step Size Matters
The learning rate controls how large a step to take. This is the single most important hyperparameter in training neural networks.
Too large: You'll overshoot the minimum, potentially making loss worse or causing training to diverge completely.
Loss landscape: \ /
\__/
With large steps: --> X <-- (overshoot back and forth)
Too small: Training converges slowly. You'll make progress, but it might take 10x or 100x more iterations than necessary.
Loss landscape: \ /
\__/
With tiny steps: . . . . . (very slow progress)
Just right: Training converges efficiently without instability.
Loss landscape: \ /
\__/
Good step size: -> -> -> (steady progress to minimum)
Typical learning rates range from 0.0001 to 0.1. Start with 0.01 and adjust based on training behavior.
Stochastic Gradient Descent (SGD)
Classical gradient descent computes gradients using the entire training dataset. This is expensive and slow. Stochastic Gradient Descent (SGD) uses small batches of data instead—typically 32, 64, or 128 examples at a time.
The "stochastic" (random) part means each batch gives a noisy estimate of the true gradient. But averaging over many batches gives the correct direction, and computing on small batches is much faster than using the entire dataset.
In practice, when people say "SGD," they usually mean "mini-batch SGD"—computing gradients on small batches rather than single examples or the full dataset.
Why Multiple Optimizer Types?
If vanilla SGD works, why do we need other optimizers? Because SGD has limitations:
- Slow convergence on complex loss landscapes
- Oscillation in narrow valleys (moves back and forth rather than forward)
- Sensitivity to learning rate choice
Advanced optimizers like SGD with momentum address these issues by accumulating information about previous gradients. This helps accelerate training and dampen oscillations.
Creating Optimizers
Optimizers in Tofu are tied to computation graphs. When you create an optimizer, it automatically collects all trainable parameters (nodes created with tofu_graph_param) from the graph.
Basic Setup
Creating an optimizer follows this pattern:
// 1. Create graph and add parameters
tofu_graph *g = tofu_graph_create();
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);
// 2. Create optimizer (automatically finds W and b)
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// 3. Use in training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
tofu_optimizer_zero_grad(opt);
// ... forward pass, compute loss ...
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
}
// 4. Cleanup (optimizer before graph)
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(weights);
tofu_tensor_free_data_too(bias);
Key points:
- Automatic parameter collection: The optimizer scans the graph and finds all PARAM nodes when created
- One optimizer per graph: Each optimizer manages parameters from a single graph
- Cleanup order matters: Always free the optimizer before the graph
Choosing a Learning Rate
Start with these defaults:
- 0.01 - Safe starting point for most problems
- 0.001 - Deep networks, complex problems
- 0.1 - Small networks, simple problems
After a few iterations, check if loss is decreasing. If not, reduce the learning rate by 10x. If loss decreases very slowly, try increasing by 2x-5x.
Memory Considerations
Different optimizers have different memory requirements:
- SGD: No extra memory (just the parameters themselves)
- SGD with momentum: One velocity buffer per parameter (doubles memory)
For large networks on memory-constrained devices, vanilla SGD may be the only option. For everything else, momentum is usually worth the extra memory.
SGD: Stochastic Gradient Descent
Vanilla SGD is the simplest optimizer. It updates parameters by directly subtracting the scaled gradient.
The Algorithm
For each parameter theta:
theta = theta - learning_rate * gradient
That's it. Compute the gradient, scale it by the learning rate, subtract from the parameter.
In code:
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
The optimizer applies this update rule to every parameter automatically when you call tofu_optimizer_step().
Implementation Example
Here's a complete training loop using SGD:
// Setup
tofu_graph *g = tofu_graph_create();
// Network: linear layer (input_dim=4, output_dim=3)
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);
// Create SGD optimizer with learning rate 0.01
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// Training loop
for (int epoch = 0; epoch < 100; epoch++) {
double epoch_loss = 0.0;
for (int batch = 0; batch < num_batches; batch++) {
// Zero gradients before forward pass
tofu_optimizer_zero_grad(opt);
// Forward pass: pred = input @ W + b
tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
tofu_graph_node *h = tofu_graph_matmul(g, x, W_node);
tofu_graph_node *pred = tofu_graph_add(g, h, b_node);
// Compute loss
tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
// Get loss value for logging
float loss_val;
tofu_tensor *loss_tensor = tofu_graph_get_value(loss);
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_val, TOFU_FLOAT);
epoch_loss += loss_val;
// Backward pass: compute gradients
tofu_graph_backward(g, loss);
// Update parameters using gradients
tofu_optimizer_step(opt);
// Clear operations for next batch
tofu_graph_clear_ops(g);
}
printf("Epoch %d: Loss = %.6f\n", epoch, epoch_loss / num_batches);
}
// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);
When to Use SGD
SGD works well when:
- Memory is tight: SGD has no extra memory overhead
- Loss landscape is smooth: Few local minima, well-conditioned gradients
- You have time to tune: SGD is sensitive to learning rate, so you'll need to experiment
SGD struggles when:
- Loss landscape is complex: Many local minima or saddle points
- Gradients are noisy: High variance in gradient estimates
- Convergence needs to be fast: SGD converges slower than momentum-based methods
Tuning SGD
The learning rate is the only hyperparameter for vanilla SGD. Here's how to tune it:
Start with 0.01:
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
Watch the first few iterations:
- Loss decreasing steadily: Good sign, continue training
- Loss increasing or NaN: Learning rate too high, reduce by 10x
- Loss barely changing: Learning rate too low, increase by 2x-5x
Common learning rate values:
- 0.1 - Aggressive, works for simple problems
- 0.01 - Conservative, good default
- 0.001 - Very conservative, deep networks
- 0.0001 - Fine-tuning pretrained models
Monitoring training:
for (int epoch = 0; epoch < max_epochs; epoch++) {
// ... training loop ...
if (epoch_loss < best_loss) {
best_loss = epoch_loss;
no_improvement_count = 0;
} else {
no_improvement_count++;
}
// Reduce learning rate if stuck
if (no_improvement_count > 10) {
opt->learning_rate *= 0.5;
printf("Reducing learning rate to %.6f\n", opt->learning_rate);
no_improvement_count = 0;
}
}
SGD with Momentum
Momentum helps SGD converge faster and more smoothly by accumulating a velocity term that averages gradients over time. This dampens oscillations and accelerates progress in consistent directions.
The Algorithm
Instead of directly using the current gradient, momentum maintains a velocity vector that accumulates gradients exponentially:
v = momentum * v - learning_rate * gradient
theta = theta + v
Where:
vis the velocity (initialized to zero)momentumis a coefficient (typically 0.9)learning_ratescales the gradient contributiongradientis the current parameter gradient
This differs from classical momentum formulations but is mathematically equivalent. The key insight: multiply the velocity by momentum (typically 0.9), then subtract the scaled gradient and add the result to the parameter.
Why Momentum Works
Think of momentum as a ball rolling downhill. When the slope consistently points in one direction, the ball accelerates (velocity builds up). When the slope changes direction, the accumulated velocity smooths out oscillations.
Without momentum (vanilla SGD):
Narrow valley: | |
Path taken: | -> <- ->| (oscillates back and forth)
| -> <- ->|
With momentum:
Narrow valley: | |
Path taken: | --> | (smooth progress forward)
| --> |
Momentum provides two benefits:
- Acceleration: Builds up speed in consistent directions
- Dampening: Reduces oscillations in directions that change frequently
Implementation Example
Creating an SGD optimizer with momentum requires one additional parameter:
// Create optimizer with learning_rate=0.01, momentum=0.9
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
The rest of the training loop is identical to vanilla SGD:
// Setup
tofu_graph *g = tofu_graph_create();
// Network: two-layer MLP
tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){784, 128}, TOFU_FLOAT);
tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){128}, TOFU_FLOAT);
tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){128, 10}, TOFU_FLOAT);
tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);
tofu_graph_node *W1_node = tofu_graph_param(g, W1);
tofu_graph_node *b1_node = tofu_graph_param(g, b1);
tofu_graph_node *W2_node = tofu_graph_param(g, W2);
tofu_graph_node *b2_node = tofu_graph_param(g, b2);
// Create optimizer with momentum
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
// Training loop
for (int epoch = 0; epoch < 100; epoch++) {
for (int batch = 0; batch < num_batches; batch++) {
tofu_optimizer_zero_grad(opt);
// Forward pass
tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
// Layer 1: h1 = relu(x @ W1 + b1)
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1_node);
h1 = tofu_graph_add(g, h1, b1_node);
h1 = tofu_graph_relu(g, h1);
// Layer 2: output = h1 @ W2 + b2
tofu_graph_node *h2 = tofu_graph_matmul(g, h1, W2_node);
h2 = tofu_graph_add(g, h2, b2_node);
// Loss
tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
tofu_graph_node *loss = tofu_graph_mse_loss(g, h2, target);
// Backward and update
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
tofu_graph_clear_ops(g);
}
}
// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W1);
tofu_tensor_free_data_too(b1);
tofu_tensor_free_data_too(W2);
tofu_tensor_free_data_too(b2);
Tuning Momentum
Momentum has two hyperparameters: learning rate and momentum coefficient.
Learning Rate: Start with the same values as vanilla SGD (0.01 is a good default). Momentum often allows slightly higher learning rates because it dampens oscillations.
Momentum Coefficient: Controls how much past gradients influence current updates.
Common values:
- 0.9 - Standard choice, works well for most problems
- 0.95 - High momentum, use for slow convergence
- 0.99 - Very high momentum, use for very deep networks
- 0.5-0.8 - Low momentum, use if training is unstable
The momentum coefficient is easier to tune than learning rate. Start with 0.9 and adjust if needed.
When to Use Momentum
Use momentum when:
- Training is slow: Momentum accelerates convergence
- Gradients are noisy: Momentum smooths out noise
- Deep networks: Momentum helps propagate gradients through many layers
- Memory is available: Momentum requires one velocity buffer per parameter
Stick with vanilla SGD when:
- Memory is very tight: Momentum doubles memory requirements
- Loss landscape is simple: Vanilla SGD may be sufficient
In practice, momentum is the default choice for most problems. The memory cost is usually worth the faster convergence.
Using Optimizers in Training
Now that you understand optimizer algorithms, let's look at the mechanics of using them in training loops.
The Training Cycle
Every training iteration follows the same four-step pattern:
1. Zero gradients (tofu_optimizer_zero_grad)
2. Forward pass (build computation graph)
3. Backward pass (tofu_graph_backward)
4. Update parameters (tofu_optimizer_step)
This cycle repeats for every batch in every epoch.
Step-by-Step Breakdown
Step 1: Zero Gradients
Gradients accumulate by default in Tofu. If you don't zero them, they'll keep adding up across iterations, leading to incorrect updates.
tofu_optimizer_zero_grad(opt);
This clears all gradient buffers for parameters tracked by the optimizer. Always call this before the forward pass.
Step 2: Forward Pass
Build the computation graph by adding operations. Each operation automatically computes its value:
tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *h = tofu_graph_matmul(g, x, W);
tofu_graph_node *pred = tofu_graph_add(g, h, b);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
At this point, loss contains the computed loss value, but gradients haven't been computed yet.
Step 3: Backward Pass
Compute gradients by calling backward on the loss node:
tofu_graph_backward(g, loss);
This triggers reverse-mode automatic differentiation. Tofu walks the graph backwards, computing gradients for every parameter using the chain rule. After this call, every parameter has its gradient stored in node->grad.
Step 4: Update Parameters
Apply the optimizer's update rule to adjust parameters:
tofu_optimizer_step(opt);
This uses the computed gradients to update parameters. For SGD, it subtracts learning_rate * gradient from each parameter. For momentum, it updates velocity buffers and then parameters.
Step 5: Clear Operations
Before the next iteration, clear operation nodes from the graph while preserving parameters:
tofu_graph_clear_ops(g);
This frees memory used by intermediate computations (matmul results, activations, etc.) but keeps parameters and their gradients intact.
Complete Training Loop
Here's a full training loop with all the pieces:
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
for (int epoch = 0; epoch < num_epochs; epoch++) {
double total_loss = 0.0;
for (int batch = 0; batch < num_batches; batch++) {
// 1. Zero gradients
tofu_optimizer_zero_grad(opt);
// 2. Forward pass
tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
tofu_graph_node *pred = forward_pass(g, x); // Your model
tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
// Track loss for logging
float loss_val;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
total_loss += loss_val;
// 3. Backward pass
tofu_graph_backward(g, loss);
// 4. Update parameters
tofu_optimizer_step(opt);
// 5. Clear operations
tofu_graph_clear_ops(g);
}
// Log epoch statistics
printf("Epoch %d: Avg Loss = %.6f\n", epoch, total_loss / num_batches);
}
Common Mistakes
Mistake 1: Forgetting to zero gradients
// WRONG: Gradients accumulate indefinitely
for (int i = 0; i < iterations; i++) {
// No zero_grad call!
// ... forward, backward, step ...
}
This causes gradients to grow without bound. Updates become incorrect after the first iteration.
Correct:
for (int i = 0; i < iterations; i++) {
tofu_optimizer_zero_grad(opt); // Clear old gradients
// ... forward, backward, step ...
}
Mistake 2: Calling step before backward
// WRONG: No gradients computed yet!
tofu_optimizer_step(opt);
tofu_graph_backward(g, loss);
The optimizer needs gradients to update parameters. Always call backward before step.
Correct:
tofu_graph_backward(g, loss); // Compute gradients
tofu_optimizer_step(opt); // Use gradients to update
Mistake 3: Not clearing operations
// WRONG: Memory grows indefinitely
for (int batch = 0; batch < num_batches; batch++) {
// ... training ...
// No clear_ops call!
}
Each batch adds nodes to the graph. Without clearing, memory usage grows until the program crashes.
Correct:
for (int batch = 0; batch < num_batches; batch++) {
// ... training ...
tofu_graph_clear_ops(g); // Clear after each batch
}
Monitoring Training
Track key metrics to understand training progress:
for (int epoch = 0; epoch < num_epochs; epoch++) {
double epoch_loss = 0.0;
int num_correct = 0;
for (int batch = 0; batch < num_batches; batch++) {
// ... training loop ...
// Track loss
float loss_val;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
epoch_loss += loss_val;
// Track accuracy (for classification)
num_correct += count_correct_predictions(pred, target);
}
double avg_loss = epoch_loss / num_batches;
double accuracy = (double)num_correct / (num_batches * batch_size);
printf("Epoch %d: Loss = %.6f, Accuracy = %.2f%%\n",
epoch, avg_loss, accuracy * 100);
}
Learning Rate Strategies
The learning rate often needs adjustment during training. Starting with a fixed rate works for simple problems, but complex models benefit from learning rate schedules.
Fixed Learning Rate
The simplest strategy: use the same learning rate throughout training.
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// Training loop uses 0.01 for all epochs
for (int epoch = 0; epoch < 100; epoch++) {
// ... training ...
}
This works well when:
- The problem is simple
- You've found a good learning rate through experimentation
- Training converges in a reasonable number of epochs
Step Decay
Reduce the learning rate by a fixed factor every N epochs:
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.1);
for (int epoch = 0; epoch < 100; epoch++) {
// Reduce learning rate by 10x every 30 epochs
if (epoch % 30 == 0 && epoch > 0) {
opt->learning_rate *= 0.1;
printf("Epoch %d: Learning rate reduced to %.6f\n",
epoch, opt->learning_rate);
}
// ... training loop ...
}
Common schedules:
- Divide by 10 every 30 epochs (0.1 -> 0.01 -> 0.001)
- Divide by 2 every 10 epochs (0.1 -> 0.05 -> 0.025)
Step decay is simple and effective for many problems.
Exponential Decay
Gradually reduce the learning rate every epoch:
double initial_lr = 0.1;
double decay_rate = 0.95;
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, initial_lr);
for (int epoch = 0; epoch < 100; epoch++) {
// Update learning rate
opt->learning_rate = initial_lr * pow(decay_rate, epoch);
if (epoch % 10 == 0) {
printf("Epoch %d: Learning rate = %.6f\n", epoch, opt->learning_rate);
}
// ... training loop ...
}
This provides smooth, gradual decay. The decay rate controls how quickly the learning rate decreases (0.95 is typical).
Cosine Annealing
Reduce the learning rate following a cosine curve:
#include <math.h>
double initial_lr = 0.1;
double min_lr = 0.001;
int num_epochs = 100;
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, initial_lr);
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Cosine annealing formula
double progress = (double)epoch / num_epochs;
opt->learning_rate = min_lr + (initial_lr - min_lr) *
(1.0 + cos(M_PI * progress)) / 2.0;
// ... training loop ...
}
Cosine annealing provides smooth decay that starts fast and slows down near the end.
Learning Rate Warmup
For very high initial learning rates, gradually increase from a small value:
double target_lr = 0.1;
int warmup_epochs = 5;
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, target_lr / warmup_epochs);
for (int epoch = 0; epoch < 100; epoch++) {
// Warmup phase
if (epoch < warmup_epochs) {
opt->learning_rate = target_lr * (epoch + 1) / warmup_epochs;
}
// ... training loop ...
}
Warmup prevents instability in the first few epochs when using aggressive learning rates.
Adaptive Scheduling
Reduce the learning rate when progress stalls:
double best_loss = INFINITY;
int patience = 5;
int no_improvement_count = 0;
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
for (int epoch = 0; epoch < 100; epoch++) {
// ... training loop ...
double epoch_loss = compute_epoch_loss();
// Track progress
if (epoch_loss < best_loss) {
best_loss = epoch_loss;
no_improvement_count = 0;
} else {
no_improvement_count++;
}
// Reduce learning rate if stuck
if (no_improvement_count >= patience) {
opt->learning_rate *= 0.5;
printf("Reducing learning rate to %.6f\n", opt->learning_rate);
no_improvement_count = 0;
}
}
This adapts to training dynamics automatically, reducing the learning rate only when needed.
Choosing a Strategy
Start simple: Use a fixed learning rate first. Only add scheduling if training plateaus.
Step decay: Good default for most problems. Easy to understand and implement.
Exponential/Cosine: Use for long training runs (100+ epochs) where smooth decay is beneficial.
Adaptive: Best when you're not sure how many epochs you need or when progress is unpredictable.
Warmup: Use when starting with very high learning rates (0.1+) to prevent early instability.
Choosing an Optimizer
With multiple optimizers available, how do you choose? Here's a practical decision guide.
Start with SGD + Momentum
For most problems, SGD with momentum (0.01 learning rate, 0.9 momentum) is the best starting point:
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
This provides:
- Good convergence speed
- Reasonable memory overhead
- Robustness to hyperparameter choices
Decision Tree
Is memory extremely tight? (< 2x parameter memory available)
├─ YES: Use vanilla SGD
└─ NO: Continue
Is the problem very simple? (linear model, small dataset)
├─ YES: Use vanilla SGD (momentum won't help much)
└─ NO: Continue
Is the network deep? (> 5 layers)
├─ YES: Use SGD with momentum 0.9 or higher
└─ NO: Use SGD with momentum 0.9
Comparison Table
| Optimizer | Memory | Convergence | Tuning | Best For |
|---|---|---|---|---|
| SGD | Minimal | Slower | Difficult | Memory-constrained, simple problems |
| SGD+Momentum | 2x params | Faster | Moderate | General purpose, deep networks |
Network Depth Considerations
Shallow networks (1-3 layers):
- Vanilla SGD often sufficient
- Momentum helps but not essential
Medium networks (4-10 layers):
- Momentum recommended
- Use momentum 0.9
Deep networks (10+ layers):
- Momentum essential
- Use momentum 0.95-0.99
Problem Type Recommendations
Regression (MSE loss):
// Start here
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
Classification (cross-entropy loss):
// May need higher learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.05, 0.9);
Fine-tuning pretrained models:
// Very small learning rate to preserve learned features
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.0001);
When in Doubt
Default configuration for new problems:
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
This works well for most cases. Adjust based on training behavior:
- Loss diverges: Reduce learning rate by 10x
- Convergence too slow: Increase learning rate by 2-5x
- Still slow: Increase momentum to 0.95
Troubleshooting
Training neural networks is often an iterative process of diagnosing and fixing issues. Here are common problems and their solutions.
Loss is NaN or Infinite
Symptoms: Loss becomes NaN or infinity after a few iterations.
Causes:
- Learning rate too high
- Gradient explosion (very large gradients)
- Numerical instability in loss function
Solutions:
Reduce learning rate dramatically:
// If using 0.01, try 0.001
opt->learning_rate = 0.001;
Check gradients for extreme values:
tofu_graph_backward(g, loss);
// Before optimizer step, check gradient magnitudes
for (int i = 0; i < opt->num_params; i++) {
tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
double max_grad = find_max_abs_value(grad);
if (max_grad > 1000.0) {
printf("Warning: Large gradient detected: %.2f\n", max_grad);
}
}
tofu_optimizer_step(opt);
Implement gradient clipping:
void clip_gradients(tofu_optimizer *opt, double max_norm) {
for (int i = 0; i < opt->num_params; i++) {
tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
if (!grad) continue;
// Compute L2 norm
double norm = 0.0;
for (int j = 0; j < grad->len; j++) {
float val;
TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
norm += val * val;
}
norm = sqrt(norm);
// Clip if too large
if (norm > max_norm) {
double scale = max_norm / norm;
for (int j = 0; j < grad->len; j++) {
float val;
TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
val *= scale;
TOFU_TENSOR_DATA_FROM(grad, j, val, TOFU_FLOAT);
}
}
}
}
// Use before optimizer step
tofu_graph_backward(g, loss);
clip_gradients(opt, 1.0); // Clip to max norm of 1.0
tofu_optimizer_step(opt);
Loss Not Decreasing
Symptoms: Loss stays constant or decreases very slowly.
Causes:
- Learning rate too low
- Model stuck in poor initialization
- Gradient vanishing
- Wrong loss function or labels
Solutions:
Increase learning rate:
opt->learning_rate *= 10.0; // Try 10x higher
Check if gradients are flowing:
tofu_graph_backward(g, loss);
// Check if gradients are non-zero
for (int i = 0; i < opt->num_params; i++) {
tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
double sum_abs = 0.0;
for (int j = 0; j < grad->len; j++) {
float val;
TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
sum_abs += fabs(val);
}
double mean_abs = sum_abs / grad->len;
if (mean_abs < 1e-7) {
printf("Warning: Very small gradients (%.2e) for parameter %d\n",
mean_abs, i);
}
}
Try momentum if using vanilla SGD:
// Replace vanilla SGD with momentum
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
Loss Oscillates
Symptoms: Loss goes up and down rather than steadily decreasing.
Causes:
- Learning rate too high
- Batch size too small (noisy gradients)
- Wrong momentum setting
Solutions:
Reduce learning rate:
opt->learning_rate *= 0.5; // Try half the current rate
Use or increase momentum:
// If using vanilla SGD, add momentum
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
// If already using momentum, increase it
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.95);
Training Slows Down Over Time
Symptoms: Loss decreases quickly at first, then stalls.
Causes:
- Learning rate too high for fine-tuning
- Converging to local minimum
- Need learning rate schedule
Solutions:
Implement step decay:
for (int epoch = 0; epoch < 100; epoch++) {
// Reduce learning rate when progress slows
if (epoch == 30 || epoch == 60 || epoch == 90) {
opt->learning_rate *= 0.1;
printf("Reduced learning rate to %.6f\n", opt->learning_rate);
}
// ... training loop ...
}
Use adaptive scheduling:
double best_loss = INFINITY;
int no_improvement = 0;
for (int epoch = 0; epoch < 100; epoch++) {
// ... training ...
if (epoch_loss < best_loss) {
best_loss = epoch_loss;
no_improvement = 0;
} else {
no_improvement++;
}
if (no_improvement > 5) {
opt->learning_rate *= 0.5;
no_improvement = 0;
}
}
Memory Issues
Symptoms: Program crashes with allocation errors or runs out of memory.
Causes:
- Not clearing operations between batches
- Momentum optimizer on large networks
- Accumulating tensors unintentionally
Solutions:
Always clear operations:
for (int batch = 0; batch < num_batches; batch++) {
// ... training ...
tofu_graph_clear_ops(g); // Essential for memory management
}
Use vanilla SGD if momentum exhausts memory:
// Switch from momentum to vanilla SGD
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_create(g, 0.01);
Best Practices
Here are guidelines to help you train models effectively and avoid common pitfalls.
Always Zero Gradients
Make this the first line of every training iteration:
for (int batch = 0; batch < num_batches; batch++) {
tofu_optimizer_zero_grad(opt); // Never forget this!
// ... rest of training loop ...
}
Without this, gradients accumulate across batches, leading to incorrect updates.
Monitor Multiple Metrics
Don't rely on loss alone. Track additional metrics:
for (int epoch = 0; epoch < num_epochs; epoch++) {
double total_loss = 0.0;
double total_l2_norm = 0.0;
double max_grad = 0.0;
for (int batch = 0; batch < num_batches; batch++) {
// ... training ...
// Track loss
float loss_val;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
total_loss += loss_val;
// Track parameter norm
total_l2_norm += compute_parameter_norm(opt);
// Track max gradient
max_grad = fmax(max_grad, compute_max_gradient(opt));
}
printf("Epoch %d: Loss=%.4f, Param_Norm=%.4f, Max_Grad=%.4f\n",
epoch, total_loss / num_batches,
total_l2_norm / num_batches, max_grad);
}
Save Checkpoints
Periodically save model parameters during training:
void save_parameters(tofu_optimizer *opt, const char *filename) {
FILE *f = fopen(filename, "wb");
if (!f) return;
for (int i = 0; i < opt->num_params; i++) {
tofu_tensor *param = tofu_graph_get_value(opt->params[i]);
fwrite(param->data, 1, param->len * sizeof(float), f);
}
fclose(f);
}
// Use during training
for (int epoch = 0; epoch < 100; epoch++) {
// ... training loop ...
// Save every 10 epochs
if (epoch % 10 == 0) {
char filename[256];
snprintf(filename, sizeof(filename), "model_epoch_%d.bin", epoch);
save_parameters(opt, filename);
}
}
Start Conservative
Begin with conservative hyperparameters and increase aggressiveness only if needed:
// Conservative defaults
double learning_rate = 0.01; // Not too high
double momentum = 0.9; // Standard momentum
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, learning_rate, momentum);
It's easier to increase the learning rate if training is too slow than to recover from instability caused by too-high rates.
Test on Small Data First
Before training on the full dataset, verify your setup on a small subset:
// Test with 10 batches first
int test_batches = 10;
for (int epoch = 0; epoch < 5; epoch++) {
for (int batch = 0; batch < test_batches; batch++) {
// ... training loop ...
}
}
// If loss decreases on small data, scale to full dataset
This quickly reveals issues with the model, loss function, or optimizer configuration.
Use Learning Rate Warmup for High Rates
When using aggressive learning rates (> 0.05), warm up gradually:
double target_lr = 0.1;
int warmup_epochs = 5;
for (int epoch = 0; epoch < 100; epoch++) {
if (epoch < warmup_epochs) {
opt->learning_rate = target_lr * (epoch + 1) / warmup_epochs;
} else {
opt->learning_rate = target_lr;
}
// ... training loop ...
}
Document Your Configuration
Keep track of hyperparameters that work:
// Document successful configurations
printf("Configuration:\n");
printf(" Optimizer: SGD with Momentum\n");
printf(" Learning Rate: %.6f\n", opt->learning_rate);
printf(" Momentum: %.2f\n", 0.9);
printf(" Batch Size: %d\n", batch_size);
printf(" Schedule: Step decay by 0.1 every 30 epochs\n");
This helps when you need to replicate results or adjust for similar problems.
Complete Example
Here's a complete training example that demonstrates all the concepts from this guide.
Problem: Binary Classification
We'll train a two-layer neural network to classify binary data.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "tofu.h"
// Network architecture: input(10) -> hidden(20) -> output(1)
int main(void) {
// Hyperparameters
const int input_dim = 10;
const int hidden_dim = 20;
const int output_dim = 1;
const int batch_size = 32;
const int num_batches = 100;
const int num_epochs = 50;
const double learning_rate = 0.01;
const double momentum = 0.9;
// Create graph
tofu_graph *g = tofu_graph_create();
// Initialize parameters
tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){input_dim, hidden_dim}, TOFU_FLOAT);
tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){hidden_dim}, TOFU_FLOAT);
tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){hidden_dim, output_dim}, TOFU_FLOAT);
tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){output_dim}, TOFU_FLOAT);
// Add parameters to graph
tofu_graph_node *W1_node = tofu_graph_param(g, W1);
tofu_graph_node *b1_node = tofu_graph_param(g, b1);
tofu_graph_node *W2_node = tofu_graph_param(g, W2);
tofu_graph_node *b2_node = tofu_graph_param(g, b2);
// Create optimizer
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, learning_rate, momentum);
printf("Training Configuration:\n");
printf(" Architecture: %d -> %d -> %d\n", input_dim, hidden_dim, output_dim);
printf(" Optimizer: SGD with Momentum\n");
printf(" Learning Rate: %.4f\n", learning_rate);
printf(" Momentum: %.2f\n", momentum);
printf(" Batch Size: %d\n", batch_size);
printf(" Epochs: %d\n\n", num_epochs);
// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
double epoch_loss = 0.0;
// Learning rate schedule: reduce by 0.1 every 20 epochs
if (epoch % 20 == 0 && epoch > 0) {
opt->learning_rate *= 0.1;
printf("Reduced learning rate to %.6f\n", opt->learning_rate);
}
for (int batch = 0; batch < num_batches; batch++) {
// 1. Zero gradients
tofu_optimizer_zero_grad(opt);
// 2. Generate synthetic batch data (normally loaded from dataset)
tofu_tensor *batch_x = generate_batch_data(batch_size, input_dim);
tofu_tensor *batch_y = generate_batch_labels(batch_size, output_dim);
// 3. Forward pass
tofu_graph_node *x = tofu_graph_input(g, batch_x);
// Layer 1: h = relu(x @ W1 + b1)
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1_node);
h1 = tofu_graph_add(g, h1, b1_node);
h1 = tofu_graph_relu(g, h1);
// Layer 2: pred = h @ W2 + b2
tofu_graph_node *pred = tofu_graph_matmul(g, h1, W2_node);
pred = tofu_graph_add(g, pred, b2_node);
// Loss
tofu_graph_node *target = tofu_graph_input(g, batch_y);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
// Track loss
float loss_val;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
epoch_loss += loss_val;
// 4. Backward pass
tofu_graph_backward(g, loss);
// 5. Update parameters
tofu_optimizer_step(opt);
// 6. Clear operations
tofu_graph_clear_ops(g);
// Free batch data
tofu_tensor_free_data_too(batch_x);
tofu_tensor_free_data_too(batch_y);
}
// Log progress
double avg_loss = epoch_loss / num_batches;
printf("Epoch %2d: Loss = %.6f\n", epoch, avg_loss);
// Early stopping
if (avg_loss < 0.001) {
printf("Converged! Stopping early.\n");
break;
}
}
// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W1);
tofu_tensor_free_data_too(b1);
tofu_tensor_free_data_too(W2);
tofu_tensor_free_data_too(b2);
printf("\nTraining complete!\n");
return 0;
}
// Helper function to generate synthetic data (replace with real data loading)
tofu_tensor* generate_batch_data(int batch_size, int input_dim) {
float *data = (float*)malloc(batch_size * input_dim * sizeof(float));
for (int i = 0; i < batch_size * input_dim; i++) {
data[i] = ((float)rand() / RAND_MAX) * 2.0f - 1.0f; // Random in [-1, 1]
}
tofu_tensor *t = tofu_tensor_create_with_values(data, 2,
(int[]){batch_size, input_dim});
free(data);
return t;
}
tofu_tensor* generate_batch_labels(int batch_size, int output_dim) {
float *data = (float*)malloc(batch_size * output_dim * sizeof(float));
for (int i = 0; i < batch_size * output_dim; i++) {
data[i] = ((float)rand() / RAND_MAX) > 0.5f ? 1.0f : 0.0f;
}
tofu_tensor *t = tofu_tensor_create_with_values(data, 2,
(int[]){batch_size, output_dim});
free(data);
return t;
}
This example demonstrates:
- Creating a computation graph with parameters
- Building a multi-layer network
- Setting up an optimizer with momentum
- Implementing a learning rate schedule
- Proper training loop structure
- Monitoring loss over time
- Correct cleanup order
Adapt this template for your specific problem by:
- Changing network architecture (layer sizes, activations)
- Loading real data instead of synthetic batches
- Adding validation/test evaluation
- Saving model checkpoints
- Implementing gradient clipping if needed
Summary
Optimizers are the engine of neural network training. They take computed gradients and update parameters to minimize loss. Here are the key takeaways:
Core Concepts:
- Gradient descent follows gradients downhill toward lower loss
- Learning rate controls step size (most important hyperparameter)
- Momentum accumulates velocity to accelerate convergence and dampen oscillations
Choosing an Optimizer:
- Start with SGD + momentum (learning_rate=0.01, momentum=0.9)
- Use vanilla SGD only for memory-constrained or very simple problems
- Deep networks benefit from higher momentum (0.95-0.99)
Using Optimizers:
- Always follow the pattern: zero_grad, forward, backward, step
- Clear operations between batches to manage memory
- Monitor training metrics (loss, gradients, parameter norms)
Tuning:
- Start with learning_rate=0.01
- Increase if training is too slow, decrease if unstable
- Use learning rate schedules for long training runs
- Implement gradient clipping for unstable gradients
Troubleshooting:
- NaN loss: Reduce learning rate, clip gradients
- No progress: Increase learning rate, add momentum
- Oscillations: Reduce learning rate, increase momentum
- Slow convergence: Use learning rate schedule, higher momentum
With these principles, you can train neural networks effectively and debug issues when they arise. Experiment with different configurations, monitor training carefully, and adjust based on what you observe. The best optimizer and hyperparameters depend on your specific problem, so be prepared to iterate.
Loss Functions
Loss functions are the core mechanism that guides neural network training. They measure how far your model's predictions are from the true values, producing a single scalar number that quantifies the error. During training, the optimizer uses gradients of this loss to adjust model parameters and improve predictions.
This guide explains how loss functions work, when to use each type, and how to integrate them into your training loops. You'll learn to choose the right loss function for your task and interpret loss values during training.
Table of Contents
- Introduction
- Loss Function Fundamentals
- Mean Squared Error
- Cross-Entropy Loss
- Choosing a Loss Function
- Loss in Training Loop
- Understanding Loss Values
- Advanced Topics
- Complete Examples
- Best Practices
Introduction
Every machine learning model needs a way to evaluate how well it's performing. A loss function (also called an objective function or cost function) provides this evaluation by computing a numerical score representing prediction error.
During training:
- The model makes predictions on input data
- The loss function compares predictions to true target values
- The result is a single number (scalar) representing total error
- Gradients of this loss tell us how to adjust weights
- The optimizer updates weights to reduce the loss
The choice of loss function depends on your task type (regression vs classification) and the structure of your data. Tofu provides two fundamental loss functions that cover most use cases:
- Mean Squared Error (MSE): For regression tasks where you predict continuous values
- Cross-Entropy Loss: For classification tasks where you predict discrete classes
Let's explore the fundamental properties all loss functions must have, then dive into each type.
Loss Function Fundamentals
To work correctly with gradient-based optimization, loss functions must satisfy three key requirements.
1. Objective Function
A loss function defines the optimization objective—the quantity we want to minimize during training. Lower loss means better predictions. The training process iteratively adjusts model parameters to find weights that minimize this function.
Think of it like hiking down a mountain in fog. The loss value tells you your current altitude, and the gradient tells you which direction is downhill. Your goal is to reach the lowest point (minimize loss).
2. Scalar Output
Loss functions must return a single number (scalar), not a vector or matrix. This scalar summarizes all prediction errors across all samples and features into one value.
Why scalar? Because optimization algorithms need a single objective to minimize. You can't simultaneously minimize multiple conflicting objectives without combining them into one number.
Example shapes:
Predictions: [batch_size, features] e.g., [32, 10]
Targets: [batch_size, features] e.g., [32, 10]
Loss: [1] (scalar)
The loss computation typically:
- Computes per-element errors
- Sums or averages across all elements
- Returns a single scalar value
3. Differentiable
Loss functions must be differentiable (smooth, with computable gradients). The gradient tells us how loss changes when we adjust each parameter—it's the compass that guides optimization.
Non-differentiable functions (like step functions or absolute value at zero) create problems for gradient-based optimizers. They can't compute meaningful gradients, so training fails or converges slowly.
Mathematically, we need:
∂L/∂w (gradient of loss L with respect to each weight w)
Tofu's automatic differentiation system computes these gradients automatically through backpropagation, so you don't need to derive formulas manually.
Loss Function Workflow
Here's how a loss function fits into one training iteration:
1. Forward Pass:
Input → Model → Predictions
2. Loss Computation:
Loss = loss_function(Predictions, Targets)
3. Backward Pass:
Compute ∂Loss/∂weights for all parameters
4. Parameter Update:
weights = weights - learning_rate * ∂Loss/∂weights
5. Repeat until loss is minimized
Now let's examine specific loss functions and when to use them.
Mean Squared Error
Mean Squared Error (MSE) is the most common loss function for regression tasks—problems where you predict continuous numerical values rather than discrete classes.
Mathematical Formula
MSE computes the average squared difference between predictions and targets:
MSE = (1/n) * Σ(prediction - target)²
Where:
- n = number of elements (batch_size × features)
- Σ = sum over all elements
The squaring operation ensures:
- Errors are always positive (negative errors don't cancel positive ones)
- Large errors are penalized more heavily than small errors
- The function is smooth and differentiable everywhere
Implementation in Tofu
Use tofu_graph_mse_loss() to add MSE loss to your computation graph:
tofu_graph_node* tofu_graph_mse_loss(tofu_graph* g,
tofu_graph_node* pred,
tofu_graph_node* target);
Parameters:
g: Computation graphpred: Model predictions (any shape)target: True target values (must match pred shape)
Returns: Scalar loss node (shape [1])
Requirements:
predandtargetmust have identical shapes- Both must be non-NULL
When to Use MSE
MSE is ideal for regression problems:
Perfect use cases:
- Predicting house prices (continuous dollar values)
- Estimating temperature (continuous degrees)
- Forecasting stock prices
- Predicting ages, distances, or other continuous quantities
- Image denoising (pixel value reconstruction)
Why it works:
- Treats all dimensions equally
- Penalizes large errors more than small ones (squared term)
- Has nice mathematical properties (convex, smooth gradients)
- Easy to interpret (units are squared target units)
When NOT to use:
- Classification tasks (use cross-entropy instead)
- When outliers are common (MSE heavily penalizes outliers)
- When you care about percentage error rather than absolute error
Practical Example
Here's a complete regression example predicting house prices:
// Setup: Simple linear regression y = x @ W + b
tofu_graph *g = tofu_graph_create();
// Training data: 4 samples with 2 features each
float input_data[] = {
1.0f, 2.0f, // Sample 1: [sqft=1000, bedrooms=2]
2.0f, 3.0f, // Sample 2: [sqft=2000, bedrooms=3]
3.0f, 4.0f, // Sample 3: [sqft=3000, bedrooms=4]
4.0f, 5.0f // Sample 4: [sqft=4000, bedrooms=5]
};
// Target prices (in $100k)
float target_data[] = {
150.0f, // $150k
250.0f, // $250k
350.0f, // $350k
450.0f // $450k
};
// Create tensors
tofu_tensor *x_tensor = tofu_tensor_create(
input_data, 2, (int[]){4, 2}, TOFU_FLOAT);
tofu_tensor *y_tensor = tofu_tensor_create(
target_data, 2, (int[]){4, 1}, TOFU_FLOAT);
// Model parameters (weights and bias)
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){2, 1}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){1}, TOFU_FLOAT);
// Build computation graph
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);
// Forward pass: prediction = x @ W + b
tofu_graph_node *matmul_result = tofu_graph_matmul(g, x, W);
tofu_graph_node *prediction = tofu_graph_add(g, matmul_result, b);
// Target
tofu_graph_node *target = tofu_graph_input(g, y_tensor);
// Compute MSE loss
tofu_graph_node *loss = tofu_graph_mse_loss(g, prediction, target);
// Get loss value
tofu_tensor *loss_value = tofu_graph_get_value(loss);
float loss_scalar;
TOFU_TENSOR_DATA_TO(loss_value, 0, loss_scalar, TOFU_FLOAT);
printf("MSE Loss: %.6f\n", loss_scalar);
// Backward pass to compute gradients
tofu_graph_backward(g, loss);
// Now W->grad and b->grad contain gradients for parameter updates
Understanding MSE Values
MSE values depend on the scale of your target values:
Small targets (e.g., normalized to [0, 1]):
- Good MSE: < 0.01
- Acceptable: 0.01 - 0.1
- Poor: > 0.1
Large targets (e.g., house prices in thousands):
- MSE = 10,000 means average error is √10,000 = $100
- MSE = 1,000 means average error is √1,000 ≈ $32
- MSE = 100 means average error is √100 = $10
Tip: Take the square root of MSE to get Root Mean Squared Error (RMSE), which has the same units as your target variable and is easier to interpret.
Gradient Behavior
The gradient of MSE with respect to predictions is:
∂MSE/∂pred = (2/n) * (pred - target)
Key properties:
- Gradient magnitude is proportional to error size
- Large errors produce large gradients (faster learning)
- Small errors produce small gradients (slower learning)
- Can cause exploding gradients if predictions are very wrong
Cross-Entropy Loss
Cross-Entropy Loss (also called log loss) is the standard loss function for classification tasks—problems where you assign inputs to discrete categories.
Mathematical Formula
Cross-entropy measures the difference between predicted probability distributions and true labels:
CE = -(1/n) * Σ(target * log(prediction))
Where:
- n = batch_size × num_classes
- target = one-hot encoded true class (or class probabilities)
- prediction = softmax probabilities (sum to 1)
- log = natural logarithm
The formula rewards correct classifications (low loss) and heavily penalizes confident wrong predictions (high loss).
Why Cross-Entropy for Classification?
Cross-entropy has special properties that make it ideal for classification:
- Probabilistic interpretation: It measures the "surprise" of predictions given the true distribution
- Strong gradients: Even for small probability errors, gradients remain strong enough to drive learning
- Numerical stability: Works well with softmax activation (more on this below)
- Theoretical foundation: Derived from maximum likelihood estimation in statistics
MSE doesn't work well for classification because:
- It treats class probabilities as arbitrary numbers, ignoring their sum-to-one constraint
- Gradients vanish when the model is confident but wrong
- No probabilistic interpretation
Softmax and Cross-Entropy Connection
Cross-entropy is almost always used with softmax activation on the final layer. Here's why:
Softmax converts raw scores (logits) into probabilities:
softmax(x_i) = exp(x_i) / Σ(exp(x_j))
Properties:
- All outputs are in range (0, 1)
- Outputs sum to 1 (valid probability distribution)
- Highlights the maximum value (turns scores into confident predictions)
Together, softmax + cross-entropy creates a powerful combination:
- Softmax outputs represent class probabilities
- Cross-entropy compares these probabilities to true labels
- Gradients flow efficiently even when predictions are wrong
Implementation in Tofu
Use tofu_graph_ce_loss() with softmax probabilities:
tofu_graph_node* tofu_graph_ce_loss(tofu_graph* g,
tofu_graph_node* pred,
tofu_graph_node* target);
Parameters:
g: Computation graphpred: Predicted probabilities from softmax (shape: [batch, num_classes])target: One-hot encoded true labels (shape: [batch, num_classes])
Returns: Scalar loss node (shape [1])
Requirements:
predmust be softmax probabilities (values in [0, 1], sum to 1 per sample)targetmust be one-hot encoded (1 for true class, 0 for others)- Shapes must match
Numerical stability: Tofu's implementation adds epsilon (1e-7) to avoid log(0), which would be undefined.
When to Use Cross-Entropy
Cross-entropy is ideal for classification:
Perfect use cases:
- Image classification (cat vs dog vs bird)
- Text classification (spam vs not spam)
- Sentiment analysis (positive/negative/neutral)
- Multi-class problems (10 digits, 1000 object categories, etc.)
- Any problem with discrete categorical outputs
Why it works:
- Designed for probability distributions
- Strong gradients throughout training
- Natural pairing with softmax
- Well-studied theoretical properties
When NOT to use:
- Regression problems (use MSE instead)
- Multi-label classification where multiple classes can be true simultaneously (requires binary cross-entropy per class)
Practical Example: MNIST-style Digit Classification
Here's a complete classification example for 4 classes:
// Setup: Neural network for 4-class classification
tofu_graph *g = tofu_graph_create();
// Training data: 8 samples with 10 features each
float input_data[8 * 10] = { /* ...fill with data... */ };
// One-hot encoded labels (8 samples, 4 classes)
float label_data[8 * 4] = {
1, 0, 0, 0, // Sample 0: class 0
0, 1, 0, 0, // Sample 1: class 1
0, 0, 1, 0, // Sample 2: class 2
0, 0, 0, 1, // Sample 3: class 3
1, 0, 0, 0, // Sample 4: class 0
0, 1, 0, 0, // Sample 5: class 1
0, 0, 1, 0, // Sample 6: class 2
0, 0, 0, 1 // Sample 7: class 3
};
// Create tensors
tofu_tensor *x_tensor = tofu_tensor_create(
input_data, 2, (int[]){8, 10}, TOFU_FLOAT);
tofu_tensor *y_tensor = tofu_tensor_create(
label_data, 2, (int[]){8, 4}, TOFU_FLOAT);
// Model parameters
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){10, 4}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
// Build graph
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);
// Forward pass: logits = x @ W + b
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
logits = tofu_graph_add(g, logits, b);
// Softmax activation (converts logits to probabilities)
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1); // axis=1
// Target labels
tofu_graph_node *target = tofu_graph_input(g, y_tensor);
// Cross-entropy loss
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);
// Get loss value
tofu_tensor *loss_value = tofu_graph_get_value(loss);
float loss_scalar;
TOFU_TENSOR_DATA_TO(loss_value, 0, loss_scalar, TOFU_FLOAT);
printf("Cross-Entropy Loss: %.6f\n", loss_scalar);
// Backward pass
tofu_graph_backward(g, loss);
// Gradients are now available in W->grad and b->grad
Understanding Cross-Entropy Values
Cross-entropy values depend on the number of classes:
Binary classification (2 classes):
- Random guessing: ~0.693 (log(2))
- Good model: < 0.3
- Excellent model: < 0.1
Multi-class (e.g., 10 classes):
- Random guessing: ~2.303 (log(10))
- Good model: < 1.0
- Excellent model: < 0.5
Key insight: Cross-entropy can never be negative. Zero loss means perfect predictions (100% confidence in correct class). As predictions get worse, loss increases without bound.
Interpreting Loss During Training
Watch for these patterns:
Healthy training:
Epoch 0: loss = 2.30 (random initialization)
Epoch 10: loss = 1.20 (learning started)
Epoch 50: loss = 0.50 (converging)
Epoch 100: loss = 0.15 (well-trained)
Problems:
- Loss stays at log(num_classes): Model isn't learning (check learning rate)
- Loss increases: Learning rate too high or numerical instability
- Loss plateaus early: Model too simple or data too hard
Gradient Behavior
The gradient of cross-entropy with respect to predictions is:
∂CE/∂pred = -(1/n) * (target / pred)
Key properties:
- When prediction is very wrong (pred ≈ 0 but target = 1), gradient is very large
- When prediction is correct and confident (pred ≈ 1 and target = 1), gradient is small
- This creates strong learning signals when needed most
Combined with softmax, the gradient simplifies beautifully to:
∂(CE + softmax)/∂logits = pred - target
This is why softmax + cross-entropy is the gold standard for classification.
Choosing a Loss Function
Selecting the right loss function is critical—it defines what "good" means for your model. Here's a decision guide.
Decision Tree
Start: What type of problem are you solving?
│
├─ Predicting continuous values (numbers)?
│ └─ Use Mean Squared Error (MSE)
│ Examples: Regression, image denoising, forecasting
│
└─ Predicting discrete categories (classes)?
└─ Use Cross-Entropy Loss
Examples: Classification, object recognition, sentiment analysis
Quick Reference Table
| Task Type | Loss Function | Output Activation | Target Format |
|---|---|---|---|
| Regression | MSE | None (linear) | Continuous values |
| Binary Classification | Cross-Entropy | Softmax (2 classes) | One-hot [0,1] or [1,0] |
| Multi-class Classification | Cross-Entropy | Softmax | One-hot encoding |
| Image Reconstruction | MSE | None or Sigmoid | Pixel values |
Detailed Recommendations
Use MSE when:
- Output is a continuous number (prices, temperatures, distances)
- You care about absolute error magnitude
- Your task is regression or reconstruction
- Outliers are rare or acceptable
Use Cross-Entropy when:
- Output is a discrete category (class label)
- You need probability predictions
- Your task is classification
- You want strong gradients throughout training
Example scenarios:
| Problem | Input | Output | Loss | Why |
|---|---|---|---|---|
| House price prediction | Features (sqft, bedrooms) | Price ($) | MSE | Continuous value |
| Spam detection | Email text | Spam/Not Spam | Cross-Entropy | Binary classification |
| Digit recognition | Image pixels | Digit (0-9) | Cross-Entropy | Multi-class classification |
| Temperature forecast | Historical data | Temperature (°F) | MSE | Continuous value |
| Sentiment analysis | Review text | Pos/Neg/Neutral | Cross-Entropy | Multi-class classification |
Common Mistakes
Mistake 1: Using MSE for classification
// WRONG: Using MSE to predict classes
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *loss = tofu_graph_mse_loss(g, logits, target); // Bad!
Problem: MSE treats class probabilities as arbitrary numbers, leading to weak gradients and poor convergence.
Fix: Use softmax + cross-entropy:
// CORRECT: Classification with cross-entropy
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target); // Good!
Mistake 2: Forgetting softmax before cross-entropy
// WRONG: Cross-entropy without softmax
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *loss = tofu_graph_ce_loss(g, logits, target); // Bad!
Problem: Cross-entropy expects probabilities (sum to 1), but logits are raw scores.
Fix: Always apply softmax first:
// CORRECT: Softmax before cross-entropy
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target); // Good!
Mistake 3: Wrong target format
// WRONG: Class indices instead of one-hot for cross-entropy
float targets[] = {0, 1, 2, 1}; // Class indices
Problem: Cross-entropy expects one-hot encoded targets, not class indices.
Fix: Convert to one-hot:
// CORRECT: One-hot encoded targets
float targets[] = {
1, 0, 0, // Class 0
0, 1, 0, // Class 1
0, 0, 1, // Class 2
0, 1, 0 // Class 1
};
Loss in Training Loop
The loss function integrates into the training loop at a specific point in the forward-backward cycle. Understanding this workflow ensures correct implementation.
Training Loop Structure
A typical training iteration follows this pattern:
1. Zero gradients (clear previous gradients)
2. Forward pass (compute predictions)
3. Compute loss (evaluate predictions)
4. Backward pass (compute gradients via backpropagation)
5. Optimizer step (update parameters)
6. Repeat
Loss computation happens after the forward pass and before the backward pass. It's the bridge connecting prediction to optimization.
Complete Training Loop Example
Here's a full training loop showing loss integration:
// Setup: Create graph, parameters, and optimizer
tofu_graph *g = tofu_graph_create();
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);
tofu_graph_node *W_node = tofu_graph_param(g, weights);
tofu_graph_node *b_node = tofu_graph_param(g, bias);
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// Training loop
const int NUM_EPOCHS = 100;
const int BATCH_SIZE = 32;
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
float epoch_loss = 0.0f;
int num_batches = 0;
for (int batch = 0; batch < num_batches_in_dataset; batch++) {
// 1. Zero gradients
tofu_optimizer_zero_grad(opt);
// 2. Forward pass
// Load batch data
tofu_tensor *batch_x = load_batch_input(batch);
tofu_tensor *batch_y = load_batch_target(batch);
tofu_graph_node *x = tofu_graph_input(g, batch_x);
tofu_graph_node *y_true = tofu_graph_input(g, batch_y);
// Model forward pass
tofu_graph_node *h = tofu_graph_matmul(g, x, W_node);
tofu_graph_node *pred = tofu_graph_add(g, h, b_node);
// 3. Compute loss
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, y_true);
// Extract loss value for monitoring
tofu_tensor *loss_tensor = tofu_graph_get_value(loss);
float loss_value;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_value, TOFU_FLOAT);
epoch_loss += loss_value;
// 4. Backward pass
tofu_graph_backward(g, loss);
// 5. Optimizer step
tofu_optimizer_step(opt);
// Cleanup batch resources
tofu_tensor_free(batch_x);
tofu_tensor_free(batch_y);
// Clear graph operations (keeps parameters)
tofu_graph_clear_ops(g);
num_batches++;
}
// Report progress
float avg_loss = epoch_loss / num_batches;
printf("Epoch %d: loss = %.6f\n", epoch, avg_loss);
}
Monitoring Loss During Training
Track loss values across epochs to monitor training progress:
// Loss tracking
float loss_history[NUM_EPOCHS];
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
// ... training code ...
loss_history[epoch] = avg_loss;
// Print every 10 epochs
if (epoch % 10 == 0) {
printf("Epoch %3d: loss = %.6f\n", epoch, avg_loss);
}
}
Loss Curves
Visualizing loss over time reveals training behavior:
Healthy training curve:
Loss
│
2.0 ┤●
│ ●
1.5 ┤ ●
│ ●●
1.0 ┤ ●●
│ ●●●
0.5 ┤ ●●●●●●●●●●●
└──────────────────────────> Epoch
Characteristics:
- Smooth decrease
- Eventually plateaus
- No wild fluctuations
Warning signs:
Loss increasing:
│ ●●●
│ ●●
│●●
└─────> Learning rate too high
Loss plateauing early:
│●●●●●●●●●●●●●●
│
└─────> Model too simple or stuck
Loss oscillating:
│ ● ● ● ● ●
│● ● ● ● ● ●
└─────> Batch size too small or LR too high
Understanding Loss Values
Interpreting loss values correctly helps diagnose training issues and assess model quality.
Absolute Loss Magnitude
Loss value interpretation depends heavily on context:
For MSE:
- Scale depends on target value range
- MSE = 100 is terrible for normalized data [0, 1]
- MSE = 100 might be excellent for house prices in thousands
- Always consider: What's the typical magnitude of your targets?
For Cross-Entropy:
- Random guessing baseline: log(num_classes)
- Binary classification random: 0.693
- 10-class random: 2.303
- Perfect predictions: 0.0
Rule of thumb: Compare loss to a baseline (random guessing or simple heuristic) to assess improvement.
Relative Changes Matter More
Focus on loss trends rather than absolute values:
// Good trend (decreasing)
Epoch 0: loss = 1.50
Epoch 10: loss = 1.20 (20% reduction)
Epoch 20: loss = 0.85 (29% reduction)
Epoch 50: loss = 0.45 (47% reduction)
// Bad trend (increasing)
Epoch 0: loss = 1.50
Epoch 10: loss = 1.65 (increasing - problem!)
Common Loss Troubleshooting
Problem: Loss is NaN or infinite
Causes:
- Learning rate too high (exploding gradients)
- Numerical overflow in loss computation
- Invalid data (NaN in input)
Fixes:
// 1. Reduce learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.001); // Was 0.1
// 2. Check for NaN in data
for (int i = 0; i < tensor->len; i++) {
float val;
TOFU_TENSOR_DATA_TO(tensor, i, val, TOFU_FLOAT);
if (isnan(val) || isinf(val)) {
fprintf(stderr, "Invalid data at index %d\n", i);
}
}
// 3. Add gradient clipping (manual)
tofu_tensor *grad = tofu_graph_get_grad(param_node);
// Clip gradients to [-max_grad, max_grad]
Problem: Loss doesn't decrease
Causes:
- Learning rate too low
- Model too simple (can't fit data)
- Weights initialized poorly
- Wrong loss function for task
Fixes:
// 1. Increase learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.1); // Was 0.001
// 2. Add hidden layers (increase model capacity)
tofu_graph_node *h1 = tofu_graph_relu(g, tofu_graph_add(g,
tofu_graph_matmul(g, x, W1), b1));
tofu_graph_node *h2 = tofu_graph_relu(g, tofu_graph_add(g,
tofu_graph_matmul(g, h1, W2), b2));
// 3. Check you're using the right loss function
// Classification → Cross-Entropy, Regression → MSE
Problem: Loss plateaus too early
Causes:
- Model capacity too small
- Learning rate needs adjustment
- Reached local minimum
- Need more training time
Fixes:
// 1. Train longer
const int NUM_EPOCHS = 500; // Was 100
// 2. Add capacity
// Increase hidden layer size or add more layers
// 3. Try learning rate schedule
float lr = (epoch < 50) ? 0.1 : 0.01; // Reduce LR after 50 epochs
Problem: Loss oscillates wildly
Causes:
- Learning rate too high
- Batch size too small
- Numerical instability
Fixes:
// 1. Reduce learning rate
lr = 0.001; // Was 0.1
// 2. Increase batch size
BATCH_SIZE = 64; // Was 16
// 3. Add momentum (helps smooth updates)
tofu_optimizer *opt = tofu_optimizer_adam_create(g, 0.001);
Comparing Train vs Validation Loss
Always monitor loss on held-out validation data:
// Training loop with validation
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
// Training
float train_loss = train_one_epoch(g, opt, train_data);
// Validation (no gradient updates)
float val_loss = evaluate_loss(g, val_data);
printf("Epoch %d: train_loss=%.4f, val_loss=%.4f\n",
epoch, train_loss, val_loss);
// Check for overfitting
if (val_loss > train_loss * 1.5) {
printf("Warning: Model may be overfitting\n");
}
}
Healthy pattern:
Train loss: 0.50, Val loss: 0.55 (close - good generalization)
Overfitting:
Train loss: 0.10, Val loss: 0.80 (gap too large - overfitting)
Advanced Topics
Beyond basic loss functions, advanced techniques can improve training stability and model performance.
Loss Weighting
Sometimes you want to emphasize certain samples or classes. Loss weighting adjusts the contribution of individual samples.
Class weighting for imbalanced data:
If you have 90% negative samples and 10% positive samples in binary classification, the model may ignore the minority class. Weight the minority class higher:
// Manually weight loss by class
// Assume we have per-sample weights
float class_weights[2] = {1.0f, 9.0f}; // Weight minority class 9x
// Compute weighted loss manually
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *loss_unweighted = tofu_graph_ce_loss(g, probs, target);
// Get loss and multiply by weights
tofu_tensor *loss_tensor = tofu_graph_get_value(loss_unweighted);
float loss_val;
TOFU_TENSOR_DATA_TO(loss_tensor, 0, loss_val, TOFU_FLOAT);
// Weight by class (simplified - production code would weight per-sample)
int predicted_class = /* determine class */;
float weighted_loss = loss_val * class_weights[predicted_class];
Note: Tofu doesn't have built-in weighted loss functions yet, so implement weighting manually or at the sample level.
Regularization Loss Terms
Regularization adds a penalty term to prevent overfitting:
Total Loss = Task Loss + λ * Regularization Term
Where λ controls regularization strength
L2 Regularization (Weight Decay):
Penalize large weights to prevent overfitting:
// Compute L2 regularization manually
float l2_penalty = 0.0f;
const float lambda = 0.01f;
tofu_tensor *W = param_tensor;
for (int i = 0; i < W->len; i++) {
float w;
TOFU_TENSOR_DATA_TO(W, i, w, TOFU_FLOAT);
l2_penalty += w * w;
}
l2_penalty *= lambda;
// Add to loss
float total_loss = task_loss + l2_penalty;
Note: Most optimizers (like Adam) have built-in weight decay support, which is more efficient than manual regularization.
Custom Loss Functions
For specialized tasks, you may need custom losses. Implement them by:
- Computing the loss value using tensor operations
- Implementing the backward pass (gradient computation)
Example: Huber loss (robust to outliers)
// Huber loss: Combines MSE (small errors) with MAE (large errors)
// Loss = 0.5 * (pred - target)^2 if |error| < delta
// = delta * (|error| - 0.5*delta) otherwise
// This requires implementing a custom graph operation
// (beyond basic usage - see advanced tutorials)
For most use cases, MSE and cross-entropy are sufficient. Custom losses require deeper knowledge of Tofu's backward pass implementation.
Multi-Task Learning
Train one model for multiple tasks by combining losses:
// Example: Predict both class and bounding box
tofu_graph_node *class_logits = /* classification head */;
tofu_graph_node *bbox_pred = /* regression head */;
// Classification loss
tofu_graph_node *class_probs = tofu_graph_softmax(g, class_logits, 1);
tofu_graph_node *class_loss = tofu_graph_ce_loss(g, class_probs, class_target);
// Bounding box regression loss
tofu_graph_node *bbox_loss = tofu_graph_mse_loss(g, bbox_pred, bbox_target);
// Combined loss (weighted sum)
// Note: Must be done manually as Tofu doesn't support loss addition yet
float class_loss_val = extract_scalar(class_loss);
float bbox_loss_val = extract_scalar(bbox_loss);
float total_loss = class_loss_val + 0.5 * bbox_loss_val; // Weight bbox 50%
Complete Examples
Let's walk through two complete, practical examples: regression and classification.
Example 1: Regression - House Price Prediction
Goal: Predict house prices from square footage and number of bedrooms.
#include <stdio.h>
#include <stdlib.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"
int main() {
// Dataset: 4 houses
float features[] = {
1000.0f, 2.0f, // 1000 sqft, 2 bedrooms → $150k
1500.0f, 3.0f, // 1500 sqft, 3 bedrooms → $200k
2000.0f, 3.0f, // 2000 sqft, 3 bedrooms → $250k
2500.0f, 4.0f // 2500 sqft, 4 bedrooms → $300k
};
float prices[] = {150.0f, 200.0f, 250.0f, 300.0f};
// Create tensors
tofu_tensor *X = tofu_tensor_create(features, 2, (int[]){4, 2}, TOFU_FLOAT);
tofu_tensor *y = tofu_tensor_create(prices, 2, (int[]){4, 1}, TOFU_FLOAT);
// Model parameters (linear regression: y = X @ W + b)
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){2, 1}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){1}, TOFU_FLOAT);
// Create graph and optimizer
tofu_graph *g = tofu_graph_create();
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.0001); // Small LR
// Training loop
for (int epoch = 0; epoch < 1000; epoch++) {
tofu_optimizer_zero_grad(opt);
// Forward pass
tofu_graph_node *x_node = tofu_graph_input(g, X);
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);
tofu_graph_node *y_node = tofu_graph_input(g, y);
tofu_graph_node *pred = tofu_graph_add(g,
tofu_graph_matmul(g, x_node, W_node), b_node);
// MSE loss
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, y_node);
// Extract loss value
float loss_val;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
// Backward and optimize
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
if (epoch % 100 == 0) {
printf("Epoch %4d: MSE = %.2f\n", epoch, loss_val);
}
tofu_graph_clear_ops(g);
}
// Final predictions
printf("\nFinal predictions:\n");
tofu_graph_node *x_node = tofu_graph_input(g, X);
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);
tofu_graph_node *pred = tofu_graph_add(g,
tofu_graph_matmul(g, x_node, W_node), b_node);
tofu_tensor *predictions = tofu_graph_get_value(pred);
for (int i = 0; i < 4; i++) {
float pred_price;
TOFU_TENSOR_DATA_TO(predictions, i, pred_price, TOFU_FLOAT);
printf("House %d: Predicted=%.1f, Actual=%.1f\n",
i, pred_price, prices[i]);
}
// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free(X);
tofu_tensor_free(y);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);
return 0;
}
Expected output:
Epoch 0: MSE = 42500.00
Epoch 100: MSE = 1250.50
Epoch 200: MSE = 523.75
Epoch 900: MSE = 12.30
Final predictions:
House 0: Predicted=148.5, Actual=150.0
House 1: Predicted=201.2, Actual=200.0
House 2: Predicted=251.8, Actual=250.0
House 3: Predicted=298.7, Actual=300.0
Example 2: Classification - XOR Problem
Goal: Learn the XOR function (classic non-linear classification).
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "tofu_tensor.h"
#include "tofu_graph.h"
#include "tofu_optimizer.h"
// Xavier initialization
float xavier_init(int fan_in) {
float limit = sqrtf(6.0f / fan_in);
return limit * (2.0f * rand() / RAND_MAX - 1.0f);
}
int main() {
// XOR dataset
float inputs[] = {
0, 0, // → 0
0, 1, // → 1
1, 0, // → 1
1, 1 // → 0
};
// One-hot targets (2 classes: [1,0] = class 0, [0,1] = class 1)
float targets[] = {
1, 0, // XOR(0,0) = 0
0, 1, // XOR(0,1) = 1
0, 1, // XOR(1,0) = 1
1, 0 // XOR(1,1) = 0
};
// Create tensors
tofu_tensor *X = tofu_tensor_create(inputs, 2, (int[]){4, 2}, TOFU_FLOAT);
tofu_tensor *y = tofu_tensor_create(targets, 2, (int[]){4, 2}, TOFU_FLOAT);
// Model: 2 → 4 → 2 (need hidden layer for non-linearity)
tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){2, 4}, TOFU_FLOAT);
tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){4, 2}, TOFU_FLOAT);
tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){2}, TOFU_FLOAT);
// Xavier initialization for W1, W2
for (int i = 0; i < W1->len; i++) {
float val = xavier_init(2);
TOFU_TENSOR_DATA_FROM(W1, i, val, TOFU_FLOAT);
}
for (int i = 0; i < W2->len; i++) {
float val = xavier_init(4);
TOFU_TENSOR_DATA_FROM(W2, i, val, TOFU_FLOAT);
}
// Create graph and optimizer
tofu_graph *g = tofu_graph_create();
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.5); // Higher LR
// Training loop
for (int epoch = 0; epoch < 2000; epoch++) {
tofu_optimizer_zero_grad(opt);
// Forward pass
tofu_graph_node *x = tofu_graph_input(g, X);
tofu_graph_node *w1 = tofu_graph_param(g, W1);
tofu_graph_node *b1_node = tofu_graph_param(g, b1);
tofu_graph_node *w2 = tofu_graph_param(g, W2);
tofu_graph_node *b2_node = tofu_graph_param(g, b2);
tofu_graph_node *y_node = tofu_graph_input(g, y);
// Layer 1: x @ W1 + b1 → ReLU
tofu_graph_node *h1 = tofu_graph_relu(g, tofu_graph_add(g,
tofu_graph_matmul(g, x, w1), b1_node));
// Layer 2: h1 @ W2 + b2 → softmax
tofu_graph_node *logits = tofu_graph_add(g,
tofu_graph_matmul(g, h1, w2), b2_node);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
// Cross-entropy loss
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, y_node);
float loss_val;
TOFU_TENSOR_DATA_TO(tofu_graph_get_value(loss), 0, loss_val, TOFU_FLOAT);
// Backward and optimize
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
if (epoch % 200 == 0) {
printf("Epoch %4d: CE Loss = %.4f\n", epoch, loss_val);
}
tofu_graph_clear_ops(g);
}
// Test predictions
printf("\nFinal predictions:\n");
tofu_graph_node *x = tofu_graph_input(g, X);
tofu_graph_node *w1 = tofu_graph_param(g, W1);
tofu_graph_node *b1_node = tofu_graph_param(g, b1);
tofu_graph_node *w2 = tofu_graph_param(g, W2);
tofu_graph_node *b2_node = tofu_graph_param(g, b2);
tofu_graph_node *h1 = tofu_graph_relu(g, tofu_graph_add(g,
tofu_graph_matmul(g, x, w1), b1_node));
tofu_graph_node *logits = tofu_graph_add(g,
tofu_graph_matmul(g, h1, w2), b2_node);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_tensor *predictions = tofu_graph_get_value(probs);
for (int i = 0; i < 4; i++) {
float prob0, prob1;
TOFU_TENSOR_DATA_TO(predictions, i*2, prob0, TOFU_FLOAT);
TOFU_TENSOR_DATA_TO(predictions, i*2+1, prob1, TOFU_FLOAT);
int pred_class = (prob1 > prob0) ? 1 : 0;
int true_class = (targets[i*2+1] > 0.5f) ? 1 : 0;
printf("[%.0f, %.0f] → Pred=%d (%.3f, %.3f), True=%d\n",
inputs[i*2], inputs[i*2+1],
pred_class, prob0, prob1, true_class);
}
// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free(X);
tofu_tensor_free(y);
tofu_tensor_free_data_too(W1);
tofu_tensor_free_data_too(b1);
tofu_tensor_free_data_too(W2);
tofu_tensor_free_data_too(b2);
return 0;
}
Expected output:
Epoch 0: CE Loss = 0.7120
Epoch 200: CE Loss = 0.4532
Epoch 400: CE Loss = 0.2145
Epoch 1800: CE Loss = 0.0523
Final predictions:
[0, 0] → Pred=0 (0.972, 0.028), True=0
[0, 1] → Pred=1 (0.045, 0.955), True=1
[1, 0] → Pred=1 (0.039, 0.961), True=1
[1, 1] → Pred=0 (0.968, 0.032), True=0
Best Practices
Follow these guidelines for effective loss function usage:
1. Match Loss to Task Type
Always use:
- MSE for regression
- Cross-Entropy for classification
Never mix them: Using the wrong loss leads to poor convergence and incorrect learning.
2. Monitor Loss During Training
// Log loss to file or console
FILE *log = fopen("training_log.txt", "w");
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
float loss = train_epoch(...);
fprintf(log, "%d,%.6f\n", epoch, loss);
if (epoch % 10 == 0) {
printf("Epoch %d: loss=%.6f\n", epoch, loss);
}
}
fclose(log);
Track trends, not just final values. Loss should decrease smoothly over time.
3. Use Appropriate Learning Rates
Loss behavior reveals learning rate issues:
// Too high: Loss explodes or oscillates wildly
// Solution: Reduce by 10x
lr = 0.01; // Was 0.1
// Too low: Loss barely decreases
// Solution: Increase by 10x
lr = 0.1; // Was 0.01
4. Normalize Your Data
Large input/output ranges cause numerical instability:
// Bad: Raw house prices ($100k - $500k)
float price = 250000.0f;
// Good: Normalized to reasonable range
float price_normalized = (250000.0f - mean) / std_dev;
// or
float price_scaled = 250000.0f / 1000.0f; // Scale to [100-500]
Normalization prevents exploding gradients and improves convergence.
5. Check for Numerical Issues
// Add checks during training
if (isnan(loss_val) || isinf(loss_val)) {
fprintf(stderr, "ERROR: Loss is %f at epoch %d\n", loss_val, epoch);
// Reduce learning rate or check data
break;
}
6. Compare to Baselines
Always establish a baseline before training:
// Baseline 1: Random predictions
// For classification: loss ≈ log(num_classes)
// For regression: loss ≈ variance of targets
// Baseline 2: Simple heuristic
// Classification: Always predict most common class
// Regression: Always predict mean target value
printf("Random baseline loss: %.4f\n", baseline_loss);
printf("Trained model loss: %.4f\n", final_loss);
printf("Improvement: %.1f%%\n",
100.0 * (baseline_loss - final_loss) / baseline_loss);
7. Use Validation Data
Never trust training loss alone:
// Split data: 80% train, 20% validation
float train_loss = evaluate_loss(g, train_data);
float val_loss = evaluate_loss(g, val_data);
if (val_loss > train_loss * 1.5) {
printf("Warning: Possible overfitting\n");
}
8. Save Best Model Based on Validation Loss
float best_val_loss = INFINITY;
tofu_tensor *best_W = NULL;
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
train_epoch(...);
float val_loss = evaluate_validation(...);
if (val_loss < best_val_loss) {
best_val_loss = val_loss;
// Save model parameters
if (best_W) tofu_tensor_free_data_too(best_W);
best_W = tofu_tensor_clone(W);
printf("New best model at epoch %d: val_loss=%.4f\n",
epoch, val_loss);
}
}
9. Early Stopping
Stop training when validation loss stops improving:
int patience = 20; // Wait 20 epochs for improvement
int no_improve_count = 0;
for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
float val_loss = train_and_validate(...);
if (val_loss < best_val_loss) {
best_val_loss = val_loss;
no_improve_count = 0;
} else {
no_improve_count++;
}
if (no_improve_count >= patience) {
printf("Early stopping at epoch %d\n", epoch);
break;
}
}
10. Document Your Loss Function Choice
// At the top of your training code, document decisions:
/*
* Model: Image classifier for 10 classes
* Loss: Cross-Entropy (classification task)
* Architecture: Input → 128 → 64 → 10 (softmax)
* Optimizer: SGD with lr=0.01
* Expected loss: Random ~2.3, Target <0.5
*/
This helps future debugging and maintains clear expectations.
Summary
Loss functions are the foundation of neural network training. Key takeaways:
-
Match loss to task:
- Regression → MSE
- Classification → Cross-Entropy (with softmax)
-
Loss must be:
- Scalar (single number)
- Differentiable (smooth gradients)
- Representative of task objective
-
Monitor loss trends:
- Decreasing = learning
- Plateauing = convergence or stuck
- Increasing = problem (LR too high, numerical issues)
-
Interpret loss in context:
- Compare to baselines (random guessing)
- Track validation loss (detect overfitting)
- Understand scale (depends on data range)
-
Debug with loss values:
- NaN/Inf → Check learning rate, data validity
- No decrease → Increase LR or model capacity
- Oscillation → Reduce LR or increase batch size
With proper loss function selection and monitoring, you'll train neural networks that converge reliably and achieve strong performance on your task.
For more details on loss function implementation and gradients, see the Graph API Reference. For optimizer integration, see the Optimizer User Guide.
Linear Regression
Tutorial on implementing linear regression with Tofu.
Classification
Tutorial on building classification models.
CNN Training
Training Convolutional Neural Networks with Tofu.
Residual Networks
Building and training residual networks (ResNets).
Memory Management
Best practices for memory management in Tofu.
Error Handling
Best practices for error handling.
Debugging
Tips and techniques for debugging Tofu applications.
Performance
Performance optimization tips and techniques.
Tensor API Reference
The Tensor API provides the core data structures and operations for Tofu. Tensors are multi-dimensional arrays that support automatic differentiation when used with the Graph API.
Table of Contents
- Data Structures
- Creation Functions
- Shape Operations
- Mathematical Operations
- Element-wise Operations
- Reductions
- Activation Functions
- Utilities
Data Structures
tofu_tensor
The core tensor structure representing a multi-dimensional array.
struct tofu_tensor {
tofu_dtype dtype; // Data type (TOFU_FLOAT, TOFU_INT32, etc.)
int len; // Total number of elements
int ndim; // Number of dimensions
int *dims; // Array of dimension sizes
void *data; // Pointer to data buffer
struct tofu_tensor *owner; // Data owner (NULL if self-owned)
void *backend_data; // Backend-specific data
};
Data Types (tofu_dtype)
Supported tensor data types:
TOFU_FLOAT- 32-bit floating point (most common for neural networks)TOFU_DOUBLE- 64-bit floating pointTOFU_INT32- 32-bit signed integerTOFU_INT64- 64-bit signed integerTOFU_INT16- 16-bit signed integerTOFU_INT8- 8-bit signed integerTOFU_UINT32- 32-bit unsigned integerTOFU_UINT64- 64-bit unsigned integerTOFU_UINT16- 16-bit unsigned integerTOFU_UINT8- 8-bit unsigned integerTOFU_BOOL- Boolean type
Element-wise Operations (tofu_elew_op)
TOFU_MUL- Multiplication (*)TOFU_DIV- Division (/)TOFU_SUM- Addition (+)TOFU_SUB- Subtraction (-)TOFU_MAX- Element-wise maximumTOFU_MIN- Element-wise minimumTOFU_POW- Power (^)
Creation Functions
tofu_tensor_create
Create a tensor with an existing data buffer.
tofu_tensor *tofu_tensor_create(void *data, int ndim, const int *dims, tofu_dtype dtype);
Parameters:
data- Pointer to data buffer (cannot be NULL)ndim- Number of dimensions (must be > 0 and <= TOFU_MAXDIM = 8)dims- Array of dimension sizes, length must bendimdtype- Data type (TOFU_FLOAT, TOFU_INT32, etc.)
Returns: Pointer to newly allocated tensor (caller owns, must call tofu_tensor_free)
Ownership:
- The tensor does NOT take ownership of the data buffer
- Caller must manage data lifetime and free both tensor and data separately
- Even when passed to
tofu_graph_param(), caller still owns tensor
Example:
float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
int dims[] = {2, 2};
tofu_tensor *t = tofu_tensor_create(data, 2, dims, TOFU_FLOAT);
// Use tensor...
tofu_tensor_free(t); // Free tensor structure
// data is still valid - free manually if needed
Notes:
- Typical pattern: create tensor → use in graph →
tofu_graph_free()→tofu_tensor_free()→ free data - Violating preconditions triggers
assert()and crashes
See also: tofu_tensor_zeros, tofu_tensor_create_with_values
tofu_tensor_create_with_values
Create a tensor with heap-allocated copy of provided values.
tofu_tensor *tofu_tensor_create_with_values(const float *values, int ndim, const int *dims);
Parameters:
values- Array of initial values (cannot be NULL)ndim- Number of dimensions (must be > 0 and <= TOFU_MAXDIM)dims- Array of dimension sizes, length must bendim
Returns: Pointer to newly allocated tensor with copied data (caller owns, must call tofu_tensor_free_data_too)
Important:
- Creates heap-allocated copy of values (safe for gradients)
- DO NOT use compound literals like
(float[]){1.0f}as they create stack memory - Number of values must match product of dims
- Caller must call
tofu_tensor_free_data_tooto free both tensor and data
Example:
float values[] = {1.0f, 2.0f, 3.0f, 4.0f};
int dims[] = {2, 2};
tofu_tensor *t = tofu_tensor_create_with_values(values, 2, dims);
// Use tensor...
tofu_tensor_free_data_too(t); // Free both tensor and data
tofu_tensor_zeros
Create a zero-initialized tensor with allocated data buffer.
tofu_tensor *tofu_tensor_zeros(int ndim, const int *dims, tofu_dtype dtype);
Parameters:
ndim- Number of dimensions (must be > 0 and <= TOFU_MAXDIM)dims- Array of dimension sizes, length must bendimdtype- Data type (TOFU_FLOAT, TOFU_INT32, etc.)
Returns: Pointer to newly allocated zero-filled tensor (caller owns, must call tofu_tensor_free_data_too)
Ownership:
- Allocates both tensor structure and data buffer
- Caller must call
tofu_tensor_free_data_tooto free both - Even when passed to
tofu_graph_param(), caller still owns tensor
Example:
int dims[] = {3, 4};
tofu_tensor *t = tofu_tensor_zeros(2, dims, TOFU_FLOAT);
// All elements are 0.0f
// Use tensor...
tofu_tensor_free_data_too(t); // Free both tensor and data
See also: tofu_tensor_create, tofu_tensor_clone
tofu_tensor_clone
Create a deep copy of a tensor.
tofu_tensor *tofu_tensor_clone(const tofu_tensor *src);
Parameters:
src- Source tensor to clone (cannot be NULL)
Returns: Pointer to newly allocated tensor (caller owns, must call tofu_tensor_free_data_too)
Behavior:
- Creates both new tensor structure and new data buffer
- Copies all data from source to new tensor
- Preserves shape and data type
Example:
tofu_tensor *original = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor *copy = tofu_tensor_clone(original);
// copy is independent of original
tofu_tensor_free_data_too(copy);
tofu_tensor_free_data_too(original);
tofu_tensor_repeat
Create a tensor by repeating data multiple times.
tofu_tensor *tofu_tensor_repeat(const tofu_tensor *src, int times);
Parameters:
src- Source tensor to repeat (cannot be NULL)times- Number of repetitions (must be > 0)
Returns: Pointer to newly allocated tensor (caller owns, must call tofu_tensor_free_data_too)
Behavior:
- Creates new tensor with size =
src->len * times - Repeats source data sequentially
Example:
float data[] = {1.0f, 2.0f};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){2}, TOFU_FLOAT);
tofu_tensor *repeated = tofu_tensor_repeat(t, 3);
// repeated contains: [1.0, 2.0, 1.0, 2.0, 1.0, 2.0]
tofu_tensor_free_data_too(repeated);
tofu_tensor_free(t);
tofu_tensor_arange
Create a 1-D tensor with evenly spaced values (similar to NumPy arange).
tofu_tensor *tofu_tensor_arange(double start, double stop, double step, tofu_dtype dtype);
Parameters:
start- Starting value (inclusive)stop- Ending value (exclusive)step- Step size between valuesdtype- Data type for the resulting tensor
Returns: Pointer to newly allocated 1-D tensor (caller owns, must call tofu_tensor_free_data_too)
Behavior:
- Creates values
[start, start+step, start+2*step, ..., stop) - Number of elements =
ceil((stop - start) / step)
Example:
tofu_tensor *t = tofu_tensor_arange(0.0, 5.0, 1.0, TOFU_FLOAT);
// t contains: [0.0, 1.0, 2.0, 3.0, 4.0]
tofu_tensor_free_data_too(t);
See also: tofu_tensor_rearange for in-place filling
tofu_tensor_rearange
Fill existing tensor with evenly spaced values (in-place arange).
void tofu_tensor_rearange(tofu_tensor *src, double start, double stop, double step);
Parameters:
src- Tensor to fill (cannot be NULL)start- Starting value (inclusive)stop- Ending value (exclusive)step- Step size between values
Behavior:
- Fills tensor with
[start, start+step, start+2*step, ...] - Number of values written is
min(tensor size, ceil((stop-start)/step)) - Modifies tensor data in-place
Cleanup Functions
tofu_tensor_free
Free tensor structure (does NOT free data buffer).
void tofu_tensor_free(tofu_tensor *t);
Parameters:
t- Tensor to free (can be NULL, no-op if NULL)
Behavior:
- Frees only the tensor structure and dims array
- Does NOT free the data buffer - caller must free data separately
- Safe to call even if tensor was used with
tofu_graph_param() - Call AFTER
tofu_graph_free()if tensor was used with graph
Example:
float data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){4}, TOFU_FLOAT);
tofu_tensor_free(t); // Free tensor structure only
// data is still valid
See also: tofu_tensor_free_data_too
tofu_tensor_free_data_too
Free both tensor structure and data buffer.
void tofu_tensor_free_data_too(tofu_tensor *t);
Parameters:
t- Tensor to free (can be NULL, no-op if NULL)
Behavior:
- Frees both the tensor and its associated data buffer
- Only use if tensor owns its data (created with
tofu_tensor_zeros,tofu_tensor_clone, etc.) - Do NOT use if tensor was created with
tofu_tensor_create()(usetofu_tensor_free) - Safe to call if tensor was used with
tofu_graph_param() - Call AFTER
tofu_graph_free()if tensor was used with graph
Example:
tofu_tensor *t = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor_free_data_too(t); // Free both tensor and data
Warning: Using this on tensors created with tofu_tensor_create() will cause undefined behavior!
Shape Operations
tofu_tensor_size
Get total number of elements in tensor.
size_t tofu_tensor_size(tofu_tensor *t);
Parameters:
t- Tensor (cannot be NULL)
Returns: Total element count (product of all dimensions)
Example:
tofu_tensor *t = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
size_t size = tofu_tensor_size(t); // Returns 12
tofu_tensor_reshape
Reshape tensor to new dimensions (view operation, no data copy).
tofu_tensor *tofu_tensor_reshape(tofu_tensor *src, int ndim, const int *dims);
Parameters:
src- Source tensor (cannot be NULL)ndim- Number of dimensions for reshaped tensordims- Array of new dimension sizes
Returns: New tensor structure sharing data with source (caller owns, must call tofu_tensor_free)
Behavior:
- Does NOT copy data - result shares memory with source
- Only changes shape metadata, not data layout
- Source must outlive result tensor
- Product of dims must equal
tofu_tensor_size(src)
Warning: Do NOT call tofu_tensor_free_data_too on the reshaped view - this would free the shared data while the source tensor still references it! Only use tofu_tensor_free on views.
Example:
tofu_tensor *t = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor *reshaped = tofu_tensor_reshape(t, 2, (int[]){3, 4});
// reshaped is a view of t with shape [3, 4]
tofu_tensor_free(reshaped); // Free view
tofu_tensor_free_data_too(t); // Free original
See also: tofu_tensor_reshape_src for in-place reshape
tofu_tensor_reshape_src
Reshape tensor in-place (modifies source tensor metadata).
void tofu_tensor_reshape_src(tofu_tensor *src, int ndim, const int *dims);
Parameters:
src- Tensor to reshape (cannot be NULL)ndim- Number of dimensions for reshaped tensordims- Array of new dimension sizes
Behavior:
- Modifies src tensor structure in-place
- Does NOT copy or reallocate data
- Only changes shape metadata
- Product of dims must equal
tofu_tensor_size(src)
Example:
tofu_tensor *t = tofu_tensor_zeros(1, (int[]){12}, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){3, 4});
// t now has shape [3, 4]
tofu_tensor_transpose
Transpose tensor by permuting dimensions.
tofu_tensor *tofu_tensor_transpose(const tofu_tensor *src, tofu_tensor *dst, const int *axes);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)axes- Permutation array (can be NULL for reverse order)
Returns: Result tensor (caller owns if dst was NULL)
Behavior:
- If
axesis NULL, reverses dimension order (e.g.,[2,3,4]→[4,3,2]) - If
axesis non-NULL, permutes according to axes (e.g.,axes=[1,0]swaps dims) - For 2-D matrix,
axes=NULLtransposes (rows ↔ columns)
Example:
// Matrix transpose
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *transposed = tofu_tensor_transpose(matrix, NULL, NULL);
// transposed has shape [4, 3]
// Custom permutation
int axes[] = {2, 0, 1};
tofu_tensor *t3d = tofu_tensor_zeros(3, (int[]){2, 3, 4}, TOFU_FLOAT);
tofu_tensor *permuted = tofu_tensor_transpose(t3d, NULL, axes);
// permuted has shape [4, 2, 3]
tofu_tensor_slice
Extract slice from tensor (copies data).
tofu_tensor *tofu_tensor_slice(const tofu_tensor *src, tofu_tensor *dst,
int axis, int start, int len);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)axis- Axis along which to slicestart- Starting index along axislen- Length of slice
Returns: Result tensor (caller owns if dst was NULL)
Preconditions:
axis < src->ndimstart >= 0andstart + len <= src->dims[axis]- If
dstis non-NULL, it must have correct shape for slice
Example:
tofu_tensor *t = tofu_tensor_arange(0.0, 10.0, 1.0, TOFU_FLOAT);
tofu_tensor *slice = tofu_tensor_slice(t, NULL, 0, 2, 5);
// slice contains: [2.0, 3.0, 4.0, 5.0, 6.0]
See also: tofu_tensor_slice_nocopy for view without copying
tofu_tensor_slice_nocopy
Create view of tensor slice (no data copy).
tofu_tensor *tofu_tensor_slice_nocopy(tofu_tensor *src, tofu_tensor *dst,
int axis, int start, int len);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)axis- Axis along which to slicestart- Starting index along axislen- Length of slice
Returns: Result tensor sharing data with source (caller owns if dst was NULL)
Behavior:
- Does NOT copy data - result shares memory with source
- Modifying result will modify source tensor
- Source must outlive result tensor
Warning: This is a view operation - changes affect the original tensor!
tofu_tensor_concat
Concatenate two tensors along specified axis.
tofu_tensor *tofu_tensor_concat(const tofu_tensor *src1, const tofu_tensor *src2,
tofu_tensor *dst, int axis);
Parameters:
src1- First tensor (cannot be NULL)src2- Second tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)axis- Axis along which to concatenate
Returns: Result tensor (caller owns if dst was NULL)
Preconditions:
- All dimensions except
axismust match between src1 and src2
Behavior:
- Result
dims[axis] = src1->dims[axis] + src2->dims[axis]
Example:
tofu_tensor *a = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor *concat = tofu_tensor_concat(a, b, NULL, 0);
// concat has shape [4, 3]
Mathematical Operations
tofu_tensor_matmul
Compute matrix multiplication with broadcasting.
tofu_tensor *tofu_tensor_matmul(const tofu_tensor *src1, const tofu_tensor *src2,
tofu_tensor *dst);
Parameters:
src1- Left operand tensor (cannot be NULL)src2- Right operand tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)
Returns: Result tensor (caller owns if dst was NULL)
Preconditions:
- For 1-D @ 1-D:
src1->dims[0]must equalsrc2->dims[0] - For 2-D and higher:
src1->dims[src1->ndim-1]must equalsrc2->dims[src2->ndim-2]
Behavior:
- 1-D @ 1-D: Dot product → scalar
- 2-D @ 2-D: Standard matrix multiplication
- N-D @ 1-D: Matrix-vector (drops last dim)
- 1-D @ N-D: Vector-matrix (drops first dim)
- N-D @ N-D: Batch matmul with broadcasting
Example:
// Matrix multiplication
tofu_tensor *A = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *B = tofu_tensor_zeros(2, (int[]){4, 5}, TOFU_FLOAT);
tofu_tensor *C = tofu_tensor_matmul(A, B, NULL);
// C has shape [3, 5]
// Batch matrix multiplication
tofu_tensor *batch_A = tofu_tensor_zeros(3, (int[]){2, 3, 4}, TOFU_FLOAT);
tofu_tensor *batch_B = tofu_tensor_zeros(3, (int[]){2, 4, 5}, TOFU_FLOAT);
tofu_tensor *batch_C = tofu_tensor_matmul(batch_A, batch_B, NULL);
// batch_C has shape [2, 3, 5]
Notes:
- Most commonly used operation for neural networks
- Broadcasts batch dimensions automatically
See also: tofu_tensor_inner for inner product
tofu_tensor_inner
Compute inner product (sum-product over last axes).
tofu_tensor *tofu_tensor_inner(const tofu_tensor *src1, const tofu_tensor *src2,
tofu_tensor *dst);
Parameters:
src1- First tensor (cannot be NULL)src2- Second tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)
Returns: Result tensor (caller owns if dst was NULL)
Preconditions:
src1->dims[src1->ndim-1]must equalsrc2->dims[src2->ndim-1]
Behavior:
- 1-D × 1-D: Dot product → scalar
- 2-D × 2-D:
result[i,j] = sum(a[i,:] * b[j,:]) - N-D × N-D: Cartesian product of non-last dimensions
- Output shape:
(*a.shape[:-1], *b.shape[:-1])
Example:
tofu_tensor *a = tofu_tensor_arange(0.0, 3.0, 1.0, TOFU_FLOAT); // [0, 1, 2]
tofu_tensor *b = tofu_tensor_arange(1.0, 4.0, 1.0, TOFU_FLOAT); // [1, 2, 3]
tofu_tensor *result = tofu_tensor_inner(a, b, NULL);
// result = 0*1 + 1*2 + 2*3 = 8.0
See also: tofu_tensor_matmul, tofu_tensor_outer
tofu_tensor_outer
Compute outer product (cartesian product without summation).
tofu_tensor *tofu_tensor_outer(const tofu_tensor *src1, const tofu_tensor *src2,
tofu_tensor *dst);
Parameters:
src1- First tensor (cannot be NULL)src2- Second tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)
Returns: Result tensor (caller owns if dst was NULL)
Behavior:
- Flattens both input tensors
- Computes:
result[i,j] = a[i] * b[j] - Always produces 2-D output
- Output shape:
[a.size, b.size]where size is total element count
Example:
tofu_tensor *a = tofu_tensor_arange(0.0, 3.0, 1.0, TOFU_FLOAT); // [0, 1, 2]
tofu_tensor *b = tofu_tensor_arange(1.0, 3.0, 1.0, TOFU_FLOAT); // [1, 2]
tofu_tensor *result = tofu_tensor_outer(a, b, NULL);
// result shape [3, 2]:
// [[0, 0],
// [1, 2],
// [2, 4]]
Element-wise Operations
tofu_tensor_elew
Apply element-wise binary operation with broadcasting.
tofu_tensor *tofu_tensor_elew(const tofu_tensor *src1, const tofu_tensor *src2,
tofu_tensor *dst, tofu_elew_op elew_op);
Parameters:
src1- First tensor (cannot be NULL)src2- Second tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)elew_op- Operation to apply (TOFU_MUL, TOFU_DIV, TOFU_SUM, TOFU_SUB, TOFU_POW, etc.)
Returns: Result tensor (caller owns if dst was NULL)
Preconditions:
- src1 and src2 must be broadcastable (NumPy rules)
Operations:
TOFU_MUL- Element-wise multiplication (*)TOFU_DIV- Element-wise division (/)TOFU_SUM- Element-wise addition (+)TOFU_SUB- Element-wise subtraction (-)TOFU_POW- Element-wise power (^)TOFU_MAX- Element-wise maximumTOFU_MIN- Element-wise minimum
Example:
tofu_tensor *a = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT); // [1, 2, 3, 4]
tofu_tensor *b = tofu_tensor_arange(2.0, 6.0, 1.0, TOFU_FLOAT); // [2, 3, 4, 5]
tofu_tensor *sum = tofu_tensor_elew(a, b, NULL, TOFU_SUM);
// sum = [3, 5, 7, 9]
tofu_tensor *prod = tofu_tensor_elew(a, b, NULL, TOFU_MUL);
// prod = [2, 6, 12, 20]
// Broadcasting example
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
float scalar_data[] = {2.0f};
tofu_tensor *scalar = tofu_tensor_create(scalar_data, 1, (int[]){1}, TOFU_FLOAT);
tofu_tensor *scaled = tofu_tensor_elew(matrix, scalar, NULL, TOFU_MUL);
// All elements of matrix multiplied by 2.0
See also: tofu_tensor_elew_param, tofu_tensor_elew_broadcast
tofu_tensor_elew_param
Apply element-wise operation between tensor and scalar.
tofu_tensor *tofu_tensor_elew_param(const tofu_tensor *src, double param,
tofu_tensor *dst, tofu_elew_op elew_op);
Parameters:
src- Source tensor (cannot be NULL)param- Scalar parameterdst- Destination tensor (can be NULL to allocate new)elew_op- Operation to apply
Returns: Result tensor with same shape as src (caller owns if dst was NULL)
Behavior:
- Applies operation element-wise:
op(tensor_element, param)
Example:
tofu_tensor *t = tofu_tensor_arange(1.0, 5.0, 1.0, TOFU_FLOAT); // [1, 2, 3, 4]
tofu_tensor *scaled = tofu_tensor_elew_param(t, 2.0, NULL, TOFU_MUL);
// scaled = [2, 4, 6, 8]
tofu_tensor *shifted = tofu_tensor_elew_param(t, 10.0, NULL, TOFU_SUM);
// shifted = [11, 12, 13, 14]
tofu_tensor *squared = tofu_tensor_elew_param(t, 2.0, NULL, TOFU_POW);
// squared = [1, 4, 9, 16]
tofu_tensor_elew_broadcast
Apply element-wise operation with automatic broadcasting.
tofu_tensor *tofu_tensor_elew_broadcast(const tofu_tensor *src1, const tofu_tensor *src2,
tofu_tensor *dst, tofu_elew_op elew_op);
Parameters:
src1- First tensor (cannot be NULL)src2- Second tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)elew_op- Operation to apply
Returns: Result tensor with broadcast shape (caller owns if dst was NULL)
Notes:
- Automatically broadcasts inputs to compatible shape
- Equivalent to
tofu_tensor_elewbut with explicit broadcast handling - Follows NumPy broadcasting rules
Reductions
tofu_tensor_sumreduce
Reduce tensor along axis using sum operation.
tofu_tensor *tofu_tensor_sumreduce(const tofu_tensor *src, tofu_tensor *dst, int axis);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)axis- Axis along which to reduce
Returns: Result tensor with dims[axis] removed (caller owns if dst was NULL)
Behavior:
- Output shape:
src->dimswithdims[axis]removed - Computes sum of all elements along specified axis
Example:
tofu_tensor *t = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
// Fill with 1.0
for (int i = 0; i < 12; i++) {
float val = 1.0f;
TOFU_TENSOR_DATA_FROM(t, i, val, TOFU_FLOAT);
}
tofu_tensor *row_sum = tofu_tensor_sumreduce(t, NULL, 1);
// row_sum has shape [3], each element = 4.0
tofu_tensor *col_sum = tofu_tensor_sumreduce(t, NULL, 0);
// col_sum has shape [4], each element = 3.0
See also: tofu_tensor_meanreduce, tofu_tensor_maxreduce
tofu_tensor_meanreduce
Reduce tensor along axis using mean operation.
tofu_tensor *tofu_tensor_meanreduce(const tofu_tensor *src, tofu_tensor *dst, int axis);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)axis- Axis along which to reduce
Returns: Result tensor with dims[axis] removed (caller owns if dst was NULL)
Behavior:
- Output shape:
src->dimswithdims[axis]removed - Computes arithmetic mean of all elements along specified axis
Example:
tofu_tensor *t = tofu_tensor_arange(0.0, 12.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){3, 4});
tofu_tensor *row_mean = tofu_tensor_meanreduce(t, NULL, 1);
// row_mean has shape [3]
// row_mean[0] = mean([0,1,2,3]) = 1.5
// row_mean[1] = mean([4,5,6,7]) = 5.5
// row_mean[2] = mean([8,9,10,11]) = 9.5
tofu_tensor_maxreduce
Reduce tensor along axis using max operation.
tofu_tensor *tofu_tensor_maxreduce(const tofu_tensor *src, tofu_tensor *dst,
tofu_tensor *arg, int axis);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)arg- Argmax indices tensor (can be NULL if indices not needed)axis- Axis along which to reduce
Returns: Result tensor with dims[axis] removed (caller owns if dst was NULL)
Behavior:
- Output shape:
src->dimswithdims[axis]removed - If
argis non-NULL, fills it with indices of maximum values
Example:
float data[] = {3.0f, 1.0f, 4.0f, 1.0f, 5.0f, 9.0f};
tofu_tensor *t = tofu_tensor_create(data, 2, (int[]){2, 3}, TOFU_FLOAT);
tofu_tensor *max_vals = tofu_tensor_maxreduce(t, NULL, NULL, 1);
// max_vals = [4.0, 9.0]
tofu_tensor *indices = tofu_tensor_zeros(1, (int[]){2}, TOFU_INT32);
max_vals = tofu_tensor_maxreduce(t, NULL, indices, 1);
// indices = [2, 2] (position of max in each row)
tofu_tensor_sub_broadcast
Subtract reduced tensor from source with broadcasting.
tofu_tensor *tofu_tensor_sub_broadcast(const tofu_tensor *src, const tofu_tensor *reduced,
tofu_tensor *dst, int axis);
Parameters:
src- Source tensor (cannot be NULL)reduced- Reduced tensor to subtract (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)axis- Axis along which reduction was performed
Returns: Result tensor with same shape as src (caller owns if dst was NULL)
Preconditions:
reduced->ndim = src->ndim - 1(one dimension removed)
Behavior:
- Broadcasts reduced tensor back along axis and subtracts
- Useful for normalization operations (subtract mean, etc.)
Activation Functions
tofu_tensor_lrelu
Apply Leaky ReLU activation function.
tofu_tensor *tofu_tensor_lrelu(const tofu_tensor *src, tofu_tensor *dst, float negslope);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)negslope- Slope for negative values (typically 0.01)
Returns: Result tensor with same shape as src (caller owns if dst was NULL)
Behavior:
- Computes:
x if x >= 0, else negslope * x - Standard ReLU equivalent when
negslope = 0
Example:
float data[] = {-2.0f, -1.0f, 0.0f, 1.0f, 2.0f};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){5}, TOFU_FLOAT);
tofu_tensor *relu = tofu_tensor_lrelu(t, NULL, 0.0f);
// relu = [0.0, 0.0, 0.0, 1.0, 2.0]
tofu_tensor *leaky = tofu_tensor_lrelu(t, NULL, 0.01f);
// leaky = [-0.02, -0.01, 0.0, 1.0, 2.0]
Note: For use in computation graphs with automatic differentiation, use tofu_graph_relu() instead.
tofu_tensor_softmax
Apply softmax activation along specified axis.
tofu_tensor *tofu_tensor_softmax(const tofu_tensor *src, tofu_tensor *dst, int axis);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)axis- Axis along which to apply softmax
Returns: Result tensor with same shape as src (caller owns if dst was NULL)
Behavior:
- Computes:
exp(x_i) / sum(exp(x_j))along axis - Uses numerically stable implementation (subtracts max before exp)
- Output values sum to 1.0 along specified axis
Example:
float logits[] = {1.0f, 2.0f, 3.0f};
tofu_tensor *t = tofu_tensor_create(logits, 1, (int[]){3}, TOFU_FLOAT);
tofu_tensor *probs = tofu_tensor_softmax(t, NULL, 0);
// probs ≈ [0.09, 0.24, 0.67] (sums to 1.0)
Note: For use in computation graphs with automatic differentiation, use tofu_graph_softmax() instead.
tofu_tensor_layer_norm
Apply layer normalization with learnable affine transform.
tofu_tensor *tofu_tensor_layer_norm(const tofu_tensor *src, tofu_tensor *dst,
const tofu_tensor *gamma, const tofu_tensor *beta,
int axis, double eps);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)gamma- Scale parameter tensor (can be NULL for no scaling)beta- Shift parameter tensor (can be NULL for no shift)axis- Axis along which to normalizeeps- Small constant for numerical stability (typically 1e-5)
Returns: Result tensor with same shape as src (caller owns if dst was NULL)
Behavior:
- Normalizes:
(x - mean) / sqrt(variance + eps) - Then applies:
gamma * normalized + beta(if gamma/beta non-NULL) - If gamma/beta are NULL, only normalization is applied
Example:
tofu_tensor *x = tofu_tensor_zeros(2, (int[]){2, 4}, TOFU_FLOAT);
float gamma_data[] = {1.0f, 1.0f, 1.0f, 1.0f};
float beta_data[] = {0.0f, 0.0f, 0.0f, 0.0f};
tofu_tensor *gamma = tofu_tensor_create(gamma_data, 1, (int[]){4}, TOFU_FLOAT);
tofu_tensor *beta = tofu_tensor_create(beta_data, 1, (int[]){4}, TOFU_FLOAT);
tofu_tensor *normalized = tofu_tensor_layer_norm(x, NULL, gamma, beta, 1, 1e-5);
Utilities
tofu_tensor_issameshape
Check if two tensors have the same shape.
int tofu_tensor_issameshape(const tofu_tensor *t1, const tofu_tensor *t2);
Parameters:
t1- First tensor (cannot be NULL)t2- Second tensor (cannot be NULL)
Returns: 1 if same shape, 0 otherwise
tofu_tensor_isbroadcastable
Check if two tensors can be broadcast together (NumPy semantics).
int tofu_tensor_isbroadcastable(const tofu_tensor *t1, const tofu_tensor *t2);
Parameters:
t1- First tensor (cannot be NULL)t2- Second tensor (cannot be NULL)
Returns: 1 if broadcastable, 0 otherwise
Broadcasting Rules:
- Arrays with fewer dimensions are prepended with size-1 dimensions
- Size-1 dimensions are stretched to match the other array
- Dimensions must match or one must be 1
Example:
tofu_tensor *a = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
int can_broadcast = tofu_tensor_isbroadcastable(a, b); // Returns 1
tofu_tensor *c = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);
can_broadcast = tofu_tensor_isbroadcastable(a, c); // Returns 0
tofu_tensor_broadcast_to
Broadcast tensor to specified shape (NumPy semantics).
tofu_tensor *tofu_tensor_broadcast_to(const tofu_tensor *src, tofu_tensor *dst,
int ndim, const int *dims);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)ndim- Number of dimensions for target shapedims- Target dimension sizes
Returns: Result tensor with target shape (caller owns if dst was NULL)
Preconditions:
- src must be broadcastable to target shape (NumPy rules)
Behavior:
- Follows NumPy broadcasting rules
- Size-1 dimensions are stretched to match target
tofu_tensor_print
Print tensor to stdout with custom format.
void tofu_tensor_print(const tofu_tensor *t, const char *fmt);
Parameters:
t- Tensor to print (cannot be NULL)fmt- Format string for each element (e.g., "%.6f", "%d")
Example:
tofu_tensor *t = tofu_tensor_arange(0.0, 6.0, 1.0, TOFU_FLOAT);
tofu_tensor_reshape_src(t, 2, (int[]){2, 3});
tofu_tensor_print(t, "%.1f");
// Output:
// [[0.0, 1.0, 2.0],
// [3.0, 4.0, 5.0]]
See also: tofu_tensor_fprint for printing to arbitrary stream, tofu_tensor_save for saving to file
tofu_tensor_fprint
Print tensor to file stream with custom format.
void tofu_tensor_fprint(FILE *stream, const tofu_tensor *t, const char *fmt);
Parameters:
stream- File stream to write to (cannot be NULL)t- Tensor to print (cannot be NULL)fmt- Format string for each element
tofu_tensor_save
Save tensor to file with custom format.
int tofu_tensor_save(const char *file_name, const tofu_tensor *t, const char *fmt);
Parameters:
file_name- Path to output file (cannot be NULL)t- Tensor to save (cannot be NULL)fmt- Format string for each element
Returns: 0 on success, non-zero on error
tofu_tensor_convert
Convert tensor to different data type.
tofu_tensor *tofu_tensor_convert(const tofu_tensor *src, tofu_tensor *dst,
tofu_dtype dtype_d);
Parameters:
src- Source tensor (cannot be NULL)dst- Destination tensor (can be NULL to allocate new)dtype_d- Target data type
Returns: Result tensor with same shape as src but different dtype (caller owns if dst was NULL)
Behavior:
- Converts each element to target type with appropriate casting
- May lose precision (e.g., float to int truncates)
Example:
float data[] = {1.7f, 2.3f, 3.9f};
tofu_tensor *floats = tofu_tensor_create(data, 1, (int[]){3}, TOFU_FLOAT);
tofu_tensor *ints = tofu_tensor_convert(floats, NULL, TOFU_INT32);
// ints = [1, 2, 3]
tofu_tensor_index
Convert multi-dimensional coordinates to flat index.
int tofu_tensor_index(const tofu_tensor *t, int *coords);
Parameters:
t- Tensor (cannot be NULL)coords- Array of coordinates, length must bet->ndim
Returns: Flat index into tensor data array
tofu_tensor_coords
Convert flat index to multi-dimensional coordinates.
void tofu_tensor_coords(const tofu_tensor *t, int index, int *coords);
Parameters:
t- Tensor (cannot be NULL)index- Flat index into tensor data arraycoords- Output array for coordinates, length must bet->ndim
Common Patterns
Working with Tensor Memory
// Pattern 1: User manages data buffer
float data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *t = tofu_tensor_create(data, 1, (int[]){4}, TOFU_FLOAT);
// Use tensor...
tofu_tensor_free(t);
// data is still valid
// Pattern 2: Library manages data buffer
tofu_tensor *t = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
// Use tensor...
tofu_tensor_free_data_too(t);
// Both tensor and data are freed
Accessing Tensor Elements
tofu_tensor *t = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
// Read element at index i
float value;
TOFU_TENSOR_DATA_TO(t, i, value, TOFU_FLOAT);
// Write element at index i
value = 42.0f;
TOFU_TENSOR_DATA_FROM(t, i, value, TOFU_FLOAT);
// Copy element from src[si] to dst[di]
TOFU_TENSOR_DATA_ASSIGN(dst, di, src, si);
Broadcasting Example
// Add scalar to matrix (broadcasting)
tofu_tensor *matrix = tofu_tensor_zeros(2, (int[]){3, 4}, TOFU_FLOAT);
tofu_tensor *result = tofu_tensor_elew_param(matrix, 5.0, NULL, TOFU_SUM);
// Add vector to matrix rows (broadcasting)
tofu_tensor *row_vec = tofu_tensor_zeros(1, (int[]){4}, TOFU_FLOAT);
result = tofu_tensor_elew_broadcast(matrix, row_vec, NULL, TOFU_SUM);
Graph API Reference
The Graph API provides computational graph construction and automatic differentiation for training neural networks. It implements reverse-mode automatic differentiation (backpropagation) for computing gradients.
Table of Contents
- Data Structures
- Graph Lifecycle
- Leaf Nodes
- Operations
- Loss Functions
- Backward Pass
- Utilities
- Usage Patterns
Data Structures
tofu_graph
The computation graph structure that manages all nodes and their relationships.
struct tofu_graph {
tofu_graph_node** nodes; // All nodes in graph
int num_nodes; // Number of nodes
int capacity; // Allocated capacity
tofu_graph_node** topo_order; // Nodes in reverse topological order
int topo_size; // Size of topo_order
int topo_capacity; // Allocated capacity
int next_id; // Next available node ID
};
tofu_graph_node
A node in the computation graph representing an operation or leaf value.
struct tofu_graph_node {
int id; // Unique node ID within graph
tofu_op_type op; // Operation type
tofu_tensor* value; // Forward pass result
tofu_tensor* grad; // Gradient (∂L/∂value)
tofu_graph_node** inputs; // Input nodes
int num_inputs; // Number of inputs
int capacity_inputs; // Allocated capacity for inputs
tofu_backward_fn backward_fn; // Backward pass function
void* backward_ctx; // Context for backward (saved tensors, etc.)
int requires_grad; // Does this need gradient computation?
int visited; // For topological sort
tofu_graph* graph; // Parent graph
};
Operation Types (tofu_op_type)
Enumeration of all supported operations:
-
Leaf nodes:
TOFU_OP_INPUT- Input node (no gradient)TOFU_OP_PARAM- Trainable parameter (requires gradient)
-
Binary operations:
TOFU_OP_MATMUL- Matrix multiplicationTOFU_OP_ADD- Element-wise additionTOFU_OP_MUL- Element-wise multiplication
-
Activations:
TOFU_OP_RELU- ReLU activationTOFU_OP_SOFTMAX- Softmax activationTOFU_OP_LAYER_NORM- Layer normalization
-
Shape operations:
TOFU_OP_RESHAPE- Reshape operationTOFU_OP_TRANSPOSE- Transpose operation
-
Reductions:
TOFU_OP_MEAN- Mean reductionTOFU_OP_SUM- Sum reduction
-
Loss functions:
TOFU_OP_MSE_LOSS- Mean squared error lossTOFU_OP_CE_LOSS- Cross-entropy loss
Graph Lifecycle
tofu_graph_create
Create a new empty computation graph.
tofu_graph* tofu_graph_create(void);
Returns: Pointer to newly allocated graph (caller owns, must call tofu_graph_free)
Behavior:
- Graph starts empty - add nodes via
tofu_graph_input,tofu_graph_param, etc. - Graph does NOT take ownership of tensors passed to
tofu_graph_param - Caller must call
tofu_graph_freeto free graph and all nodes
Example:
tofu_graph *g = tofu_graph_create();
// Build graph...
tofu_graph_free(g);
tofu_graph_free
Free computation graph and all nodes.
void tofu_graph_free(tofu_graph* g);
Parameters:
g- Graph to free (can be NULL, no-op if NULL)
Behavior:
- Frees all graph nodes and their gradients
- Frees intermediate operation results (matmul, add, etc.)
- Does NOT free INPUT or PARAM tensors (caller owns them)
- Caller must separately free tensors passed to input/param functions
- Safe to call multiple times (idempotent)
Ownership Pattern:
// Create tensors
tofu_tensor *input = tofu_tensor_zeros(2, (int[]){1, 4}, TOFU_FLOAT);
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
// Build graph
tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, input);
tofu_graph_node *W = tofu_graph_param(g, weights);
// ... more operations ...
// Cleanup (important order!)
tofu_graph_free(g); // 1. Free graph first
tofu_tensor_free_data_too(input); // 2. Then free tensors
tofu_tensor_free_data_too(weights);
See also: tofu_graph_clear_ops
tofu_graph_clear_ops
Clear all operation nodes but keep parameter nodes.
void tofu_graph_clear_ops(tofu_graph* g);
Parameters:
g- Graph to clear (cannot be NULL)
Behavior:
- Frees all nodes except PARAM and INPUT nodes
- Preserves trainable parameters for next forward pass
- Use between training iterations to reset computation graph
Use Case:
tofu_graph *g = tofu_graph_create();
// Add parameters (preserved across iterations)
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Build forward graph for this batch
tofu_graph_node *x = tofu_graph_input(g, batch_data);
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
tofu_graph_node *out = tofu_graph_add(g, y, b);
// Backward and optimize...
// Clear operations for next iteration (W and b are preserved)
tofu_graph_clear_ops(g);
}
Notes:
- More efficient than creating new graph each iteration
- Parameters maintain their values and gradients
- Violating preconditions triggers
assert()and crashes
Leaf Nodes
Leaf nodes are the starting points of computation - they have no inputs and represent either data or learnable parameters.
tofu_graph_input
Create input node (non-trainable data source).
tofu_graph_node* tofu_graph_input(tofu_graph* g, tofu_tensor* data);
Parameters:
g- Graph to add node to (cannot be NULL)data- Input tensor data (cannot be NULL)
Returns: Pointer to newly created graph node (graph owns node, caller owns tensor)
Behavior:
- Input nodes do NOT compute gradients
- IMPORTANT: Graph does NOT take ownership of data tensor
- Caller must free data tensor separately after
tofu_graph_free() - Use for input data that doesn't require backpropagation
Example:
float input_data[] = {1.0f, 2.0f, 3.0f, 4.0f};
tofu_tensor *x_tensor = tofu_tensor_create(input_data, 2, (int[]){1, 4}, TOFU_FLOAT);
tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
// Use x in graph operations...
tofu_graph_free(g);
tofu_tensor_free(x_tensor); // Caller must free tensor
Notes:
- Typical pattern: create tensor → input node → use → graph_free → free tensor
- Violating preconditions triggers
assert()and crashes
See also: tofu_graph_param for trainable parameters
tofu_graph_param
Create parameter node (trainable weights/biases).
tofu_graph_node* tofu_graph_param(tofu_graph* g, tofu_tensor* data);
Parameters:
g- Graph to add node to (cannot be NULL)data- Parameter tensor data (cannot be NULL)
Returns: Pointer to newly created graph node (graph owns node, caller owns tensor)
Behavior:
- IMPORTANT: Graph does NOT take ownership of data tensor
- Caller must free data tensor separately after
tofu_graph_free() - Parameter nodes compute gradients during backward pass
- Use for trainable weights, biases, etc.
Example:
// Create trainable weights
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);
tofu_graph *g = tofu_graph_create();
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);
// Build network...
// Training loop with backward pass computes W->grad and b->grad
// Cleanup
tofu_graph_free(g);
tofu_tensor_free_data_too(W); // Caller must free tensors
tofu_tensor_free_data_too(b);
Notes:
- Typical pattern: create tensor → param node → free tensor after graph_free
- Gradients are stored in the node, accessible via
tofu_graph_get_grad() - Violating preconditions triggers
assert()and crashes
See also: tofu_graph_input for non-trainable inputs, tofu_graph_get_grad to access gradients
Operations
Operations create new nodes in the graph that compute values during forward pass and gradients during backward pass.
tofu_graph_matmul
Add matrix multiplication node to graph.
tofu_graph_node* tofu_graph_matmul(tofu_graph* g, tofu_graph_node* a, tofu_graph_node* b);
Parameters:
g- Graph to add node to (cannot be NULL)a- Left operand node (cannot be NULL)b- Right operand node (cannot be NULL)
Returns: Pointer to result node (graph owns, freed by tofu_graph_free)
Preconditions:
a->value->dims[last]must equalb->value->dims[second-to-last]
Behavior:
- Computes matrix multiplication with broadcasting
- Implements backward pass for gradient computation
- Result node requires gradient if any input requires gradient
Example:
// Neural network layer: y = x @ W
tofu_graph_node *x = tofu_graph_input(g, input_tensor);
tofu_graph_node *W = tofu_graph_param(g, weights_tensor);
tofu_graph_node *y = tofu_graph_matmul(g, x, W);
Notes:
- Most commonly used operation for neural networks
- Follows same semantics as
tofu_tensor_matmul(see Tensor API) - Violating preconditions triggers
assert()and crashes
tofu_graph_add
Add element-wise addition node to graph.
tofu_graph_node* tofu_graph_add(tofu_graph* g, tofu_graph_node* a, tofu_graph_node* b);
Parameters:
g- Graph to add node to (cannot be NULL)a- First operand node (cannot be NULL)b- Second operand node (cannot be NULL)
Returns: Pointer to result node (graph owns, freed by tofu_graph_free)
Preconditions:
- a and b must be broadcastable (NumPy rules)
Behavior:
- Computes element-wise addition with broadcasting
- Implements backward pass for gradient computation
Example:
// Add bias: y = x + b
tofu_graph_node *x = tofu_graph_matmul(g, input, weights);
tofu_graph_node *b = tofu_graph_param(g, bias_tensor);
tofu_graph_node *y = tofu_graph_add(g, x, b);
Notes:
- Supports NumPy-style broadcasting
- Common for adding biases to layer outputs
tofu_graph_mul
Add element-wise multiplication node to graph.
tofu_graph_node* tofu_graph_mul(tofu_graph* g, tofu_graph_node* a, tofu_graph_node* b);
Parameters:
g- Graph to add node to (cannot be NULL)a- First operand node (cannot be NULL)b- Second operand node (cannot be NULL)
Returns: Pointer to result node (graph owns, freed by tofu_graph_free)
Preconditions:
- a and b must be broadcastable (NumPy rules)
Behavior:
- Computes element-wise multiplication with broadcasting
- Implements backward pass for gradient computation
Example:
// Attention mechanism: scaled dot product
tofu_graph_node *qk = tofu_graph_matmul(g, q, k);
tofu_graph_node *scale = tofu_graph_param(g, scale_tensor);
tofu_graph_node *scaled = tofu_graph_mul(g, qk, scale);
tofu_graph_relu
Add ReLU activation node to graph.
tofu_graph_node* tofu_graph_relu(tofu_graph* g, tofu_graph_node* x);
Parameters:
g- Graph to add node to (cannot be NULL)x- Input node (cannot be NULL)
Returns: Pointer to result node (graph owns, freed by tofu_graph_free)
Behavior:
- Computes ReLU:
max(0, x) - Implements backward pass for gradient computation
- Gradient is 1 where
x > 0, else 0
Example:
// Hidden layer with ReLU
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1);
tofu_graph_node *h1_bias = tofu_graph_add(g, h1, b1);
tofu_graph_node *h1_relu = tofu_graph_relu(g, h1_bias);
tofu_graph_softmax
Add softmax activation node to graph.
tofu_graph_node* tofu_graph_softmax(tofu_graph* g, tofu_graph_node* x, int axis);
Parameters:
g- Graph to add node to (cannot be NULL)x- Input node (cannot be NULL)axis- Axis along which to apply softmax
Returns: Pointer to result node (graph owns, freed by tofu_graph_free)
Preconditions:
axis < x->value->ndim
Behavior:
- Computes softmax along specified axis (exp normalization)
- Implements backward pass for gradient computation
- Numerically stable (subtracts max before exp)
Example:
// Classification output layer
tofu_graph_node *logits = tofu_graph_matmul(g, h, W_out);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
Notes:
- Typically used for classification tasks
- Output values sum to 1.0 along specified axis
tofu_graph_layer_norm
Add layer normalization node to graph.
tofu_graph_node* tofu_graph_layer_norm(tofu_graph* g, tofu_graph_node* x,
tofu_graph_node* gamma, tofu_graph_node* beta,
int axis, double eps);
Parameters:
g- Graph to add node to (cannot be NULL)x- Input node (cannot be NULL)gamma- Scale parameter node (can be NULL for no scaling)beta- Shift parameter node (can be NULL for no shift)axis- Axis along which to normalizeeps- Small constant for numerical stability (typically 1e-5)
Returns: Pointer to result node (graph owns, freed by tofu_graph_free)
Preconditions:
axis < x->value->ndimeps > 0
Behavior:
- Normalizes:
(x - mean) / sqrt(variance + eps) - Then applies:
gamma * normalized + beta(if gamma/beta non-NULL) - Implements backward pass for gradient computation
Example:
// Transformer-style layer norm
tofu_graph_node *gamma = tofu_graph_param(g, gamma_tensor);
tofu_graph_node *beta = tofu_graph_param(g, beta_tensor);
tofu_graph_node *normalized = tofu_graph_layer_norm(g, x, gamma, beta, 1, 1e-5);
Notes:
- Common in transformer architectures
- Helps stabilize training of deep networks
tofu_graph_reshape
Add reshape node to graph.
tofu_graph_node* tofu_graph_reshape(tofu_graph* g, tofu_graph_node* x, int ndim, const int* dims);
Parameters:
g- Graph to add node to (cannot be NULL)x- Input node (cannot be NULL)ndim- Number of dimensions for reshaped tensordims- Array of new dimension sizes
Returns: Pointer to result node (graph owns, freed by tofu_graph_free)
Preconditions:
- Product of dims must equal
x->valuetotal elements
Behavior:
- View operation (no data copy) - reshaped tensor shares data with input
- Implements backward pass for gradient computation
Example:
// Flatten for fully connected layer
// Input: [batch, channels, height, width]
// Output: [batch, channels * height * width]
int flat_dim = channels * height * width;
tofu_graph_node *flat = tofu_graph_reshape(g, x, 2, (int[]){batch_size, flat_dim});
tofu_graph_transpose
Add transpose node to graph.
tofu_graph_node* tofu_graph_transpose(tofu_graph* g, tofu_graph_node* x, const int* axes);
Parameters:
g- Graph to add node to (cannot be NULL)x- Input node (cannot be NULL)axes- Permutation array (can be NULL for reverse order)
Returns: Pointer to result node (graph owns, freed by tofu_graph_free)
Preconditions:
- If axes is non-NULL, it must be valid permutation of
[0, ..., ndim-1]
Behavior:
- If axes is NULL, reverses dimension order
- Implements backward pass for gradient computation
Example:
// Transpose matrix for weight matrix
tofu_graph_node *W_T = tofu_graph_transpose(g, W, NULL);
Loss Functions
Loss functions compute scalar values representing model error. They're typically the final nodes in a computation graph before calling tofu_graph_backward().
tofu_graph_mse_loss
Add mean squared error loss node to graph.
tofu_graph_node* tofu_graph_mse_loss(tofu_graph* g, tofu_graph_node* pred, tofu_graph_node* target);
Parameters:
g- Graph to add node to (cannot be NULL)pred- Prediction node (cannot be NULL)target- Target/ground truth node (cannot be NULL)
Returns: Pointer to scalar loss node (graph owns, freed by tofu_graph_free)
Preconditions:
- pred and target must have same shape
Behavior:
- Computes:
mean((pred - target)^2) - Returns scalar (average over all elements)
- Use for regression tasks
- Implements backward pass for gradient computation
Example:
// Regression task
tofu_graph_node *pred = tofu_graph_matmul(g, x, W);
tofu_graph_node *target = tofu_graph_input(g, target_tensor);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
// Compute gradients
tofu_graph_backward(g, loss);
See also: tofu_graph_ce_loss for classification
tofu_graph_ce_loss
Add cross-entropy loss node to graph.
tofu_graph_node* tofu_graph_ce_loss(tofu_graph* g, tofu_graph_node* pred, tofu_graph_node* target);
Parameters:
g- Graph to add node to (cannot be NULL)pred- Prediction node (softmax probabilities) (cannot be NULL)target- Target/ground truth node (class indices or one-hot) (cannot be NULL)
Returns: Pointer to scalar loss node (graph owns, freed by tofu_graph_free)
Behavior:
- Computes:
-sum(target * log(pred)) - Returns scalar (average over batch)
- Use for classification tasks
- Numerically stable implementation
- Implements backward pass for gradient computation
Example:
// Classification task
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
tofu_graph_node *target = tofu_graph_input(g, target_tensor);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);
// Compute gradients
tofu_graph_backward(g, loss);
See also: tofu_graph_mse_loss for regression
Backward Pass
tofu_graph_backward
Perform backward pass (backpropagation) from loss node.
void tofu_graph_backward(tofu_graph* g, tofu_graph_node* loss);
Parameters:
g- Graph containing loss node (cannot be NULL)loss- Loss node to backpropagate from (cannot be NULL)
Preconditions:
- loss must be scalar (single element tensor)
Behavior:
- Computes gradients for all nodes requiring gradient
- Populates
node->gradfor all PARAM nodes - Uses reverse-mode automatic differentiation
- Call after forward pass, before optimizer step
- Gradients accumulate across multiple backward passes and from multiple computational paths
- Always call
tofu_graph_zero_gradbefore each training iteration unless you intentionally want gradient accumulation (e.g., for gradient accumulation across mini-batches)
Example:
// Training iteration
tofu_graph *g = tofu_graph_create();
// Forward pass
tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *pred = tofu_graph_matmul(g, x, W);
tofu_graph_node *target = tofu_graph_input(g, target_data);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
// Backward pass
tofu_graph_backward(g, loss);
// Now W->grad contains ∂loss/∂W
tofu_tensor *W_grad = tofu_graph_get_grad(W);
Notes:
- Automatically builds topological sort for efficient gradient computation
- Violating preconditions triggers
assert()and crashes
See also: tofu_graph_zero_grad to clear gradients, tofu_graph_get_grad to access gradients
Utilities
tofu_graph_get_value
Get forward pass result from graph node.
tofu_tensor* tofu_graph_get_value(tofu_graph_node* node);
Parameters:
node- Graph node (cannot be NULL)
Returns: Pointer to result tensor (node owns, do NOT free)
Behavior:
- Returns tensor computed during forward pass
- Do NOT free returned tensor (node owns it)
Example:
tofu_graph_node *pred = tofu_graph_matmul(g, x, W);
tofu_tensor *pred_value = tofu_graph_get_value(pred);
// Print predictions
tofu_tensor_print(pred_value, "%.6f");
Warning: Do not free the returned tensor!
tofu_graph_get_grad
Get gradient from graph node.
tofu_tensor* tofu_graph_get_grad(tofu_graph_node* node);
Parameters:
node- Graph node (cannot be NULL)
Returns: Pointer to gradient tensor (node owns, do NOT free), or NULL if no gradient
Behavior:
- Returns gradient computed during backward pass
- Returns NULL if backward hasn't been called yet
- Do NOT free returned tensor (node owns it)
Example:
tofu_graph_node *W = tofu_graph_param(g, weights);
// ... build graph and backward pass ...
tofu_tensor *W_grad = tofu_graph_get_grad(W);
if (W_grad) {
// Use gradient for parameter update
// (or use optimizer which handles this automatically)
}
Warning: Do not free the returned tensor!
tofu_graph_zero_grad
Zero out all gradients in graph.
void tofu_graph_zero_grad(tofu_graph* g);
Parameters:
g- Graph to zero gradients for (cannot be NULL)
Behavior:
- Sets all
node->gradtensors to zero - Call before each training iteration to prevent gradient accumulation
- Does NOT free gradient tensors, just zeros values
Example:
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Zero gradients before forward pass
tofu_graph_zero_grad(g);
// Forward pass
// ... build graph ...
// Backward pass
tofu_graph_backward(g, loss);
// Update parameters
tofu_optimizer_step(optimizer);
}
Notes:
- Essential for correct training - prevents gradient accumulation
- Typically called by optimizer's
zero_grad()function
See also: tofu_optimizer_zero_grad
Usage Patterns
Basic Training Loop
// Setup
tofu_graph *g = tofu_graph_create();
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){4, 3}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){3}, TOFU_FLOAT);
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
for (int batch = 0; batch < num_batches; batch++) {
// Zero gradients
tofu_optimizer_zero_grad(opt);
// Forward pass
tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
tofu_graph_node *y = tofu_graph_matmul(g, x, W_node);
tofu_graph_node *out = tofu_graph_add(g, y, b_node);
tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
tofu_graph_node *loss = tofu_graph_mse_loss(g, out, target);
// Backward pass
tofu_graph_backward(g, loss);
// Update parameters
tofu_optimizer_step(opt);
// Clear operations for next batch (keeps parameters)
tofu_graph_clear_ops(g);
}
}
// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);
Multi-Layer Neural Network
// Define network architecture
typedef struct {
tofu_tensor *W1, *b1;
tofu_tensor *W2, *b2;
tofu_tensor *W3, *b3;
} Network;
// Forward pass function
tofu_graph_node* forward_pass(tofu_graph *g, tofu_graph_node *x, Network *net) {
// Layer 1
tofu_graph_node *W1 = tofu_graph_param(g, net->W1);
tofu_graph_node *b1 = tofu_graph_param(g, net->b1);
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1);
h1 = tofu_graph_add(g, h1, b1);
h1 = tofu_graph_relu(g, h1);
// Layer 2
tofu_graph_node *W2 = tofu_graph_param(g, net->W2);
tofu_graph_node *b2 = tofu_graph_param(g, net->b2);
tofu_graph_node *h2 = tofu_graph_matmul(g, h1, W2);
h2 = tofu_graph_add(g, h2, b2);
h2 = tofu_graph_relu(g, h2);
// Output layer
tofu_graph_node *W3 = tofu_graph_param(g, net->W3);
tofu_graph_node *b3 = tofu_graph_param(g, net->b3);
tofu_graph_node *out = tofu_graph_matmul(g, h2, W3);
out = tofu_graph_add(g, out, b3);
return out;
}
// Usage
tofu_graph *g = tofu_graph_create();
tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *pred = forward_pass(g, x, &network);
tofu_graph_node *target = tofu_graph_input(g, target_data);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
tofu_graph_backward(g, loss);
Classification with Softmax and Cross-Entropy
// Classification network
tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);
// Logits
tofu_graph_node *logits = tofu_graph_matmul(g, x, W);
logits = tofu_graph_add(g, logits, b);
// Softmax (probabilities)
tofu_graph_node *probs = tofu_graph_softmax(g, logits, 1);
// Cross-entropy loss
tofu_graph_node *target = tofu_graph_input(g, target_data);
tofu_graph_node *loss = tofu_graph_ce_loss(g, probs, target);
// Backward and optimize
tofu_graph_backward(g, loss);
tofu_optimizer_step(optimizer);
Memory Management Best Practices
// 1. Create tensors for parameters (library manages data)
tofu_tensor *weights = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_tensor *bias = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);
// 2. Create graph and add parameters
tofu_graph *g = tofu_graph_create();
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);
// 3. Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Create input tensors (user manages data)
float *batch_data = load_batch(epoch);
tofu_tensor *x_tensor = tofu_tensor_create(batch_data, 2, (int[]){32, 784}, TOFU_FLOAT);
// Build graph
tofu_graph_node *x = tofu_graph_input(g, x_tensor);
// ... forward pass ...
// Training step
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
// Clean up batch resources
tofu_tensor_free(x_tensor); // Free tensor structure
free(batch_data); // Free data buffer
// Clear operations (keeps parameters)
tofu_graph_clear_ops(g);
}
// 4. Cleanup (IMPORTANT ORDER!)
tofu_optimizer_free(opt); // Free optimizer first
tofu_graph_free(g); // Then free graph
tofu_tensor_free_data_too(weights); // Finally free parameter tensors
tofu_tensor_free_data_too(bias);
Efficient Batch Processing
tofu_graph *g = tofu_graph_create();
// Add parameters once (persists across batches)
tofu_graph_node *W = tofu_graph_param(g, weights);
tofu_graph_node *b = tofu_graph_param(g, bias);
for (int batch = 0; batch < num_batches; batch++) {
tofu_optimizer_zero_grad(opt);
// Add input for this batch
tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
// Build forward graph
tofu_graph_node *out = tofu_graph_add(g, tofu_graph_matmul(g, x, W), b);
tofu_graph_node *loss = tofu_graph_mse_loss(g, out, batch_targets[batch]);
// Backward and update
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
// Clear operations for next batch (W and b are preserved)
tofu_graph_clear_ops(g);
}
Notes
Gradient Accumulation
Gradients accumulate by default. Always call tofu_graph_zero_grad() or tofu_optimizer_zero_grad() before each training iteration:
// CORRECT: Zero gradients before each iteration
for (int i = 0; i < num_iterations; i++) {
tofu_optimizer_zero_grad(opt); // Clear previous gradients
// ... forward and backward ...
}
// INCORRECT: Gradients accumulate indefinitely
for (int i = 0; i < num_iterations; i++) {
// ... forward and backward ...
// Gradients from all iterations accumulate!
}
Dynamic Graphs
Tofu uses dynamic computation graphs (define-by-run). The graph structure can change between iterations:
for (int epoch = 0; epoch < num_epochs; epoch++) {
if (epoch < 10) {
// Simple network
out = tofu_graph_matmul(g, x, W1);
} else {
// More complex network
h = tofu_graph_relu(g, tofu_graph_matmul(g, x, W1));
out = tofu_graph_matmul(g, h, W2);
}
tofu_graph_backward(g, loss);
tofu_graph_clear_ops(g); // Clear for next iteration
}
Error Checking
Most functions use assert() for precondition checking. In release builds with assertions disabled, violating preconditions leads to undefined behavior. Always ensure:
- Pointers are non-NULL
- Shapes are compatible
- Tensors are broadcastable
- Loss is scalar before calling backward
Optimizer API Reference
The Optimizer API provides algorithms for updating trainable parameters based on computed gradients. Optimizers automatically collect parameters from the computation graph and apply update rules during training.
Table of Contents
- Data Structures
- Creating Optimizers
- Training Operations
- Parameter Management
- Usage Patterns
- Hyperparameter Guidance
Data Structures
tofu_optimizer
The optimizer structure that manages parameters and their update strategy.
struct tofu_optimizer {
tofu_optim_type type; // Optimizer type
tofu_graph* graph; // Associated computation graph
tofu_graph_node** params; // Array of parameter nodes
int num_params; // Number of parameters
int capacity_params; // Allocated capacity
double learning_rate; // Learning rate
void* state; // Optimizer state (momentum buffers, etc.)
tofu_optim_step_fn step_fn; // Parameter update function
};
Optimizer Types (tofu_optim_type)
Available optimization algorithms:
TOFU_OPTIM_SGD- Vanilla Stochastic Gradient DescentTOFU_OPTIM_SGD_MOMENTUM- SGD with momentumTOFU_OPTIM_ADAM- Adam optimizer (future)
Creating Optimizers
tofu_optimizer_sgd_create
Create SGD (Stochastic Gradient Descent) optimizer.
tofu_optimizer* tofu_optimizer_sgd_create(tofu_graph* g, double learning_rate);
Parameters:
g- Computation graph containing parameters (cannot be NULL)learning_rate- Learning rate (step size) (must be > 0)
Returns: Pointer to newly allocated optimizer (caller owns, must call tofu_optimizer_free)
Preconditions:
- g must not be NULL
- learning_rate > 0
Behavior:
- Implements vanilla SGD:
param = param - learning_rate * grad - Automatically collects all PARAM nodes from graph
- Caller must call
tofu_optimizer_freeto free optimizer
Algorithm:
for each parameter θ:
θ ← θ - η * ∇θL
where:
η = learning_rate
∇θL = gradient of loss w.r.t. parameter
Example:
tofu_graph *g = tofu_graph_create();
// Add parameters to graph
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_graph_node *W_node = tofu_graph_param(g, W);
// Create optimizer
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
tofu_optimizer_zero_grad(opt);
// ... forward and backward pass ...
tofu_optimizer_step(opt);
}
// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
Notes:
- Simple and robust, good baseline optimizer
- No momentum or adaptive learning rates
- May converge slowly on complex problems
- Violating preconditions triggers
assert()and crashes
See also: tofu_optimizer_sgd_momentum_create for SGD with momentum
tofu_optimizer_sgd_momentum_create
Create SGD optimizer with momentum.
tofu_optimizer* tofu_optimizer_sgd_momentum_create(tofu_graph* g, double learning_rate, double momentum);
Parameters:
g- Computation graph containing parameters (cannot be NULL)learning_rate- Learning rate (step size) (must be > 0)momentum- Momentum coefficient (typically 0.9) (must be >= 0 and < 1)
Returns: Pointer to newly allocated optimizer (caller owns, must call tofu_optimizer_free)
Preconditions:
- g must not be NULL
- learning_rate > 0
- 0 <= momentum < 1
Behavior:
- Implements SGD with momentum:
velocity = momentum * velocity - learning_rate * gradparam = param + velocity
- Momentum helps accelerate training and reduces oscillations
- Automatically collects all PARAM nodes from graph
- Caller must call
tofu_optimizer_freeto free optimizer
Algorithm:
for each parameter θ:
v ← μ * v - η * ∇θL
θ ← θ + v
where:
η = learning_rate
μ = momentum
v = velocity (accumulated gradients)
∇θL = gradient of loss w.r.t. parameter
Note: This is mathematically equivalent to classical momentum
(v = μ*v + ∇θL, θ = θ - η*v) but incorporates the learning
rate into the velocity update rather than the parameter update.
Example:
tofu_graph *g = tofu_graph_create();
// Add parameters
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_graph_node *W_node = tofu_graph_param(g, W);
// Create optimizer with momentum
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
tofu_optimizer_zero_grad(opt);
// ... forward and backward pass ...
tofu_optimizer_step(opt);
}
// Cleanup
tofu_optimizer_free(opt);
Notes:
- Momentum helps escape local minima and speeds up convergence
- Typical momentum values: 0.9 (standard), 0.99 (high momentum)
- More effective than vanilla SGD for deep networks
- Violating preconditions triggers
assert()and crashes
See also: tofu_optimizer_sgd_create for vanilla SGD
Cleanup
tofu_optimizer_free
Free optimizer and its state.
void tofu_optimizer_free(tofu_optimizer* opt);
Parameters:
opt- Optimizer to free (can be NULL, no-op if NULL)
Behavior:
- Frees optimizer structure and internal state (momentum buffers, etc.)
- Does NOT free the graph or parameters (graph owns them)
- Safe to call multiple times (idempotent)
Cleanup Order:
// CORRECT order:
tofu_optimizer_free(opt); // 1. Free optimizer
tofu_graph_free(g); // 2. Free graph
tofu_tensor_free_data_too(weights); // 3. Free tensors
// INCORRECT order (may crash):
tofu_graph_free(g); // DON'T free graph before optimizer!
tofu_optimizer_free(opt); // Optimizer may access freed memory
Training Operations
tofu_optimizer_step
Perform one optimization step (update parameters).
void tofu_optimizer_step(tofu_optimizer* opt);
Parameters:
opt- Optimizer (cannot be NULL)
Preconditions:
- opt must not be NULL
- Gradients must be computed (call
tofu_graph_backwardfirst)
Behavior:
- Updates all parameters using computed gradients
- Algorithm depends on optimizer type (SGD, SGD+momentum, etc.)
- Call after backward pass: forward → backward → step
- Does NOT zero gradients - call
tofu_optimizer_zero_gradif needed
Training Sequence:
for (int iteration = 0; iteration < num_iterations; iteration++) {
// 1. Zero gradients
tofu_optimizer_zero_grad(opt);
// 2. Forward pass
tofu_graph_node *x = tofu_graph_input(g, input_data);
tofu_graph_node *pred = forward_pass(g, x);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
// 3. Backward pass
tofu_graph_backward(g, loss);
// 4. Update parameters
tofu_optimizer_step(opt);
// 5. Clear operations for next iteration
tofu_graph_clear_ops(g);
}
Notes:
- Must call
tofu_graph_backward()before this function - Modifies parameter tensors in-place
- Violating preconditions triggers
assert()and crashes
See also: tofu_graph_backward, tofu_optimizer_zero_grad
tofu_optimizer_zero_grad
Zero out all parameter gradients.
void tofu_optimizer_zero_grad(tofu_optimizer* opt);
Parameters:
opt- Optimizer (cannot be NULL)
Preconditions:
- opt must not be NULL
Behavior:
- Sets gradients to zero for all tracked parameters
- Call before each training iteration to prevent gradient accumulation
- Equivalent to
tofu_graph_zero_gradbut works via optimizer
Example:
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Zero gradients before forward pass
tofu_optimizer_zero_grad(opt);
// Forward pass
tofu_graph_node *pred = forward_pass(g, input);
tofu_graph_node *loss = compute_loss(g, pred, target);
// Backward pass
tofu_graph_backward(g, loss);
// Update parameters
tofu_optimizer_step(opt);
}
Notes:
- Essential for correct training - prevents gradient accumulation
- Must call before each training iteration
- Violating preconditions triggers
assert()and crashes
See also: tofu_graph_zero_grad
Parameter Management
Most users won't need these functions - parameters are automatically collected during optimizer creation. These are useful for advanced use cases like dynamic network architectures.
tofu_optimizer_add_param
Manually add parameter node to optimizer.
int tofu_optimizer_add_param(tofu_optimizer* opt, tofu_graph_node* param);
Parameters:
opt- Optimizer (cannot be NULL)param- Parameter node to track (cannot be NULL)
Returns: 0 on success, non-zero on error
Preconditions:
- opt and param must not be NULL
- param must be a PARAM node (requires gradient)
Behavior:
- Usually not needed - optimizer auto-collects params at creation
- Use if you need to add parameters dynamically
Example:
// Create optimizer
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// Add parameters dynamically (rare use case)
tofu_tensor *new_weight = tofu_tensor_zeros(2, (int[]){10, 5}, TOFU_FLOAT);
tofu_graph_node *W_new = tofu_graph_param(g, new_weight);
tofu_optimizer_add_param(opt, W_new);
Notes:
- Rarely needed - use only for dynamic architectures
- Violating preconditions triggers
assert()and crashes
See also: tofu_optimizer_collect_params to scan graph for all params
tofu_optimizer_collect_params
Collect all parameter nodes from graph.
void tofu_optimizer_collect_params(tofu_optimizer* opt);
Parameters:
opt- Optimizer (cannot be NULL)
Preconditions:
- opt must not be NULL
Behavior:
- Scans graph and adds all PARAM nodes to optimizer
- Called automatically during optimizer creation
- Use if graph structure changes and you need to rescan
- Clears existing parameter list before collecting
Example:
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// Add more parameters to graph later
tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){10, 5}, TOFU_FLOAT);
tofu_graph_node *W2_node = tofu_graph_param(g, W2);
// Rescan graph to include new parameters
tofu_optimizer_collect_params(opt);
Notes:
- Rarely needed - parameters auto-collected at creation
- Use only if network structure changes dynamically
- Violating preconditions triggers
assert()and crashes
Usage Patterns
Basic Training Loop
// Setup
tofu_graph *g = tofu_graph_create();
// Create parameters
tofu_tensor *W = tofu_tensor_zeros(2, (int[]){784, 10}, TOFU_FLOAT);
tofu_tensor *b = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);
// Add to graph
tofu_graph_node *W_node = tofu_graph_param(g, W);
tofu_graph_node *b_node = tofu_graph_param(g, b);
// Create optimizer
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
for (int batch = 0; batch < num_batches; batch++) {
// 1. Zero gradients
tofu_optimizer_zero_grad(opt);
// 2. Forward pass
tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
tofu_graph_node *h = tofu_graph_matmul(g, x, W_node);
tofu_graph_node *pred = tofu_graph_add(g, h, b_node);
// 3. Compute loss
tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
tofu_graph_node *loss = tofu_graph_mse_loss(g, pred, target);
// 4. Backward pass
tofu_graph_backward(g, loss);
// 5. Update parameters
tofu_optimizer_step(opt);
// 6. Clear operations for next batch
tofu_graph_clear_ops(g);
}
}
// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W);
tofu_tensor_free_data_too(b);
Training with Momentum
// Setup with momentum optimizer
tofu_graph *g = tofu_graph_create();
// Network parameters
tofu_tensor *W1 = tofu_tensor_zeros(2, (int[]){784, 128}, TOFU_FLOAT);
tofu_tensor *b1 = tofu_tensor_zeros(1, (int[]){128}, TOFU_FLOAT);
tofu_tensor *W2 = tofu_tensor_zeros(2, (int[]){128, 10}, TOFU_FLOAT);
tofu_tensor *b2 = tofu_tensor_zeros(1, (int[]){10}, TOFU_FLOAT);
// Add to graph
tofu_graph_node *W1_node = tofu_graph_param(g, W1);
tofu_graph_node *b1_node = tofu_graph_param(g, b1);
tofu_graph_node *W2_node = tofu_graph_param(g, W2);
tofu_graph_node *b2_node = tofu_graph_param(g, b2);
// Create optimizer with momentum
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
for (int batch = 0; batch < num_batches; batch++) {
tofu_optimizer_zero_grad(opt);
// Forward pass
tofu_graph_node *x = tofu_graph_input(g, batch_data[batch]);
// Layer 1
tofu_graph_node *h1 = tofu_graph_matmul(g, x, W1_node);
h1 = tofu_graph_add(g, h1, b1_node);
h1 = tofu_graph_relu(g, h1);
// Layer 2
tofu_graph_node *h2 = tofu_graph_matmul(g, h1, W2_node);
h2 = tofu_graph_add(g, h2, b2_node);
// Loss
tofu_graph_node *target = tofu_graph_input(g, batch_targets[batch]);
tofu_graph_node *loss = tofu_graph_mse_loss(g, h2, target);
// Backward and update
tofu_graph_backward(g, loss);
tofu_optimizer_step(opt);
tofu_graph_clear_ops(g);
}
}
// Cleanup
tofu_optimizer_free(opt);
tofu_graph_free(g);
tofu_tensor_free_data_too(W1);
tofu_tensor_free_data_too(b1);
tofu_tensor_free_data_too(W2);
tofu_tensor_free_data_too(b2);
Learning Rate Scheduling
Manual learning rate adjustment during training:
// Create optimizer
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.1);
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Reduce learning rate every 10 epochs
if (epoch % 10 == 0 && epoch > 0) {
opt->learning_rate *= 0.5;
printf("Epoch %d: Reduced learning rate to %.6f\n", epoch, opt->learning_rate);
}
// Training loop for this epoch
for (int batch = 0; batch < num_batches; batch++) {
tofu_optimizer_zero_grad(opt);
// ... forward, backward, step ...
}
}
Monitoring Gradients
Useful for debugging and understanding training dynamics:
// After backward pass, before optimizer step
tofu_tensor *W_grad = tofu_graph_get_grad(W_node);
// Compute gradient statistics
double grad_sum = 0.0;
double grad_max = -INFINITY;
for (int i = 0; i < W_grad->len; i++) {
float val;
TOFU_TENSOR_DATA_TO(W_grad, i, val, TOFU_FLOAT);
grad_sum += fabs(val);
if (fabs(val) > grad_max) grad_max = fabs(val);
}
printf("Gradient mean: %.6f, max: %.6f\n",
grad_sum / W_grad->len, grad_max);
// Now update parameters
tofu_optimizer_step(opt);
Gradient Clipping (Manual)
Prevent exploding gradients:
void clip_gradients(tofu_optimizer *opt, double max_norm) {
for (int i = 0; i < opt->num_params; i++) {
tofu_tensor *grad = tofu_graph_get_grad(opt->params[i]);
if (!grad) continue;
// Compute gradient norm
double norm = 0.0;
for (int j = 0; j < grad->len; j++) {
float val;
TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
norm += val * val;
}
norm = sqrt(norm);
// Clip if necessary
if (norm > max_norm) {
double scale = max_norm / norm;
for (int j = 0; j < grad->len; j++) {
float val;
TOFU_TENSOR_DATA_TO(grad, j, val, TOFU_FLOAT);
val *= scale;
TOFU_TENSOR_DATA_FROM(grad, j, val, TOFU_FLOAT);
}
}
}
}
// Usage in training loop
tofu_graph_backward(g, loss);
clip_gradients(opt, 1.0); // Clip to max norm of 1.0
tofu_optimizer_step(opt);
Hyperparameter Guidance
Learning Rate
The learning rate is the most important hyperparameter. It controls the step size of parameter updates.
Guidelines:
| Problem Type | Recommended Range | Notes |
|---|---|---|
| Small networks | 0.01 - 0.1 | Can use larger learning rates |
| Deep networks | 0.001 - 0.01 | Need smaller learning rates |
| Fine-tuning | 0.0001 - 0.001 | Very small to preserve learned features |
Common values:
- 0.1 - Starting point for small networks
- 0.01 - Default safe choice for most problems
- 0.001 - Deep networks, complex problems
- 0.0001 - Fine-tuning pre-trained models
Signs of incorrect learning rate:
- Too high: Loss diverges (increases), NaN values, training unstable
- Too low: Very slow convergence, loss decreases too slowly
Example - Finding good learning rate:
// Try multiple learning rates
double learning_rates[] = {0.001, 0.01, 0.1};
for (int lr_idx = 0; lr_idx < 3; lr_idx++) {
printf("\n=== Testing LR: %.4f ===\n", learning_rates[lr_idx]);
// Reset parameters
reinitialize_parameters(W, b);
// Create optimizer with this learning rate
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, learning_rates[lr_idx]);
// Train for a few epochs
for (int epoch = 0; epoch < 10; epoch++) {
// ... training loop ...
printf("Epoch %d, Loss: %.6f\n", epoch, loss_value);
}
tofu_optimizer_free(opt);
}
Momentum
Momentum helps accelerate convergence and dampen oscillations.
Guidelines:
| Scenario | Recommended Value | Effect |
|---|---|---|
| Default | 0.9 | Good balance for most problems |
| High momentum | 0.95 - 0.99 | Faster convergence, may overshoot |
| Low momentum | 0.5 - 0.8 | More stable, slower convergence |
| No momentum | 0.0 | Vanilla SGD, most stable but slowest |
Common values:
- 0.9 - Standard choice for most problems
- 0.95 - Deep networks, when convergence is slow
- 0.99 - Very deep networks (ResNet, Transformers)
- 0.5 - Noisy gradients, unstable training
Example:
// Standard momentum for deep network
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
// Higher momentum for very deep network
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.99);
// Low momentum for noisy gradients
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.5);
Batch Size Considerations
Batch size affects effective learning rate:
// Larger batches → more stable gradients → can use higher learning rate
int batch_size = 128;
double lr = 0.01;
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, lr);
// If you increase batch size, consider increasing learning rate proportionally
// batch_size = 256 → lr = 0.02
// batch_size = 512 → lr = 0.04
Learning Rate Schedules
Common strategies for adjusting learning rate during training:
Step Decay:
// Reduce learning rate every N epochs
if (epoch % 30 == 0 && epoch > 0) {
opt->learning_rate *= 0.1; // Reduce by 10x
}
Exponential Decay:
// Decay gradually every epoch
double initial_lr = 0.1;
double decay_rate = 0.96;
opt->learning_rate = initial_lr * pow(decay_rate, epoch);
Cosine Annealing:
// Smooth decay following cosine curve
double initial_lr = 0.1;
double min_lr = 0.001;
opt->learning_rate = min_lr + (initial_lr - min_lr) *
(1 + cos(M_PI * epoch / num_epochs)) / 2;
Training Tips
1. Start with a reasonable learning rate:
// Good defaults:
tofu_optimizer *opt_sgd = tofu_optimizer_sgd_create(g, 0.01);
tofu_optimizer *opt_momentum = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
2. Monitor loss and adjust:
double prev_loss = INFINITY;
for (int epoch = 0; epoch < num_epochs; epoch++) {
// ... training ...
// Check if loss is improving
if (loss_value > prev_loss * 1.1) {
printf("Loss increased! Consider reducing learning rate.\n");
}
prev_loss = loss_value;
}
3. Use learning rate warmup for large learning rates:
double target_lr = 0.1;
int warmup_epochs = 5;
for (int epoch = 0; epoch < num_epochs; epoch++) {
if (epoch < warmup_epochs) {
// Gradually increase learning rate
opt->learning_rate = target_lr * (epoch + 1) / warmup_epochs;
} else {
opt->learning_rate = target_lr;
}
// ... training ...
}
4. Weight decay (L2 regularization) - manual implementation:
double weight_decay = 0.0001;
void apply_weight_decay(tofu_optimizer *opt, double weight_decay) {
for (int i = 0; i < opt->num_params; i++) {
tofu_tensor *param = tofu_graph_get_value(opt->params[i]);
for (int j = 0; j < param->len; j++) {
float val;
TOFU_TENSOR_DATA_TO(param, j, val, TOFU_FLOAT);
val *= (1.0 - weight_decay * opt->learning_rate);
TOFU_TENSOR_DATA_FROM(param, j, val, TOFU_FLOAT);
}
}
}
// Use before optimizer step
tofu_graph_backward(g, loss);
apply_weight_decay(opt, 0.0001);
tofu_optimizer_step(opt);
Common Pitfalls
Forgetting to Zero Gradients
Problem:
// WRONG: Gradients accumulate indefinitely
for (int i = 0; i < num_iterations; i++) {
// forward, backward, step...
// Gradients keep accumulating!
}
Solution:
// CORRECT: Zero gradients each iteration
for (int i = 0; i < num_iterations; i++) {
tofu_optimizer_zero_grad(opt); // Clear gradients
// forward, backward, step...
}
Incorrect Cleanup Order
Problem:
// WRONG: Freeing graph before optimizer
tofu_graph_free(g); // Graph freed
tofu_optimizer_free(opt); // Optimizer tries to access freed graph!
Solution:
// CORRECT: Free optimizer before graph
tofu_optimizer_free(opt); // Free optimizer first
tofu_graph_free(g); // Then free graph
Learning Rate Too High
Symptoms:
- Loss becomes NaN
- Loss diverges (increases)
- Training unstable
Solution:
// Reduce learning rate by 10x
double new_lr = opt->learning_rate * 0.1;
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_create(g, new_lr);
Learning Rate Too Low
Symptoms:
- Loss decreases very slowly
- Training takes many epochs
- No progress after many iterations
Solution:
// Increase learning rate by 10x
double new_lr = opt->learning_rate * 10.0;
tofu_optimizer_free(opt);
opt = tofu_optimizer_sgd_create(g, new_lr);
Notes
Optimizer State Persistence
Optimizer state (like momentum buffers) persists across training iterations:
// Momentum accumulates across iterations
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
for (int epoch = 0; epoch < num_epochs; epoch++) {
// Momentum from previous epochs affects current updates
// ... training ...
}
Parameter Collection
Optimizers automatically collect parameters when created:
// All PARAM nodes are collected automatically
tofu_graph_node *W1 = tofu_graph_param(g, weights1);
tofu_graph_node *W2 = tofu_graph_param(g, weights2);
tofu_optimizer *opt = tofu_optimizer_sgd_create(g, 0.01);
// opt now tracks both W1 and W2
Memory Management
Optimizer owns its internal state but not the graph or parameters:
// Optimizer allocates momentum buffers (if using momentum)
tofu_optimizer *opt = tofu_optimizer_sgd_momentum_create(g, 0.01, 0.9);
// When freed, optimizer releases momentum buffers
tofu_optimizer_free(opt);
// Graph and parameters remain valid
// (must be freed separately)
Changelog
Version history and changes.
API Stability
Information about API stability guarantees and versioning.