Rust Machine Learning Showdown: Candle vs. tch-rs in Production

Table of Contents

If you’ve been following the Rust ecosystem throughout 2025, you know that the “Rust for Machine Learning” narrative has shifted from “Is it possible?” to “Which tool should I use for production?”

While Python remains the undisputed king of training and research, Rust has carved out a massive niche in inference and deployment. The promise is simple: type safety, fearless concurrency, and massive performance gains without the overhead of the Python Global Interpreter Lock (GIL).

But here is the dilemma every mid-to-senior Rust developer faces today: Do you stick with the battle-tested bindings of tch-rs (Libtorch), or do you embrace the pure-Rust approach of Hugging Face’s Candle?

In this article, we aren’t just reading docs. We are going to build equivalent models in both, analyze the compilation/runtime differences, and look at the architectural trade-offs. By the end, you’ll know exactly which crate belongs in your Cargo.toml.

The Landscape in Late 2025
#

Before we write code, let’s understand the architectural differences. This is crucial because it dictates your build pipeline, Docker image sizes, and deployment strategy.

1. tch-rs (The Wrapper)
#

tch-rs provides Rust bindings for the C++ API of PyTorch (Libtorch).

Pros: Access to virtually every operation PyTorch supports. If it works in Python, it likely works here.
Cons: Heavy reliance on external C++ shared libraries. The build process can be painful (LNK2001 errors, anyone?).

2. Candle (The Native)
#

Candle is a minimalist ML framework written entirely in Rust by Hugging Face.

Pros: Lightweight, compiles to WASM, zero C++ dependencies (unless you enable CUDA), “rusty” API design.
Cons: Fewer implemented operators compared to PyTorch (though the gap is closing fast).

Here is a visual breakdown of how your code interacts with the hardware in both scenarios:

flowchart TD subgraph Candle ["Candle (Pure Rust)"] direction TB C_User["User Code"] --> C_Core["Candle Core"] C_Core --> C_Backends{Backends} C_Backends -- "feature: cuda" --> C_CUDA["CUDA Kernels"] C_Backends -- "feature: metal" --> C_Metal["Metal Kernels"] C_Backends -- "default" --> C_CPU["CPU / SIMD"] end subgraph Tch ["tch-rs (Bindings)"] direction TB T_User["User Code"] --> T_Rust["tch-rs Crate"] T_Rust -->|"FFI"| T_CPP["Libtorch (C++)"] T_CPP --> T_Impl["PyTorch Implementation"] T_Impl --> T_HW["CUDA / CPU / Metal"] end style C_Core fill:#dea,stroke:#333,stroke-width:2px style T_CPP fill:#add,stroke:#333,stroke-width:2px

Prerequisites and Setup
#

To follow along, ensure you have a modern Rust toolchain installed (1.80+ recommended).

Environment Setup
#

For tch-rs, you usually need to download Libtorch manually, though the crate can handle it automatically in some cases. For Candle, you just need Cargo.

Create a new project:

cargo new rust_ml_showdown
cd rust_ml_showdown

Update your Cargo.toml to include both for this comparison (in a real app, you’d pick one):

[package]
name = "rust_ml_showdown"
version = "0.1.0"
edition = "2021"

[dependencies]
# The Contender
tch = "0.18" # Verify latest version on crates.io
# The Challenger
candle-core = "0.8"
candle-nn = "0.8"
anyhow = "1.0"

Round 1: The “Hello World” of Tensors
#

Let’s look at the syntax. We will perform a simple matrix multiplication followed by a ReLU activation. This is the bread and butter of neural networks.

The `tch-rs` Approach
#

If you come from PyTorch, this will feel incredibly familiar.

// src/bin/tch_example.rs
use tch::{Tensor, Kind, Device};

fn main() -> anyhow::Result<()> {
    // 1. Define device (CUDA if available, else CPU)
    let device = Device::cuda_if_available();
    println!("Running tch-rs on: {:?}", device);

    // 2. Create Tensors
    // A: 2x3 matrix
    let a = Tensor::from_slice(&[1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0])
        .view([2, 3])
        .to(device);
    
    // B: 3x2 matrix
    let b = Tensor::from_slice(&[0.1f32, 0.2, 0.3, 0.4, 0.5, 0.6])
        .view([3, 2])
        .to(device);

    // 3. Operations: Matmul -> ReLU
    // Note: tch uses method chaining heavily
    let result = a.matmul(&b).relu();

    // 4. Print result
    result.print();

    Ok(())
}

Observation: The API is imperative and eager. It feels like dynamic Python code but statically typed. Notice .view() and .to()—exact parallels to PyTorch.

The `Candle` Approach
#

Candle forces you to handle errors explicitly. There is no hidden panic if shapes mismatch; it returns a Result.

// src/bin/candle_example.rs
use candle_core::{Device, Tensor, DType};
use anyhow::Result;

fn main() -> Result<()> {
    // 1. Define Device (explicit choice)
    let device = Device::Cpu; // Or Device::new_cuda(0)?;
    println!("Running Candle on: {:?}", device);

    // 2. Create Tensors
    // Candle requires shape to be explicit during creation usually
    let a = Tensor::new(&[1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0], &device)?
        .reshape((2, 3))?;

    let b = Tensor::new(&[0.1f32, 0.2, 0.3, 0.4, 0.5, 0.6], &device)?
        .reshape((3, 2))?;

    // 3. Operations
    // Note: Explicit error handling with `?`
    let result = a.matmul(&b)?.relu()?;

    // 4. Print
    println!("Result:\n{}", result);

    Ok(())
}

Observation: Candle is more verbose regarding error handling (? everywhere). This is a good thing for production. In tch-rs, a shape mismatch often crashes the C++ runtime causing a segmentation fault or a panic. In Candle, it’s a manageable Rust error.

Round 2: Defining a Neural Network
#

Let’s step it up. How do we define a reuseable model layer?

`tch-rs`: The `nn::Module` Style
#

tch-rs uses a “Path” pattern to register variables.

use tch::{nn, nn::Module, Device, Tensor};

struct SimpleNet {
    fc1: nn::Linear,
    fc2: nn::Linear,
}

impl SimpleNet {
    fn new(vs: &nn::Path, in_dim: i64, hidden_dim: i64, out_dim: i64) -> SimpleNet {
        SimpleNet {
            fc1: nn::linear(vs / "fc1", in_dim, hidden_dim, Default::default()),
            fc2: nn::linear(vs / "fc2", hidden_dim, out_dim, Default::default()),
        }
    }
}

// Forward pass trait
impl Module for SimpleNet {
    fn forward(&self, xs: &Tensor) -> Tensor {
        xs.apply(&self.fc1).relu().apply(&self.fc2)
    }
}

`Candle`: The Struct-Based Style
#

Candle separates the variable builder (VarBuilder) from the model logic cleanly.

use candle_core::{Tensor, Result};
use candle_nn::{Linear, Module, VarBuilder, linear};

struct SimpleNet {
    fc1: Linear,
    fc2: Linear,
}

impl SimpleNet {
    fn new(vs: VarBuilder, in_dim: usize, hidden_dim: usize, out_dim: usize) -> Result<Self> {
        let fc1 = linear(in_dim, hidden_dim, vs.pp("fc1"))?;
        let fc2 = linear(hidden_dim, out_dim, vs.pp("fc2"))?;
        Ok(Self { fc1, fc2 })
    }

    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let x = self.fc1.forward(xs)?;
        let x = x.relu()?;
        self.fc2.forward(&x)
    }
}

Verdict: The syntax is remarkably similar. However, Candle’s use of usize for dimensions (standard Rust) vs tch’s i64 (C++ heritage) makes Candle feel more native.

Round 3: Feature & Performance Comparison
#

This is where the decision is usually made. I’ve compiled a comparison based on current benchmarks and developer experience.

Feature	tch-rs (Libtorch)	Candle (Hugging Face)
Backend	C++ Libtorch Bindings	Pure Rust (Core) + CUDA/Metal Kernels
Compilation Speed	🐢 Slow (Linker heavy)	🐇 Fast
Binary Size	Huge (>100MB with Libtorch)	Tiny (Small static binary)
WASM Support	No (Very difficult)	✅ First-class citizen
Model Support	Excellent (Anything PyTorch)	Good (LLMs, Bert, Whisper, SD)
Hugging Face Hub	Manual Download logic	Integrated (`hf-hub` crate)
Developer Experience	Dynamic-ish, Panics	Type-safe, Result-based
Deployment	Requires shared libs on host	Copy single binary & run

The “Build Time” Trap
#

When working with tch-rs, your CI/CD pipeline becomes complex. You must ensure the LIBTORCH environment variable matches the CUDA version on the runner. With Candle, you just run cargo build --release.

Performance
#

Inference (CPU): Candle is often faster due to specialized SIMD optimizations for specific models (like quantization in LLMs).
Inference (GPU): tch-rs still holds a slight edge for generic networks because Libtorch’s CUDA kernels are highly tuned by NVIDIA/Facebook over a decade. However, for LLMs (Llama, Mistral), Candle’s custom kernels are competitive.

Best Practices and Common Pitfalls
#

1. Shape Checking
#

In ML, 90% of bugs are shape mismatches.

tch-rs: Use tensor.size() frequently to debug.
Candle: Use tensor.dims().
Tip: Create a macro to assert shapes in debug builds for both libraries.

2. Loading Weights (The .safetensors Revolution)
#

Forget .bin (Pickle). Use .safetensors. Rust pioneered this format, and it is natively supported by Candle.

Candle can memory-map .safetensors files, leading to near-instant model loading.
tch-rs can load them too, but it feels like a second-class citizen compared to native PyTorch checkpoints.

3. Memory Management
#

tch-rs: Uses C++ shared_ptr under the hood. Rust’s Drop handles cleanup, but if you create cycles in graphs, you might leak.
Candle: Pure Rust ownership rules apply. It’s much harder to leak memory.

Example: Loading a Model from Hugging Face
#

This is the most common use case in 2025: downloading a pre-trained BERT or Llama model and running it.

Here is how clean it is in Candle:

use candle_core::Tensor;
use hf_hub::{api::sync::Api, Repo, RepoType};

fn main() -> anyhow::Result<()> {
    // 1. Get the file path from Hugging Face Hub
    let api = Api::new()?;
    let repo = api.repo(Repo::new("bert-base-uncased".to_string(), RepoType::Model));
    let weights_path = repo.get("model.safetensors")?;

    println!("Weights downloaded to: {:?}", weights_path);

    // 2. Load weights using Candle
    // Note: We need a VarBuilder to map names in safetensors to the model
    // let vb = unsafe { VarBuilder::from_mmaped_safetensors(&[weights_path], DType::F32, &Device::Cpu)? };

    // From here, you would pass `vb` to your model struct's new() method.
    Ok(())
}

Doing this in tch-rs usually involves manually downloading the file using reqwest, ensuring it’s a format Libtorch understands, and then loading it.

Conclusion: Which One Should You Choose?
#

As we move through 2026, the recommendation is becoming clearer.

Choose tch-rs if:

You are porting a complex, custom research model from Python code that uses obscure PyTorch operators.
You need 100% numerical parity with PyTorch for validation purposes.
Binary size and compile times are not a concern (e.g., on-premise dedicated servers).

Choose Candle if:

You are deploying to production. This is the big one. Small binaries, no shared library headaches, Docker-friendly.
You are working with LLMs (Llama, Mistral, Gemma) or standard Computer Vision models (ResNet, YOLO).
You want to run on Edge devices or the Browser (WASM).
You prefer idiomatic Rust (Results over Panics).

At Rust DevPro, we have largely migrated our inference microservices to Candle. The CI/CD simplification alone saved us hours of debugging linker errors.

The Landscape in Late 2025 #

1. tch-rs (The Wrapper) #

2. Candle (The Native) #

Prerequisites and Setup #

Environment Setup #

Round 1: The “Hello World” of Tensors #

The tch-rs Approach #

The Candle Approach #

Round 2: Defining a Neural Network #

tch-rs: The nn::Module Style #

Candle: The Struct-Based Style #

Round 3: Feature & Performance Comparison #

The “Build Time” Trap #

Performance #

Best Practices and Common Pitfalls #

1. Shape Checking #

2. Loading Weights (The .safetensors Revolution) #

3. Memory Management #

Example: Loading a Model from Hugging Face #

Conclusion: Which One Should You Choose? #

Further Reading #

Related Articles