Sunday, May 10, 2026
banner
Top Selling Multipurpose WP Theme

Step 01/09 · Conditions

What you want earlier than you begin

cuda-oxyde has particular model necessities for every dependency. Please ensure your system meets all of those necessities earlier than putting in something. The challenge is at present Linux solely (Examined on Ubuntu 24.04).

Linux (Ubuntu 24.04)
rusts at evening
CUDA Toolkit 12.x+
LLVM 21+
Clang 21/libclang-common-21-dev
Git

ⓘ Why LLVM 21?
Easy kernels may fit with LLVM 20, however these concentrating on Hopper or Blackwell (TMA, tcgen05, WGMMA) require: llc LLVM 21 and later. It is a strict requirement and isn’t really helpful.

Verify your present CUDA model to make sure compatibility.

nvcc --version

Step 02/09 · Set up Rust each evening

Arrange the Rust Nightly toolchain

cuda oxide requires Rust each evening Two further parts: rust-src and rustc-dev. The toolchain is mounted by nightly-2026-04-03 by way of rust-toolchain.toml In a repository — Mechanically put in the primary time you run a construct in a repository.

If it is advisable set up manually:

# Set up the pinned nightly toolchain
rustup toolchain set up nightly-2026-04-03

# Add required parts
rustup element add rust-src rustc-dev 
  --toolchain nightly-2026-04-03

# Affirm the toolchain is lively
rustup present

ⓘ Why these parts?
rustc-dev Exposes inside compiler APIs that customized codegen backends hook into. rust-src Required in order that the compiler can discover and compile its personal commonplace library sources for gadget targets.

Step 03/09 · Set up LLVM 21

Set up LLVM 21 utilizing the NVPTX backend

The cuda-oxyde pipeline makes use of a textual LLVM IR (.ll file) to the skin llc Binary that generates PTX. Requires LLVM 21 or later with NVPTX backend enabled.

# Ubuntu/Debian
sudo apt set up llvm-21

# Confirm the NVPTX backend is current
llc-21 --version | grep nvptx

Pipeline is auto-detected llc-22 and llc-21 your PATH In that order. Set surroundings variables to repair particular binaries.

# Pin to a particular llc binary
export CUDA_OXIDE_LLC=/usr/bin/llc-21

⚠ Frequent errors
If NVPTX doesn’t seem within the output, llc-21 --versionthe LLVM construct was compiled with out the NVPTX goal. Set up from the official LLVM apt repository as an alternative of the distribution’s default bundle, which can omit the GPU backend.

Step 04/09 · Set up Clang

Putting in Clang 21 for the cuda-bindings crate

of cuda-bindings Makes use of of crates bindgen To generate FFI bindings cuda.h at construct time. bindgen wants libclang — and particularly requires Clang’s personal useful resource listing, which incorporates: stddef.h). bare libclang1-* The runtime bundle is Not sufficient.

# Set up the complete clang-21 bundle (consists of useful resource headers)
sudo apt set up clang-21

# Alternatively, the -dev header bundle additionally works
sudo apt set up libclang-common-21-dev

⚠ Signs of no clicking sound
When you set up solely the runtime and no headers, the host construct will fail with a mysterious error. 'stddef.h' file not discovered An error occurred throughout bind gen. run cargo oxide physician We’ll catch this earlier than trying the construct within the subsequent step.

Step 05/09 · Set up cargo oxide

Clone the repository and set up cargo oxide

cargo-oxide Cargo subcommand that drives the whole construct pipeline. cargo oxide construct, cargo oxide run, cargo oxide debugand cargo oxide pipeline.

within the repository (to strive the instance):

git clone https://github.com/NVlabs/cuda-oxide.git
cd cuda-oxide

# cargo oxide works out of the field by way of a workspace alias
cargo oxide run vecadd

outdoors the repository (in your personal challenge):

# Set up globally from the git supply
cargo set up 
  --git https://github.com/NVlabs/cuda-oxide.git 
  cargo-oxide

# On first run, cargo-oxide fetches and builds the codegen backend

Then use built-in well being checks to make sure all conditions are met.

cargo oxide physician

ⓘ What the physician checks
Validate the Rust toolchain (nightly, rust-src, rustc-dev), CUDA toolkit, LLVM model and NVPTX assist, Clang/libclang headers, and codegen backend binaries. Please right the purple objects earlier than persevering with.

Step 06/09 · Run your first kernel

Constructing and working the vecadd pattern

The canonical first instance is: vecadd — Vector addition kernel that provides two arrays of 1,024 f32 Confirm the values ​​on the GPU and validate the outcomes on the host.

# Construct and run end-to-end
cargo oxide run vecadd

If the whole lot is configured appropriately, you must see one thing like this:

✓ SUCCESS: All 1024 components right!

To see the whole compilation pipeline from Rust MIR by means of every Pliron dialect to PTX, run the next command:

# Print the complete Rust MIR — dialect-mir — mem2reg — dialect-llvm — LLVM IR — PTX hint
cargo oxide pipeline vecadd

To debug cuda-gdb:

cargo oxide debug vecadd --tui

ⓘ Output artifacts
A profitable construct generates two recordsdata: goal/debug/vecadd (host binary) and goal/debug/vecadd.ptx (gadget code). The host binary hundreds the PTX file by means of the CUDA driver at runtime.

Step 07/09 · Create the kernel

Write it your self #[kernel] operate

The kernel operate is annotated with the next #[kernel]. use DisjointSlice<T> In case of variable output &[T] For read-only enter. To entry a thread’s distinctive {hardware} index, use thread::index_1d().

use cuda_device::{kernel, thread, DisjointSlice};

// Tier 1 security: race-free by development, no `unsafe` wanted.
// DisjointSlice::get_mut() solely accepts a ThreadIndex —
// a hardware-derived opaque sort guaranteeing distinctive writes per thread.
#[kernel]
pub fn scale(enter: &[f32], issue: f32, mut out: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    if let Some(elem) = out.get_mut(idx) {
        *elem = enter[idx.get()] * issue;
    }
}

ⓘ Tier 1 security — the way it works
ThreadIndex An opaque new sort usize Can solely be created from {hardware} built-in registers (threadIdx, blockIdx, blockDim). Every thread will get a singular worth, so DisjointSlice::get_mut() solely accepts ThreadIndexwrites are structurally conflict-free – no unsafe wherever within the kernel.

Step 08/09 · Boot from host

Booting the kernel from host code

Host and gadget code reside in the identical location .rs file. The host aspect makes use of CudaContext, DeviceBuffer,and cuda_launch! Macros to handle and dispatch GPU reminiscence.

use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
use cuda_host::{cuda_launch, load_kernel_module};

fn most important() {
    // Initialize GPU context on gadget 0
    let ctx    = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();
    let module = load_kernel_module(&ctx, "scale_example").unwrap();

    // Add enter information to GPU reminiscence
    let information: Vec<f32> = (0..1024).map(|i| i as f32).gather();
    let enter  = DeviceBuffer::from_host(&stream, &information).unwrap();
    let mut output = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

    // Dispatch the kernel — LaunchConfig auto-sizes blocks/grids
    cuda_launch! {
        kernel: scale,
        stream: stream,
        module: module,
        config: LaunchConfig::for_num_elems(1024),
        args: [slice(input), 2.5f32, slice_mut(output)]
    }.unwrap();

    // Obtain consequence again to host
    let consequence = output.to_host_vec(&stream).unwrap();
    assert!((consequence[1] - 2.5).abs() < 1e-5);
    println!("✓ Kernel ran efficiently!");
}

ⓘ How cuda_launch!
Scalars the argument checklist (slice, scalar, flatten captured closure) into PTX kernel parameters and dispatches the kernel to the desired stream. No handbook argument marshaling is required.

Step 09/09 · Subsequent step

What to discover subsequent

CUDA-OXIDE setup is now full. Listed below are the long run high-value paths, ordered by complexity.

  • Generic kernel with monomorphism — give it a strive generic instance (cargo oxide run generic) to see how fn scale<T: Copy> Compile every sort right into a separate PTX kernel.
  • Closure with seizehost_closure An instance is transfer |x: f32| x * issue Closures are scalarized and robotically handed as PTX kernel parameters.
  • Asynchronous GPU executioncuda_launch_async! return the lazy DeviceOperation is executed with .sync() or .await. Please seek advice from async_mlp and async_vecadd instance.
  • Shared reminiscence and warp built-in features — these require scoping unsafe Blocks with documented security contracts. See Tier 2 of the Security Mannequin documentation.
  • GEMM on the pace of sunshinegemm_sol In our instance, we obtain 868 TFLOPS (58% of cuBLAS SoL) on B200. cta_group::2CLC, and a four-stage pipeline.
  • blackwell tensor coretcgen05 The instance targets TMEM, MMA, and SM_100a. cta_group::2. Requires LLVM 21+.

ⓘ Identified limitations of v0.1.0
index_2d(stride) There’s at present a documented downside. If threads in the identical kernel use totally different stride values, the 2 threads &mut T on the identical component with out unsafe Perception. Bind stride to a single parameter till the repair is ​​full (elevating stride to a sort parameter). let Bind and reuse in any respect name websites.

Full documentation: nvlabs.github.io/cuda-oxyde · sauce: github.com/NVlabs/cuda-oxyde

banner
Top Selling Multipurpose WP Theme

Converter

Top Selling Multipurpose WP Theme

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

banner
Top Selling Multipurpose WP Theme

Leave a Comment

banner
Top Selling Multipurpose WP Theme

Latest

Best selling

22000,00 $
16000,00 $
6500,00 $

Top rated

6500,00 $
22000,00 $
900000,00 $

Products

Knowledge Unleashed
Knowledge Unleashed

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.