Pipelining AI/ML coaching workloads use CUDA streams

by root June 27, 2025

written by root June 27, 2025 0 comment 147 views

The ninth in Pytorch’s sequence on Efficiency Profiling and Optimization goals to spotlight the important thing function of efficiency evaluation and optimization in machine studying growth. All through the sequence, we reviewed a wide range of sensible instruments and methods for analyzing and enhancing the runtime efficiency of Pytorch-based AI/ML fashions. Our targets have been two:

Emphasises the significance of routine analysis and optimization of AI/ML workloads.
To exhibit the accessibility of a wide range of instruments and applied sciences for analyzing and optimizing AI/ML runtime efficiency. You do not should be a CUDA skilled to considerably enhance mannequin efficiency and cut back computational prices.

On this put up, we think about using Cuda Streams, a robust function of Nvidia’s CUDA programming mannequin, which supplies a complicated solution to run concurrently with overlapping GPU operations. Usually, you affiliate the coaching workload of an AI/ML mannequin with a single monolithic (aka “unbreakable”) calculation graph g When working on a GPU, there are some situations the place you possibly can break down the graph into two completely different subgraphs G1 and G2the place g = g2*g1. In such circumstances, the CUDA stream lets you “pipe” the computational graph, that’s, program the coaching steps to execute G1 (Batch enter) n+1) Parallel G2 (In nth Output of G1). This system is especially helpful when:

Neither subgraph makes use of the GPU utterly when working it alone.
The 2 subgraphs have related computational prices (i.e. they don’t management the runtime both).

We examine two frequent situations the place a “pipeline” will be achieved.

Coaching or fine-tuning partial fashions:
It is not uncommon to freeze pre-trained fashions spine Coaching solely the (e.g., function extractor or encoder) and fashions head (Instance, decoder). Since freezing spine It doesn’t depend upon the slope from headtwo will be run on the identical time.
Offloading preprocessing knowledge to the GPU:
Knowledge preprocessing will be moved to the GPU as it’s a frequent solution to cope with bottlenecks in enter pipelines (also called GPU starvation). Making ready preprocessing operations on the mannequin graph improves efficiency, however extra acquire will be achieved by working preprocessing on separate CUDA streams in parallel with mannequin execution. In comparison with mannequin calculations, preprocessing shouldn’t be trivial.

To facilitate dialogue, we outline two toy coaching scripts and measure coaching efficiency in numerous situations. The experiment was carried out with an Amazon EC2 G5.2XLARGE Run the occasion (together with NVIDIA A10G GPU and eight VCPU) Pytorch (2.6) Deep Learning Ami (dlami).

Word: The code snippets we share are for demonstration functions solely. Do not depend on its correctness or optimality. The impression of utilizing CUDA streams is dependent upon the mannequin structure and system configuration. It’s endorsed that you just carry out your individual profiling and experimentation earlier than integrating your CUDA stream (or another instrument method to be talked about) into your workflow.

Half 1: Encoder decoder mannequin pipeline

The primary use circumstances we discover embody a CNN-based picture segmentation mannequin consisting of a set (pre-training) encoder and a trainable decoder. On this situation, the encoder weights are frozen and never affected by backpropagation, so the encoder will be run independently of decoder coaching. On this part, we consider the impression of utilizing CUDA streams to pipe the coaching course of.

Toy Picture Segmentation Coaching Experiments

First, we begin by defining it with a easy CNN-based picture encoder and its corresponding decoder.

undefined

Subsequent, we create a composite dataset of random pictures and segmentation maps.

from torch.utils.knowledge import DataLoader
from torchvision.datasets.imaginative and prescient import VisionDataset

# A dataset with random pictures and per-pixel labels
class FakeDataset(VisionDataset):
    def __init__(self):
        tremendous().__init__(root=None)
        self.dimension = 1000000

    def __getitem__(self, index):
        # create a random picture
        img = torch.randint(0, 256, (3, img_size, img_size),
                            dtype=torch.uint8)

        # create a random label map
        goal = torch.randint(0, num_classes, (img_size, img_size))

        return img, goal

    def __len__(self):
        return self.dimension

train_set = FakeDataset()

train_loader = DataLoader(
    dataset=train_set,
    batch_size=8,
    num_workers=8
)

Lastly, we outline the loss operate, optimizer, and coaching loop. Word that it freezes the encoder weights and trains solely the decoder.

import time

machine = torch.machine("cuda")
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(decoder.parameters())

# Freeze the encoder weights
encoder.requires_grad_(False)
encoder.eval().to(machine)

decoder.practice().to(machine)

warmup = 10
active_batches = 100
total_iters = warmup + active_batches

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0].to(machine=machine, non_blocking=True).float()
    labels = knowledge[1].to(machine=machine, non_blocking=True)
    optimizer.zero_grad()
    with torch.no_grad():
        options = encoder(inputs)
    output = decoder(options)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

    if idx == warmup:
        # sync the GPU and begin the timer
        torch.cuda.synchronize()
        t0 = time.perf_counter()

    if idx == total_iters:
        break

# look ahead to the GPU to finnish after which cease the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

The baseline coaching script achieves a mean throughput of 83 steps, with a mean GPU utilization of 85%.

Pipeline for mannequin execution utilizing CUDA streams

The revised model of the coaching loop proven under introduces two CUDA streams: One runs the encoder, and the opposite trains the decoder. Every iteration performs two operations concurrently.

Prepare your decoder with picture options and labels from batches n.
Run the encoder on the enter batch n+1 Generates picture capabilities.

encoder_stream = torch.cuda.Stream()
decoder_stream = torch.cuda.Stream()

# initialize the options to None
options = None

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0].to(machine, non_blocking=True).float()
    labels_next = knowledge[1].to(machine, non_blocking=True)

    if options shouldn't be None:
        with torch.cuda.stream(decoder_stream):
            decoder_stream.wait_stream(encoder_stream)

            optimizer.zero_grad()
            output = decoder(options)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

    with torch.cuda.stream(encoder_stream):
        with torch.no_grad():
            options =  encoder(inputs)
        # Report that options was produced on s1_backbone
        options.record_stream(encoder_stream)

    labels = labels_next

    if idx == warmup:
        # sync the GPU and begin the timer
        torch.cuda.synchronize()
        t0 = time.perf_counter()
    if idx == total_iters:
        break

# look ahead to the GPU to complete after which cease the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

This modification represents a 9.6% speedup, with a mean throughput of 91 steps per second. It is a main enchancment. Particularly contemplating that the baseline already has a excessive GPU utilization (85%).

Pipeline sensitivity to workload properties

The effectiveness of pipelines utilizing CUDA streams relies upon closely on the main points of the coaching workload and runtime setting. If the encoder is considerably bigger than the decoder (or vice versa), the pipeline can produce little profit or intrude with efficiency. Conversely, if the GPU is underutilized, pipelines have a tendency to offer better advantages.

As an instance this dependency, we reproduce the experiment with numerous batch sizes. The outcomes are summarized under.

The impact of pipelines with CUDA streams on throughput (by creator)

Bigger batch sizes cut back the advantages of pipelining. That is probably as a result of bigger batch sizes will naturally result in increased (extra environment friendly) GPU utilization, and there may be much less room for enchancment by way of concurrent execution.

Half 2: Offloading GPU Enhancements

On this part, we apply using CUDA streams to speed up knowledge augmentation. Earlier weblog put up (e.g. right here, here), we studied bottleneck issues in knowledge enter pipelines from numerous views and reviewed a number of methods for diagnosing and addressing them. A typical trigger of those bottlenecks is exhausted CPU assets. The CPU is unable to satisfy the computational wants of Preprocessing Pipeline. Consequently, GPU starvation is a situation wherein costly GPUs are positioned idle and ready for knowledge to reach.

One efficient resolution is to dump heavy knowledge preprocessing to the GPU. We go a step additional by demonstrating this system and working augmentation on a devoted CUDA stream, permitting mannequin coaching and concurrent execution.

Toy Picture Classification Coaching Experiments

First, we begin by defining a easy CNN-based picture classification mannequin.

import torch
import torch.nn as nn

import torch
import torch.nn as nn

img_size = 256
num_classes = 10
mannequin = nn.Sequential(
    # Begin with 256x256 picture
    nn.Conv2d(3, 16, kernel_size=1),
    nn.ReLU(inplace=True),
    nn.Conv2d(16, 32, kernel_size=2, stride=2),  # 2x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(32, 64, kernel_size=2, stride=2),  # 4x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(64, 128, kernel_size=2, stride=2),  # 8x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(128, 256, kernel_size=2, stride=2),  # 16x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(256, 512, kernel_size=2, stride=2),  # 32x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(512, 1024, kernel_size=2, stride=2),  # 64x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(1024, 2048, kernel_size=2, stride=2),  # 128X downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(2048, 4096, kernel_size=2, stride=2),  # 256X
    nn.Flatten(),
    nn.Linear(4096, num_classes)
)

Subsequent, we create an artificial dataset with an augmentation pipeline that’s deliberately designed to trigger extreme efficiency bottlenecks.

import random
from torch.utils.knowledge import DataLoader
import torchvision.transforms.v2 as T
from torchvision.datasets.imaginative and prescient import VisionDataset
import torchvision.transforms.v2.useful as F
import torchvision.ops as ops

# A dataset with random pictures and labels
class FakeDataset(VisionDataset):
    def __init__(self, rework = None):
        tremendous().__init__(root=None, rework=rework)
        self.dimension = 1000000

    def __getitem__(self, index):
        # create a random picture
        img = torch.randint(0, 256, (3, img_size, img_size),
                           dtype=torch.uint8)
        # create a random label
        goal = torch.randint(0, num_classes, (1, ))

        if self.rework:
            # Apply tranformations
            img = self.rework(img)

        return img, goal

    def __len__(self):
        return self.dimension

augmentations = T.Compose([
    T.ToDtype(torch.float32),
    T.RandomCrop(img_size//2),
    T.Resize(img_size),
    T.RandomRotation(degrees=45.0),
    T.GaussianBlur(kernel_size=7),
    T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
])

train_set = FakeDataset(rework=augmentations)

train_loader = DataLoader(
    dataset=train_set,
    batch_size=32,
    num_workers=8
)

Lastly, we outline the loss operate, optimizer, and coaching loop.

import time

machine = torch.machine("cuda")
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(mannequin.parameters())

mannequin.practice().to(machine)

warmup = 10
active_batches = 100
total_iters = warmup + active_batches

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0].to(machine=machine, non_blocking=True)
    labels = knowledge[1].to(machine=machine, non_blocking=True).squeeze()
    optimizer.zero_grad()
    output = mannequin(inputs)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

    if idx == warmup:
        # sync the GPU and begin the timer
        torch.cuda.synchronize()
        t0 = time.perf_counter()

    if idx == total_iters:
        break

# look ahead to the GPU to finnish after which cease the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

Working this baseline script will end in a mean throughput of 20.41 steps per second, with solely 42% GPU. Heavy knowledge augmentation is choking the CPU, which results in GPU starvation. For extra info on detecting bottlenecks in your knowledge entry pipeline, see my earlier put up.

Offloading knowledge augmentation to GPU

Transfer the augmentation to the GPU to deal with the efficiency bottlenecks within the knowledge enter pipeline.

Step one is to outline Custom Data Conversion This is applicable crops with random rotations per pattern in batches. That is essential for the built-in Torchvision The conversion applies the identical augmentation throughout the batch – loses the per-sample randomness seen on the CPU.

I will implement it Batch Random Crop Convert utilizing ROI_ALIGN operator.

class BatchRandomCrop(T.Rework):
    def __init__(self, output_size):
        tremendous().__init__()
        self.output_size = output_size

    def rework(self, img: torch.Tensor, params: dict):
        batch_size, _, original_height, original_width = img.form
        machine = img.machine
        max_top = original_height - self.output_size
        max_left = original_width - self.output_size

        # Generate random high and left coords for every picture within the batch
        random_top = torch.randint(0, max_top + 1, (batch_size,),
                                   machine=machine, dtype=torch.float32)
        random_left = torch.randint(0, max_left + 1, (batch_size,),
                                    machine=machine, dtype=torch.float32)

        image_indices = torch.arange(batch_size, machine=machine,
                                     dtype=torch.float32)

        bins = torch.stack([
            image_indices,
            random_left,
            random_top,
            random_left + self.output_size,
            random_top + self.output_size
        ], dim=1)

        cropped_batch = ops.roi_align(
            img,
            bins,
            output_size=self.output_size
        )
        return cropped_batch

I will implement it Batch Random Rotate Transfrom by repeating all pictures within the batch and making use of a random rotation to every. Word that this model shouldn’t be vectorized. A totally vectorized implementation would require extra effort.

class BatchRandomRotation(T.Rework):
    def __init__(self, levels):
        tremendous().__init__()
        self .levels = levels

    def rework(self, inpt: torch.Tensor, params: dict):
        # cut up the batch into an inventory of particular person pictures
        pictures = record(torch.unbind(inpt, dim=0))

        augmented_images = []
        for img_tensor in pictures:
            # generate a random angle
            angle = random.uniform(-self.levels, self.levels)

            # apply the rotation to the only picture
            transformed_img = F.rotate(
                img_tensor,
                angle=angle
            )
            augmented_images.append(transformed_img)

        # stack the reworked pictures
        return torch.stack(augmented_images, dim=0)

This defines batch_transform This mimics the CPU-based augmentation pipeline outlined above.

batch_transform = T.Compose([
    T.ToDtype(torch.float32),
    BatchRandomCrop(img_size//2),
    T.Resize(img_size),
    BatchRandomRotation(degrees=45.0),
    T.GaussianBlur(kernel_size=7),
    T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
])

Lastly, reset the dataset and replace the coaching loop to use the brand new one batch_transform:

train_set = FakeDataset(rework=None)

train_loader = DataLoader(
    dataset=train_set,
    batch_size=32,
    num_workers=8
)

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0].to(machine=machine, non_blocking=True)
    labels = knowledge[1].to(machine=machine, non_blocking=True).squeeze()
    
    # apply augmentations
    inputs = batch_transform(inputs)
    
    optimizer.zero_grad()
    output = mannequin(inputs)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

    if idx == warmup:
        torch.cuda.synchronize()
        t0 = time.perf_counter()

    if idx == total_iters:
        break

torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

This up to date coaching script improves throughput to 35.22 steps per second. Baseline outcomes velocity up by 72.57%.

Enhanced pipelining utilizing CUDA streams

Subsequent, pipeline the augmentation and coaching steps utilizing two separate CUDA streams. One is for knowledge conversion for coaching fashions. Every iteration of the loop performs two concurrent operations.

Prepare your mannequin in an prolonged batch n.
Carry out GPU-based knowledge augmentation in batches n+1

transform_stream = torch.cuda.Stream()
model_stream = torch.cuda.Stream()

# initialize the reworked worth to None
reworked = None

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0]
    labels_next = knowledge[1]

    if reworked shouldn't be None:
        with torch.cuda.stream(model_stream):
            labels = labels.to(machine, non_blocking=True).squeeze()
            model_stream.wait_stream(transform_stream)
            optimizer.zero_grad()
            output = mannequin(reworked)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

    with torch.cuda.stream(transform_stream):
        inputs = inputs.to(machine, non_blocking=True)
        reworked = batch_transform(inputs)
        # Report that the tensor was produced on transform_stream
        reworked.record_stream(transform_stream)

    labels = labels_next

    if idx == warmup:
        torch.cuda.synchronize()
        t0 = time.perf_counter()
    if idx == total_iters:
        break

torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

This additional improves throughput to 38.82 steps per second. This will increase by 10.2% within the serialized resolution and 90.20% greater than the unique baseline.

Pipeline sensitivity to workload properties

As we noticed in Half 1, the advantages of pipelines utilizing CUDA streams fluctuate relying on the workload particulars. The desk under captures outcomes for a number of completely different batch sizes.

The impact of pipelines with CUDA streams on throughput (by creator)

Bigger batch sizes make GPU offloading more practical and considerably enhance efficiency. On the identical time, earnings from the pipeline will lower. This can be based mostly on the truth that bigger batch sizes enhance GPU effectivity and cut back the possibilities of overlap.

abstract

In relation to working AI/ML workloads, they’re counted each millisecond. On this put up, we investigated the impression of pipelining AI/ML coaching steps utilizing CUDA streams in two frequent situations. That is offloading of partial mannequin coaching and knowledge augmentation to the GPU. In each circumstances, the pipelined resolution was superior to the serialized implementation, however the diploma of enchancment was considerably completely different based mostly on the batch dimension worth.

As highlighted all through the put up, the anticipated impression of utilizing CUDA streams might fluctuate extensively based mostly in your AI/ML workload. For instance, if the GPU is already being utilized effectively, the overhead of utilizing a CUDA stream can truly result in poor runtime efficiency. We extremely suggest testing this system with your individual workload earlier than adopting this strategy.

I hope the methods defined on this put up are helpful. Take a look at different posts on this sequence for ideas, tips and methods for profiling and optimizing your AI/ML workflows.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Pipelining AI/ML coaching workloads use CUDA streams

Half 1: Encoder decoder mannequin pipeline

Toy Picture Segmentation Coaching Experiments

Pipeline for mannequin execution utilizing CUDA streams

Pipeline sensitivity to workload properties

Half 2: Offloading GPU Enhancements

Toy Picture Classification Coaching Experiments

Offloading knowledge augmentation to GPU

Enhanced pipelining utilizing CUDA streams

Pipeline sensitivity to workload properties

abstract

The Michigan lawsuit alleges trainer humiliating college students for refusing to pledge loyalty

Openai Hires Groups Behind AI Beneficial Startup Crossmind

Converter

Editors Pick

Newsletter

Categories

Related Posts