Is it the way forward for Python numerical calculations?
Late final yr, Nvidia made an necessary announcement about the way forward for Python-based numerical computing. You would not be shocked when you miss it. In spite of everything, then and now, all different bulletins from all AI firms are enormous.
It was launched within the announcement cunumeric library, Drop-in alternate of ubiquitous numpy libraries constructed on prime Legate Framework.
Who’s Nvidia?
Most individuals in all probability know Nvidia from the ultra-fast chips that energy computer systems and knowledge facilities world wide. You may additionally be conversant in Nvidia’s charismatic, leather-based jacket-loving CEO Jensen Huang.
What many individuals do not know is that nvidia additionally designs and creates revolutionary gadget architectures and related software program. One in all its most extremely regarded merchandise is Calculate a unified gadget structure (cuda). cuda NVIDIA’s distinctive parallel computing platform and programming mannequin. Since its launch in 2007, it has developed right into a complete ecosystem that features drivers, runtimes, compilers, mathematical libraries, debug and profiling instruments, and container photographs. The result’s a well-tuned {hardware} and software program loop to maintain the Nvidia GPU on the coronary heart of contemporary high-performance and AI workloads.
What’s Legate?
Legate is an NVIDIA-driven open supply runtime layer that means that you can run acquainted Python knowledge science libraries (equivalent to Numpy, Cunumeric, Pandas-style APIs, sparse linear algebra kernels) with out altering piezin code with out altering multi-core CPUs, single or multi-GPU nodes, and even multi-node cross-stars. This converts high-level array operations into fine-grained duties and hand graphs that graph into C++ Legion A runtime that schedules duties, partitions knowledge, and strikes tiles between CPUs, GPUs and community hyperlinks.
In brief, Legate can transparently scale the acquainted single-node Python libraries to multi-GPU multi-node machines.
What’s Knumerick?
Cunumeric is a numpy drop-in alternate the place the Array operation is carried out by the Legate process engine and accelerated on one or many Nvidia GPUs (or if there isn’t a GPU on all CPU cores). Actually, you want to set up, however you’ll need to change one import line to get began utilizing it as a substitute of the common Numpy code. for instance …
# previous
import numpy as np
...
...
# new
import cupynumeric as np # every little thing else stays the identical
...
...
…after which run the script on the terminal utilizing the Legate command.
Behind the scenes, Cunumeric creates every numpy name, for instance, NP.Sin, NP.Linalg.svd, flashy indexes, broadcasts, reductions, and so forth. These duties are
- partition There are arrays in tiles which might be sized to suit the GPU reminiscence.
- schedule Every tile of one of the best accessible gadgets (GPU or CPU).
- Overlap Communication calculations when the workload spans a number of GPUs or nodes.
- It is spilling Tiles to NVME/SSD when the dataset outruns GPU RAM.
Cunumeric Mirry’s API mirrors virtually one-to-one Numpy, permitting present scientific or knowledge science code to scale from a laptop computer to a multi-GPU cluster with out rewriting it.
Advantages of efficiency
So, all of this appears nice, proper? Nonetheless, it solely is sensible if it gives a concrete efficiency enchancment in use of numpy, and Nvidia makes a powerful declare that that is true. Knowledge scientists, machine studying engineers and knowledge engineers often use a lot Numpy, so we will perceive that this may be an necessary facet of the system we write and preserve.
At the moment there are not any clusters of GPUs or supercomputers to check this, however the desktop PC has an NVIDIA GeForce RTX 4070 GPU. Use this to check a few of the NVIDIA claims.
(base) tom@tpr-desktop:~$ nvidia-smi
Solar Jun 15 15:26:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.75 Driver Model: 566.24 CUDA Model: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Identify Persistence-M | Bus-Id Disp.A | Risky Uncorr. ECC |
| Fan Temp Perf Pwr:Utilization/Cap | Reminiscence-Utilization | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti On | 00000000:01:00.0 On | N/A |
| 32% 29C P8 9W / 285W | 1345MiB / 12282MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Sort Course of identify GPU Reminiscence |
| ID ID Utilization |
|=========================================================================================|
| No operating processes discovered |
+-----------------------------------------------------------------------------------------+
Set up Cunumeric and Numpy in your PC to carry out the comparability take a look at. This helps you assess whether or not Nvidia’s claims are correct and perceive the efficiency variations between the 2 libraries.
Organising the event setting.
As at all times, I wish to arrange a unique improvement setting to run the assessments. That approach, what I do in that setting is not going to have an effect on different tasks. On the time of writing, Cunumeric can’t be put in on Home windows, so I will use WSL2 Ubuntu as a substitute.
Arrange your setting utilizing a miniconder, however be at liberty to make use of it irrespective of which software you employ.
$ conda create cunumeric-env python=3.10 -c conda-forge
$ conda activate cunumeric-env
$ conda set up -c conda-forge -c legate cupynumeric
$ conda set up -c conda-forge ucx cuda-cudart cuda-version=12
Code Instance 1 – Easy Matrix Proliferation
Multiplication of the matrix is the bread and butter of mathematical operations that underpin so many AI techniques, so it is sensible to strive it out first.
Be aware that in all my examples, I run snippets of numpy and cunumeric code 5 instances in a row, and averaged every of them on common. It additionally performs a GPU warm-up step earlier than timing is carried out to account for overhead equivalent to just-in-time (JIT) compilation.
import time
import gc
import argparse
import sys
def benchmark_numpy(n, runs):
"""Runs the matrix multiplication benchmark utilizing commonplace NumPy on the CPU."""
import numpy as np
print(f"--- NumPy (CPU) Benchmark ---")
print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")
# 1. Generate knowledge ONCE earlier than the timing loop.
print(f"Producing two {n}x{n} random matrices on CPU...")
A = np.random.rand(n, n).astype(np.float32)
B = np.random.rand(n, n).astype(np.float32)
# 2. Carry out one untimed warm-up run.
print("Performing warm-up run...")
_ = np.matmul(A, B)
print("Heat-up full.n")
# 3. Carry out the timed runs.
instances = []
for i in vary(runs):
begin = time.time()
# The operation being timed. The @ operator is a handy
# shorthand for np.matmul.
C = A @ B
finish = time.time()
length = finish - begin
instances.append(length)
print(f"Run {i+1}: time = {length:.4f}s")
del C # Clear up the consequence matrix
gc.acquire()
avg = sum(instances) / len(instances)
print(f"nNumPy common: {avg:.4f}sn")
return avg
def benchmark_cunumeric(n, runs):
"""Runs the matrix multiplication benchmark utilizing cuNumeric on the GPU."""
import cupynumeric as cn
import numpy as np # Import numpy for the canonical sync
print(f"--- cuNumeric (GPU) Benchmark ---")
print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")
# 1. Generate knowledge ONCE on the GPU earlier than the timing loop.
print(f"Producing two {n}x{n} random matrices on GPU...")
A = cn.random.rand(n, n).astype(np.float32)
B = cn.random.rand(n, n).astype(np.float32)
# 2. Carry out a vital untimed warm-up run for JIT compilation.
print("Performing warm-up run...")
C_warmup = cn.matmul(A, B)
# The most effective follow for synchronization: power a replica again to the CPU.
_ = np.array(C_warmup)
print("Heat-up full.n")
# 3. Carry out the timed runs.
instances = []
for i in vary(runs):
begin = time.time()
# Launch the operation on the GPU
C = A @ B
# Synchronize by changing the consequence to a host-side NumPy array.
np.array(C)
finish = time.time()
length = finish - begin
instances.append(length)
print(f"Run {i+1}: time = {length:.4f}s")
del C
gc.acquire()
avg = sum(instances) / len(instances)
print(f"ncuNumeric common: {avg:.4f}sn")
return avg
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Benchmark matrix multiplication on NumPy (CPU) vs. cuNumeric (GPU)."
)
parser.add_argument(
"-n", "--n", sort=int, default=3000, assist="Matrix measurement (n x n)"
)
parser.add_argument(
"-r", "--runs", sort=int, default=5, assist="Variety of timing runs"
)
parser.add_argument(
"--cunumeric", motion="store_true", assist="Run the cuNumeric (GPU) model"
)
args, unknown = parser.parse_known_args()
# The dispatcher logic
if args.cunumeric or "--cunumeric" in unknown:
benchmark_cunumeric(args.n, args.runs)
else:
benchmark_numpy(args.n, args.runs)
Performing numerous points of issues, regulars are used python example1.py Command line syntax. The syntax is extra sophisticated to run utilizing Legate. What it does is disable computerized configuration for Legate, then launch the example1.py script below Legate utilizing one CPU, one GPU, and nil OpenMP thread.
That is the output.
(cunumeric-env) tom@tpr-desktop:~$ python example1.py
--- NumPy (CPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)
Producing two 3000x3000 random matrices on CPU...
Performing warm-up run...
Heat-up full.
Run 1: time = 0.0976s
Run 2: time = 0.0987s
Run 3: time = 0.0957s
Run 4: time = 0.1063s
Run 5: time = 0.0989s
NumPy common: 0.0994s
(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example1.py --cunu
meric
[0 - 7f2e8fcc8480] 0.000000 {5}{module_config}: Module numa cannot detect assets.
[0 - 7f2e8fcc8480] 0.000000 {4}{topology}: cannot open /sys/gadgets/system/node/
[0 - 7f2e8fcc8480] 0.000049 {4}{threads}: reservation ('GPU ctxsync 0x55cd5fd34530') can't be happy
--- cuNumeric (GPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)
Producing two 3000x3000 random matrices on GPU...
Performing warm-up run...
Heat-up full.
Run 1: time = 0.0113s
Run 2: time = 0.0089s
Run 3: time = 0.0086s
Run 4: time = 0.0090s
Run 5: time = 0.0087s
cuNumeric common: 0.0093s
Effectively, that is a powerful begin. Cunumeric has registered a 10x speedup with Numpy.
You’ll be able to ignore the warning that Legate is output. These are informational and point out that Legate couldn’t discover enough CPU/Reminiscence Structure (NUMA) or enough CPU core particulars to handle the GPU.
Code Instance 2 – Logistic Regression
Logistic regression is a basic software in knowledge science because it gives a easy and interpretable solution to mannequin and predict binary outcomes (sure/no, go/fail, click on/no click on). This instance measures the period of time it takes to coach a easy binary classifier on artificial knowledge. For every of the 5 runs, it’s generated first n Pattern d Options (x), and the corresponding random 0/1 label vector (y). Initializes the burden vector w To Zeros, then run 500 Iteration of batch gradient descent: Calculating linear predictions z = x.dot(w)apply sigmoid p = 1/(1+exp(–Z))Calculate the gradient grad = xtdot(p – y) / nand replace the weights with w – = 0.1 * grad. The script information the elapsed time for every run, cleans up reminiscence, and eventually prints common coaching time.
import time
import gc
import argparse
import sys
# --- Reusable Coaching Perform ---
# By placing the coaching loop in its personal perform, we keep away from code duplication.
# The `np` argument permits us to go in both the numpy or cupynumeric module.
def train_logistic_regression(np, X, y, iters, alpha):
"""Performs a set variety of gradient descent iterations."""
# Guarantee w begins on the right gadget (CPU or GPU)
w = np.zeros(X.form[1])
for _ in vary(iters):
z = X.dot(w)
p = 1.0 / (1.0 + np.exp(-z))
grad = X.T.dot(p - y) / X.form[0]
w -= alpha * grad
return w
def benchmark_numpy(n_samples, n_features, iters, alpha):
"""Runs the logistic regression benchmark utilizing commonplace NumPy on the CPU."""
import numpy as np
print(f"--- NumPy (CPU) Benchmark ---")
print(f"Coaching on {n_samples} samples, {n_features} options for {iters} iterationsn")
# 1. Generate knowledge ONCE earlier than the timing loop.
print("Producing random dataset on CPU...")
X = np.random.rand(n_samples, n_features)
y = (np.random.rand(n_samples) > 0.5).astype(np.float64)
# 2. Carry out one untimed warm-up run.
print("Performing warm-up run...")
_ = train_logistic_regression(np, X, y, iters, alpha)
print("Heat-up full.n")
# 3. Carry out the timed runs.
instances = []
for i in vary(args.runs):
begin = time.time()
# The operation being timed
_ = train_logistic_regression(np, X, y, iters, alpha)
finish = time.time()
length = finish - begin
instances.append(length)
print(f"Run {i+1}: time = {length:.3f}s")
gc.acquire()
avg = sum(instances) / len(instances)
print(f"nNumPy common: {avg:.3f}sn")
return avg
def benchmark_cunumeric(n_samples, n_features, iters, alpha):
"""Runs the logistic regression benchmark utilizing cuNumeric on the GPU."""
import cupynumeric as cn
import numpy as np # Additionally import numpy for the canonical synchronization
print(f"--- cuNumeric (GPU) Benchmark ---")
print(f"Coaching on {n_samples} samples, {n_features} options for {iters} iterationsn")
# 1. Generate knowledge ONCE on the GPU earlier than the timing loop.
print("Producing random dataset on GPU...")
X = cn.random.rand(n_samples, n_features)
y = (cn.random.rand(n_samples) > 0.5).astype(np.float64)
# 2. Carry out a vital untimed warm-up run for JIT compilation.
print("Performing warm-up run...")
w_warmup = train_logistic_regression(cn, X, y, iters, alpha)
# The most effective follow for synchronization: power a replica again to the CPU.
_ = np.array(w_warmup)
print("Heat-up full.n")
# 3. Carry out the timed runs.
instances = []
for i in vary(args.runs):
begin = time.time()
# Launch the operation on the GPU
w = train_logistic_regression(cn, X, y, iters, alpha)
# Synchronize by changing the ultimate consequence again to a NumPy array.
np.array(w)
finish = time.time()
length = finish - begin
instances.append(length)
print(f"Run {i+1}: time = {length:.3f}s")
del w
gc.acquire()
avg = sum(instances) / len(instances)
print(f"ncuNumeric common: {avg:.3f}sn")
return avg
if __name__ == "__main__":
# A extra sturdy argument parsing setup
parser = argparse.ArgumentParser(
description="Benchmark logistic regression on NumPy (CPU) vs. cuNumeric (GPU)."
)
# Hyperparameters for the mannequin
parser.add_argument(
"-n", "--n_samples", sort=int, default=2_000_000, assist="Variety of knowledge samples"
)
parser.add_argument(
"-d", "--n_features", sort=int, default=10, assist="Variety of options"
)
parser.add_argument(
"-i", "--iters", sort=int, default=500, assist="Variety of gradient descent iterations"
)
parser.add_argument(
"-a", "--alpha", sort=float, default=0.1, assist="Studying price"
)
# Benchmark management
parser.add_argument(
"-r", "--runs", sort=int, default=5, assist="Variety of timing runs"
)
parser.add_argument(
"--cunumeric", motion="store_true", assist="Run the cuNumeric (GPU) model"
)
args, unknown = parser.parse_known_args()
# Dispatcher logic
if args.cunumeric or "--cunumeric" in unknown:
benchmark_cunumeric(args.n_samples, args.n_features, args.iters, args.alpha)
else:
benchmark_numpy(args.n_samples, args.n_features, args.iters, args.alpha)
and output.
(cunumeric-env) tom@tpr-desktop:~$ python example2.py
--- NumPy (CPU) Benchmark ---
Coaching on 2000000 samples, 10 options for 500 iterations
Producing random dataset on CPU...
Performing warm-up run...
Heat-up full.
Run 1: time = 12.292s
Run 2: time = 11.830s
Run 3: time = 11.903s
Run 4: time = 12.843s
Run 5: time = 11.964s
NumPy common: 12.166s
(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example2.py --cunu
meric
[0 - 7f04b535c480] 0.000000 {5}{module_config}: Module numa cannot detect assets.
[0 - 7f04b535c480] 0.000000 {4}{topology}: cannot open /sys/gadgets/system/node/
[0 - 7f04b535c480] 0.001149 {4}{threads}: reservation ('GPU ctxsync 0x55fb037cf140') can't be happy
--- cuNumeric (GPU) Benchmark ---
Coaching on 2000000 samples, 10 options for 500 iterations
Producing random dataset on GPU...
Performing warm-up run...
Heat-up full.
Run 1: time = 1.964s
Run 2: time = 1.957s
Run 3: time = 1.968s
Run 4: time = 1.955s
Run 5: time = 1.960s
cuNumeric common: 1.961s
Not as spectacular as our first instance, however the speedup of an already quick numpy programme shouldn’t be sniffed.
Code Instance 3 – Fixing a Linear Equation
This script benchmarks how lengthy it takes to unravel a excessive density 3000 x 3000 linear algebraic equation system. That is the fundamental operation of linear algebra used to unravel equations of sort ax = b, the place A is a big grid of numbers (on this case, a 3000 x 3000 matrix), and B is an inventory of numbers (vectors).
The purpose is to seek out an unknown listing of numbers X that make the equation true. It is a computationally intensive process on the coronary heart of many scientific simulations, engineering issues, monetary fashions, and even AI algorithms.
import time
import gc
import argparse
import sys # Import sys to examine arguments
# Be aware: The library imports (numpy and cupynumeric) at the moment are finished *inside*
# their respective features to maintain them separate and keep away from import errors.
def benchmark_numpy(n, runs):
"""Runs the linear clear up benchmark utilizing commonplace NumPy on the CPU."""
import numpy as np
print(f"--- NumPy (CPU) Benchmark ---")
print(f"Fixing {n}×{n} A x = b ({runs} runs)n")
# 1. Generate knowledge ONCE earlier than the timing loop.
print("Producing random system on CPU...")
A = np.random.randn(n, n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
# 2. Carry out one untimed warm-up run. That is good follow even for
# the CPU to make sure caches are heat and any one-time setup is completed.
print("Performing warm-up run...")
_ = np.linalg.clear up(A, b)
print("Heat-up full.n")
# 3. Carry out the timed runs.
instances = []
for i in vary(runs):
begin = time.time()
# The operation being timed
x = np.linalg.clear up(A, b)
finish = time.time()
length = finish - begin
instances.append(length)
print(f"Run {i+1}: time = {length:.6f}s")
# Clear up the consequence to be protected with reminiscence
del x
gc.acquire()
avg = sum(instances) / len(instances)
print(f"nNumPy common: {avg:.6f}sn")
return avg
def benchmark_cunumeric(n, runs):
"""Runs the linear clear up benchmark utilizing cuNumeric on the GPU."""
import cupynumeric as cn
import numpy as np # Additionally import numpy for the canonical synchronization
print(f"--- cuNumeric (GPU) Benchmark ---")
print(f"Fixing {n}×{n} A x = b ({runs} runs)n")
# 1. Generate knowledge ONCE on the GPU earlier than the timing loop.
# This ensures we're not timing the info switch in our essential loop.
print("Producing random system on GPU...")
A = cn.random.randn(n, n).astype(np.float32)
b = cn.random.randn(n).astype(np.float32)
# 2. Carry out a vital untimed warm-up run. This handles JIT
# compilation and different one-time GPU setup prices.
print("Performing warm-up run...")
x_warmup = cn.linalg.clear up(A, b)
# The most effective follow for synchronization: power a replica again to the CPU.
_ = np.array(x_warmup)
print("Heat-up full.n")
# 3. Carry out the timed runs.
instances = []
for i in vary(runs):
begin = time.time()
# Launch the operation on the GPU
x = cn.linalg.clear up(A, b)
# Synchronize by changing the consequence to a host-side NumPy array.
# That is assured to dam till the GPU has completed.
np.array(x)
finish = time.time()
length = finish - begin
instances.append(length)
print(f"Run {i+1}: time = {length:.6f}s")
# Clear up the GPU array consequence
del x
gc.acquire()
avg = sum(instances) / len(instances)
print(f"ncuNumeric common: {avg:.6f}sn")
return avg
if __name__ == "__main__":
# A extra sturdy argument parsing setup
parser = argparse.ArgumentParser(
description="Benchmark linear clear up on NumPy (CPU) vs. cuNumeric (GPU)."
)
parser.add_argument(
"-n", "--n", sort=int, default=3000, assist="Matrix measurement (n x n)"
)
parser.add_argument(
"-r", "--runs", sort=int, default=5, assist="Variety of timing runs"
)
# Use parse_known_args() to deal with potential further arguments from Legate
args, unknown = parser.parse_known_args()
# The dispatcher logic: examine if "--cunumeric" is within the command line
# It is a easy and efficient solution to change between modes.
if "--cunumeric" in sys.argv or "--cunumeric" in unknown:
benchmark_cunumeric(args.n, args.runs)
else:
benchmark_numpy(args.n, args.runs)
output.
(cunumeric-env) tom@tpr-desktop:~$ python example4.py
--- NumPy (CPU) Benchmark ---
Fixing 3000×3000 A x = b (5 runs)
Producing random system on CPU...
Performing warm-up run...
Heat-up full.
Run 1: time = 0.133075s
Run 2: time = 0.126129s
Run 3: time = 0.135849s
Run 4: time = 0.137383s
Run 5: time = 0.138805s
NumPy common: 0.134248s
(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example4.py --cunumeric
[0 - 7f29f42ce480] 0.000000 {5}{module_config}: Module numa cannot detect assets.
[0 - 7f29f42ce480] 0.000000 {4}{topology}: cannot open /sys/gadgets/system/node/
[0 - 7f29f42ce480] 0.000053 {4}{threads}: reservation ('GPU ctxsync 0x562e88c28700') can't be happy
--- cuNumeric (GPU) Benchmark ---
Fixing 3000×3000 A x = b (5 runs)
Producing random system on GPU...
Performing warm-up run...
Heat-up full.
Run 1: time = 0.009685s
Run 2: time = 0.010043s
Run 3: time = 0.009966s
Run 4: time = 0.009739s
Run 5: time = 0.009383s
cuNumeric common: 0.009763s
That is an unimaginable consequence. The Nvidia Cunumery Run is 100 instances sooner than the Numpy Run.
Code Instance 4 – Sorting
Sorting is the basic a part of every little thing that occurs in computing, and most builders do not even give it some thought as a result of fashionable computer systems are so quick. However let’s have a look at how totally different it could possibly make to this ubiquitous operation utilizing Cunumeric. Kind giant (30,000,000) 1D numbers
# benchmark_sort.py
import time
import sys
import gc
# Array measurement
n = 30_000_000 # 30 million parts
def benchmark_numpy():
import numpy as np
print(f"Sorting an array of {n} parts with NumPy (5 runs)n")
instances = []
for i in vary(5):
knowledge = np.random.randn(n).astype(np.float32)
begin = time.time()
_ = np.kind(knowledge)
finish = time.time()
length = finish - begin
instances.append(length)
print(f"Run {i+1}: time = {length:.6f}s")
del knowledge
gc.acquire()
avg = sum(instances) / len(instances)
print(f"nNumPy common: {avg:.6f}sn")
def benchmark_cunumeric():
import cupynumeric as np
print(f"Sorting an array of {n} parts with cuNumeric (5 runs)n")
instances = []
for i in vary(5):
knowledge = np.random.randn(n).astype(np.float32)
begin = time.time()
_ = np.kind(knowledge)
# Power GPU sync
_ = np.linalg.norm(np.zeros(()))
finish = time.time()
length = finish - begin
instances.append(length)
print(f"Run {i+1}: time = {length:.6f}s")
del knowledge
gc.acquire()
_ = np.linalg.norm(np.zeros(()))
avg = sum(instances) / len(instances)
print(f"ncuNumeric common: {avg:.6f}sn")
if __name__ == "__main__":
if "--cunumeric" in sys.argv:
benchmark_cunumeric()
else:
benchmark_numpy()
output.
(cunumeric-env) tom@tpr-desktop:~$ python example5.py
--- NumPy (CPU) Benchmark ---
Sorting an array of 30000000 parts (5 runs)
Creating random array on CPU...
Performing warm-up run...
Heat-up full.
Run 1: time = 0.588777s
Run 2: time = 0.586813s
Run 3: time = 0.586745s
Run 4: time = 0.586525s
Run 5: time = 0.583783s
NumPy common: 0.586529s
-----------------------------
(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example5.py --cunumeric
[0 - 7fd9e4615480] 0.000000 {5}{module_config}: Module numa cannot detect assets.
[0 - 7fd9e4615480] 0.000000 {4}{topology}: cannot open /sys/gadgets/system/node/
[0 - 7fd9e4615480] 0.000082 {4}{threads}: reservation ('GPU ctxsync 0x564489232fd0') can't be happy
--- cuNumeric (GPU) Benchmark ---
Sorting an array of 30000000 parts (5 runs)
Creating random array on GPU...
Performing warm-up run...
Heat-up full.
Run 1: time = 0.010857s
Run 2: time = 0.007927s
Run 3: time = 0.007921s
Run 4: time = 0.008240s
Run 5: time = 0.007810s
cuNumeric common: 0.008551s
-------------------------------
Much more spectacular performances from Cunumeric and Legate.
abstract
This text launched cunumericNvidia library designed for top efficiency, Drop-in Alternate of numpy. The important thing level is that knowledge scientists can speed up present Python code on NVIDIA GPUs with minimal effort. “Legate” Directions.
Two key parts energy the expertise.
- Legate: Nvidia’s open supply runtime layer that routinely converts high-level Python operations into duties. Handle the distribution of those duties to a single or a number of GPU, intelligently handle knowledge partition dealing with, reminiscence administration (even spilling to disk if essential), and communication optimization.
- cunumeric: A user-friendly library that displays the Numpy API. Whenever you make a name like np.matmul(), Cunumeric converts it to a process that Legate Engine performs on the GPU.
I used to be in a position to confirm Nvidia efficiency claims by operating 4 benchmark assessments (utilizing an NVIDIA RTX 4070 TI GPU) on my desktop PC and evaluating the GPU’s Cunumerim with the CPU’s commonplace Numpy.
The outcomes present a big efficiency enchancment for Cunumeric.
- Matrix Proliferation: ~10x Sooner than numpy.
- Logistic Regression Coaching: ~6x Sooner.
- Resolving linear equations: Enormous 100x+ Pace up.
- Giant array sorting: One other huge enchancment that’s virtually finished 70x Sooner.
In conclusion, I’ve proven that Cunumeric will reach its promise, making the immeasurable computational energy of the GPU accessible to the broader Python Knowledge Science neighborhood with out the necessity for a sudden studying curve or full code rewriting.
For extra data and hyperlinks to associated assets, see the unique Nvidia announcement on Cunumeric here.

