That is half 1 of a two-part collection on distributed computing with Ray. This half will present you how one can use Ray in your native PC, and half 2 will present you how one can prolong Ray to a multi-server cluster within the cloud.
You bought a brand new 16-core laptop computer or desktop and need to take a look at its capabilities by doing a little heavy calculations.
You are a Python programmer, however you are not an professional but, so that you open up your favourite LLM and ask questions like:
“I want to rely the variety of prime numbers inside a specified enter vary. Please give me the Python code to take action.”
After a number of seconds, your LLM will offer you a code. After a number of brief exchanges and some tweaks, you will find yourself with one thing like this:
import math, time, os
def is_prime(n: int) -> bool:
if n < 2: return False
if n == 2: return True
if n % 2 == 0: return False
r = int(math.isqrt(n)) + 1
for i in vary(3, r, 2):
if n % i == 0:
return False
return True
def count_primes(a: int, b: int) -> int:
c = 0
for n in vary(a, b):
if is_prime(n):
c += 1
return c
if __name__ == "__main__":
A, B = 10_000_000, 20_000_000
total_cpus = os.cpu_count() or 1
# Begin "chunky"; we will sweep this later
chunks = max(4, total_cpus * 2)
step = (B - A) // chunks
print(f"CPUs~{total_cpus}, chunks={chunks}")
t0 = time.time()
outcomes = []
for i in vary(chunks):
s = A + i * step
e = s + step if i < chunks - 1 else B
outcomes.append(count_primes(s, e))
complete = sum(outcomes)
print(f"complete={complete}, time={time.time() - t0:.2f}s")
After I run this system, it really works completely. The one downside is that it takes fairly some time to run (relying on the dimensions of your enter vary, it may take anyplace from 30 to 60 seconds). That most likely will not be accepted.
what are you doing now? There are a number of choices, however the three most typical are most likely:
– Parallelize your code utilizing threads or multiprocessing
– Rewrite your code in a “quick” language like C or Rust
– Attempt libraries like Cython, Numba, NumPy, and so on.
These are all viable choices, however every has its drawbacks. Choices 1 and three considerably improve code complexity, and intermediate choices might require you to study a brand new programming language.
What if I informed you there’s one other means? Minimal adjustments are required to present code. One through which the runtime is routinely distributed throughout all accessible cores.
It is only a third get together factor Ray The library is dedicated to doing so.
What’s a lei?
The Ray Python library is open supply distributed computing framework designed to realize that Straightforward to increase You may transfer your Python applications out of your laptop computer to your cluster with minimal code adjustments.
Ray makes it straightforward to scale and distribute compute-intensive software workloads, from deep studying to information processing, throughout clusters of distant computer systems, whereas additionally delivering sensible software runtime enhancements on laptops, desktops, and even distant cloud-based computing clusters.
Ray supplies a wealthy set of libraries and integrations constructed on a versatile distributed execution framework, making distributed computing simply accessible to everybody.
Briefly, Ray lets you parallelize and distribute Python code with minimal effort, whether or not it is working domestically in your laptop computer or on an enormous cloud-based cluster.
Utilizing Rays
In the remainder of this text, we’ll cowl the fundamentals of utilizing Ray to speed up CPU-intensive Python code, and we’ll arrange some instance code snippets to point out you ways straightforward it’s to include Ray’s performance into your individual workloads.
In case you’re a knowledge scientist or machine studying engineer, there are some necessary ideas it’s essential to perceive first to get essentially the most out of Ray. Ray is made up of a number of parts.
ray information is a scalable library designed for information processing in ML and AI duties. It supplies versatile, high-performance APIs for AI duties resembling batch inference, information preprocessing, and information ingestion for ML coaching.
ray prepare is a versatile and scalable library designed for coaching and fine-tuning distributed machine studying.
raytune Used for hyperparameter tuning.
ray serve is a scalable library for deploying fashions that facilitates on-line inference APIs.
ray lilib Used for scalable reinforcement studying
As you’ll be able to see, Ray may be very centered on large-scale language fashions and AI purposes, however there’s one final necessary part that we’ve not talked about but, and that is what we’ll use on this article.
ray core Designed to scale CPU-intensive general-purpose Python purposes. It’s designed to distribute Python workloads throughout all accessible cores on the system it’s working on.
This text solely discusses ray cores.
Two necessary ideas to grasp inside Ray Core are: job and actors.
The duty is stateless A employee or service carried out utilizing Ray by adorning common Python features.
actor (or stateful staff) are used, for instance, when it’s essential to monitor and keep the state of a dependent variable throughout a distributed cluster. Actors are carried out by adorning common Python class.
Actors and duties are each outlined utilizing the identical @ray.distant Decorator. As soon as outlined, these duties are carried out utilizing particular instructions. .distant() Technique supplied by Ray. Let us take a look at an instance of this subsequent.
Establishing the event atmosphere
Earlier than you begin coding, it’s essential to arrange a growth atmosphere to maintain your initiatives siled so they do not intervene with one another. I am going to use conda for this, however be at liberty to make use of your favourite device. Run code utilizing a Jupyter pocket book within the WSL2 Ubuntu shell on Home windows.
$ conda create -n ray-test python=3.13 -y
$ conda activate ray-test
(ray-test) $ conda set up ray[default]
Code instance – Rely prime numbers
Let’s take the instance given initially, counting the variety of prime numbers within the vary 10,000,000 to twenty,000,000.
Run your unique Python code and measure how lengthy it takes.
import math, time, os
def is_prime(n: int) -> bool:
if n < 2: return False
if n == 2: return True
if n % 2 == 0: return False
r = int(math.isqrt(n)) + 1
for i in vary(3, r, 2):
if n % i == 0:
return False
return True
def count_primes(a: int, b: int) -> int:
c = 0
for n in vary(a, b):
if is_prime(n):
c += 1
return c
if __name__ == "__main__":
A, B = 10_000_000, 20_000_000
total_cpus = os.cpu_count() or 1
# Begin "chunky"; we will sweep this later
chunks = max(4, total_cpus * 2)
step = (B - A) // chunks
print(f"CPUs~{total_cpus}, chunks={chunks}")
t0 = time.time()
outcomes = []
for i in vary(chunks):
s = A + i * step
e = s + step if i < chunks - 1 else B
outcomes.append(count_primes(s, e))
complete = sum(outcomes)
print(f"complete={complete}, time={time.time() - t0:.2f}s")
And what in regards to the output?
CPUs~32, chunks=64
complete=606028, time=31.17s
Now, are you able to enhance that utilizing Ray? Sure, simply observe this straightforward 4-step course of.
Step 1 – Ray Initialization. Add these two traces to the highest of your code.
import ray
ray.init()
Step 2 – Create the distant perform. It is easy. Simply enhance the perform you need to optimize with the @ray.distant decorator. The perform that’s adorned is the one that’s doing essentially the most work. On this instance, it’s the count_primes perform.
@ray.distant(num_cpus=1)
def count_primes(begin: int, finish: int) -> int:
...
...
Step 3 – Launch parallel duties. Name the distant perform utilizing .distant Ray command.
refs.append(count_primes.distant(s, e))
Step 4 – Wait till all duties are accomplished. Every of Ray’s duties is object reference when known as. That is my promise to Ray. Which means that Ray will begin the distant execution of the duty, and Ray will return its worth sooner or later sooner or later. Observe all ObjectRefs returned by executing the duty. ray.get() perform. This can block till all duties are accomplished.
outcomes = ray.get(duties)
Let’s put this all collectively. As you’ll be able to see, the adjustments to the unique code are minimal. The one code added is 4 traces and a print assertion to show the variety of working nodes and cores.
import math
import time
# -----------------------------------------
# Change No. 1
# -----------------------------------------
import ray
ray.init(auto)
def is_prime(n: int) -> bool:
if n < 2: return False
if n == 2: return True
if n % 2 == 0: return False
r = int(math.isqrt(n)) + 1
for i in vary(3, r, 2):
if n % i == 0:
return False
return True
# -----------------------------------------
# Change No. 2
# -----------------------------------------
@ray.distant(num_cpus=1) # pure-Python loop → 1 CPU per job
def count_primes(a: int, b: int) -> int:
c = 0
for n in vary(a, b):
if is_prime(n):
c += 1
return c
if __name__ == "__main__":
A, B = 10_000_000, 60_000_000
total_cpus = int(ray.cluster_resources().get("CPU", 1))
# Begin "chunky"; we will sweep this later
chunks = max(4, total_cpus * 2)
step = (B - A) // chunks
print(f"nodes={len(ray.nodes())}, CPUs~{total_cpus}, chunks={chunks}")
t0 = time.time()
refs = []
for i in vary(chunks):
s = A + i * step
e = s + step if i < chunks - 1 else B
# -----------------------------------------
# Change No. 3
# -----------------------------------------
refs.append(count_primes.distant(s, e))
# -----------------------------------------
# Change No. 4
# -----------------------------------------
complete = sum(ray.get(refs))
print(f"complete={complete}, time={time.time() - t0:.2f}s")
Nicely, was all of it price it? Let’s run the brand new code and see what we get.
2025-11-01 13:36:30,650 INFO employee.py:2004 -- Began a neighborhood Ray occasion. View the dashboard at 127.0.0.1:8265
/residence/tom/.native/lib/python3.10/site-packages/ray/_private/employee.py:2052: FutureWarning: Tip: In future variations of Ray, Ray will now not override accelerator seen units env var if num_gpus=0 or num_gpus=None (default). To allow this habits and switch off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
warnings.warn(
nodes=1, CPUs~32, chunks=64
complete=606028, time=3.04s
Nicely, the outcomes converse for themselves. Ray Python code is as follows: 10x sooner than common Python code. Not too shabby.
The place does this pace improve come from? Nicely, Ray can unfold the workload throughout all cores on the system. Cores are like mini CPUs. After I ran the unique Python code, just one core was used. That is advantageous, but when your CPU has a number of cores, as most trendy PCs do, you are leaving your cash on the road, so to talk.
In my case, the CPU has 24 cores, so it is no shock that the Ray code is way sooner than the non-Ray code.
Monitoring Ray jobs
One other level price mentioning is that Ray makes it very straightforward to watch job execution by the dashboard. Discover that the output you obtain once you run the Ray pattern code reveals the next:
... -- Began a neighborhood Ray occasion. View the dashboard at 127.0.0.1:8265
Because you’re working this in your desktop, you will see a neighborhood URL hyperlink. In case you’re working this on a cluster, the URL factors to a location on the cluster’s head node.
If you click on the supplied URL hyperlink, it’s best to see one thing just like the next:
From this predominant display screen, you’ll be able to drill down to watch numerous features of your Ray program utilizing the menu hyperlinks on the prime of the web page.
Utilizing Ray actors
We talked about earlier that actors are an integral a part of Ray core processing. Actors are used to coordinate and share information between Ray duties. For instance, for instance you need to set a world restriction that each one working duties should observe. Suppose you will have a pool of employee duties, however you need solely a most of 5 duties to run on the similar time. Beneath is a few code that appears to work.
import math, time, os
def is_prime(n: int) -> bool:
if n < 2: return False
if n == 2: return True
if n % 2 == 0: return False
r = int(math.isqrt(n)) + 1
for i in vary(3, r, 2):
if n % i == 0:
return False
return True
def count_primes(a: int, b: int) -> int:
c = 0
for n in vary(a, b):
if is_prime(n):
c += 1
return c
if __name__ == "__main__":
A, B = 10_000_000, 20_000_000
total_cpus = os.cpu_count() or 1
# Begin "chunky"; we will sweep this later
chunks = max(4, total_cpus * 2)
step = (B - A) // chunks
print(f"CPUs~{total_cpus}, chunks={chunks}")
t0 = time.time()
outcomes = []
for i in vary(chunks):
s = A + i * step
e = s + step if i < chunks - 1 else B
outcomes.append(count_primes(s, e))
complete = sum(outcomes)
print(f"complete={complete}, time={time.time() - t0:.2f}s")
I used international variables to restrict the variety of duties to run. The code is syntactically right and runs with out error. Sadly, I do not get the outcomes I anticipated. It is because every Ray job runs in its personal course of house and has its personal copy of worldwide variables. International variables will not be shared between features. So once you run the above code you will notice output like this:
Whole calls: 200
Supposed GLOBAL_QPS: 5.0
Anticipated time if really global-limited: ~40.00s
Precise time with 'international var' (damaged): 3.80s
Noticed cluster QPS: ~52.6 (ought to have been ~5.0)
To repair this, use actors. Recall that actors are simply Python lessons adorned with Ray. Right here is the code with the actor:
import time, ray
ray.init(ignore_reinit_error=True, log_to_driver=False)
# That is our actor
@ray.distant
class GlobalPacer:
"""Serialize calls so cluster-wide charge <= qps."""
def __init__(self, qps: float):
self.interval = 1.0 / qps
self.next_time = time.time()
def purchase(self):
# Wait contained in the actor till we will proceed
now = time.time()
if now < self.next_time:
time.sleep(self.next_time - now)
# Reserve the subsequent slot; guard in opposition to drift
self.next_time = max(self.next_time + self.interval, time.time())
return True
@ray.distant
def call_api_with_limit(n_calls: int, pacer):
carried out = 0
for _ in vary(n_calls):
# Anticipate international permission
ray.get(pacer.purchase.distant())
# faux API name (no additional sleep right here)
carried out += 1
return carried out
if __name__ == "__main__":
NUM_WORKERS = 10
CALLS_EACH = 20
GLOBAL_QPS = 5.0 # cluster-wide cap
total_calls = NUM_WORKERS * CALLS_EACH
expected_min_time = total_calls / GLOBAL_QPS
pacer = GlobalPacer.distant(GLOBAL_QPS)
t0 = time.time()
ray.get([call_api_with_limit.remote(CALLS_EACH, pacer) for _ in range(NUM_WORKERS)])
dt = time.time() - t0
print(f"Whole calls: {total_calls}")
print(f"International QPS cap: {GLOBAL_QPS}")
print(f"Anticipated time (if capped at {GLOBAL_QPS} QPS): ~{expected_min_time:.2f}s")
print(f"Precise time with actor: {dt:.2f}s")
print(f"Noticed cluster QPS: ~{total_calls/dt:.1f}")
The limiter code is encapsulated in a category (GlobalPacer) and adorned with ray.distant. Which means that it applies to all working duties. Run the up to date code and see the distinction within the output.
Whole calls: 200
International QPS cap: 5.0
Anticipated time (if capped at 5.0 QPS): ~40.00s
Precise time with actor: 39.86s
Noticed cluster QPS: ~5.0
abstract
This text launched Rayan open supply Python framework. Scale computationally intensive applications You may go from single core to a number of cores to clusters with minimal code adjustments.
We briefly mentioned Ray’s predominant parts: Ray Information, Ray Prepare, Ray Tune, Ray Serve, and Ray Core, and emphasised that Ray Core is right for general-purpose CPU scaling.
We coated a few of the key ideas in Ray Core, together with introducing duties (stateless parallel features), actors (stateful staff for shared state and coordination), and ObjectRefs (future guarantees for job return values).
To reveal the advantages of utilizing Ray, we began with a easy CPU-intensive instance (counting prime numbers in a variety) to point out how a easy Python implementation can be sluggish to run on a single core.
As a substitute of rewriting your code in one other language or utilizing complicated multiprocessing libraries, you should utilize Ray to Parallelize your workload It solely takes 4 straightforward steps and some traces of code.
- Begin Ray with ray.init()
- Beautify the perform with @ray.distant to show it right into a parallel job.
- Launch duties concurrently utilizing .distant() and
- Gather the duty outcomes utilizing ray.get().
This method diminished the execution time for the prime counting pattern from about 30 seconds to about 3 seconds on a 24-core machine.
I additionally talked about how straightforward it’s to watch working jobs utilizing Ray’s built-in dashboard, and confirmed you how one can entry it.
Lastly, we confirmed an instance of how one can use Ray Actors. present the rationale international variables will not be applicable to coordinate a number of duties, Every employee has its personal reminiscence house.
Partially two of this collection, we’ll have a look at how one can take issues to the subsequent stage by permitting your Ray jobs to make use of much more CPU energy when scaling to giant multi-node servers within the cloud through Amazon Net Providers.

