I used to be engaged on a script the opposite day and it drove me loopy. It labored, positive, however it was simply…sluggish. It is actually sluggish. I felt that if I understood this, it might be quicker. the place There was a holdup.
My first thought was to begin adjusting. Information loading has been optimized. Or rewrite the for loop? However I ended myself. I as soon as fell into that entice by spending hours “optimizing” a bit of code, solely to search out that there was little change in general execution time because of this. Donald Knuth has some extent when he says, “Untimely optimization is the foundation of all evil.”
I made a decision to take a extra systematic strategy. I used to be going to know for positive, not guess. To get exhausting information displaying precisely which features had been consuming many of the clock cycles, we needed to profile the code.
This text explains the precise course of I used. We use an deliberately sluggish Python script and two nice instruments to establish bottlenecks with surgical precision.
The primary of those instruments known as: c profilea strong profiler constructed into Python. The opposite one known as snake biz, be useful gizmo that Convert profiler output into an interactive visible map.
Establishing the event atmosphere
Earlier than you begin coding, let’s arrange your improvement atmosphere. The very best follow is to create a separate Python atmosphere the place you possibly can set up and experiment with the required software program, understanding that no matter you do won’t have an effect on the remainder of your system. We’ll use conda right here, however you need to use any technique you are snug with.
#create our take a look at atmosphere
conda create -n profiling_lab python=3.11 -y
# Now activate it
conda activate profiling_lab
Now that you’ve got your atmosphere arrange, it is advisable to set up snakeviz for visualization and numpy for pattern scripts. cProfile is already constructed into Python, so there’s nothing extra to do. We’ll be utilizing Jupyter Pocket book to run the script, so we’ll additionally set up that.
# Set up our visualization software and numpy
pip set up snakeviz numpy jupyter
Enter now jupyter pocket book Kind within the command immediate. You need to see a jupyter pocket book open in your browser. If it does not occur routinely, maybe jupyter pocket book Directions. Close to the underside of it, there’s a URL that it is advisable to copy and paste into your browser to begin Jupyter Pocket book.
Your URL can be completely different than mine, however it’ll look one thing like this: –
http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69da
As soon as the software is prepared, let’s check out the code we’ll be modifying.
Our “drawback” script
To correctly take a look at a profiling software, you want scripts that exhibit apparent efficiency points. I wrote a easy program that simulates a processing drawback with reminiscence, iteration, and CPU cycles. This makes it an ideal candidate for investigation.
# run_all_systems.py
import time
import math
# ===================================================================
CPU_ITERATIONS = 34552942
STRING_ITERATIONS = 46658100
LOOP_ITERATIONS = 171796964
# ===================================================================
# --- Job 1: A Calibrated CPU-Certain Bottleneck ---
def cpu_heavy_task(iterations):
print(" -> Operating CPU-bound process...")
consequence = 0
for i in vary(iterations):
consequence += math.sin(i) * math.cos(i) + math.sqrt(i)
return consequence
# --- Job 2: A Calibrated Reminiscence/String Bottleneck ---
def memory_heavy_string_task(iterations):
print(" -> Operating Reminiscence/String-bound process...")
report = ""
chunk = "report_item_abcdefg_123456789_"
for i in vary(iterations):
report += f"|{chunk}{i}"
return report
# --- Job 3: A Calibrated "Thousand Cuts" Iteration Bottleneck ---
def simulate_tiny_op(n):
cross
def iteration_heavy_task(iterations):
print(" -> Operating Iteration-bound process...")
for i in vary(iterations):
simulate_tiny_op(i)
return "OK"
# --- Major Orchestrator ---
def run_all_systems():
print("--- Beginning FINAL SLOW Balanced Showcase ---")
cpu_result = cpu_heavy_task(iterations=CPU_ITERATIONS)
string_result = memory_heavy_string_task(iterations=STRING_ITERATIONS)
iteration_result = iteration_heavy_task(iterations=LOOP_ITERATIONS)
print("--- FINAL SLOW Balanced Showcase Completed ---")
Step 1: Gather information utilizing cProfile
The primary software, cProfile, is a deterministic profiler constructed into Python. You possibly can run scripts out of your code and file detailed statistics about each perform name.
import cProfile, pstats, io
pr = cProfile.Profile()
pr.allow()
# Run the perform you need to profile
run_all_systems()
pr.disable()
# Dump stats to a string and print the highest 10 by cumulative time
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumtime")
ps.print_stats(10)
print(s.getvalue())
That is the output.
--- Beginning FINAL SLOW Balanced Showcase ---
-> Operating CPU-bound process...
-> Operating Reminiscence/String-bound process...
-> Operating Iteration-bound process...
--- FINAL SLOW Balanced Showcase Completed ---
275455984 perform calls in 30.497 seconds
Ordered by: cumulative time
Checklist decreased from 47 to 10 as a consequence of restriction <10>
ncalls tottime percall cumtime percall filename:lineno(perform)
2 0.000 0.000 30.520 15.260 /house/tom/.native/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3541(run_code)
2 0.000 0.000 30.520 15.260 {built-in technique builtins.exec}
1 0.000 0.000 30.497 30.497 /tmp/ipykernel_173802/1743829582.py:41(run_all_systems)
1 9.652 9.652 14.394 14.394 /tmp/ipykernel_173802/1743829582.py:34(iteration_heavy_task)
1 7.232 7.232 12.211 12.211 /tmp/ipykernel_173802/1743829582.py:14(cpu_heavy_task)
171796964 4.742 0.000 4.742 0.000 /tmp/ipykernel_173802/1743829582.py:31(simulate_tiny_op)
1 3.891 3.891 3.892 3.892 /tmp/ipykernel_173802/1743829582.py:22(memory_heavy_string_task)
34552942 1.888 0.000 1.888 0.000 {built-in technique math.sin}
34552942 1.820 0.000 1.820 0.000 {built-in technique math.cos}
34552942 1.271 0.000 1.271 0.000 {built-in technique math.sqrt}
There are various numbers which are troublesome to interpret. That is the place snakeviz comes into its personal.
Step 2: Visualize bottlenecks utilizing Snakeviz
That is the place the magic occurs. Snakeviz takes the output of your profiling file and turns it into an interactive, browser-based graph that can assist you discover bottlenecks.
So let’s use that software to visualise what now we have. Since we’re utilizing Jupyter Pocket book, we have to load it first.
%load_ext snakeviz
And run it like this.
%%snakeviz
primary()
The output is split into two components. First, let’s visualize one thing like this.
What you see is a top-down “icicle” graph. Represents the decision hierarchy from prime to backside.
Prime: Python is operating the script (<組み込みメソッド組み込み実行>).
Subsequent: __main__ execution of the script (
Reminiscence-intensive components of the method will not be labeled on the graph. It is because the proportion of time related to this process is way smaller than the time allotted to the opposite two intensive features. In consequence, a a lot smaller unlabeled block seems to the best of the cpu_heavy_task block.
For evaluation, sunburst chart. It’s kind of like a pie chart, however it incorporates a set of more and more bigger concentric circles and arcs.. The thought arose that the time it takes to execute a perform is represented by the angular vary of the dimensions of the arc of a circle. The basis perform is the circle within the middle of the viz. The basis perform is executed by calling subfunctions under it and so forth. This text doesn’t talk about that show sort.
This visible test can have a a lot larger impression than watching a desk of numbers. I not needed to guess the place to look. Information was staring me within the face.
The visualization is straight away adopted by a block of textual content that particulars the timing of various components of the code, much like the output of the cprofile software. There have been over 30 traces in whole, so I will solely present the primary dozen traces.
ncalls tottime percall cumtime percall filename:lineno(perform)
----------------------------------------------------------------
1 9.581 9.581 14.3 14.3 1062495604.py:34(iteration_heavy_task)
1 7.868 7.868 12.92 12.92 1062495604.py:14(cpu_heavy_task)
171796964 4.717 2.745e-08 4.717 2.745e-08 1062495604.py:31(simulate_tiny_op)
1 3.848 3.848 3.848 3.848 1062495604.py:22(memory_heavy_string_task)
34552942 1.91 5.527e-08 1.91 5.527e-08 ~:0(<built-in technique math.sin>)
34552942 1.836 5.313e-08 1.836 5.313e-08 ~:0(<built-in technique math.cos>)
34552942 1.305 3.778e-08 1.305 3.778e-08 ~:0(<built-in technique math.sqrt>)
1 0.02127 0.02127 31.09 31.09 <string>:1(<module>)
4 0.0001764 4.409e-05 0.0001764 4.409e-05 socket.py:626(ship)
10 0.000123 1.23e-05 0.0004568 4.568e-05 iostream.py:655(write)
4 4.594e-05 1.148e-05 0.0002735 6.838e-05 iostream.py:259(schedule)
...
...
...
Step 3: Repair
After all instruments like cprofiler or snakeviz will not inform. how To kind out efficiency points, now that we all know precisely the place the issue is we will apply focused fixes.
# final_showcase_fixed_v2.py
import time
import math
import numpy as np
# ===================================================================
CPU_ITERATIONS = 34552942
STRING_ITERATIONS = 46658100
LOOP_ITERATIONS = 171796964
# ===================================================================
# --- Repair 1: Vectorization for the CPU-Certain Job ---
def cpu_heavy_task_fixed(iterations):
"""
Fastened through the use of NumPy to carry out the complicated math on a complete array
without delay, in extremely optimized C code as an alternative of a Python loop.
"""
print(" -> Operating CPU-bound process...")
# Create an array of numbers from 0 to iterations-1
i = np.arange(iterations, dtype=np.float64)
# The identical calculation, however vectorized, is orders of magnitude quicker
result_array = np.sin(i) * np.cos(i) + np.sqrt(i)
return np.sum(result_array)
# --- Repair 2: Environment friendly String Becoming a member of ---
def memory_heavy_string_task_fixed(iterations):
"""
Fastened through the use of a listing comprehension and a single, environment friendly ''.be a part of() name.
This avoids creating hundreds of thousands of intermediate string objects.
"""
print(" -> Operating Reminiscence/String-bound process...")
chunk = "report_item_abcdefg_123456789_"
# A listing comprehension is quick and memory-efficient
components = [f"|{chunk}{i}" for i in range(iterations)]
return "".be a part of(components)
# --- Repair 3: Eliminating the "Thousand Cuts" Loop ---
def iteration_heavy_task_fixed(iterations):
"""
Fastened by recognizing the duty is usually a no-op or a bulk operation.
In a real-world situation, you'd discover a option to keep away from the loop fully.
Right here, we display the repair by merely eradicating the pointless loop.
The aim is to point out the price of the loop itself was the issue.
"""
print(" -> Operating Iteration-bound process...")
# The repair is to discover a bulk operation or remove the necessity for the loop.
# For the reason that unique perform did nothing, the repair is to do nothing, however quicker.
return "OK"
# --- Major Orchestrator ---
def run_all_systems():
"""
The principle orchestrator now calls the FAST variations of the duties.
"""
print("--- Beginning FINAL FAST Balanced Showcase ---")
cpu_result = cpu_heavy_task_fixed(iterations=CPU_ITERATIONS)
string_result = memory_heavy_string_task_fixed(iterations=STRING_ITERATIONS)
iteration_result = iteration_heavy_task_fixed(iterations=LOOP_ITERATIONS)
print("--- FINAL FAST Balanced Showcase Completed ---")
Now you can rerun cprofiler with the up to date code.
import cProfile, pstats, io
pr = cProfile.Profile()
pr.allow()
# Run the perform you need to profile
run_all_systems()
pr.disable()
# Dump stats to a string and print the highest 10 by cumulative time
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumtime")
ps.print_stats(10)
print(s.getvalue())
#
# begin of output
#
--- Beginning FINAL FAST Balanced Showcase ---
-> Operating CPU-bound process...
-> Operating Reminiscence/String-bound process...
-> Operating Iteration-bound process...
--- FINAL FAST Balanced Showcase Completed ---
197 perform calls in 6.063 seconds
Ordered by: cumulative time
Checklist decreased from 52 to 10 as a consequence of restriction <10>
ncalls tottime percall cumtime percall filename:lineno(perform)
2 0.000 0.000 6.063 3.031 /house/tom/.native/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3541(run_code)
2 0.000 0.000 6.063 3.031 {built-in technique builtins.exec}
1 0.002 0.002 6.063 6.063 /tmp/ipykernel_173802/1803406806.py:1(<module>)
1 0.402 0.402 6.061 6.061 /tmp/ipykernel_173802/3782967348.py:52(run_all_systems)
1 0.000 0.000 5.152 5.152 /tmp/ipykernel_173802/3782967348.py:27(memory_heavy_string_task_fixed)
1 4.135 4.135 4.135 4.135 /tmp/ipykernel_173802/3782967348.py:35(<listcomp>)
1 1.017 1.017 1.017 1.017 {technique 'be a part of' of 'str' objects}
1 0.446 0.446 0.505 0.505 /tmp/ipykernel_173802/3782967348.py:14(cpu_heavy_task_fixed)
1 0.045 0.045 0.045 0.045 {built-in technique numpy.arange}
1 0.000 0.000 0.014 0.014 <__array_function__ internals>:177(sum)
It is a nice consequence that reveals the ability of profiling. We put effort into vital components of the code. Simply to be on the protected facet, I additionally ran snakeviz with the modified script.
%%snakeviz
run_all_systems()

Probably the most notable change is that the entire execution time has been decreased from about 30 seconds to about 6 seconds. It is a 5x speedup and was achieved by addressing three main bottlenecks seen within the “outdated” profile.
Let us take a look at every individually.
1. iteration_heavy_task
Earlier (drawback)
Within the first picture, the massive bar on the left, iteration_heavy_task, is the largest bottleneck, 14.3 seconds.
- Why was it so late? This mission was a basic “loss of life by a thousand cuts.” The perform Simulate_tiny_op did little or no, however was known as hundreds of thousands of occasions from inside a pure Python for loop. The large overhead of the Python interpreter beginning and stopping perform calls was the complete motive for the slowness.
correction
The mounted model of iteration_heavy_task_fixed acknowledged that the aim might be achieved with out the loop. In our showcase, this meant eradicating meaningless loops fully. In real-world functions, this entails discovering a single “bulk” operation to switch the iterative operation.
After that (consequence)
Within the second picture, the iteration_heavy_task bar is fully gone. At the moment, its execution time is a fraction of the time, so quick that it does not even present up on the charts. We succeeded in fixing the 14.3 seconds drawback.
2. cpu_heavy_task
Earlier (drawback)
The second main bottleneck, clearly seen as the big orange bar on the best, is cpu_heavy_task. 12.9 seconds.
- Why was it so late? Much like the iterative process, this perform was additionally restricted by the pace of the Python for loop. Though the interior math operations had been quick, the interpreter needed to course of hundreds of thousands of calculations individually, making it extremely inefficient for numerical duties.
correction
What was corrected was vectorization Use the NumPy library. As an alternative of utilizing a Python loop, cpu_heavy_task_fixed created a NumPy array and carried out all math operations (np.sqrt, np.sin, and so forth.) on the complete array concurrently. These operations are carried out in extremely optimized, precompiled C code, fully bypassing the sluggish Python interpreter loop.
After that (consequence).
Much like the primary bottleneck, the cpu_heavy_task bar is lacking from the “after” diagram. Execution time was decreased from 12.9 seconds to some milliseconds.
3.memory_heavy_string_task
Earlier (drawback):
Within the first picture, memory-heavy_string_task was operating, however its execution time was brief in comparison with the opposite two massive issues, so it was relegated to a small unlabeled area on the far proper. That was a comparatively minor situation.
correction
The repair for this process was to switch the inefficient process. Report += “…” Concatenate strings in a extra environment friendly manner: Create a listing of all string components after which name. “”. take part() Only one final time.
After that (consequence)
The second picture reveals a profitable consequence. Two bottlenecks longer than 10 seconds have been resolved, so memory-heavy-string-task-fixed is now New dominant bottleneck,in view of 4.34 seconds The whole execution time is 5.22 seconds.
Snakeviz additionally lets you see the internals of this mounted perform. The brand new most vital issue is the orange bar labeled. <リストコンプ> (checklist comprehension), which takes 3.52 seconds. that is, Repaired Probably the most time-consuming a part of the code is the method of making an in depth checklist of strings in reminiscence earlier than concatenating them.
abstract
This text gives a sensible information to figuring out and resolving efficiency points in Python code, arguing that builders ought to leverage profiling instruments to: measurement Enhance efficiency as an alternative of counting on instinct and guesswork to establish the reason for slowdowns.
We demonstrated a scientific workflow utilizing two key instruments:
- c profile: Python’s built-in profiler. Used to gather detailed information about perform calls and execution time.
- snake biz: A visualization software that transforms cProfile information into an interactive “icicle” graph. This makes it simple to visually establish which components of your code are spending essentially the most time.
This text makes use of a case research of an deliberately sluggish script that was designed with three important bottlenecks:
- Recurring certain duties: A perform that known as hundreds of thousands of occasions inside a loop. This reveals the efficiency value (“loss of life by 1000 disconnects”) of Python’s perform name overhead.
- CPU-intensive duties: A for loop that performs hundreds of thousands of mathematical calculations. This highlights the inefficiency of pure Python for heavy quantity processing.
- Reminiscence-constrained duties: += Massive strings constructed inefficiently utilizing repeated concatenations.
By analyzing Snakeviz’s output, we pinpointed these three points and utilized focused fixes.
- The iteration bottleneck was mounted as follows: Remove pointless loops.
- The CPU bottleneck was solved by vectorization utilizing NumPy, which performs mathematical operations in quick compiled C code.
- The reminiscence bottleneck was mounted by including the string half to the checklist and utilizing a single environment friendly ”.. take part() telephone.
These fixes offered a dramatic speedup and considerably decreased script execution time. 30 seconds simply until the tip 6 seconds. Lastly, we demonstrated that the profiler will be reused to establish important points even after they’ve been resolved. newthe small bottleneck signifies that efficiency tuning is an iterative course of primarily based on measurements.

