Calculating software efficiency could make the precise efficiency completely different and theoretical efficiency completely different. With an ecosystem of merchandise rising with excessive efficiency wants, reminiscent of excessive efficiency computing (HPC), gaming, or the present panorama of huge language fashions (LLM), it’s important to precisely calculate the efficiency of your software.
Merely measuring theoretical GFLOPS (floating level operations per second) is just not sufficient. It’s because purposes hardly ever attain these maximums in the actual world. That is the place roofline fashions come into play, offering a transparent visible methodology for estimating software efficiency, highlighting the important thing function of hardware-specific optimization.
Why easy metrics aren’t sufficient
When you consider measuring efficiency, there are a couple of metrics that come to thoughts.
- Operating Time: I am going to let you know this how lengthy It took a job however no insights supplied why.
- Cycle per instruction (CPI): This only Measures the processor’s calculation efficiency.
- Serial and parallel execution: Calculate measurement efficiency Wanting down {Hardware} optimization.
- Floating level operations per second (flop/s): This only It represents the theoretical most that’s typically not achieved in real-world eventualities.
These are nice metrics, however they often do not present sufficient data. For instance, utilizing floating level operations per second is a theoretical limitation that’s typically not achieved. So use it solely Metrics aren’t adequate as they ignore information motion, a typical efficiency limiter.
Roof Line Modeling
The roofline mannequin is a robust instrument that visually maps software efficiency to options of a selected {hardware} structure, reminiscent of a CPU or GPU. The mannequin will get its title from the form of the graph it generates. It contains a “roof” consisting of sloped strains and flat horizontal strains. This form represents the last word efficiency restrict imposed by the {hardware}.
From this modeling method, there are two parameters that outline the hardware-defined limits that may be achieved.
- Information motion: The period of time it takes to maneuver information. It’s calculated as the whole information dimension divided by the system’s peak reminiscence bandwidth.
- Calculation: The time required for the calculation is decided by splitting the whole variety of floating-point operations on the system’s peak calculation efficiency (usually measured in GFLOP/s).
The full operating time of an software is decided by the bigger of those two values. max {data_movement, computation}.
Regardless of improved {hardware} computing efficiency, information motion is commonly a bottleneck. Roofline modeling introduces the idea of Arithmetic Energy (AI). AI is the ratio of floating-point operations carried out on each byte of information moved from reminiscence.
- Algorithms with excessive arithmetic energy are thought of computationally starvation. Its efficiency is restricted by the power to carry out calculations shortly.
- Algorithms with low arithmetic energy are thought of data-hunger. Its efficiency is restricted by the power to maneuver information shortly.
Understanding graphs
Creative Commons Attribution-Share Alike 4.0 International
The roofline graph plots the achievable flops/s (y-axis) for arithmetic energy (x-axis). The “roof” itself signifies {hardware} limitations. The slope of the roof represents the height information bandwidth (GB/s), whereas the flat half represents the height calculation efficiency (GFLOPS). Word that every part within the picture is logarithmic scale.
- Factors beneath the roof: Exhibits suboptimal efficiency indicating the vary of enchancment.
- Factors to hit diagonal strains: Information Hungry Utility. Its efficiency is restricted by information bandwidth.
- Factors to hit the flatline: Calculate Hungry Functions. It makes use of the complete computing energy of the processor.
Why is roofline modeling vital?
Roofline modeling supplies a visible and intuitive option to perceive software efficiency and demonstrates key traits reminiscent of operational energy, GPU capabilities, and achievable flops. This type of modeling helps programmers carry out focused optimizations on {hardware} purposes the place they will obtain higher outcomes.
- Bottleneck evaluation: Visible aids permit builders to simply perceive the place the bottleneck is – reminiscence or efficiency. When an software is reminiscence intensive, builders can give attention to bettering information locality by methods reminiscent of caches and loop tiles. Whether it is intensively calculated, the main target could shift to permit for extra parallel computation or reap the benefits of compiler optimizations.
- {Hardware} and Software program Design: Software program engineers shouldn’t be afraid of the underlying {hardware}. As an alternative, it is advisable to settle for and optimize your {hardware} design. Software program engineers can use roofline modeling insights to simply accept and optimize the precise structure they’re utilizing.
Roofline modeling in operation
To carry out roofline modeling, it is advisable to profile your software to grasp efficiency. From profiling you’ll be able to receive metrics reminiscent of floating level operations (FLOPS) and reminiscence bandwidth utilization. Each are required for roofline modeling. On this article, we’ll focus on two of those instruments: Nvidia’s instruments. ncu That is the Nsight Compute CLI for GPU evaluation, particularly for purposes utilizing Pytorch and for Pytorch profilers.
For detailed CUDA kernel optimization and correct flop/byte calculation, see ncu Supplies direct GPU {hardware} counter data. in distinction, torch.profiler.profile It supplies a better stage of perspective inside Pytorch and helps you perceive the general software habits, together with operator-level efficiency, tensor reminiscence utilization, and each CPU and GPU exercise.
Profiling with NCU
ncu Command line interface used for profiling the CUDA kernel [2]. You possibly can view the outcomes instantly in your system or save them to a log file for later evaluation. To construct a roofline mannequin, it is advisable to seize particular metrics that may calculate arithmetic energy.
Use the Pytorch Imagenet repository [3] For instance of us. It is a good selection as it is easy to grasp, is effectively documented by Pytorch and works with the profiler.
Step 1: Run the NCU command to gather metrics
Step one is to run the applying by the NCU to gather the required {hardware} stage information. The command would appear to be this:
ncu --log-file <log_file_name>
--metrics <list_of_metrics_separated_by_comma>
--target-processes all
python3 <your_application.py application_arguments>
- Log File: A log file that shops the outcomes.
- Metric: That is an important parameter and describes the metric you wish to seize. To calculate the arithmetic energy, think about the next:
dram__sectors_write.sum: Complete of written DRAM sectorsdram__sectors_read.sum: The full of the DRAM sector is learnsmsp__sass_thread_inst_executed_op_fadd_pred_on.sum: Sum of floating level additionsmsp__sass_thread_inst_executed_op_fmul_pred_on.sum: Floating level multiplication wholesmsp__sass_thread_inst_executed_op_ffma_pred_on.sum: Multiply the whole progress operation of floating level fusion
- Goal Course of:
allThe flags be sure that the whole software profiles.
Modifications to NCU instructions:
ncu --log-file logs_example --metrics dram__sectors_write.sum,
dram__sectors_read.sum,
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum
--target-processes all python3
principal.py /imagenet --arch resnet50 --epochs 1 --batch-size 10
--print-freq 10 --seed 42
Step 2: Calculating Flops from Metrics
As soon as the profiler is run, you’ll be able to mixture the collected metrics to calculate the whole of the floating level operations. The method is:
[FLOPs = 2 * FMA_count + FADD_count + FMUL_count]
- Flop: Counting floating level operations.
- fma_count: Fused multiplication (FMA) operations often rely as two flops (one multiplication and one addition). That is expressed by
smsp__sass_thread_inst_executed_op_ffma_pred_on.summetric. - fadd_count: That is expressed by
smsp__sass_thread_inst_executed_op_fadd_pred_on.summetric. - fmul_count: That is expressed by
smsp__sass_thread_inst_executed_op_fmul_pred_on.summetric.
Step 3: Calculate the transferred bytes
Subsequent, calculate the whole information transferred to and from the DRAM. The NCU metric supplies the variety of DRAM sectors learn and written. Assume a typical sector dimension of 32 bytes for the most recent GPU:
[Total_DRAM_bytes = (dram__sectors_read.sum + dram__sectors_write.sum) * 32]
Step 4: Calculate the arithmetic energy
The flop and whole bytes assist you to calculate the arithmetic energy.
[AI = FLOPs / Total_DRAM_Bytes]
Step 5: Calculate the execution time
Discovering software efficiency in FLOP/s requires runtime as effectively. This lets you use Nvidia Nsight Programs (NSYS), a system-wide profiler that may precisely measure the execution time of software segments. This time, I am going to run the applying once more utilizing NSYS to generate a time-based report. From this report you’ll be able to extract the whole GPU runtime.
nsys profile -f true -o <your_nsys_output_file.qdrep> python3
<your_application.py application_arguments>
Modifications to the NSYS command:
nsys profile -f true -o time.qdrep python3 principal.py /imagenet
--arch resnet50 --epochs 1 --batch-size 10 --print-freq 10
--seed 42
After operating this command you will get it GPU_RUNNING_TIME.
Step 6: Calculate software efficiency
Lastly, we calculate the efficiency achieved on the flop by dividing the whole flop by the runtime.
[FLOP/s = FLOPs / GPU_RUNNING_TIME]
This worth supplies “attainable flops/s” that may be plotted on the roofline graph.
Profiling with a torch
For purposes written in Pytorch, built-in torch.profiler.profile It supplies a user-friendly option to gather efficiency information. There are two choices provided to builders:
- Use the Profiler Context Supervisor
- Focusing on profiling for particular neural community layers
Profiler Context Supervisor
The a part of the code you wish to profile is torch.profiler.profile() Context Supervisor. in with Statements might be outlined actions Set a to hint (CPU, CUDA, or each) schedule Profile a selected coaching process and select whether or not to file tensor form, reminiscence utilization, or flop. When you’re throughout the context it is advisable to name prof.step() On the finish of every iteration, the profiler alerts it ahead, particularly if a schedule is used.
with profile(
actions=<arguments>,
schedule=torch.profiler.schedule(<arguments>),
record_shapes=<True|False>,
profile_memory=<True|False>,
with_flops=<True|False>
) as prof:
....
prof.step()
- Actions: Specifies whether or not to profile CPU, CUDA, or each.
- schedule: It helps to profile a number of steps within the coaching loop. For those who use the schedule parameters, the profiler should name Prof.Step() to maneuver to the subsequent step.
- Record_shapes: Whether or not to file the form of the tensor.
- profile_memory: Seize reminiscence utilization
- with_flops: That is experimental, however is used to flop with the operator.
Modifications to profiler instructions:
with profile(
actions=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, energetic=3, repeat=2),
record_shapes=True,
profile_memory=True,
with_flops=True
) as prof:
Focusing on profiling for particular neural community layers
Profilers can be utilized in a extra focused option to analyze particular layers of neural networks. This helps you see if a selected layer contributes considerably to efficiency than different layers that give builders the choice to alter a selected layer. Utilizing that is very straightforward to make use of, however more often than not the primary possibility works higher. The outcomes of the Pytorch profiler can be exported and visualized on the tensorboard.
profiler.begin()
self.conv2(x)
profiler.cease()
LLMS and roofline modeling
Come to the subject that everybody has been ready for – Does roofline modeling assist calculate LLM efficiency? The quick reply is sure.
LLM is a fancy neural community structure with billions of parameters and the big datasets they course of. Coaching is a really useful resource intensive job, however the inference and fine-tuning of the mannequin should even be environment friendly.
- Bottleneck: LLMs in reasoning can endure from bottlenecks as the quantity of parameters they work with is gigantic. These parameters are mannequin weights and trigger reminiscence bandwidth points. Roofline modeling lets you profile correct layers for bottlenecks.
- {Hardware} choice: Most organizations fine-tune present fashions fairly than coaching from scratch, so selecting the best infrastructure is vital to handle prices. This underscores the significance of selecting the most effective infrastructure for coaching. For instance, selecting {hardware} in response to the LLM structure, or optimizing the mannequin to run on a selected structure can cut back coaching and inference prices.
Conclusion
The roofline mannequin supplies a robust visible evaluation of software efficiency optimization. Visualizing software efficiency throughout reminiscence and computation supplies clear steering when selecting one of the best ways to method optimization. This text solely considers rustic roofline fashions, however there are extra superior methods, reminiscent of hierarchical roofline fashions and the addition of ceilings for particular computational optimizations.
reference
[1] https://docs.nersc.gov/tools/performance/roofline/
[2] https://docs.nvidia.com/nsight-compute/nsightcomputecli/index.html

