Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Half 2

This can be a visitor put up co-written with Meta’s PyTorch group and is a continuation of Half 1 of this sequence, the place we exhibit the efficiency and ease of working PyTorch 2.0 on AWS.

Machine studying (ML) analysis has confirmed that enormous language fashions (LLMs) educated with considerably massive datasets lead to higher mannequin high quality. In the previous few years, the scale of present technology fashions has elevated considerably, and so they require trendy instruments and infrastructure to be educated effectively and at scale. PyTorch Distributed Information Parallelism (DDP) helps course of information at scale in a easy and sturdy method, however it requires the mannequin to suit on one GPU. The PyTorch Absolutely Sharded Information Parallel (FSDP) library breaks this barrier by enabling mannequin sharding to coach massive fashions throughout information parallel staff.

Distributed mannequin coaching requires a cluster of employee nodes that may scale. Amazon Elastic Kubernetes Service (Amazon EKS) is a well-liked Kubernetes-conformant service that drastically simplifies the method of working AI/ML workloads, making it extra manageable and fewer time-consuming.

On this weblog put up, AWS collaborates with Meta’s PyTorch group to debate the way to use the PyTorch FSDP library to attain linear scaling of deep studying fashions on AWS seamlessly utilizing Amazon EKS and AWS Deep Studying Containers (DLCs). We exhibit this by way of a step-by-step implementation of coaching 7B, 13B, and 70B Llama2 fashions utilizing Amazon EKS with 16 Amazon Elastic Compute Cloud (Amazon EC2) p4de.24xlarge cases (every with 8 NVIDIA A100 Tensor Core GPUs and every GPU with 80 GB HBM2e reminiscence) or 16 EC2 p5.48xlarge cases (every with 8 NVIDIA H100 Tensor Core GPUs and every GPU with 80 GB HBM3 reminiscence), attaining close to linear scaling in throughput and in the end enabling sooner coaching time.

The next scaling chart exhibits that the p5.48xlarge cases supply 87% scaling effectivity with FSDP Llama2 fine-tuning in a 16-node cluster configuration.

Challenges of coaching LLMs

Companies are more and more adopting LLMs for a variety of duties, together with digital assistants, translation, content material creation, and laptop imaginative and prescient, to reinforce the effectivity and accuracy in a wide range of functions.

Nevertheless, coaching or fine-tuning these massive fashions for a customized use case requires a considerable amount of information and compute energy, which provides to the general engineering complexity of the ML stack. That is additionally resulting from restricted reminiscence obtainable on a single GPU, which restricts the scale of the mannequin that may be educated, and likewise limits the per-GPU batch dimension used throughout coaching.

To deal with this problem, numerous mannequin parallelism methods similar to DeepSpeed ZeRO and PyTorch FSDP have been created to will let you overcome this barrier of restricted GPU reminiscence. That is completed by adopting a sharded information parallel method, the place every accelerator holds only a slice (a shard) of a mannequin reproduction as a substitute of the complete mannequin reproduction, which dramatically reduces the reminiscence footprint of the coaching job.

This put up demonstrates how you should use PyTorch FSDP to fine-tune the Llama2 mannequin utilizing Amazon EKS. We obtain this by scaling out compute and GPU capability to handle the mannequin necessities.

FSDP overview

In PyTorch DDP coaching, every GPU (known as a employee within the context of PyTorch) holds a whole copy of the mannequin, together with the mannequin weights, gradients, and optimizer states. Every employee processes a batch of information and, on the finish of the backward cross, makes use of an all-reduce operation to synchronize gradients throughout totally different staff.

Having a duplicate of the mannequin on every GPU restricts the scale of the mannequin that may be accommodated in a DDP workflow. FSDP helps overcome this limitation by sharding mannequin parameters, optimizer states, and gradients throughout information parallel staff whereas nonetheless preserving the simplicity of information parallelism.

That is demonstrated within the following diagram, the place within the case of DDP, every GPU holds a whole copy of the mannequin state, together with the optimizer state (OS), gradients (G), and parameters (P): M(OS + G + P). In FSDP, every GPU holds solely a slice of the mannequin state, together with the optimizer state (OS), gradients (G), and parameters (P): M<partition quantity>(OS + G + P). Utilizing FSDP leads to a considerably smaller GPU reminiscence footprint in comparison with DDP throughout all staff, enabling the coaching of very massive fashions or utilizing bigger batch sizes for coaching jobs.

This, nevertheless, comes at the price of elevated communication overhead, which is mitigated by way of FSDP optimizations similar to overlapping communication and computation processes with options like pre-fetching. For extra detailed data, check with Getting Started with Fully Sharded Data Parallel (FSDP).

FSDP presents numerous parameters that will let you tune the efficiency and effectivity of your coaching jobs. A number of the key options and capabilities of FSDP embrace:

Transformer wrapping coverage
Versatile combined precision
Activation checkpointing
Varied sharding methods to go well with totally different community speeds and cluster topologies:
- FULL_SHARD – Shard mannequin parameters, gradients, and optimizer states
- HYBRID_SHARD – Full shard inside a node DDP throughout nodes; helps a versatile sharding group for a full reproduction of the mannequin (HSDP)
- SHARD_GRAD_OP – Shard solely gradients and optimizer states
- NO_SHARD – Just like DDP

For extra details about FSDP, check with Efficient Large-Scale Training with Pytorch FSDP and AWS.

The next determine exhibits how FSDP works for 2 information parallel processes.

Resolution overview

On this put up, we arrange a compute cluster utilizing Amazon EKS, which is a managed service to run Kubernetes within the AWS Cloud and on-premises information facilities. Many purchasers are embracing Amazon EKS to run Kubernetes-based AI/ML workloads, making the most of its efficiency, scalability, reliability, and availability, in addition to its integrations with AWS networking, safety and different providers.

For our FSDP use case, we use the Kubeflow Training Operator on Amazon EKS, which is a Kubernetes-native challenge that facilitates fine-tuning and scalable distributed coaching for ML fashions. It helps numerous ML frameworks, together with PyTorch, which you should use to deploy and handle PyTorch coaching jobs at scale.

Using the PyTorchJob customized useful resource of Kubeflow Coaching Operator, we run coaching jobs on Kubernetes with a configurable variety of employee replicas which permits us to optimize useful resource utilization.

The next are just a few elements of the coaching operator that play a task in our Llama2 fine-tuning use case:

A centralized Kubernetes controller that orchestrates distributed coaching jobs for PyTorch.
PyTorchJob, a Kubernetes customized useful resource for PyTorch, supplied by the Kubeflow Coaching Operator, to outline and deploy Llama2 coaching jobs on Kubernetes.
etcd, which is said to the implementation of the rendezvous mechanism for coordinating the distributed coaching of PyTorch fashions. Thisetcdserver, as a part of the rendezvous course of, facilitates the coordination and synchronization of the collaborating staff throughout distributed coaching.

The next diagram illustrates the answer structure.

A lot of the particulars will probably be abstracted by the automation scripts that we use to run the Llama2 instance.

We use the next code references on this use case:

What’s Llama2?

Llama2 is a LLM pre-trained on 2 trillion tokens of textual content and code. It is among the largest and strongest LLMs obtainable at this time You should use Llama2 for a wide range of duties, together with pure language processing (NLP), textual content technology, and translation. For extra data, check with Getting started with Llama.

Llama2 is accessible in three totally different mannequin sizes:

Llama2-70b – That is the biggest Llama2 mannequin, with 70 billion parameters. It’s the strongest Llama2 mannequin and can be utilized for probably the most demanding duties.
Llama2-13b – This can be a medium-sized Llama2 mannequin, with 13 billion parameters. It’s a good stability between efficiency and effectivity, and can be utilized for a wide range of duties.
Llama2-7b – That is the smallest Llama2 mannequin, with 7 billion parameters. It’s the best Llama2 mannequin, and can be utilized for duties that don’t require the best degree of efficiency.

This put up lets you fine-tune all of those fashions on Amazon EKS. To offer a easy and reproducible expertise of making an EKS cluster and working FSDP jobs on it, we use the aws-do-eks challenge. The instance will even work with a pre-existing EKS cluster.

A scripted walkthrough is accessible on GitHub for an out-of-the-box expertise. Within the following sections, we clarify the end-to-end course of in additional element.

Provision the answer infrastructure

For the experiments described on this put up, we use clusters with p4de (A100 GPU) and p5 (H100 GPU) nodes.

Cluster with p4de.24xlarge nodes

For our cluster with p4de nodes, we use the next eks-gpu-p4de-odcr.yaml script:

export ODCR_ID=<your-capacityreservation-id>

cat > ./eks-gpu-p4de-odcr.yaml <<EOF
apiVersion: eksctl.io/v1alpha5
form: ClusterConfig
metadata:
  title: do-eks-yaml-p4de-odcr
  model: "1.28"
  area: us-east-1
  tags:
    karpenter.sh/discovery: do-eks-yaml-p4de-odcr
availabilityZones:
  - us-east-1a
  - us-east-1b
  - us-east-1c
  - us-east-1d
managedNodeGroups:
  - title: sys
    instanceType: c5.2xlarge
    desiredCapacity: 1
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
nodeGroups:
  - title: p4de-odcr
    instanceType: p4de.24xlarge
    instancePrefix: p4de-odcr
    privateNetworking: true
    availabilityZones:
      - us-east-1c
    efaEnabled: true
    minSize: 0
    desiredCapacity: 2
    maxSize: 64
    volumeSize: 500
    capacityReservation:
      capacityReservationTarget:
        capacityReservationID: $ODCR_ID
    iam:
      withAddonPolicies:
        cloudWatch: true
        ebs: true
        fsx: true
iam:
  withOIDC: true
EOF

Utilizing eksctl and the previous cluster manifest, we create a cluster with p4de nodes:

eksctl create cluster -f ./eks-gpu-p4de-odcr.yaml

Cluster with p5.48xlarge nodes

A terraform template for an EKS cluster with P5 nodes is situated within the following GitHub repo.

You may customise the cluster through the variables.tf file after which create it through the Terraform CLI:

terraform init && terraform plan -out tfplan && terraform apply tfplan

You may confirm the cluster availability by working a easy kubectl command:

The cluster is wholesome if the output of this command exhibits the anticipated variety of nodes in Prepared standing.

Deploy stipulations

To run FSDP on Amazon EKS, we use the PyTorchJob customized useful resource. It requires etcd and Kubeflow Training Operator as stipulations.

Deploy etcd with the next code:

kubectl apply -f https://uncooked.githubusercontent.com/aws-samples/aws-do-eks/essential/Container-Root/eks/deployment/etcd/etcd-deployment.yaml

Deploy Kubeflow Coaching Operator with the next code:

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

Construct and push an FSDP container picture to Amazon ECR

Use the next code to construct an FSDP container picture and push it to Amazon Elastic Container Registry (Amazon ECR):

# Obtain Dockerfile
curl -L -o ./Dockerfile.llama2-efa https://uncooked.githubusercontent.com/aws-samples/aws-do-eks/essential/Container-Root/eks/deployment/distributed-training/pytorch/pytorchjob/fsdp/Dockerfile.llama2-efa

# Construct Picture
AWS_REGION=$(aws configure get area)
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output textual content)
REGISTRY=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
IMAGE=fsdp
TAG=":llama2-efa"

docker construct --progress=plain -t ${REGISTRY}${IMAGE}${TAG} -f ./Dockerfile.llama2-efa .

# Log in to ECR, create registry, push picture
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
aws ecr create-repository --repository-name ${IMAGE}
docker picture push ${REGISTRY}${IMAGE}${TAG}

Create the FSDP PyTorchJob manifest

Insert your Hugging Face token within the following snippet previous to working it:

HF_TOKEN=”<insert_your_huggingface_token_here>”

Configure your PyTorchJob with .env file or straight in your setting variables as under:

JOB_NAME=fsdp
RDZV_HOST=etcd
RDZV_PORT=2379
NUM_WORKERS=2
INSTANCE_TYPE=p5.48xlarge
GPU_PER_WORKER=8
EFA_PER_WORKER=32
MODEL_NAME=meta-llama/Llama-2-7b-hf

CMD="huggingface-cli login --token ${HF_TOKEN} && torchrun --nproc_per_node=${GPU_PER_WORKER} --nnodes=${NUM_WORKERS} examples/finetuning.py --num_epochs=5 --batch_size_training=3 --enable_fsdp --model_name $MODEL_NAME --output_dir ."

Generate the PyTorchJob manifest utilizing the fsdp template and generate.sh script or create it straight utilizing the script under:

cat > ./fsdp.yaml <<EOF
apiVersion: kubeflow.org/v1
form: PyTorchJob
metadata:
  title: $JOB_NAME
spec:
  elasticPolicy:
    rdzvBackend: etcd
    rdzvHost: $RDZV_HOST
    rdzvPort: $RDZV_PORT
    minReplicas: 1
    maxReplicas: 64
    maxRestarts: 100
    metrics:
      - kind: Useful resource
        useful resource:
          title: cpu
          goal:
            kind: Utilization
            averageUtilization: 90
  pytorchReplicaSpecs:
    Employee:
      replicas: $NUM_WORKERS
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: $JOB_NAME
        spec:
          volumes:
            - title: shmem
              hostPath:
                path: /dev/shm
          nodeSelector:
            node.kubernetes.io/instance-type: '${INSTANCE_TYPE}'
          containers:
            - title: pytorch
              picture: '${REGISTRY}${IMAGE}${TAG}'
              imagePullPolicy: All the time
              assets:
                requests:
                  nvidia.com/gpu: $GPU_PER_WORKER
                  vpc.amazonaws.com/efa: $EFA_PER_WORKER
                limits:
                  nvidia.com/gpu: $GPU_PER_WORKER
                  vpc.amazonaws.com/efa: $EFA_PER_WORKER
              env:
                - title: LOGLEVEL
                  worth: DEBUG
                - title: NCCL_DEBUG
                  worth: INFO
                - title: TORCH_NCCL_ASYNC_ERROR_HANDLING
                  worth: '1'
              command:
                - bash
                - '-c'
                - '${CMD}'
              volumeMounts:
                - title: shmem
                  mountPath: /dev/shm
EOF

Run the PyTorchJob

Run the PyTorchJob with the next code:

kubectl apply -f ./fsdp.yaml

You will notice the desired variety of FDSP employee pods created and, after pulling the picture, they are going to enter right into a Operating state.

To see the standing of the PyTorchJob, use the next code:

kubectl describe -f ./fsdp.yaml

To cease the PyTorchJob, use the next code:

kubectl delete -f ./fsdp.yaml

After a job is full, it must be deleted earlier than initiating a brand new run. We’ve additionally noticed that deleting theetcdpod and letting it restart previous to launching a brand new job helps keep away from a RendezvousClosedError.

Scale the cluster

You may repeat the previous steps of making and working jobs whereas various the quantity and occasion kind of employee nodes within the cluster. This lets you produce scaling charts just like the one proven earlier. Generally, it’s best to see a discount in GPU reminiscence footprint, discount in epoch time, and improve in throughput when extra nodes are added to the cluster. The earlier chart was produced by conducting a number of experiments utilizing a p5 node group various from 1–16 nodes in dimension.

Observe the FSDP coaching workload

Observability of generative synthetic intelligence workloads is necessary to permit visibility into your working jobs in addition to support in maximizing the utilization of your compute assets. On this put up, we use just a few Kubernetes-native and open supply observability instruments for this objective. These instruments allow you to trace errors, statistics, and mannequin conduct, making AI observability a vital a part of any enterprise use case. On this part, we present numerous approaches for observing FSDP coaching jobs.

Employee pod logs

On the most simple degree, you want to have the ability to see the logs of your coaching pods. This will simply be completed through the use of Kubernetes-native instructions.
First, retrieve a listing of pods and find the title of the one that you simply wish to see logs for:

Then view the logs for the chosen pod:

kubectl logs -f <pod_name>

Just one employee (elected chief) pod log will listing the general job statistics. The title of the elected chief pod is accessible initially of every employee pod log, recognized by the important thing master_addr=.

CPU utilization

Distributed coaching workloads require each CPU and GPU assets. To optimize these workloads, it’s necessary to know how these assets are utilized. Thankfully, some nice open supply utilities can be found that assist visualize CPU and GPU utilization. For viewing CPU utilization, you should usehtop. In case your employee pods comprise this utility, you should use the under command to open a shell right into a pod after which runhtop.

kubectl exec -it <pod_name> -- bash

Alternatively, you may deploy an htopdaemonsetjust like the one supplied within the following GitHub repo.

Thedaemonsetwill run a light-weight htop pod on every node. You may exec into any of those pods and run thehtopcommand:

kubectl exec -it <htop_pod_name> -- htop

The next screenshot exhibits the CPU utilization on one of many nodes within the cluster. On this case, we’re taking a look at a P5.48xlarge occasion, which has 192 vCPUs. The processor cores are idle whereas the mannequin weights are downloaded, and we see rising utilization whereas the mannequin weights are being loaded to GPU reminiscence.

GPU utilization

If thenvtoputility is accessible in your pod, it’s possible you’ll exec into it utilizing under after which runnvtop.

kubectl exec -it <pod_name> -- bash

Alternatively, you may deploy a nvtopdaemonsetjust like the one supplied within the following GitHub repo.

It will run anvtoppod on every node. You may exec into any of these pods and runnvtop:

kubectl exec -it <nvtop_pod_name> -- nvtop

The next screenshot exhibits the GPU utilization on one of many nodes within the coaching cluster. On this case, we’re taking a look at a P5.48xlarge occasion, which has 8 NVIDIA H100 GPUs. The GPUs are idle whereas the mannequin weights are downloaded, then GPU reminiscence utilization will increase because the mannequin weights are loaded onto the GPU, and GPU utilization spikes to 100% whereas the coaching iterations are underway.

Grafana dashboard

Now that you simply perceive how your system works on the pod and node degree, it’s additionally necessary to have a look at metrics on the cluster degree. Aggregated utilization metrics could be collected by NVIDIA DCGM Exporter and Prometheus and visualized in Grafana.

An instance Prometheus-Grafana deployment is accessible within the following GitHub repo.

An instance DCGM exporter deployment is accessible within the following GitHub repo.

A easy Grafana dashboard is proven within the following screenshot. It was constructed by deciding on the next DCGM metrics: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_MEM_COPY_UTIL, DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_GPU_TEMP, and DCGM_FI_DEV_POWER_USAGE. The dashboard could be imported into Prometheus from GitHub.

The next dashboard exhibits one run of a Llama2 7b single epoch coaching job. The graphs present that because the streaming multiprocessor (SM) clock will increase, the ability draw and temperature of the GPUs improve as effectively, along with GPU and reminiscence utilization. You can too see that there have been no XID errors and the GPUs have been wholesome throughout this run.

Since March 2024 GPU observability for EKS is supported natively in CloudWatch Container Insights. To allow this performance simply deploy the CloudWatch Observability Add-on in your EKS cluster. Then it is possible for you to to browse pod, node, and cluster degree metrics by way of pre-configured and customizable dashboards in Container Insights.

Clear up

For those who created your cluster utilizing the examples supplied on this weblog, you may execute the next code to delete the cluster and any assets related to it, together with the VPC:
For eksctl:

eksctl delete cluster -f ./eks-gpu-p4de-odcr.yaml

For terraform:

Upcoming options

FSDP is anticipated to incorporate a per-parameter sharding characteristic, aiming to additional enhance its reminiscence footprint per GPU. Moreover, the continued improvement of FP8 help goals to enhance FSDP efficiency on H100 GPUs. Lastly, when FSDP is built-in withtorch.compile, we hope to see further efficiency enhancements and enablement of options like selective activation checkpointing.

Conclusion

On this put up, we mentioned how FSDP reduces the reminiscence footprint on every GPU, enabling the coaching of bigger fashions extra effectively and attaining close to linear scaling in throughput. We demonstrated this by way of a step-by-step implementation of coaching a Llama2 mannequin utilizing Amazon EKS on P4de and P5 cases and used observability instruments like kubectl, htop, nvtop, and dcgm to observe logs, in addition to CPU and GPU utilization.

We encourage you to reap the benefits of PyTorch FSDP in your personal LLM coaching jobs. Get began at aws-do-fsdp.

Concerning the Authors

Kanwaljit Khurmi is a Principal AI/ML Options Architect at Amazon Internet Providers. He works with AWS prospects to offer steering and technical help, serving to them enhance the worth of their machine studying options on AWS. Kanwaljit focuses on serving to prospects with containerized, distributed computing and deep studying functions.

Alex Iankoulski is a Principal Options Architect, Self-managed Machine Studying at AWS. He’s a full-stack software program and infrastructure engineer who likes to do deep, hands-on work. In his function, he focuses on serving to prospects with containerization and orchestration of ML and AI workloads on container-powered AWS providers. He’s additionally the writer of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s greatest challenges.

Ana Simoes is a Principal Machine Studying Specialist, ML Frameworks at AWS. She helps prospects deploying AI, ML, and generative AI at a big scale on HPC infrastructure within the cloud. Ana focuses on supporting prospects to attain price-performance for brand new workloads and use circumstances for generative AI and machine studying.

Hamid Shojanazeri is a Companion Engineer at PyTorch engaged on open supply, high-performance mannequin optimization, distributed coaching (FSDP), and inference. He’s the co-creator of llama-recipe and contributor to TorchServe. His essential curiosity is to enhance cost-efficiency, making AI extra accessible to the broader neighborhood.

Much less Wright is an AI/Companion Engineer in PyTorch. He works on Triton/CUDA kernels (Accelerating Dequant with SplitK work decomposition); paged, streaming, and quantized optimizers; and PyTorch Distributed (PyTorch FSDP).

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Half 2

Challenges of coaching LLMs

FSDP overview

Resolution overview

What’s Llama2?

Provision the answer infrastructure

Cluster with p4de.24xlarge nodes

Cluster with p5.48xlarge nodes

Deploy stipulations

Construct and push an FSDP container picture to Amazon ECR

Create the FSDP PyTorchJob manifest

Run the PyTorchJob

Scale the cluster

Observe the FSDP coaching workload

Employee pod logs

CPU utilization

GPU utilization

Grafana dashboard

Clear up

Upcoming options

Conclusion

Concerning the Authors

The position of outsourcing in insurance coverage

Males usually tend to collapse from anesthesia than ladies

Converter

Editors Pick

Newsletter

Categories

Related Posts