Customise DeepSeek-R1 distilled fashions utilizing Amazon SageMaker HyperPod recipes – Half 1

More and more, organizations throughout industries are turning to generative AI basis fashions (FMs) to reinforce their purposes. To realize optimum efficiency for particular use instances, clients are adopting and adapting these FMs to their distinctive area necessities. This want for personalisation has turn out to be much more pronounced with the emergence of latest fashions, corresponding to these launched by DeepSeek.

Nevertheless, customizing DeepSeek fashions successfully whereas managing computational assets stays a big problem. Tuning mannequin structure requires technical experience, coaching and fine-tuning parameters, and managing distributed coaching infrastructure, amongst others. This usually forces firms to decide on between mannequin efficiency and sensible implementation constraints, making a important want for extra accessible and streamlined mannequin customization options.

On this two-part sequence, we focus on how one can scale back the DeepSeek mannequin customization complexity by utilizing the pre-built fine-tuning workflows (additionally known as “recipes”) for each DeepSeek-R1 mannequin and its distilled variations, launched as a part of Amazon SageMaker HyperPod recipes.

On this first publish, we’ll construct an answer structure for fine-tuning DeepSeek-R1 distilled fashions and show the method by offering a step-by-step instance on customizing the DeepSeek-R1 Distill Qwen 7b mannequin utilizing recipes, attaining a median of 25% on all of the Rouge scores, with a most of 49% on Rouge 2 rating with each SageMaker HyperPod and SageMaker coaching jobs. The second a part of the sequence will deal with fine-tuning the DeepSeek-R1 671b mannequin itself.

On the time of this writing, the DeepSeek-R1 mannequin and its distilled variations for Llama and Qwen have been the most recent launched recipe. Take a look at sagemaker-hyperpod-recipes on GitHub for the most recent launched recipes, together with help for fine-tuning the DeepSeek-R1 671b parameter mannequin.

Amazon SageMaker HyperPod recipes

At re:Invent 2024, we introduced the final availability of Amazon SageMaker HyperPod recipes. SageMaker HyperPod recipes assist information scientists and builders of all ability units to get began coaching and fine-tuning standard publicly out there generative AI fashions in minutes with state-of-the-art coaching efficiency. These recipes embody a coaching stack validated by Amazon Internet Providers (AWS), which removes the tedious work of experimenting with totally different mannequin configurations, minimizing the time it takes for iterative analysis and testing. They automate a number of important steps, corresponding to loading coaching datasets, making use of distributed coaching methods, automating checkpoints for quicker restoration from faults, and managing the end-to-end coaching loop.

Recipes, paired with the resilient infrastructure of AWS, (Amazon SageMaker HyperPod and Amazon SageMaker Mannequin Coaching) present a resilient coaching setting for fine-tuning FMs corresponding to DeepSeek-R1 with out-of-the-box customization.

To assist clients shortly use DeepSeek’s highly effective and cost-efficient fashions to speed up generative AI innovation, we launched new recipes to fine-tune six DeepSeek fashions, together with DeepSeek-R1 distilled Llama and Qwen fashions utilizing supervised fine-tuning (SFT), Quantized Low-Rank Adaptation (QLoRA), Low-Rank Adaptation (LoRA) methods. On this publish, we introduce these new recipes and stroll you thru an answer to fine-tune a DeepSeek Qwen 7b mannequin for a sophisticated medical reasoning use case.

Resolution overview

At its core, as depicted within the following diagram, the recipe structure implements a hierarchical workflow that begins with a recipe specification that covers a complete configuration defining the coaching parameters, mannequin structure, and distributed coaching methods. These recipes are processed via the HyperPod recipe launcher, which serves because the orchestration layer chargeable for launching a job on the corresponding structure. The launcher interfaces with underlying cluster administration methods corresponding to SageMaker HyperPod (Slurm or Kubernetes) or coaching jobs, which deal with useful resource allocation and scheduling. It’s a well-known NeMo-style launcher with which you’ll be able to select a recipe and run it in your infrastructure of alternative (SageMaker HyperPod or coaching).

For instance, after choosing your recipe, you possibly can pre-train or fine-tune a mannequin by working python3 most important.py recipes=recipe-name. Alternatively, you should utilize a launcher script, which is a bash script that’s preconfigured to run the chosen coaching or fine-tuning job in your cluster. You may try main.py (NeMo type launcher) and launcher scripts for DeepSeek on the GitHub repository hosting SageMaker HyperPod recipes.

A key part of this structure is the HyperPod coaching adapter for NeMo, which is constructed on the NVIDIA NeMo framework and Neuronx Distributed training package, which hundreds information, creates fashions, and facilitates environment friendly information parallelism, mannequin parallelism, and hybrid parallelism methods, which allows optimum utilization of computational assets throughout the distributed infrastructure. The structure’s modular design permits for scalability and adaptability, making it notably efficient for coaching LLMs that require distributed computing capabilities.

You may run these recipes utilizing SageMaker HyperPod or as SageMaker coaching jobs. For organizations that require granular management over coaching infrastructure and intensive customization choices, SageMaker HyperPod is the best alternative. SageMaker coaching jobs, however, is tailor-made for organizations that desire a totally managed expertise for his or her coaching workflows. To be taught extra particulars about these service options, check with Generative AI basis mannequin coaching on Amazon SageMaker.

Within the subsequent sections, we go over the answer structure for these providers earlier than presenting a step-by-step implementation instance for every.

SageMaker HyperPod

To submit jobs utilizing SageMaker HyperPod, you should utilize the HyperPod recipes launcher, which offers an easy mechanism to run recipes on each Slurm and Kubernetes. After you select your orchestrator, you possibly can select your recipe’s launcher and have it run in your HyperPod cluster. The launcher will interface along with your cluster with Slurm or Kubernetes native constructs. For this publish, we use the HyperPod recipes launcher mechanism to run the coaching on a Slurm cluster. The next picture reveals the answer structure for SageMaker HyperPod.

SageMaker coaching jobs

The workflow for SageMaker coaching jobs begins with an API request that interfaces with the SageMaker management airplane, which manages the orchestration of coaching assets. The system makes use of the coaching jobs launcher to effectively run workloads on a managed cluster.

The structure makes use of Amazon Elastic Container Registry (Amazon ECR) for container picture administration. Coaching jobs are executed throughout a distributed cluster, with seamless integration to a number of storage options, together with Amazon Easy Storage Service (Amazon S3), Amazon Elastic File Storage (Amazon EFS), and Amazon FSx for Lustre. All of this runs beneath the SageMaker managed setting, offering optimum useful resource utilization and safety.

This design simplifies the complexity of distributed coaching whereas sustaining the pliability wanted for various machine studying (ML) workloads, making it a perfect answer for enterprise AI improvement. The next picture reveals the answer structure for SageMaker coaching jobs.

Resolution walkthrough

For this answer, take into account a use case for a healthcare trade startup that goals to create an correct, medically verified chat assistant software that bridges advanced medical info with patient-friendly explanations. By fine-tuning DeepSeek-R1 Distill Qwen 7b utilizing the FreedomIntelligence/medical-o1-reasoning-SFT dataset, you should utilize its medical reasoning capabilities to provide content material that maintains medical accuracy.

Conditions

You might want to full the next stipulations earlier than you possibly can run the DeepSeek-R1 Distill Qwen 7B mannequin fine-tuning pocket book.

Make the next quota enhance requests for SageMaker. You might want to request a minimal of 1 p4d.24xlarge occasion (with 8 x NVIDIA A100 GPUs) ranging to a most of two p4d.24xlarge cases (relying on time-to-train and cost-to-train trade-offs to your use case).

On the Service Quotas console, request the next SageMaker quotas:

- P4 cases (p4d.24xlarge) for coaching job utilization: 1–2
- P4 cases (p4d.24xlarge) for HyperPod clusters (“ml.p4d.24xlarge for cluster utilization“): 1-2

If you happen to select to make use of HyperPod clusters to run your coaching, arrange a HyperPod Slurm cluster following the documentation at Tutuorial for getting began with SageMaker HyperPod. Alternatively, you should utilize the AWS CloudFormation template offered within the AWS Workshop Studio at Amazon SageMaker HyperPod Own Account and comply with the directions to set up a cluster and a improvement setting to entry and submit jobs to the cluster.
(Non-compulsory) If you happen to select to make use of SageMaker coaching jobs, you possibly can create an Amazon SageMaker Studio area (check with Use fast setup for Amazon SageMaker AI) to entry Jupyter notebooks with the previous function. (You should utilize JupyterLab in your native setup, too.)

Clone the GitHub repository with the property for this deployment. This repository consists of a pocket book that references coaching property:

git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git 
cd 18_sagemaker_training_recipes/ft_deepseek_qwen_lora

Subsequent, we run the model_trainer_deepseek_r1_recipe_lora.ipynb pocket book to fine-tune the DeepSeek-R1 mannequin utilizing QLoRA on SageMaker.

Put together the dataset

To arrange the dataset, it’s essential load the FreedomIntelligence/medical-o1-reasoning-SFT dataset, tokenize and chunk the dataset, and configure the info channels for SageMaker coaching on Amazon S3. Full the next steps:

Format the dataset by making use of the immediate format for DeepSeek-R1 Distill Qwen 7B:

def generate_prompt(data_point):
    full_prompt = f"""
    Under is an instruction that describes a job, paired with an enter that gives additional context.
    Write a response that appropriately completes the request.
    Earlier than answering, think twice in regards to the query and create a step-by-step chain of ideas to make sure a logical and correct response.

    ### Instruction:
    You're a medical knowledgeable with superior data in medical reasoning, diagnostics, and remedy planning.
    Please reply the next medical query.

    ### Query:
    {data_point["Question"]}

    ### Response:
    {data_point["Complex_CoT"]}

    """
    return {"immediate": full_prompt.strip()}

Load the FreedomIntelligence/medical-o1-reasoning-SFT dataset and cut up it into coaching and validation datasets:

# Load dataset from the hub
train_set = load_dataset(dataset_name, 'en', cut up="prepare[5%:]")
test_set = load_dataset(dataset_name, 'en', cut up="prepare[:5%]")

...

train_dataset = train_set.map(
    generate_and_tokenize_prompt,
    remove_columns=columns_to_remove,
    batched=False
)

test_dataset = test_set.map(
    generate_and_tokenize_prompt,
    remove_columns=columns_to_remove,
    batched=False
)

Load the DeepSeek-R1 Distill Qwen 7B tokenizer from the Hugging Face Transformers library and generate tokens for the prepare and validation datasets:

model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
max_seq_length=1024

# Initialize a tokenizer by loading a pre-trained tokenizer configuration, utilizing the quick tokenizer implementation if out there.
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

...

train_dataset = train_dataset.map(tokenize, remove_columns=["prompt"])
test_dataset = test_dataset.map(tokenize, remove_columns=["prompt"])

Put together the coaching and validation datasets for SageMaker coaching by saving them as arrow recordsdata, which is required by SageMaker HyperPod recipes, and setting up the S3 paths the place these recordsdata can be uploaded:

train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/prepare"
val_dataset_s3_path = f"s3://{bucket_name}/{input_path}/check"

train_dataset.save_to_disk(train_dataset_s3_path)
val_dataset.save_to_disk(val_dataset_s3_path)

The dataset above can be used within the examples for each SageMaker coaching jobs and SageMaker HyerPod.

Choice A: Wonderful-tune utilizing SageMaker coaching jobs

To fine-tune the mannequin utilizing SageMaker coaching jobs with recipes, this instance makes use of the ModelTrainer class.

The ModelTrainer class is a more recent and extra intuitive method to mannequin coaching that considerably enhances consumer expertise and helps distributed coaching, Construct Your Personal Container (BYOC), and recipes. For extra details about ModelTrainer, you possibly can check with Speed up your ML lifecycle utilizing the brand new and improved Amazon SageMaker Python SDK – Half 1: ModelTrainer

To arrange the fine-tuning workload, full the next steps:

Choose the occasion sort, the container picture for the coaching job, and outline the checkpoint path the place the mannequin can be saved:

instance_type = "ml.p4d.24xlarge"

image_uri = (
    f"658645717510.dkr.ecr.{sagemaker_session.boto_session.region_name}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
)

checkpoint_s3_path = f"s3://{bucket_name}/deepseek-r1-distilled-qwen-7b-recipe-lora/checkpoints"

Create the ModelTrainer perform to encapsulate the coaching setup from a particular recipe:

from sagemaker.modules.configs import CheckpointConfig, Compute, InputData, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.prepare import ModelTrainer

instance_count = 1

# Working override for customized dataset
recipe_overrides = {
    ...
    "coach": {
        "num_nodes": instance_count,
        ...
    },
    ...
    "use_smp_model": False, # Required for PEFT
    "mannequin": {
        "hf_model_name_or_path": model_id,
        "information": {
            "train_dir": "/choose/ml/enter/information/prepare",
            "val_dir": "/choose/ml/enter/information/check",
        },
    },
}

# Outline the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=0
)

model_trainer = ModelTrainer.from_recipe(
    training_image=image_uri,
    training_recipe="fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq8k_gpu_lora",
    recipe_overrides=recipe_overrides,
    necessities="./necessities.txt",
    compute=compute_configs,
    ...
    checkpoint_config=CheckpointConfig(
        s3_uri=f"{checkpoint_s3_path}/{job_prefix}"
    ),
)

You may level to the precise recipe with the training_recipe argument and override the recipe arguments by offering a dictionary as argument of recipe_overrides. Within the earlier instance:

num_nodes: Signifies the variety of cases that can be used for the fine-tuning execution
checkpoint_dir: Location within the container the place the job will save mannequin checkpoints

The ModelTrainer class simplifies the expertise by encapsulating code and coaching setup immediately from the chosen recipe. On this instance:

training_recipe: hf_deepseek_r1_distilled_qwen_7b_seq8k_gpu_lora is defining fine-tuning setup for the LoRA approach

Arrange the enter channels for ModelTrainer by creating an InputData objects from the offered S3 bucket paths for the coaching and check and validation datasets
Submit the coaching job:

# beginning the prepare job with our uploaded datasets as enter
model_trainer.prepare(input_data_config=information, wait=True)

Choice B: Wonderful-tune utilizing SageMaker HyperPod with Slurm

To fine-tune the mannequin utilizing HyperPod, ensure your cluster is up and prepared by following the stipulations. To entry the login or head node of the HyperPod Slurm cluster out of your improvement setting, comply with the login directions at Log in to your cluster within the Amazon SageMaker HyperPod workshop.

Alternatively, you can even use AWS Techniques Supervisor and run a command like the next to begin the session. Yow will discover the cluster ID, occasion group title, and occasion ID on the Amazon SageMaker console.

aws ssm start-session --target sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id] --region region_name

Within the cluster’s login or head node, run the next instructions to arrange the setting. Run sudo su - ubuntu to run the remaining instructions as the foundation consumer until you will have a particular consumer ID to entry the cluster and your POSIX consumer is created via a lifecycle script on the cluster. Check with the multi-user setup for extra particulars.

# create a digital setting 
python3 -m venv ${PWD}/venv
supply venv/bin/activate

# clone the recipes repository and arrange the setting
git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 set up -r necessities.txt

Create a squash file utilizing Enroot to run the job on the cluster. Enroot runtime gives GPU acceleration, rootless container help, and seamless integration with excessive efficiency computing (HPC) environments, making it ideally suited for working our workflows securely.

# create a squash file utilizing Enroot
REGION=<area>
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}

After you’ve created the squash file, replace the recipes_collection/config.yaml file with absolutely the path to the squash file (created within the previous step), and replace the instance_type if wanted. The ultimate config file ought to have the next parameters:

...

cluster_type: slurm 
...

instance_type: p4d.24xlarge
...

container: /fsx/<path-to-smdistributed-modelparallel>.sqsh
...

Obtain the ready dataset that you just uploaded to S3 into the FSx for Lustre quantity connected to the cluster. Run the next instructions to obtain the recordsdata from Amazon S3:

aws s3 cp s3://{bucket_name}/{input_path}/prepare /fsx/ubuntu/deepseek/information/prepare --recursive
aws s3 cp s3://{bucket_name}/{input_path}/check /fsx/ubuntu/deepseek/information/check --recursive

Replace the launcher script for fine-tuning the DeepSeek-R1 Distill Qwen 7B mannequin. The launcher scripts function handy wrappers for executing the coaching script most important.py file), which streamlines the method of fine-tuning and parameter adjustment. For fine-tuning the DeepSeek-R1 Qwen 7B mannequin, yow will discover the precise script at:

launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_7b_seq16k_gpu_fine_tuning.sh

Earlier than working the script, it’s essential modify the situation of the coaching and validation recordsdata and replace the HuggingFace mannequin ID and optionally the entry token for personal fashions and datasets. The script ought to appear like the next (replace recipes.coach.num_nodes when you’re utilizing a multi-node cluster):

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

HF_MODEL_NAME_OR_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" # HuggingFace pretrained mannequin title or path
HF_ACCESS_TOKEN="hf_xxxx" # Non-compulsory HuggingFace entry token

TRAIN_DIR="/fsx/ubuntu/deepseek/information/prepare" # Location of coaching dataset 
VAL_DIR="/fsx/ubuntu/deepseek/information/check" # Location of validation dataset

EXP_DIR="/fsx/ubuntu/deepseek/outcomes" # Location to save lots of experiment information together with logging, checkpoints, and so forth

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/most important.py" 
    recipes=fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq16k_gpu_fine_tuning 
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/outcomes" 
    recipes.run.title="hf-deepseek-r1-distilled-qwen-7b-fine-tuning" 
    recipes.exp_manager.exp_dir="$EXP_DIR" 
    recipes.coach.num_nodes=1 
    recipes.mannequin.information.train_dir="$TRAIN_DIR" 
    recipes.mannequin.information.val_dir="$VAL_DIR" 
    recipes.mannequin.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" 
    recipes.mannequin.hf_access_token="$HF_ACCESS_TOKEN"

You may view the recipe for this fine-tuning job beneath, overriding any extra parameters as wanted:

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq16k_gpu_fine_tuning.yaml

Submit the job by working the launcher script:

bash launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_7b_seq16k_gpu_fine_tuning.sh

You may monitor the job utilizing Slurm instructions corresponding to squeue and scontrol present to view the standing of the job and the corresponding logs. After the job is full, the educated mannequin may also be out there within the outcomes folder, as proven within the following code:

cd outcomes
 ls -R
.:
checkpoints  experiment

./checkpoints:
full

./checkpoints/full:
steps_50

./checkpoints/full/steps_50:
config.json  pytorch_model.bin

./experiment:
...

Add the fine-tuned mannequin checkpoint to Amazon S3 for evaluating the mannequin utilizing the validation information:

aws s3 cp /fsx/<path_to_checkpoint> s3://{bucket_name}/{model_prefix}/qwen7b --recursive

Consider the fine-tuned mannequin

To objectively consider your fine-tuned mannequin, you possibly can run an analysis job on the validation portion of the dataset.

You may run a SageMaker coaching job and use ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated textual content and human-written reference textual content. The SageMaker coaching job will compute ROUGE metrics for each the bottom DeepSeek-R1 Distill Qwen 7B mannequin and the fine-tuned one. You may entry the code pattern for ROUGE analysis within the sagemaker-distributed-training-workshop on GitHub. Please refer this notebook for particulars.

Full the next steps:

Outline the S3 path the place the fine-tuned checkpoints are saved, the instance_type, and the picture uri to make use of within the coaching job:

trained_model = <S3_PATH>
instance_type = "ml.p4d.24xlarge"

image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    area=sagemaker_session.boto_session.region_name,
    model="2.4",
    instance_type=instance_type,
    image_scope="coaching"
)
#763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.4-gpu-py311

Create the ModelTrainer perform to encapsulate the analysis script and outline the enter information:

from sagemaker.modules.configs import Compute, InputData, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.prepare import ModelTrainer

# Outline the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    necessities="necessities.txt",
    entry_script="evaluate_recipe.py",
)

# Outline the compute
...

# Outline the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    compute=compute_configs,
    ...
    hyperparameters={
        "model_id": model_id,  # Hugging Face mannequin id
        "dataset_name": dataset_name
    }
)

# Move the enter information
train_input = InputData(
   channel_name="adapterdir",
   data_source=trained_model,
)

test_input = InputData(
   channel_name="testdata",
   data_source=test_dataset_s3_path, # S3 path the place coaching information is saved
)

# Examine enter channels configured
information = [train_input, test_input]

Submit the coaching job:

# beginning the prepare job with our uploaded datasets as enter
model_trainer.prepare(input_data_config=information, wait=True)

The next desk reveals the duty output for the fine-tuned mannequin and the bottom mannequin.

Mannequin	Rouge 1	Rouge 2	Rouge L	Rouge L Sum
Base	0.36362	0.08739	0.16345	0.3204
Wonderful-tuned	0.44232	0.13022	0.17769	0.38989
% Distinction	21.64207	49.01703	8.7121	21.68871

Our fine-tuned mannequin demonstrates outstanding effectivity, attaining about 22% total enchancment on the reasoning job after just one coaching epoch. Probably the most vital achieve seems in Rouge 2 scores—which measure bigram overlap—with about 49% enhance, indicating higher alignment between generated and reference summaries.

Notably, preliminary experiments recommend these outcomes may very well be additional enhanced by extending the coaching length. Rising the variety of epochs reveals promising potential for extra efficiency features whereas sustaining computational effectivity.

Clear up

To scrub up your assets to keep away from incurring any extra expenses, comply with these steps:

Delete any unused SageMaker Studio assets
(Non-compulsory) Delete the SageMaker Studio area
Confirm that your coaching job isn’t working anymore. To take action, in your SageMaker console, select Coaching and examine Coaching jobs.
If you happen to created a HyperPod cluster, delete the cluster to cease incurring prices. If you happen to created the networking stack from the HyperPod workshop, delete the stack as nicely to scrub up the digital personal cloud (VPC) assets and the FSx for Lustre quantity.

Conclusion

Within the first publish of this two-part DeepSeek-R1 sequence, we mentioned how SageMaker HyperPod recipes present a strong but accessible answer for organizations to scale their AI mannequin coaching capabilities with giant language fashions (LLMs) together with DeepSeek. The structure streamlines advanced distributed coaching workflows via its intuitive recipe-based method, lowering setup time from weeks to minutes.

We suggest beginning your LLM customization journey by exploring our pattern recipes within the Amazon SageMaker HyperPod documentation. The AWS AI/ML neighborhood gives intensive assets, together with workshops and technical steering, to help your implementation journey.

To start utilizing the SageMaker HyperPod recipes, go to the sagemaker-hyperpod-recipes repo on GitHub for complete documentation and instance implementations. Our crew continues to develop the recipe ecosystem primarily based on buyer suggestions and rising ML developments, ensuring that you’ve the instruments wanted for profitable AI mannequin coaching.

In our second publish, we focus on how these recipes might additional be used to fine-tune DeepSeek-R1 671b mannequin. Keep tuned!

Concerning the Authors

Kanwaljit Khurmi is a Principal Worldwide Generative AI Options Architect at AWS. He collaborates with AWS product groups, engineering departments, and clients to offer steering and technical help, serving to them improve the worth of their hybrid machine studying options on AWS. Kanwaljit makes a speciality of aiding clients with containerized purposes and high-performance computing options.

Bruno Pistone is a Senior World Broad Generative AI/ML Specialist Options Architect at AWS primarily based in Milan, Italy. He works with AWS product groups and huge clients to assist them totally perceive their technical wants and design AI and Machine Studying options that take full benefit of the AWS cloud and Amazon Machine Studying stack. His experience consists of: Finish-to-end Machine Studying, mannequin customization, and generative AI. He enjoys spending time with associates, exploring new locations, and touring to new locations.

Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker crew. He makes a speciality of giant language mannequin coaching workloads, serving to clients construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Outdoors of labor, he enjoys working, mountaineering, and cooking.

Durga Sury is a Senior Options Architect on the Amazon SageMaker crew. Over the previous 5 years, she has labored with a number of enterprise clients to arrange a safe, scalable AI/ML platform constructed on SageMaker.

Aman Shanbhag is an Affiliate Specialist Options Architect on the ML Frameworks crew at Amazon Internet Providers, the place he helps clients and companions with deploying ML coaching and inference options at scale. Earlier than becoming a member of AWS, Aman graduated from Rice College with levels in laptop science, arithmetic, and entrepreneurship.

Anirudh Viswanathan is a Sr Product Supervisor, Technical – Exterior Providers with the SageMaker AI Coaching crew. He holds a Masters in Robotics from Carnegie Mellon College, an MBA from the Wharton Faculty of Enterprise, and is called inventor on over 40 patents. He enjoys long-distance working, visiting artwork galleries, and Broadway reveals.

Customise DeepSeek-R1 distilled fashions utilizing Amazon SageMaker HyperPod recipes – Half 1

Amazon SageMaker HyperPod recipes

Resolution overview

SageMaker HyperPod

SageMaker coaching jobs

Resolution walkthrough

Conditions

Put together the dataset

Choice A: Wonderful-tune utilizing SageMaker coaching jobs

Choice B: Wonderful-tune utilizing SageMaker HyperPod with Slurm

Consider the fine-tuned mannequin

Clear up

Conclusion

Concerning the Authors

Market Dangers Confronted with Establishments and Careers for Medicare Open Registration 2025

Trump officers who tried to underestimate main local weather studies oversee it

Converter

Editors Pick

Newsletter

Categories

Related Posts