Giant language fashions (LLMs) ship robust outcomes on normal duties, however they usually wrestle with specialised work that requires understanding proprietary information, inside processes, or domain-specific terminology. Amazon Nova Forge addresses this by enabling you to construct your individual frontier fashions utilizing Amazon Nova. You can begin improvement from early mannequin checkpoints, mix proprietary information with Amazon Nova-curated coaching information, and host customized fashions securely on AWS. A key functionality is information mixing, which blends your coaching information with curated datasets. This helps the mannequin take in your area whereas retaining broad reasoning, instruction-following, and language capabilities. This prevents catastrophic forgetting that usually undermines area customization.
Profitable customization requires cautious hyperparameter tuning. Studying charge, information mixing ratio, checkpoint choice, and coaching methods all work together in methods that may silently undermine a coaching run. If any of them are mistaken, you commerce one downside for one more. This put up covers the artwork (strategic trade-offs) and science (metric-driven choices) of hyperparameter tuning on Amazon Nova Forge that can assist you keep away from costly failed coaching runs.
Fantastic-tuning for domain-specific duties means bettering efficiency in a single space with out degrading the mannequin’s normal capabilities, and getting that stability proper is tougher than it seems to be. This put up walks via methods to navigate that stability, from choosing the correct customization technique to your information and process, to configuring the coaching parameters that almost all affect outcomes, like studying charge, batch dimension, and checkpointing. We additionally cowl the frequent errors that result in wasted coaching runs and methods to catch them early, so you may enhance area efficiency with out degrading normal capabilities or burning via compute on avoidable failures.
By the tip, you’ll know methods to enhance area efficiency with out degrading normal capabilities and methods to keep away from the costly failures that come from getting the stability mistaken.
The hyperparameter tuning problem
Reaching this stability is tougher than it seems. Three elementary challenges make hyperparameter tuning significantly tough on domain-specialized fashions.
Problem 1: Catastrophic forgetting
While you practice a mannequin on slender area information, the mannequin can overwrite normal capabilities it realized throughout pre-training. This phenomenon, referred to as catastrophic forgetting, reveals up as degraded efficiency on duties exterior your coaching area. The mannequin turns into extremely specialised however loses instruction-following capacity, reasoning functionality, and broad data. In manufacturing, this implies a customer support mannequin fine-tuned in your assist tickets could now not purpose about ambiguous requests or preserve coherent multi-turn conversations.
This creates a stability-flexibility tradeoff. Ideally, the mannequin is versatile sufficient to study a corporation’s area however secure sufficient to retain normal capabilities. Nova Forge addresses this via information mixing, which blends your coaching information with curated datasets throughout coaching, and checkpoint choice, which helps you to select how a lot current alignment to protect.
Problem 2: Discovering the correct studying charge
The educational charge controls how a lot the mannequin’s weights change in response to every batch of coaching examples. It’s probably the most delicate hyperparameter throughout all customization methods. A studying charge that’s too excessive causes the mannequin to overshoot the optimum state, destabilize throughout coaching, or neglect base capabilities quickly. A studying charge that’s too low wastes compute on very sluggish convergence. The proper worth depends upon your information distribution, mixing ratio, and coaching approach.
Nova Forge gives calibrated service defaults for every coaching approach that account for these interactions. While you use information mixing, the sensitivity will increase additional. Deviating from the default studying charge when mixing Nova information with your individual information is the most typical supply of coaching instability, so these service defaults are the really helpful start line.
Problem 3: Baseline efficiency constraints
Reinforcement fine-tuning (RFT) is a method that improves mannequin conduct by producing a number of candidate responses and scoring them towards high quality standards. The mannequin learns by evaluating its personal outputs and reinforcing the higher ones. RFT works at its full capability inside a particular vary of baseline process accuracy, measured by how usually the mannequin produces right or high-quality responses earlier than fine-tuning. If baseline accuracy is simply too low (the mannequin hardly ever produces right responses), there aren’t sufficient good examples for reward-guided exploration to study from. If baseline accuracy is already very excessive, further coaching yields diminishing returns and dangers degrading current efficiency. This implies RFT can’t shut giant competence gaps the place the mannequin basically lacks the data or reasoning capacity to aim a process. It refines and strengthens behaviors the mannequin can already partially display, relatively than educating totally new capabilities from scratch.
The Nova Forge pipeline addresses each bounds. For low-baseline situations, run supervised fine-tuning (SFT) first to ascertain the foundational capabilities wanted for efficient reward-based studying. For top-baseline duties, guarantee that your reward perform has discriminative energy throughout the mannequin’s high quality vary. If most responses already rating extremely, RFT has no significant sign to optimize towards.
The Nova Forge customization pipeline
Understanding these challenges frames how the Amazon Nova Forge customization pipeline is designed to deal with them. Nova Forge gives three complementary customization methods, every serving a definite objective within the mannequin improvement lifecycle.
| Method | What it does | When to make use of | Enter information |
| Continued pre-training (CPT) | Expands foundational mannequin (FM) data via self-supervised studying on giant portions of unlabeled, domain-specific proprietary information. CPT teaches the mannequin area terminology and patterns out of your textual content corpus. | You want the mannequin to know specialised vocabulary, trade ideas, or organizational data that doesn’t exist within the base mannequin. | Giant volumes of unlabeled area textual content. Nova Forge helps CPT with information mixing and three checkpoint choices (pre-trained, mid-trained, and post-trained), every suited to completely different information scales and downstream necessities. |
| Supervised fine-tuning (SFT) | Customizes mannequin conduct utilizing a coaching dataset of input-output pairs particular to your goal duties. SFT teaches the mannequin “given X, output Y” conduct via demonstrations. | You want the mannequin to observe particular response codecs, undertake explicit tones, or carry out structured duties like classification or extraction. | 1,000–10,000 high-quality demonstrations per process. High quality, consistency, and variety matter greater than quantity. Nova Forge helps SFT with information mixing utilizing Amazon Nova-curated datasets, together with reasoning-instruction-following classes that protect normal capabilities. |
| Reinforcement fine-tuning (RFT) | Steers mannequin output towards most well-liked outcomes utilizing reward indicators. RFT optimizes the mannequin inside a behavioral neighborhood established by prior coaching for single-turn or multi-turn conversational duties. | You’ve a transparent reward perform that may consider response high quality and wish to push efficiency past what SFT alone achieves. | Prompts and a reward perform. Nova Forge helps bringing your individual exterior reward atmosphere via AWS Lambda, enabling customized verification logic for domain-specific high quality evaluation. |
When all three phases are used collectively (CPT, then SFT, then RFT), they produce the strongest outcomes. Nevertheless, with the correct pipeline, every stage may be elective. It depends upon your information availability, process kind, and start line. CPT is just wanted when the bottom mannequin lacks area vocabulary or data your process requires. SFT and RFT can be utilized independently or mixed relying on what your process calls for.
Determine 1: The Amazon Nova Forge customization pipeline. CPT teaches area data from unlabeled textual content, SFT teaches task-specific conduct from demonstrations, and RFT optimizes efficiency utilizing reward indicators. Every stage is elective, and the complete pipeline (CPT, then SFT, then RFT) produces the strongest outcomes when all three are relevant to your use case.
Amazon SageMaker AI presents completely different environments for personalization: SageMaker Serverless gives a UI-driven expertise with computerized compute provisioning, SageMaker AI coaching jobs (SMTJ) present a completely managed expertise with out cluster administration, whereas Amazon SageMaker HyperPod presents specialised environments for superior distributed coaching situations.
Strategic choices
With the customization pipeline in view, the following step is knowing the qualitative trade-offs that form your configuration. These strategic choices matter as a lot as any particular person hyperparameter worth: checkpoint choice, information mixing, and coaching mode.
Checkpoint choice (most impactful choice)
For CPT, checkpoint choice is extra impactful than any hyperparameter. Amazon Nova Forge gives three checkpoint choices, every suited to completely different information scales and downstream necessities.
- Pre-trained checkpoints are probably the most versatile and provide the quickest convergence. These checkpoints settle for new patterns readily and work greatest for large-scale CPT with substantial token budgets exceeding 100 billion tokens. When utilizing pre-trained checkpoints with giant datasets, you should utilize a better studying charge (comparable to 1e-4) to speed up data absorption. You then have to progressively cut back the educational charge again to roughly 1e-6 for mannequin stability earlier than operating SFT to let the mannequin “settle” into what it realized with out overshooting. Bear in mind that pre-trained checkpoints don’t have any directions for tuning. After CPT, it’s essential to run SFT to make the mannequin helpful for downstream duties.
- Mid-trained checkpoints stability flexibility and alignment. They settle for area data whereas retaining some instruction-following conduct. Use mid-trained checkpoints for medium-sized datasets the place you need quicker area adaptation than post-trained however extra stability than pre-trained. Mid-trained checkpoints work effectively for full rank coaching, which updates each parameter within the mannequin throughout fine-tuning, with giant, structured datasets.
- Publish-trained checkpoints are probably the most proof against new patterns however protect instruction-following and normal capabilities. Use post-trained for smaller-scale CPT the place preserving alignment issues greater than maximizing area data absorption. Publish-trained checkpoints are the really helpful start line for LoRA (Low-Rank Adaptation), which freezes the unique mannequin weights and trains small adapter matrices on prime, and different parameter-efficient fine-tuning strategies, as they preserve the mannequin’s current capabilities whereas permitting focused adaptation. For small datasets or later-stage checkpoints, use conservative studying charge values from the service defaults.

Determine 2: Checkpoint choice for continued pre-training. Pre-trained checkpoints provide most flexibility for big datasets however require SFT afterward to revive instruction-following. Publish-trained checkpoints protect alignment and go well with smaller datasets or parameter-efficient strategies like LoRA.
Information mixing technique
With out information mixing, coaching on slender area information may cause the mannequin to turn out to be unstable, leading to erratic coaching conduct (gradient instability or loss spikes) or a sudden degradation in efficiency.
When configuring information mixing, stability your buyer information round 50 p.c of the entire combine for many use instances. For SFT, at all times embrace the “reasoning-instruction-following” class in your Nova information combine. This single class considerably improves generic benchmark efficiency after fine-tuning. Skipping this class is a standard explanation for degraded reasoning efficiency in fine-tuned fashions.
Information mixing could be very delicate to studying charge. Deviating from the default studying charge when utilizing information mixing causes instability. That is the most typical mistake practitioners make. Should you observe coaching instability with information mixing, the educational charge is the primary suspect.
Discovering the optimum mixing ratio requires experimentation. Maintain your area information fixed and range the Nova information proportion throughout a number of runs. Area efficiency usually stays fixed whereas normal capabilities hold bettering the extra Nova information is blended in. Place your highest-quality information towards the tip of coaching for higher convergence.
Coaching mode: Low-Rank Adaptation (LoRA) vs Full Rank
Amazon Nova Forge helps two coaching modes that decide how mannequin parameters are up to date throughout coaching:
- LoRA updates solely adapter layers, providing decrease compute prices, quicker iteration, and compatibility with on-demand inference. LoRA achieves close to Full Rank efficiency for many duties whereas being extra forgiving of suboptimal hyperparameters. The default alpha scaling issue of 64 works for many duties. Improve alpha if LoRA is under-adapting to your information or lower it if LoRA is over-adapting and shedding normal capabilities. Use post-trained checkpoints as your start line for LoRA coaching.
- Full Rank updates all mannequin parameters, offering most adaptation capability. Full Rank requires Amazon Bedrock Provisioned Throughput for deployment (On-Demand is just obtainable for LoRA-based customization) and better compute throughout coaching. Use Full Rank when you’ve got validated your pipeline and your deployment structure justifies the extra value. Mid-trained checkpoints work effectively for Full Rank coaching with giant, structured datasets.
Begin with LoRA to validate your pipeline, information high quality, and reward perform (for RFT). Graduate to Full Rank when you’ve got confirmed the strategy works, and your manufacturing necessities justify it (for instance, mannequin efficiency or value constraints).
Really useful workflow
Making use of these strategic choices to your particular state of affairs depends upon what information and aims you’ve got. The next paths map your beginning situations to the correct sequence of methods.
In case you have labeled demonstrations and a verifiable reward perform (SFT then RFT):
- Begin with SFT utilizing LoRA to show the goal conduct and set up baseline competency.
- Allow information mixing with “reasoning-instruction-following” included to protect the mannequin’s capacity to observe structured prompts and produce well-formatted outputs throughout area adaptation.
- Use default studying charges with out modification.
- Monitor validation loss to pick the very best SFT checkpoint.
- Graduate to RFT on the SFT checkpoint to optimize additional via reward indicators.
- Think about Full Rank coaching solely after validating the strategy with LoRA.
- Take a look at totally on each your area process and normal benchmarks earlier than manufacturing deployment (see the Experiments and insights part for an instance).
Should you can outline verifiable outcomes however can’t simply label responses at scale (RFT solely):
- Consider base mannequin efficiency on a consultant pattern of your process first.
- Proceed with RFT immediately if the bottom mannequin achieves greater than roughly 5 p.c optimistic reward.
- Fall again to SFT if reward scores are constantly close to zero. The mannequin wants baseline competency earlier than reward-guided studying can take impact.
If the bottom mannequin lacks area vocabulary or data your process requires, begin with CPT:
- Run CPT to soak up area data from unlabeled textual content.
- Comply with with SFT. Pre-trained checkpoints used for CPT don’t have any instruction tuning, so SFT is required after CPT to make the mannequin helpful.
- Optionally observe with RFT to additional optimize efficiency.
Parameter configuration
With strategic choices made, now you can optimize particular hyperparameters that govern how every approach executes. This part gives steering for every approach.
Studying charge configuration
Studying charge controls how shortly the mannequin updates based mostly on coaching indicators. Service defaults characterize examined configurations that work throughout various use instances.
- For CPT: Begin at service defaults. For giant datasets exceeding one trillion tokens, you should utilize a better studying charge (comparable to 1e-4) to speed up data absorption, however you want a ramp-down stage to cut back the educational charge again to roughly 1e-6 for mannequin stability earlier than SFT. The
constant_stepsparameter controls what number of steps the mannequin trains on the peak studying charge earlier than this ramp-down stage begins. Improveconstant_stepsfor very giant token runs the place extra steps at full studying charge assist area absorption. For smaller datasets or later-stage checkpoints, use the default (decrease) studying charge from the beginning. - For SFT: Keep on with service defaults, particularly with information mixing. The really helpful studying charge is 1e-5 for LoRA and 5e-6 for full-rank SFT. Deviating from the default studying charge when mixing Nova information causes instability. Should you observe coaching instability with information mixing, the educational charge is the primary suspect.
- For RFT: Begin at service defaults. Regulate in small multiplier increments provided that wanted. If reward drops all of the sudden and doesn’t get better, the educational charge is probably going too excessive. Even a small multiplier enhance can drop efficiency under baseline.
Configure warmup steps to roughly 15 p.c of your complete coaching steps. Warmup stabilizes preliminary coaching by progressively rising the educational charge relatively than beginning on the full worth.
Batch dimension and coaching length
Batch dimension (managed by global_batch_size) is the batch parameter throughout all coaching strategies (CPT, SFT, RFT) and all environments (SageMaker Serverless, SMTJ, HyperPod). It defines the variety of coaching samples processed per optimizer step. For CPT and SFT, that is simple with one pattern equal to at least one input-output pair (SFT) or one token sequence (CPT). RFT introduces an extra parameter, number_generation, that controls what number of candidate responses are generated per immediate for reward scoring. This parameter doesn’t exist in CPT or SFT recipes, as a result of these strategies practice immediately on offered input-output pairs relatively than producing candidates. When the variety of generations parameter is current, batch dimension semantics differ between environments. Getting this mistaken results in surprising conduct.
- On SMTJ (RFT solely): Batch dimension means prompts per step. Every immediate generates N candidate responses (managed by
number_generation). Whole samples per step equals batch dimension multiplied by variety of generations. - On SageMaker HyperPod (RFT solely): Batch dimension means complete samples per step (prompts multiplied by generations). Translate rigorously when transferring configurations between environments.
For CPT, goal 2-20 million tokens per step. Use 20 million for big token budgets and a couple of million for smaller budgets. Calculate international batch dimension as the closest energy of two of tokens per step divided by max sequence size. For instance, 4 million tokens per step with a 4096-sequence size yields a batch dimension of roughly 1024. Smaller batch sizes produce noisier gradients, which can assist generalization and allow quicker iteration. Bigger batch sizes produce smoother gradients however could over-smooth domain-specific indicators. Begin with reasonable batch sizes for stability.
Match your max sequence size to your information distribution. Don’t exceed what your information wants. Smaller context lengths enhance token throughput and cut back coaching prices. For CPT, course of at most one epoch of your dataset. Keep away from repeating information, as a number of epochs on restricted CPT information results in overfitting and lack of normal capabilities. Monitor validation loss to trace progress. For SFT, Full Rank coaching usually wants fewer epochs than LoRA. LoRA coaching can tolerate barely extra epochs. Monitor validation loss to detect overfitting and choose the very best checkpoint.
RFT-specific parameters
RFT introduces further parameters not current in CPT or SFT.
- Variety of generations controls what number of candidate responses the mannequin generates per immediate for the reward perform to match. Fewer candidates imply quicker coaching however much less sign variety. Too many candidates add noise with out bettering sign and almost double coaching time. Average values hit the very best accuracy-to-time ratio. Improve in case your process has excessive variance in response high quality. Lower for speedy reward perform iteration throughout improvement.
- KL-Divergence Loss Coefficient constrains how far the mannequin’s coverage can drift from its authentic conduct. This parameter is offered on SMTJ solely. A low coefficient lets the mannequin discover freely however dangers discovering shortcuts that recreation the reward perform. A excessive coefficient prevents significant studying by pulling the mannequin again to its start line. Improve if KL divergence spikes throughout coaching to stability real studying towards behavioral drift.
- Reasoning Effort controls how a lot chain-of-thought reasoning the mannequin performs earlier than answering. Excessive reasoning effort produces the very best accuracy however will increase latency and serving value. Low reasoning effort presents quicker inference with modest accuracy trade-offs. Use excessive for max accuracy throughout validation, then think about decreasing for latency-sensitive manufacturing deployments.
- Lambda Concurrency Restrict (SMTJ solely) controls parallel AWS Lambda capabilities for reward analysis. Improve considerably for quick reward capabilities to keep away from analysis throughput turning into a bottleneck.
Do not forget that batch dimension semantics differ between platforms. On SMTJ, global_batch_size means prompts per step the place every generates N candidates. On SageMaker HyperPod, global_batch_size means complete samples (prompts multiplied by generations). Translate rigorously between environments.
Regularization parameters
Regularization parameters assist stop overfitting, particularly on smaller datasets.
- Weight decay defaults to zero. Improve modestly for those who observe overfitting on small datasets. Weight decay applies L2 regularization to constrain parameter magnitudes.
- Dropout (hidden and a spotlight) defaults to zero. Improve hidden dropout modestly for smaller datasets to cut back overfitting. Improve consideration dropout cautiously, as excessive values can harm advanced reasoning capabilities.
- Clip ratio and age tolerance are superior SageMaker HyperPod parameters. Clip ratio limits how a lot the coverage can change in a single coaching step. Age tolerance determines how lengthy coaching information stays legitimate earlier than being thought-about too stale. Refit frequency controls how usually the mannequin collects contemporary coaching information. Defaults work for many use instances. Solely alter these superior settings for those who perceive the precise stability problem you’re addressing.
Experiments and insights
With these hyperparameters in thoughts, we ran a collection of HPO experiments utilizing Amazon Nova 2.0 throughout public benchmarks together with CoCoHD, MedReason and LLaVA-CoT. The next desk summarizes the experimental configurations and key findings for every parameter sweep.
| Dataset | Rank | Alpha | GBS | LR | Max Steps | Warmup | Base Goal Perf. | SFT Goal Perf. | Rank | Perf Diff |
| MedReason | 32 | 64 | 32 | 1.00E-05 | 312 | 47 | 57.38% | 63.54% | 2 | 10.75% ↑ |
| MedReason | 64 | 64 | 32 | 1.00E-05 | 312 | 47 | 57.38% | 63.78% | 1 | 11.16% ↑ |
| MedReason | 32 | 64 | 32 | 5.00E-06 | 312 | 47 | 57.38% | 63.33% | ||
| MedReason | 32 | 64 | 32 | 1.00E-05 | 624 | 94 | 57.38% | 61.42% | ||
| LLavaCOT | 64 | 64 | 32 | 1.00E-05 | 312 | 47 | 16.22% | 68.47% | 1 | 322.13% ↑ |
| LLavaCOT | 32 | 128 | 32 | 1.00E-05 | 312 | 47 | 16.22% | 65.77% | 2 | 305.49% ↑ |
We ran LoRA SFT on Amazon Nova 2 Lite utilizing Nova Forge with rank 32, alpha 64, batch dimension 32, 15 p.c warmup, and 1 epoch, sweeping solely the educational charge to isolate its impact on the right track accuracy. The service default of 1e-5 produced the very best outcome at 63.54 p.c, a ten.75 p.c elevate over the v4 base. Dropping the educational charge to 5e-6 adversely impacted goal efficiency with out meaningfully defending normal capabilities, as MMLU, IFEval, and GPQA scores had been inside noise of the 1e-5 run. Doubling to 2 epochs on the similar studying charge dropped accuracy to 61.42 p.c, confirming that overtraining on slender area information erodes each area and normal efficiency.
We assorted LoRA rank (32 vs 64) and alpha (64 vs 128) on a multimodal reasoning process the place the bottom mannequin begins at solely 16.22 p.c accuracy. One of the best configuration, rank 64 with alpha 64, lifted accuracy to 68.47 p.c, a 322 p.c relative enchancment over the bottom. Doubling alpha to 128 at rank 32 produced an identical goal achieve at 65.77 p.c, however at a meaningfully larger general-capability regression value. For duties the place the baseline accuracy is low, rising rank is a higher-leverage adjustment than rising alpha. Alpha must be elevated solely when LoRA is under-adapting, and decreased if the mannequin is shedding normal capabilities.
No single hyperparameter configuration works greatest for all use instances. These really helpful defaults are robust beginning factors, not ensures of optimum efficiency.
Widespread pitfalls and methods to keep away from them
The next desk summarizes the most typical errors practitioners ought to keep away from when tuning Amazon Nova Forge fashions.
| Pitfall | Symptom | Resolution |
| Skipping SFT earlier than RFT | RFT produces no enchancment or degrades efficiency | Run SFT first to get the mannequin into the correct behavioral neighborhood earlier than RFT optimization. |
| Deviating from default LR with information mixing | Coaching instability, loss spikes, functionality collapse | Keep on with service defaults when utilizing information mixing. That is the most typical mistake. |
| Poor reward perform high quality | Accuracy decreases regardless of coaching, or mannequin video games the metric | Refine your reward perform earlier than altering any coaching parameter. Validate with at the very least two impartial judges. |
| A number of epochs on restricted CPT information | Overfitting, lack of normal capabilities, memorization | Course of at most one epoch of your CPT dataset. Monitor validation loss to detect overfitting early. |
| Mismatched reasoning settings | Inference conduct doesn’t match coaching conduct | Match reasoning_enabled between coaching and inference. Should you practice with reasoning, infer with reasoning. |
When tuning fashions with Nova Forge, spend money on your reward perform earlier than anything. A poor reward perform will lower accuracy no matter different hyperparameter decisions, whereas a refined one produces constant positive aspects on an identical infrastructure. Be sure that your reward perform has discriminative energy throughout the mannequin’s high quality vary, as a result of if the whole lot scores excessive, RFT has no gradient to optimize.
The identical validation self-discipline applies to LLM-as-judge choice. Your decide mannequin should reliably distinguish high quality variations throughout the mannequin’s output vary. Validate decide settlement with at the very least two impartial evaluators earlier than committing to a coaching run.
Bear in mind that coaching atmosphere stability mechanisms differ between platforms. SMTJ applies steady KL penalty as a gentle constraint, whereas SageMaker HyperPod makes use of gradient clipping as a tough cap per step. Each obtain comparable accuracy, however they require completely different tuning intuitions. Don’t assume parameters switch immediately between environments.
All through all of this, prioritize information high quality over quantity. Filtering aggressively and ensuring coaching examples precisely characterize the goal conduct will outperform merely scaling up low-quality information.
Measuring success
While you apply correct hyperparameter tuning, the outcomes may be substantial. The AWS China Utilized Science crew demonstrated this of their analysis of Amazon Nova Forge, attaining 17 p.c F1 rating enchancment on a posh Voice of Buyer classification process whereas sustaining near-baseline MMLU scores.
Key metrics to observe
Coaching loss ought to lower steadily with out sudden spikes. Spikes usually point out studying charge points or information high quality issues.
Validation loss reveals overfitting. If validation loss will increase whereas coaching loss decreases, you’re overfitting. Cut back epochs, enhance regularization, or add extra various information.
KL divergence (for RFT) reveals how far the coverage has drifted. Sudden spikes recommend the mannequin is making giant, doubtlessly unstable updates. Improve the KL loss coefficient if this happens.
Reward metrics (for RFT) ought to enhance steadily. If reward improves quickly then plateaus or drops, the mannequin could also be gaming the reward perform. Revisit your reward design.
Conclusion
Optimizing mannequin customization with Amazon Nova Forge requires balancing artwork and science. The artwork includes understanding trade-offs: checkpoint choice, information mixing technique, and coaching mode choices form your consequence greater than any single hyperparameter. The science includes systematic tuning: studying charge, batch dimension, and technique-specific parameters require cautious configuration based mostly in your information and aims.
Information and reward high quality exceed any hyperparameter in significance. Earlier than tuning coaching parameters, optimize your information pipeline and reward perform. Begin with service defaults, particularly for studying charge and information mixing, as these defaults exist as a result of they work throughout a variety of use instances.
For many manufacturing situations, the strongest pipeline is SFT adopted by RFT. RFT refines current functionality however can’t get better from a low baseline, so supervised fine-tuning wants to ascertain stable efficiency first. Information mixing must be handled as important for manufacturing workloads, not elective. It prevents catastrophic forgetting and gives optimization stability wanted for dependable outcomes.
When working with continued pre-training, checkpoint choice is probably the most impactful choice you’ll make. Match checkpoint flexibility to your information scale: earlier checkpoints for large-scale area adaptation, later checkpoints for smaller datasets the place preserving instruction-following conduct issues.
To get began with Amazon Nova Forge, discover the Amazon Nova documentation and the SageMaker HyperPod recipes repository on GitHub. For hands-on examples of information mixing in motion, see the Nova Forge information mixing weblog put up. For a deeper dive into RFT with Nova Forge see the Reinforcement fine-tuning for Amazon Nova: Educating AI via suggestions weblog put up.
Acknowledgements
The authors want to thank Zheng Du, Bharathan Balaji, Anjie Fang, and Mengnong Xu from the AWS AGI Customization Science crew for his or her technical steering.
In regards to the authors

