Implementing Softmax from scratch: Avoiding the numerical stability entice

by root January 7, 2026

written by root January 7, 2026 0 comment 145 views

In deep studying, classification fashions should not solely make predictions, but in addition specific confidence. That is the place the Softmax activation operate comes into play. Softmax takes the uncooked unbounded scores produced by a neural community and transforms them into well-defined chance distributions, permitting every output to be interpreted because the probability of a selected class.

This property makes Softmax the premise for multiclass classification duties, starting from picture recognition to language modeling. This text supplies an intuitive understanding of how Softmax works and why its implementation particulars are extra essential than it appears. Please test Full code here.

Implementing Naive Softmax

import torch

def softmax_naive(logits):
    exp_logits = torch.exp(logits)
    return exp_logits / exp_logits.sum(dim=1, keepdim=True)

This operate implements Softmax activation in its easiest type. Increase every logit to an influence and normalize it by the sum of all powered values over the category, producing a chance distribution for every enter pattern.

This implementation is mathematically right and readable, however numerically unstable. A big constructive logit could cause overflow, and a big detrimental logit can underflow to zero. In consequence, this model needs to be averted in actual coaching pipelines. Please test Full code here.

Pattern logit and goal label

This instance defines a small batch containing three samples and three courses to signify each profitable and failed instances. The primary and third samples comprise cheap logit values and behave as anticipated through the softmax calculation. The second pattern deliberately incorporates excessive values (1000 and -1000) to display numerical instability. That is the place a easy Softmax implementation breaks down.

The goal tensor is used to specify the proper class index for every pattern, compute the classification loss, and observe how the instability propagates throughout backpropagation. Please test Full code here.

# Batch of three samples, 3 courses
logits = torch.tensor([
    [2.0, 1.0, 0.1],      
    [1000.0, 1.0, -1000.0],  
    [3.0, 2.0, 1.0]
], requires_grad=True)

targets = torch.tensor([0, 2, 1])

Ahead move: Softmax output and failure examples

Throughout the ahead move, a easy Softmax operate is utilized to the logits to generate class chances. For regular logit values (first and third samples), the output is a sound chance distribution whose values lie between 0 and 1 and sum to 1.

Nevertheless, within the second pattern, elevating the overflow to an influence of 1000 reveals a numerical drawback. infinitywhereas -1000 underflows as follows: zero. In consequence, an invalid operation happens throughout normalization, producing NaN values and nil chances. one time NaN If it seems at this stage, it would contaminate all subsequent calculations and make the mannequin unusable for coaching. Please test Full code here.

# Ahead move
probs = softmax_naive(logits)

print("Softmax chances:")
print(probs)

Goal chance and loss breakdown

Right here, we extract the anticipated chances akin to the true class of every pattern. The primary and third samples return legitimate chances, however the second pattern’s goal chance is 0.0. That is brought on by numerical underflow within the softmax calculation. Calculating the loss, -log(p)If we take the logarithm of 0.0, we get: +∞.

This makes the general loss infinite and a serious failure throughout coaching. When the loss turns into infinite, the gradient calculation turns into unstable, NaN Successfully stops studying throughout backpropagation. Please test Full code here.

# Extract goal chances
target_probs = probs[torch.arange(len(targets)), targets]

print("nTarget chances:")
print(target_probs)

# Compute loss
loss = -torch.log(target_probs).imply()
print("nLoss:", loss)

Backpropagation: Gradient destruction

As soon as backpropagation is triggered, the results of infinite loss turn into instantly seen. The slopes of the primary and third samples stay finite as a result of the softmax output is working correctly. Nevertheless, within the second pattern, the log(0) operation within the loss produces a NaN gradient throughout all courses.

These NaNs propagate backward via the community, contaminating weight updates and successfully breaking coaching. That is why numerical instability on the softmax loss boundary is so harmful. As soon as NaN seems, restoration is almost inconceivable except you restart your coaching. Please test Full code here.

loss.backward()

print("nGradients:")
print(logits.grad)

Numerical instability and its penalties

Separating softmax and cross-entropy introduces critical numerical stability dangers because of exponential overflow and underflow. If the logit is giant, the chance can go to infinity or zero, leading to log(0) and a NaN gradient that shortly breaks coaching. At an operational scale, this can be a certainty slightly than a uncommon edge case. With no steady fusion implementation, large-scale multi-GPU coaching runs will fail unexpectedly.

The core of the numerical drawback arises from the truth that computer systems can not signify numbers which are infinitely giant or infinitely small. Floating-point codecs corresponding to FP32 have strict limits on how giant or small values they’ll retailer. When Softmax calculates exp(x), giant constructive values will develop quickly to infinity past the utmost representable quantity, whereas giant detrimental values will shrink very a lot to zero. If the worth reaches infinity or zero, subsequent operations corresponding to division and logarithm will fail and produce invalid outcomes. Please test Full code here.

Implementing steady cross-entropy loss utilizing LogSumExp

This implementation computes the cross-entropy loss straight from the uncooked logits with out explicitly computing the softmax chance. To take care of numerical stability, the logits are first shifted by subtracting the utmost worth for every pattern to make sure that the exponential operate stays inside a protected vary.

The regularization time period is then calculated utilizing the LogSumExp trick, after which the unique (unshifted) goal logit is subtracted to get the proper loss. This method avoids overflow, underflow, and NaN gradients and mirrors the best way cross-entropy is applied in production-grade deep studying frameworks. Please test Full code here.

def stable_cross_entropy(logits, targets):

    # Discover max logit per pattern
    max_logits, _ = torch.max(logits, dim=1, keepdim=True)

    # Shift logits for numerical stability
    shifted_logits = logits - max_logits

    # Compute LogSumExp
    log_sum_exp = torch.log(torch.sum(torch.exp(shifted_logits), dim=1)) + max_logits.squeeze(1)

    # Compute loss utilizing ORIGINAL logits
    loss = log_sum_exp - logits[torch.arange(len(targets)), targets]

    return loss.imply()

Secure ahead and backward path

Operating a steady cross-entropy implementation with the identical excessive logit produces a finite loss and a well-defined gradient. The LogSumExp formulation retains all intermediate calculations inside a protected numerical vary, even when one pattern incorporates very giant values (1000 and -1000). In consequence, backpropagation completes efficiently with out producing NaNs, and every class receives a significant gradient sign.

This confirms that the instability seen earlier was not brought on by the info itself, however by a easy separation of softmax and cross-entropy. This drawback was utterly solved through the use of a numerically steady fusion loss formulation. Please test Full code here.

logits = torch.tensor([
    [2.0, 1.0, 0.1],
    [1000.0, 1.0, -1000.0],
    [3.0, 2.0, 1.0]
], requires_grad=True)

targets = torch.tensor([0, 2, 1])

loss = stable_cross_entropy(logits, targets)
print("Secure loss:", loss)

loss.backward()
print("nGradients:")
print(logits.grad)

conclusion

In actuality, the hole between the components and the real-world code is the reason for many coaching failures. Though softmax and cross-entropy are well-defined mathematically, their easy implementation ignores the finite precision limitations of IEEE 754 {hardware} and underflows and overflows are inevitable.

The important thing repair is easy however essential. Shift the logit earlier than exponentiation and work within the logarithmic area as a lot as attainable. Most significantly, coaching not often requires express chances, steady log chances are adequate and far safer. In case your loss all of the sudden turns to NaN in manufacturing, it typically signifies that Softmax is being calculated manually the place it should not be.

Please test Full code here. Please be at liberty to observe us too Twitter Do not forget to affix us 100,000+ ML subreddits and subscribe our newsletter. cling on! Are you on telegram? You can now also participate by telegram.

Try the newest releases ai2025.deva 2025-focused analytics platform that transforms mannequin launches, benchmarks, and ecosystem exercise into structured datasets that may be filtered, in contrast, and exported.

I’m a Civil Engineering graduate from Jamia Millia Islamia, New Delhi (2022) and have a powerful curiosity in knowledge science, particularly neural networks and their purposes in numerous fields.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Implementing Softmax from scratch: Avoiding the numerical stability entice

Implementing Naive Softmax

Pattern logit and goal label

Ahead move: Softmax output and failure examples

Goal chance and loss breakdown

Backpropagation: Gradient destruction

Numerical instability and its penalties

Implementing steady cross-entropy loss utilizing LogSumExp

Secure ahead and backward path

conclusion

chili lime seasoning recipe

Astronomers could have solved the rationale for Betelgeuse’s unusual dimming

Converter

Editors Pick

Newsletter

Categories

Related Posts