Distribution is probably the most generally used, and sadly many precise knowledge are usually not regular. When confronted with extraordinarily distorted knowledge, it’s engaging to make use of log transformations to normalize distributions and stabilize the variance. I not too long ago labored on a venture utilizing Epoch AI knowledge to research the vitality consumption of AI fashions coaching [1]. Since there isn’t any official knowledge on vitality utilization for every mannequin, the ability draw for every mannequin was calculated by multiplying the coaching time. The brand new variable, vitality (kWh), was very right-skewed, together with excessive, excessively prolonged outliers (Fig. 1).
To deal with this skewness and inhomogeneity, my first intuition was to use log conversion to vitality variables. The log (vitality) distribution seems to be rather more regular (Fig. 2), and the Shapiro Wilk take a look at confirmed the normality of the boundary line (p≈0.5).

Modeling Dilemma: Log Conversion and Log Linking
The visualization regarded good, however after I moved on to modeling, I used to be confronted with a dilemma. Logged Response Variables (log(Y) ~ X)), Or that you must mannequin it Unique response variable I take advantage of A Log Hyperlink Operate (Y ~ X, hyperlink = “log"))? We additionally examined two distributions, Gaussian (regular) and gamma distributions, and mixed every distribution with each log approaches. This gave us 4 completely different fashions beneath, all put in utilizing a generalized linear mannequin (GLM) of R.
all_gaussian_log_link <- glm(Energy_kWh ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware,
household = gaussian(hyperlink = "log"), knowledge = df)
all_gaussian_log_transform <- glm(log(Energy_kWh) ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware,
knowledge = df)
all_gamma_log_link <- glm(Energy_kWh ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware + 0,
household = Gamma(hyperlink = "log"), knowledge = df)
all_gamma_log_transform <- glm(log(Energy_kWh) ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware + 0,
household = Gamma(), knowledge = df)
Mannequin comparability: AIC and diagnostic plots
4 fashions have been in contrast utilizing the Akaike Info Criterion (AIC). That is an estimator for prediction errors. Sometimes, the decrease the AIC, the higher the mannequin can be.
AIC(all_gaussian_log_link, all_gaussian_log_transform, all_gamma_log_link, all_gamma_log_transform)
df AIC
all_gaussian_log_link 25 2005.8263
all_gaussian_log_transform 25 311.5963
all_gamma_log_link 25 1780.8524
all_gamma_log_transform 25 352.5450
Of the 4 fashions, fashions that use log conversion outcomes have a lot decrease AIC values than fashions that use log hyperlinks. As a result of the variations in AICs between log-transformed and log-link fashions have been substantial (311 and 352 vs 1780 and 2005), we additionally examined diagnostic plots to additional confirm that the log-transformed fashions match higher.




Primarily based on AIC values and diagnostic plots, we determined to advance the log-transformed gamma mannequin as we had the second lowest AIC worth, and its residuals and match plots look higher than that of the log-transformed Gaussian mannequin.
We started to look into which explanatory variables have been helpful and which interactions have been essential. The ultimate mannequin I selected is:
glm(components = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity +
Training_hardware + 0, household = Gamma(), knowledge = df)
Interpretation coefficients
Nevertheless, after I started to interpret the coefficients within the mannequin, one thing was felt. Since solely the response variables have been log-transformed, the impact of the predictor is proliferating and the coefficients should be exponentially reverted to the unique scale. A rise of 1 unit of 𝓍 multiplies the outcome by EXP (β). [2].
Trying on the outcomes desk for the mannequin beneath, Training_time_hour, hardware_quantity, and their interplay phrases Training_time_hour: hardware_quantity As a result of it’s a steady variable, the coefficients characterize the slope. Alternatively, since I specified +0 within the mannequin expression, all ranges of the class Training_hardware It acts as an intercept. In different phrases, every {hardware} sort acted as an intercept β₀ when the corresponding dummy variable turned lively.
> glm(components = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity +
Training_hardware + 0, household = Gamma(), knowledge = df)
Coefficients:
Estimate Std. Error t worth Pr(>|t|)
Training_time_hour -1.587e-05 3.112e-06 -5.098 5.76e-06 ***
Hardware_quantity -5.121e-06 1.564e-06 -3.275 0.00196 **
Training_hardwareGoogle TPU v2 1.396e-01 2.297e-02 6.079 1.90e-07 ***
Training_hardwareGoogle TPU v3 1.106e-01 7.048e-03 15.696 < 2e-16 ***
Training_hardwareGoogle TPU v4 9.957e-02 7.939e-03 12.542 < 2e-16 ***
Training_hardwareHuawei Ascend 910 1.112e-01 1.862e-02 5.969 2.79e-07 ***
Training_hardwareNVIDIA A100 1.077e-01 6.993e-03 15.409 < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB 1.020e-01 1.072e-02 9.515 1.26e-12 ***
Training_hardwareNVIDIA A100 SXM4 80 GB 1.014e-01 1.018e-02 9.958 2.90e-13 ***
Training_hardwareNVIDIA GeForce GTX 285 3.202e-01 7.491e-02 4.275 9.03e-05 ***
Training_hardwareNVIDIA GeForce GTX TITAN X 1.601e-01 2.630e-02 6.088 1.84e-07 ***
Training_hardwareNVIDIA GTX Titan Black 1.498e-01 3.328e-02 4.501 4.31e-05 ***
Training_hardwareNVIDIA H100 SXM5 80GB 9.736e-02 9.840e-03 9.894 3.59e-13 ***
Training_hardwareNVIDIA P100 1.604e-01 1.922e-02 8.342 6.73e-11 ***
Training_hardwareNVIDIA Quadro P600 1.714e-01 3.756e-02 4.562 3.52e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000 1.538e-01 3.263e-02 4.714 2.12e-05 ***
Training_hardwareNVIDIA Quadro RTX 5000 1.819e-01 4.021e-02 4.524 3.99e-05 ***
Training_hardwareNVIDIA Tesla K80 1.125e-01 1.608e-02 6.993 7.54e-09 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 1.072e-01 1.353e-02 7.922 2.89e-10 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 9.444e-02 2.030e-02 4.653 2.60e-05 ***
Training_hardwareNVIDIA V100 1.420e-01 1.201e-02 11.822 8.01e-16 ***
Training_time_hour:Hardware_quantity 2.296e-09 9.372e-10 2.450 0.01799 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma household taken to be 0.05497984)
Null deviance: NaN on 70 levels of freedom
Residual deviance: 3.0043 on 48 levels of freedom
AIC: 345.39
When the gradient was transformed to the speed of change of the response variable, the impact of every steady variable was practically zero and barely damaging.
All intercepts have been returned to simply 1 kWh on the unique scale. The outcomes have been meaningless. Not less than one slope ought to develop with huge vitality consumption. I believed that utilizing a log hyperlink mannequin with the identical predictor would possibly end in completely different outcomes, so I match the mannequin once more.
glm(components = Energy_kWh ~ Training_time_hour * Hardware_quantity +
Training_hardware + 0, household = Gamma(hyperlink = "log"), knowledge = df)
Coefficients:
Estimate Std. Error t worth Pr(>|t|)
Training_time_hour 1.818e-03 1.640e-04 11.088 7.74e-15 ***
Hardware_quantity 7.373e-04 1.008e-04 7.315 2.42e-09 ***
Training_hardwareGoogle TPU v2 7.136e+00 7.379e-01 9.670 7.51e-13 ***
Training_hardwareGoogle TPU v3 1.004e+01 3.156e-01 31.808 < 2e-16 ***
Training_hardwareGoogle TPU v4 1.014e+01 4.220e-01 24.035 < 2e-16 ***
Training_hardwareHuawei Ascend 910 9.231e+00 1.108e+00 8.331 6.98e-11 ***
Training_hardwareNVIDIA A100 1.028e+01 3.301e-01 31.144 < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB 1.057e+01 5.635e-01 18.761 < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 80 GB 1.093e+01 5.751e-01 19.005 < 2e-16 ***
Training_hardwareNVIDIA GeForce GTX 285 3.042e+00 1.043e+00 2.916 0.00538 **
Training_hardwareNVIDIA GeForce GTX TITAN X 6.322e+00 7.379e-01 8.568 3.09e-11 ***
Training_hardwareNVIDIA GTX Titan Black 6.135e+00 1.047e+00 5.862 4.07e-07 ***
Training_hardwareNVIDIA H100 SXM5 80GB 1.115e+01 6.614e-01 16.865 < 2e-16 ***
Training_hardwareNVIDIA P100 5.715e+00 6.864e-01 8.326 7.12e-11 ***
Training_hardwareNVIDIA Quadro P600 4.940e+00 1.050e+00 4.705 2.18e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000 5.469e+00 1.055e+00 5.184 4.30e-06 ***
Training_hardwareNVIDIA Quadro RTX 5000 4.617e+00 1.049e+00 4.401 5.98e-05 ***
Training_hardwareNVIDIA Tesla K80 8.631e+00 7.587e-01 11.376 3.16e-15 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 9.994e+00 6.920e-01 14.443 < 2e-16 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.058e+01 1.047e+00 10.105 1.80e-13 ***
Training_hardwareNVIDIA V100 9.208e+00 3.998e-01 23.030 < 2e-16 ***
Training_time_hour:Hardware_quantity -2.651e-07 6.130e-08 -4.324 7.70e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma household taken to be 1.088522)
Null deviance: 2.7045e+08 on 70 levels of freedom
Residual deviance: 1.0593e+02 on 48 levels of freedom
AIC: 1775
At the moment, Training_time and hardware_quantity Complete vitality consumption will increase by 0.18% per further hour and 0.07% per further tip. Alternatively, their interplay reduces vitality utilization from 2×10%. These outcomes make extra sense Training_time Can attain as much as 7000 hours hardware_quantity As much as 16,000 items.

To higher visualize the variations, I created two plots evaluating predictions (displayed as dashed strains) for each fashions. The left panel used the log-converted gamma GLM mannequin. On this gamma GLM mannequin, the dashed line was virtually flat and near zero, and there was nowhere close to the mounted line of RAW knowledge. In the meantime, the proper panel used a gamma GLM mannequin linked to the log. Right here the dashed strains have been a lot nearer aligned with the precise match line.
test_data <- df[, c("Training_time_hour", "Hardware_quantity", "Training_hardware")]
prediction_data <- df %>%
mutate(
pred_energy1 = exp(predict(glm3, newdata = test_data)),
pred_energy2 = predict(glm3_alt, newdata = test_data, sort = "response"),
)
y_limits <- c(min(df$Energy_KWh, prediction_data$pred_energy1, prediction_data$pred_energy2),
max(df$Energy_KWh, prediction_data$pred_energy1, prediction_data$pred_energy2))
p1 <- ggplot(df, aes(x = Hardware_quantity, y = Energy_kWh, colour = Training_time_group)) +
geom_point(alpha = 0.6) +
geom_smooth(methodology = "lm", se = FALSE) +
geom_smooth(knowledge = prediction_data, aes(y = pred_energy1), methodology = "lm", se = FALSE,
linetype = "dashed", measurement = 1) +
scale_y_log10(limits = y_limits) +
labs(x="{Hardware} Amount", y = "log of Power (kWh)") +
theme_minimal() +
theme(legend.place = "none")
p2 <- ggplot(df, aes(x = Hardware_quantity, y = Energy_kWh, colour = Training_time_group)) +
geom_point(alpha = 0.6) +
geom_smooth(methodology = "lm", se = FALSE) +
geom_smooth(knowledge = prediction_data, aes(y = pred_energy2), methodology = "lm", se = FALSE,
linetype = "dashed", measurement = 1) +
scale_y_log10(limits = y_limits) +
labs(x="{Hardware} Amount", colour = "Coaching Time Stage") +
theme_minimal() +
theme(axis.title.y = element_blank())
p1 + p2

Why log conversion fails
To know why a logged-transformed mannequin can not seize the underlying results as a mannequin linked to the log, let’s check out what occurs if you apply log transformations to response variables.
For instance y is the same as a operate with x and an error time period.

Making use of log conversion to y really compresses each f(x) and errors.

In different phrases, it’s modeling a totally new response variable LOG(Y). When connecting your personal operate g(x) – in my case g(x)= training_time_hour*hardware_quantity + training_hardware– You are attempting to seize the mixed impact of each “decreased” f(x) and error phrases.
In distinction, when utilizing log hyperlinks, you’re modeling the unique Y quite than the transformed model. As an alternative, the mannequin predicts y by exponentially exponenting our personal operate g(x).

The mannequin minimizes the distinction between precise and predicted Y. On this method, the error time period stays intact on the unique scale.

Conclusion
Log conversion variables are usually not the identical as utilizing log hyperlinks and don’t at all times yield dependable outcomes. Below the hood, the log conversion modifications the variable itself, distorting each fluctuations and noise. Understanding this refined mathematical distinction behind a mannequin is simply as essential as looking for the perfect mannequin.
[1] Epoch AI. Knowledge on excellent AI fashions . Retrieved from https://epoch.ai/data/notable-ai-models
[2] College of Virginia Library. Interpretation of log transformations in linear fashions.Retrieved from https://library.virginia.edu/data/articles/interpreting-log-transformations-in-a-linear-model

