says “this buyer’s likelihood of leaving is 0.4” and your code does predict(X) >= 0.5, you’ve simply made a pricing choice: you determined that the price of sending a retention supply to a buyer who would have stayed is precisely equal to the price of shedding one who would have left, and on the IBM Telco dataset (arguably the most-recycled churn dataset on Kaggle and GitHub), that call is mistaken by an element of 13.
I assembled a corpus of 36 publicly accessible IBM Telco churn analyses (Kaggle notebooks, GitHub repositories, weblog posts, peer-reviewed papers), and the reporting sample is hanging: roughly 9 in ten report classification accuracy or F1, simply over one in seven report a revenue curve, and none use survival evaluation to compute lifetime worth.
The result’s a literature the place the identical dataset has been re-modelled a whole bunch of occasions, and each default-threshold mannequin leaves cash on the ground: about $86 per buyer in avoidable burn on the usual 20% check cut up, and scaled to a 100,000-subscriber guide with the identical churn profile that will signify $8.6 million in recoverable value; the IBM Telco churn fee (26.5% annual) is unusually excessive, and a more healthy B2C SaaS guide with 5–8% annual churn would see the per-customer determine drop by roughly 3–4×, so what stays invariant throughout any cost-sensitive setting isn’t the headline greenback quantity however the asymmetry — 13× costlier to overlook a churner than to over-treat a loyalist.
This piece lays out three issues, so as: first, what the IBM Telco literature experiences and what it leaves out; second, find out how to compute the greenback value of a misclassification utilizing public 2026 B2C SaaS benchmarks and Kaplan-Meier survival evaluation, with no hand-waved CAC; third, why the textbook Bayes-optimal threshold system loses to a brute-force sweep when the mannequin is educated on SMOTE-balanced information, and what to do about it.
Each quantity on this article is reproducible from the scripts linked on the finish.
1. The 36-article hole
The IBM Telco Buyer Churn dataset is small (7,032 cleaned rows), tidy, labelled, and has been the canonical introductory churn dataset on Kaggle for practically a decade.
To get a really feel for what the general public corpus really measures, I listed 36 analyses throughout Kaggle, GitHub, and the most important data-science blogs, scoring every one on ten reporting dimensions starting from F1 rating to a CAC-and-LTV-grounded revenue curve.
The sample that emerged is in Determine 2.

Three findings value pulling out:
- Saturated: F1, accuracy, AUC, confusion-matrix screenshots, and SMOTE-vs-no-SMOTE comparisons seem in 80–90% of the corpus, with hyperparameter tuning through Optuna or grid search a near-universal trope.
- Unusual: a revenue curve (complete greenback value of misclassifications as a operate of choice threshold) seems in fewer than 15% of the analyses I reviewed, and when it does seem, the FN/FP value numbers are normally picked from a textbook instance with out anchoring them to actual CAC or LTV.
- Absent: not one of the 36 analyses I listed compute buyer lifetime worth through survival evaluation on
tenure; most both skip LTV completely or use the steady-state Skok systemLTV = ARPU / monthly_churn_rate, which assumes a homogeneous buyer base — a powerful declare for a dataset the place contract kind, fee technique, and tenure all materially shift retention.
Skipping survival evaluation issues as a result of the edge choice is a operate of LTV: for those who misjudge LTV by 2x you misjudge the price of a missed churner by 2x, and the cost-optimal threshold strikes with it.
The following two sections construct the lacking piece, then plug it again into the edge drawback.
2. The price of an error, in {dollars}
Three numbers decide the greenback value of each prediction (ARPU, gross margin, and CAC); two come straight out of the dataset, one from public 2026 trade benchmarks.
import pandas as pd
df = pd.read_csv("telco.csv")
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna().reset_index(drop=True)
arpu = df["MonthlyCharges"].imply() # $64.80
mean_tenure = df["tenure"].imply() # 32.42 months
churn_rate = (df["Churn"] == "Sure").imply() # 26.58 %
realised_ltv_churned = df.loc[df["Churn"] == "Sure",
"TotalCharges"].imply() # $1,531.80
ARPU and tenure are dataset-native: imply month-to-month cost is $64.80, imply noticed tenure is 32.4 months, and the realised LTV of churners is $1,531.80 (simply the typical TotalCharges over clients who already left), so holding ARPU mounted, three widespread LTV framings give very completely different ceilings:

None of those three is the correct reply, and the primary one is actively mistaken. ARPU × mean_tenure is the framing most tutorials attain for, and it’s damaged on the root. ARPU isn’t a property of a buyer; it’s the joint consequence of each function that additionally drives churn — contract kind, fee technique, product bundle, family construction. The income leaves with the shopper, so ARPU and tenure usually are not unbiased portions, and the textbook decomposition LTV ≈ E[ARPU] × E[lifetime] is simply legitimate when Cov(ARPU, lifetime) = 0. In any churn dataset value modelling that covariance is non-zero — if it have been zero, the mannequin would don’t have anything to foretell — and a buyer with $80/month Month-to-month Costs and eight months of tenure isn’t a “high-revenue buyer”; their $80 and their 8 months are two readings of the identical underlying danger profile.
Layer on the truth that ARPU varies inside a single buyer’s lifetime — a promo-onboarding cohort paying $30/month for the primary three months and $80/month thereafter, a long-tenured loyalty buyer grandfathered right into a $50/month fee whereas new clients pay $90/month for a similar product — and multiplying the imply of 1 distribution by the imply of one other describes a buyer who exists nowhere within the information.
The Skok system a minimum of makes use of a steady-state churn fee, but it surely assumes that fee is fixed ceaselessly. The realised LTV is an actual quantity, but it surely solely describes the shoppers you’ve already misplaced.
Not one of the three tells you when a new buyer breaks even on acquisition value — for that you simply want an actual retention curve, which is the subsequent part, however first a phrase on CAC.
For B2C SaaS within the 2026 benchmarks, CAC ranges from about $68 (eCommerce) to over $200 (fintech), with mid-market subscription merchandise clustering round $150 ([1], [2]); telecom subscriber acquisition value is materially increased ($300+ as soon as handset subsidies are amortised), so $150 is a conservative anchor for this dataset, and selecting the next quantity would solely make the burn calculation in Part 4 bigger.

Gross margin for B2C SaaS sits within the 70–85% band, with 75% as the same old midpoint that matches David Skok’s modelling assumptions for steady-state SaaS economics ([3]).
That provides us the constructing blocks for the price of a single prediction error.
CAC = 150
ARPU = 64.80
REMAINING_TENURE_MONTHS = 18
FN_COST = CAC + ARPU * REMAINING_TENURE_MONTHS # $1,316.40
FP_COST = 100 # typical marketing campaign value (midpoint)
ratio = FN_COST / FP_COST # 13.2 : 1
A false unfavourable (telling a buyer they may keep after they really depart) prices you the brand new acquisition spend ($150) plus 18 months of foregone income ($64.80 × 18 = $1,166.40), for $1,316.40 complete, whereas a false optimistic (flagging a buyer as a churn danger after they have been going to remain) prices roughly $100 of marketing campaign and low cost expense, leaving a price ratio of 13.2 to 1.
A be aware on what the framework is and isn’t. The $150 CAC and the $100 false-positive value on this article are placeholders; CAC varies materially by acquisition channel, and the $100 is shorthand for no matter your actual retention intervention prices — a reduction, a CSM name, a bundle improve, a product investigation. None of those are interchangeable, and a blanket low cost isn’t a retention technique: it’s a deferral mechanism that retains clients solely till the low cost expires whereas paying to retain clients who have been by no means going to depart (and, worse, coaching them to count on the subsequent low cost). Actual retention technique maps churn drivers — a buyer leaving as a result of backup_online retains failing is retained by fixing backup_online, not by a ten% off e mail — and allocates finances towards product enchancment, with the marketing campaign value as a short-term bridge whereas engineering catches up. The revenue curve here’s a threshold-setting software that operates after you’ve determined your retention playbook (what intervention applies to whom, at what actual value); it’s not an alternative choice to that call. Deal with the $150 and the $100 as a single consultant pair; phase them, and the framework segments with them.
That ratio is your complete cause threshold = 0.5 is the mistaken default: the choice boundary ought to mirror the asymmetry, and we are going to get to the precise system, however first comes the LTV piece.
3. The LTV revenue curve
Most churn writeups deal with lifetime worth as a static greenback quantity you multiply by a hazard fee.
Survival evaluation does higher.
It measures retention immediately from the info and turns LTV right into a curve: the cumulative contribution margin per buyer as a operate of months since acquisition, beginning at −CAC on day zero (you’ve paid to accumulate the shopper and earned nothing) and climbing as every surviving month provides ARPU × gross_margin × P(nonetheless alive) to the steadiness.
The Kaplan-Meier estimator does the heavy lifting, with tenure because the length and Churn == "Sure" because the occasion, producing the general curve in Determine 4.
from lifelines import KaplanMeierFitter
import numpy as np
CAC, GROSS_MARGIN, HORIZON = 150, 0.75, 72
kmf = KaplanMeierFitter().match(df["tenure"], (df["Churn"] == "Sure").astype(int))
months = np.arange(0, HORIZON + 1)
S = kmf.survival_function_at_times(months).values
month-to-month = ARPU * GROSS_MARGIN * S
ltv_curve = np.cumsum(month-to-month) - CAC
ltv_curve[0] = -CAC # day zero, solely CAC is sunk

Three readings value pulling out:
- Breakeven at month 3 (in expectation, not per buyer): throughout the unique acquired cohort, the survival-weighted cumulative contribution covers the
$150acquisition value by month 3, and CAC payback beneath twelve months is the David Skok rule of thumb that Telco beats by an element of 4. That is the correct quantity for budgeting cohort-level retention spend, however it’s a cohort common that hides bimodal variance: a buyer who churns in month 1 individually contributes one month of margin ($48.60) and by no means recoups their CAC, whereas a 70-month survivor contributes properly over$3,400. The Kaplan-Meier weighting bakes these early losses in accurately — they only don’t get a star on the curve. - LTV on the 72-month horizon ≈ $2,527 per buyer: mixed with the
$150CAC, that’s an LTV:CAC ratio of about 17.8:1, properly above the three:1 ground most SaaS traders search for, and a helpful sanity verify that the dataset describes a wholesome unit-economics enterprise reasonably than a death-spiral one. - Churn-reduction uplift is modest on the cohort stage: a ten% discount in churn lifts terminal LTV by ~2.8% and a 20% discount lifts it by ~5.7%, so the raise is actual however not heroic, and the acquisition choice issues greater than the retention intervention.
Segmenting the identical calculation by Contract (Determine 5) is the place the framework earns its preserve.

A two-year-contract buyer is value about $3,372 over the 72-month horizon, whereas a month-to-month buyer is value $1,620 (lower than half), with the identical ARPU and the identical CAC, so your complete delta is retention; from a marketing-spend perspective, the right-side acquisition goal (the shopper you ought to be keen to pay extra to accumulate) is the contract-locked one, although they appear “much less worthwhile monthly” in any given snapshot.
That is the sort of choice the usual IBM Telco evaluation can not make, as a result of it by no means computes survival-conditional LTV within the first place.
4. The classification revenue curve
With FN value, FP value, and survival-based LTV in hand, the edge query turns into a one-dimensional optimization: prepare a mannequin, get predicted chances on the check set, sweep the edge from 0 to 1, compute complete greenback value at every threshold, and decide the minimal.
The mannequin here’s a tuned XGBoost educated with SMOTE on the prepare fold solely, the usual Telco recipe.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
y = (df["Churn"] == "Sure").astype(int)
X = pd.get_dummies(df.drop(columns=["customerID", "Churn"]),
drop_first=True).astype(float)
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.20, stratify=y, random_state=42)
scaler = StandardScaler().match(X_tr)
X_tr_s, X_te_s = scaler.remodel(X_tr), scaler.remodel(X_te)
X_tr_b, y_tr_b = SMOTE(random_state=42).fit_resample(X_tr_s, y_tr)
mannequin = XGBClassifier(n_estimators=400, max_depth=5,
learning_rate=0.05,
subsample=0.9, colsample_bytree=0.9,
random_state=42, eval_metric="logloss")
mannequin.match(X_tr_b, y_tr_b)
probs = mannequin.predict_proba(X_te_s)[:, 1]
thresholds = np.linspace(0.01, 0.99, 99)
totals = []
for t in thresholds:
pred = (probs >= t).astype(int)
tn, fp, fn, tp = confusion_matrix(y_te, pred).ravel()
totals.append(fn * 1316.40 + fp * 100)
The result’s in Determine 6.

The numbers, on a 1,407-row check set:

Transferring from 0.5 to the empirical minimal recovers $121,160 on the check set, which is $86.11 per buyer, and making use of that to a 100,000-subscriber guide provides the headline $8.6M; your mileage will differ along with your CAC, your ARPU, and your retention curve, however the multiplier (13× costlier to overlook a churner than to over-treat a loyalist) is what makes the hole massive.
When the textbook system loses to the edge sweep
Open any cost-sensitive classification reference (Provost and Fawcett’s Knowledge Science for Enterprise is the canonical one [4]) and one can find the Bayes-optimal threshold system:
t* = C_FP / (C_FP + C_FN)
Plug in our value ratio: t* = 100 / (100 + 1316.40) ≈ 0.0706, which is right math, and on a mannequin with calibrated chances that threshold minimizes anticipated value; however the sweep provides t = 0.03, and at that threshold the test-set value is $8,661 decrease than at 0.07, so the place is the hole?
The Bayes Optimum system assumes the mannequin’s predicted chances are calibrated: a prediction of 0.5 ought to correspond to a 50% true churn likelihood, however our mannequin is educated on a SMOTE-balanced set, which inflates the minority class to 50% throughout coaching, and tree-based learners then output chances biased towards increased values, with the mannequin’s “0.07” mapping to roughly the true 0.03 in calibrated likelihood house; the textbook system isn’t mistaken, it’s being utilized to an out-of-spec enter.
There are two clear fixes:
- Calibrate the chances first: apply Platt scaling or isotonic regression on a held-out set, then use the Bayes-optimal threshold on the calibrated output, with scikit-learn’s
CalibratedClassifierCVdoing it in a single line. - Skip calibration and sweep: it’s low cost, it tolerates calibration drift, and on a small dataset like Telco the test-set sweep is extra dependable than a calibration mannequin match on a whole bunch of held-out rows.
In follow, for manufacturing techniques with common re-training, the sweep is what most groups ship; the system is the correct factor to show (with calibration because the asterisk), and each ought to seem in any trustworthy writeup of a cost-sensitive churn mannequin.
Neither one exhibits up within the IBM Telco corpus I listed.
5. What the subsequent IBM Telco article ought to report
Three concrete shifts would make the subsequent 36 IBM Telco analyses extra helpful than the final 36:
Report a revenue curve, not a confusion matrix. F1 at threshold 0.5 is a match metric (helpful for rating fashions when it’s important to decide one, ineffective for deciding find out how to ship one), and the curve in Determine 6 has extra decision-relevant data than each accuracy comparability within the corpus mixed.
Anchor LTV in survival evaluation, not steady-state assumptions. Kaplan-Meier on tenure is 30 strains of Python; the breakeven quantity, the LTV at horizon, and the contract-segmented curve give advertising and marketing operations a usable finances to spend on retention plus a defensible reply to “which buyer ought to we purchase tougher?”, and the Skok system stays a fantastic sanity verify reasonably than the load-bearing LTV estimate.
Disclose the calibration assumption once you quote a Bayes-optimal threshold. Both calibrate first or be aware explicitly that the edge reported is the empirical minimal from a sweep, and Wang et al. ([5]) make a carefully associated argument with a extra elaborate metric (e-Income) that makes use of survival evaluation end-to-end, with the identical core thought.
Phase the intervention, not simply the rating. A revenue curve assumes a single FP value (the worth of “treating” one false alarm), however in follow the most affordable intervention for a high-value loyalist is completely different from the most affordable for an at-risk new account: a bundle improve for one, a price-pain investigation for the opposite, do-nothing for the third; segment-aware FP prices (and segment-aware thresholds) are the pure follow-up to the framework right here.
I went into this anticipating the hole can be within the modeling. It isn’t.
The IBM Telco dataset has been mined to bedrock for predictive accuracy, and what it may nonetheless educate is whether or not our pipelines result in good choices, not simply correct predictions.
That requires three issues: greenback prices on errors, actual retention curves on clients, and an trustworthy threshold on the classifier — 4 scripts and a Kaplan-Meier match get you there.
References
[1] Genesys Development Advertising, Buyer Acquisition Price Benchmarks: 44 Statistics Each Advertising Chief Ought to Know in 2026 (2026), genesysgrowth.com.
[2] Confirmed SaaS, CAC Payback Benchmarks 2026: SaaS Buyer Acquisition Price (2026), proven-saas.com.
[3] D. Skok, SaaS Metrics 2.0: Detailed Definitions (2014, up to date 2024), forentrepreneurs.com.
[4] F. Provost and T. Fawcett, Knowledge Science for Enterprise (2013), O’Reilly Media, ch. 7–8.
[5] Y. Wang, S. Albrecht, et al., e-Income: A Enterprise-Aligned Analysis Metric for Revenue-Delicate Buyer Churn Prediction (2025), arXiv:2507.08860.
[6] C. Davidson-Pilon, lifelines: survival evaluation in Python (2019), Journal of Open Supply Software program 4(40), 1317.
[7] W. Verbeke, T. Verdonck and S. Maldonado, Revenue-driven choice bushes for churn prediction (2018), European Journal of Operational Analysis, 284(3).
[8] N. El Attar and M. El-Hajj, A scientific evaluate of buyer churn prediction approaches in telecommunications (2026), Frontiers in Synthetic Intelligence.
Code, information, and reproducible scripts for each determine can be found on request. The dataset is the IBM Telco Buyer Churn dataset, totally artificial pattern information printed by IBM in its official repository (github.com/IBM/telco-customer-churn-on-icp4d) beneath the Apache License 2.0, which allows use, spinoff evaluation, and publication with attribution. The info is artificial and comprises no actual clients or PII.
Thanks for studying! If in case you have any questions or want to join, be happy to succeed in out to me on LinkedIn. 👋

