Choosing an Experimentation Platform: A Retrospective

by root June 7, 2026

written by root June 7, 2026 0 comment 64 views

, in each firm that desires to ship merchandise individuals love, when “we must always experiment extra” turns into “we can’t hold experimenting like this.” Hand-tuned holdouts; traffic-allocation tickets bouncing between PMs and engineers; analyst calendars booked weeks out. The want to be data-driven kind of outgrows the equipment that was presupposed to make it so.

That was the place we sat at ManyChat final 12 months. We selected Eppo, however that call is the smallest a part of the story, and the half you possibly can least transplant to your organization. What I wish to share as an alternative is the method I walked by way of to get there, what I acquired flawed alongside the best way, and what shocked me on the opposite aspect of the contract (yep, medical doctors hate me for this trick).

A observe on timing. We picked Eppo at an unusually thrilling second within the trade, as the seller map was shifting below us mid-evaluation. Eppo itself had been acquired by Datadog some months earlier than. Statsig had not too long ago been acquired by OpenAI, and would later be offered on to Amplitude. I don’t suppose any of what I describe under relies on that individual information cycle, however I wish to acknowledge that a few of it formed our temper whereas we have been deciding.

I break what follows into three acts: earlier than the choice, throughout it (making the choice), and after.

Earlier than

Let me get you within the temper we have been in at the start occurred. As I onboarded to the corporate, an engineer informed me that if there have been two simultaneous alternatives to run experiments, his group would merely postpone the second thought to a later dash as a result of the technical headache of configuring the 2 allocations. The danger of getting it flawed ultimately outweighed the thrill to check. That is fairly actually: anti-velocity at finest; no experiment at worst. And for that one experiment that might be configured, copy-pasting boilerplate allocation logic was their bread and butter.

An analyst on the opposite aspect of that very same pipe described herself as a “human microservice”; she meant the holdout teams, outlined by hand, refreshed by hand, handed on to the engineer, and so forth … an thrilling alternative to expertise your complete circulation in first-person POV, certainly. However, irony apart, that was the second the case for a platform stopped being summary.

I had seen variations of this room earlier than. At Marktplaats, some years earlier, I had written the in-house Python libraries that attempt to soak up this sort of ache, and we noticed time-to-insight go down from days to hours, within the tail instances.

I watched the identical build-or-buy debate play out once more at Adevinta, globally, at a bigger scale, the place it landed on constructing fairly than shopping for. Fortunate for us at Manychat, by the top of 2025 the platform choices had matured sufficient that, for a corporation our dimension and at that second, shopping for was the plain transfer.

We needed the software that might give us one of the best shot at getting our experimentation program the place we wish it: cutting-edge statistics, sure, however extra importantly a software that nudges its customers towards conclusive experiments by default; product managers included.

Two issues stood between us and the selection. The primary was easy: we had named the ache, nevertheless it was solely anecdotal up to now. Management had a (superb) notion of what was damaged, and I had heard devs and product managers grumble concerning the present stack after I first met them. However none of that was the identical type of object as a vendor necessities record. Till we might put the 2 aspect by aspect, we couldn’t inform which capabilities have been nice-to-haves and which have been the purpose.

The second was more durable. The choice carried lots of weight as irrespective of how you set it, there’s at all times a lock-in aspect to any platform; culturally, if not technically. And assets are finite: we couldn’t POC each platform available on the market. Not to mention the chance value of getting to reverse the choice and begin over once more. Selecting one to guess on, in a single sitting, with no probability to course-correct, would have been asking to be flawed. And with the choices being so related in most methods, discovering one of the best one for us was a matter of precision. We wanted a technique to break a single high-stakes resolution into smaller, lower-stakes ones that constructed on one another.

Interviews, and de-risking the choice

I began with interviews. PMs, product analysts, engineers, entrepreneurs. The purpose was to transform anecdote into one thing we might maintain up in opposition to a vendor’s function record. The engineer’s calendar story, the analyst’s “human microservice”, the PM who had given up on operating atomic experiments and was bundling modifications into greater releases as an alternative, suspending a few of them fully: these grew to become the job description for the software. I can’t overstate how a lot this paid me again later. Each time the method drifted, and it drifted, the interviews have been the anchor we got here again to. They have been additionally what made the entire effort credible contained in the group: telling my CPO why we have been spinning up a POC was a distinct dialog after I might quote a particular friction again to her.

For the single-shot drawback, we phased the invention into three layers, every specializing in the subsequent stage of depth within the analysis:

Desk analysis. Learn the seller docs, sketch an extended record. Most platforms self-eliminated right here, earlier than we ever opened a gross sales funnel. Loads of Claude Code at this step, too.
Demos. A centered dialog with every shortlisted vendor. A bit gross sales pitch, certain, however principally us probing the areas we had determined mattered most.
POC. Arms on the platforms, with actual knowledge and actual evaluators, just for the 2 finalists.

Every layer narrowed the sphere and acquired us data at a “worth” we might afford. By the point we reached the POC we have been down to 2, and the choice in entrance of us had shrunk to one thing we might really maintain. Statsig, or Eppo?

There’s one a part of this I might repeat on day one in all any future platform resolution, in any class: the interviews outline these ache factors. They have been the one greatest unlock of the entire stage. Working shut behind them, sponsorship. And I don’t imply simply from my director, who requested to push it ahead. I saved friends and stakeholders who must again / undertake the choice within the loop the entire method by way of. By the point the POC ended, the choice shocked nobody.

On the finish of “earlier than” we had a shortlist of two, and the self-discipline of how we had narrowed to them. We knew what labored for us. The more durable query was nonetheless ready: between two platforms that each cleared our bar, which was really higher for us? How would we outline “higher” conceptually, and the way would we agree on it virtually?

Throughout

It was the debrief, after the POC, and the analysts on the panel have been taking turns speaking. Two of them, who knew our stack finest, completed their abstract with a sentence just like:

“As a product analyst, I might be actually pleased to maneuver ahead with both of them.”

I sat with that for a second. The consolidated scores agreed with them: the 2 platforms got here in at 4.36 and 4.47 on a five-point scale, throughout greater than twenty weighted standards. By any affordable learn, it was a tie. I had spent weeks constructing a course of that might level clearly at one platform, and the method had simply informed me, within the voice of the friends I trusted most to identify a significant distinction, that there was no significant distinction from his seat.

What I discovered in that second, and wouldn’t have discovered with out the panel, is that analyst-grade rigor has develop into desk stakes. The marginal worth of selecting one fashionable experimentation platform over one other doesn’t accrue to your scorecard; it accrues some place else. The place, precisely, was the query I now needed to reply.

So I wanted a choice I might defend; to myself first, then to my knowledge director and CPO, then to the groups who would inherit it. Coin flips and private preferences are unhealthy foundations for a multi-year contract. And the tie meant the tiebreaker couldn’t be invented after the actual fact; it needed to replicate what we really needed from the subsequent few years of experimentation at ManyChat.

Specifically, we weren’t selecting between two snapshots; we have been selecting between two trajectories. Eppo’s guess was on guided, opinionated, PM-shaped *cough * proof *cough * workflows; Statsig’s was on power-user flexibility. Each have been defensible for certain. However we had mentioned, recall:

We needed the software that might give us one of the best shot at getting our experimentation program the place we wish it: cutting-edge statistics, sure, however extra importantly a software that nudges its customers towards conclusive experiments by default (…)

I observed what didn’t occur. The POC plan referred to as for PMs to trial each platforms and feed scores again into the matrix. They principally didn’t due to bandwidth. One head of selling operations and one PM gave me unprompted impressions, and the remainder of the PM-side proof and enter stayed skinny. The absence of PM suggestions did one thing counterintuitive: it elevated the load I gave to PM-facing UX / workflows, and governance, within the closing name. The logic is uneven. Analysts are adaptable, power-users if you’ll; they’ll work their method by way of no matter interface you hand them. PM onboarding isn’t adaptable in the identical method. If the platform our analysts rated equally can also be the one which lowers the barrier for our PMs, that may be a resolution; the reverse, selecting the analyst-equivalent platform our PMs would have struggled with, would have been quiet self-sabotage.

Briefly, we might lastly say: every part else near-equal, the usability for non-technical people is what units the 2 platforms aside.

So we picked Eppo. The trajectory query is what tipped it: on an extended horizon, Eppo lined up higher with the place we needed experimentation to stay; nearer to experimenting groups, and past simply the analyst. Data administration as a first-class object. Reporting that doesn’t want a deck rebuilt round it. Statsig had its benefits too; CUPED (a variance-reduction approach) inside its energy calculator, a standalone metrics explorer, a extra versatile evaluation floor; and we accepted these as 12 months 1 gaps to work round, whereas Eppo was being revambed inside Datadog, and buying these options too.

Wanting again, the lesson I take away from it’s double-edged. The choice wanted extra rigour than intuition needed, after which much less religion in that rigour than I anticipated. The scorecard mattered as a result of it compelled everybody to be particular, and to create a way of belief and credibility within the end result. It gave me 360-degree protection, however the name got here from the moments inside it: the analyst tie, and the imaginative and prescient query. Six months after signing, a curious colleague would ask me how we had picked, and I might stroll them by way of the panel, the scorecard, the corrections, and the imaginative and prescient/framing query. That’s a win for me.

After

I believe I anticipated, someplace I might not admit aloud, that signing the contract was the end line. I had spent weeks constructing a reputable resolution system, a course of, and had spent a few hours of vendor calls. The week we signed I had a quiet day. I sat down at my desk and began a working doc about what would occur subsequent. Legend has it that I’m nonetheless writing it.

The clean-water metaphor I had used within the proposal saved coming again to me. We had laid the pipes; that was the SDK integration, the info plumbing, the warehouse connections. The platform itself too, if you’ll. Pipes get you circulation, however not clear water. Within the worst case, pipes contaminate it as an alternative (extra crap output, quicker). Clear water is what comes out of pipes when the remainder of the system (the supply, the therapy, the individuals who preserve it) does its job. Experiments work the identical method: a platform will get you the circulation, however the reliable outcomes come from governance and course of, from individuals, and from how significantly the group treats the distinction between testing an thought and launching a function.

The software is prepared; the group isn’t but prepared for the software.

Until that time I used to be deep in the price of the contract, however not the price of bridging the hole between the software is current now and the group is able to use it.

I had informed colleagues, within the weeks main as much as signing, {that a} chunk of the analytics group’s capability would slowly ramp as much as a brand new equilibrium as soon as Eppo was stay. As of writing, I’m nonetheless hopeful that may materialise 1 / 4 or two from now; however not earlier than we get some issues in place first. Velocity, the mere act of experimenting extra in a given interval, additionally has to attend.

Signing didn’t purchase time again but, nor did it convey us extra experiments instantly. The work that began the day after signing, forming a cross-functional integration group, drafting the experiment lifecycle, configuring Eppo protocols (a part of its governance framework), certifying our first success metrics and guardrails, migrating a information base, designing a coaching curriculum, all needed to occur earlier than the platform might ship the rate potential we knew it had. En breve, what was forward was not a software drawback. Quite, a governance, course of, and other people one.

Three legs of a stool

For experiments to really be reliable at Manychat, three issues must be current on the identical time: the tooling, and engineering integration so experiments can circulation by way of the platform, course of and governance so the experiments that circulation by way of are correctly designed and determined, and individuals and expertise so one of the best practises are adopted in observe and never solely on paper. Drop any one of many three and the entire thing leans.

We had the software and the connections now. Course of and governance was totally on the info science group: a five-stage experiment lifecycle (Suggest, Design, Run, Analyse, Resolve); an authorized set of success and guardrail metrics; all of it encoded into the platform’s personal protocol templates in order that the rails weren’t a Notion web page however a function of the software. Folks and expertise are to be materialised in advert hoc Eppo-delivered software quick-starts, and an Experimentation 101 and 102 curriculum in the long run. An ongoing argument for a graduated autonomy mannequin, PMs paired with analysts at first, extra independence over time; that’s the dot on the horizon.

The opposite factor

A milder lesson: signing Eppo was the place my job description modified. I had walked into the challenge because the Workers accountable for selecting a software. I walked out doing change administration; onboarding groups, educating, leaning on PMs about lifecycle compliance, spending credibility I had banked for different issues. It was completely price it for me, although.

Closing notes

If I needed to compress all of this, these can be the few traces I’d match it in:

A reputable resolution is the deliverable, not the platform. The platform is an artifact. The choice is what your group will stay inside for years.

In the identical spirit, pipes will not be water. A software is important infrastructure for reliable experimentation, however not ample. The work begins, not ends, on the day the contract is signed.

I’m writing all of this realizing the experimentation instruments market is in movement; the seller churn I flagged up high has not stopped. Regardless of the map seems like by the point you learn this, the bits of course of that survived for me are in all probability the bits price borrowing: the interviews, the phased discovery, the imaginative and prescient framing, and the trustworthy budgeting for what comes after.

If you wish to dive into the main points over an internet cup of espresso, be at liberty to ping me on LinkedIn! I’d be pleased to share concepts with you.

Additionally try my personal page for extra piece like this.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Choosing an Experimentation Platform: A Retrospective

Earlier than

Interviews, and de-risking the choice

Throughout

After

The software is prepared; the group isn’t but prepared for the software.

Three legs of a stool

The opposite factor

Closing notes

XRP Ledger focuses on tokenized finance as Schwartz envisions subsequent use circumstances

Crimson states and blue states have gotten the identical sad nation.

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply