Great experiments kick off a virtuous cycle: they inspire others to question their own assumptions, and that in turn locks in a culture of hypothesis thinking that spreads through an organization. I’m often amazed by how quickly a company can go from 1 experiment a month, to 10, to 100.
Unfortunately, this progression is never perfectly smooth. As you run more tests, collect more data, and include more people in the process, a whole new set of challenges starts to appear. This final post is all about those pitfalls – the scaling challenges that can stop a growing testing culture in its tracks. To illustrate them, I’ll be sharing examples from two titans of the testing world: Booking.com and Airbnb.
Experimenting in more places
Booking is a company that’s famous for designing its entire development culture around constant experimentation. But like every testing program, they started with humble beginnings. The first version of their experimentation platform only supported running a handful of experiments at a time, and when they got started a decade ago the team couldn’t imagine the scale they’d eventually hit. Now fast forward to the present.
Overall, on a daily basis, all members of our departments run and analyse more than a thousand concurrent experiments to quickly validate new ideas. These experiments run across all our products, from mobile apps and tools used by hoteliers to customer service phone lines and internal systems. Experimentation has become so ingrained in Booking.com culture that every change, from entire redesigns and infrastructure changes to bug fixes, is wrapped in an experiment…Such democratization is only possible if running experiments is cheap, safe and easy enough that anyone can go ahead with testing new ideas, which in turn means that the experiment infrastructure must be generic, flexible and extensible enough to support all current and future use cases.
Through all this exponential growth, Booking has had to continuously revamp its testing infrastructure to support new kinds of experiments. At each stage of growth, new teams brought new touchpoints and new challenges. From website conversion optimization, they expanded to native mobile testing, and from there to experiments deep in the backend technology stack. Today, Booking has a team of over 40 developers and statisticians continuously working on improving their testing platform.
At Optimizely, we’ve taken our own version of this journey. Like Booking, we’ve devoted 8+ years and tens of millions of dollars in R&D to expand our “experimentation footprint.” From simple website testing, we’ve followed the lead of the savviest teams to add native mobile experimentation and server-side testing across 10 different languages. Along the way, we’ve had to work through maddeningly subtle problems. Experiments have to work everywhere, but they also have to be rigorous, performant, secure, compliant, consistent, and reliable. If any one of those things goes wrong, in any part of the stack, it can instantly undermine years of trust built up in a testing culture.
If there’s one lesson I can distill from that journey, it’s this: don’t choose your testing platform based on the first experiment you want to run, or even the tenth. Think about where your hundredth or thousandth test might run – and make sure that you’re building or buying technology that can scale far beyond it.
Collecting more data
Scaling experimentation doesn’t just mean supporting new use cases: it also means gathering more data. A lot more data. Enough to take your cleverly designed analytics pipeline and blow it up, over and over again. I’ve lived through this particular challenge three or four times now, so in a strange way, it was comforting to read about Airbnb’s long saga of scaling their Experiment Reporting Framework (ERF):
The number of concurrent experiments running in ERF has grown from a few dozen (in 2014) to about 500…More impressively, the number of metrics computed per day has grown exponentially…Today we compute ~2500 distinct metrics per day and roughly 50k distinct experiment/metric combinations.
The post lays out a series of challenges along the way:
- Early on, all experiment analysis happened through a simple script that ran once a day. This all worked fine, until one day it started taking more than 24 hours to run. Suddenly, they were dealing with “a swamped Hadoop cluster and dissatisfied end users.” (“Dissatisfied” is probably an understatement. When the same thing happened to us at Microsoft, we’d start a test on Monday and not see data until Thursday. Everyone went ballistic.)
- To unclog the pipes, they realized they had to rebuild the entire backend. So they made a whole new technology called Airflow – now a thriving open-source project. And it worked: analysis times went from over 24 hours to under 45 minutes. Problem solved!
- Just kidding: “Adoption immediately began to thrive. This led to more experiments and a huge influx of new metrics. Things quickly got out of hand with some experiments adopting 100 or more metrics.” Soon, it all got slow again, and on top of that, the UI became hopelessly overcrowded.
- So they switched to a new system that precomputed more data. Problem…solved? Not exactly. “It was hugely successful, but left users wanting to dive deeper into metrics. To accommodate this need, we launched…” — ok, I think we all get the idea.
I love this story because at every stage, Airbnb’s platform was a victim of its own success. As experimentation got easier, users expected more from it! And as the data became more critical to running the business, analysts expected to get deeper insights and generate them faster than ever.
At Optimizely, we’ve certainly felt this pressure. Our engineers have spent many sleepless nights helping sites like The New York Times and The Gap sustain experimentation through traffic spikes like Election Day and Black Friday. Through it all, we’ve learned that it’s essential to build infrastructure that can handle billions of incoming events – while also supporting open-ended exploration through segmentation, advanced metrics, and other interactions. And it has to be fast, which is why we’ve always insisted on real-time results. If something breaks on Monday, you can’t afford to wait until Tuesday to find out. Or as Booking says:
Sometimes experiments introduce such severe bugs, removing, for example, the ability for certain customers to book, that the overall business is immediately impacted. Therefore, it must be possible to attribute, in real time (in our current system that means less than a minute), the impact of every experiment on both the overall business and the health of the systems.
Adding more experimenters
When I talk to engineers at Booking and Airbnb, they’re justifiably proud of the way they’ve overcome these technical hurdles. But when I ask where they still struggle, I hear the same answer: “It’s not about the technology, it’s about teaching people to use it.” No matter how solid the code is, it’s still so easy to pick the wrong metrics and get tricked by statistics. At Airbnb, “consistency of experiments and avoiding mistakes is a big problem.” Booking has steered a large part of its effort from helping teams run tests to helping teams plan and document them:
Enabling everyone to come up with new hypotheses is key to democratizing experimentation and moving away from a product organization where only product managers decide what features to test next. Therefore, Booking.com’s experiment platform acts as a searchable repository of all the previous successes and failures, dating back to the very first experiment, which everyone can consult and audit.
We’ve seen the same challenges at Optimizely, which is why we were so excited to join forces with Experiment Engine last year. In January, we released Optimizely Program Management to offer our customers their own version of Booking’s idea repository. Already, we’re seeing teams like the BBC and IBM turbo-charge their culture by allowing everyone in the organization to submit ideas, score them by impact and effort (and love!), record lessons from past experiments, and measure program velocity.
More broadly, I’ve been struck again and again by how important usability and reusability are for sustaining a culture of experimentation. Before Optimizely, I would hack together A/B tests by writing my own randomized bucketing in code and sifting through the results by querying raw data. This approach works fine for a small, savvy team that knows what it’s doing. But it falls apart instantly the moment you want to expand to multiple teams testing in different places.
If you’re a developer, the challenge usually isn’t building your own experiments, but building a repeatable process for everyone else to run experiments. That means not just easy code interfaces for building tests, but helpful user interfaces for guiding experiments through each step of building a good test. And as a product manager, I’m consistently humbled by users who tell me, “My favorite thing about your product is the documentation.”
As I read through these stories like Airbnb’s and Booking’s, I can’t help asking myself – “is it really worth all the trouble?” It feels like reading the story of early explorers or astronauts, going out into the unknown and hitting one challenge after another. At each point, they must have considered just pulling the plug on experimentation and going back to guesswork. It’s certainly easier.
What’s remarkable, though, is that none of them gave up. Just the opposite: at each stage, they discovered a way to make experimentation radically easier, and in doing so, they unlocked the scientific method for a whole new branch of the business. Experimentation pitfalls became opportunities to innovate. And eventually, those innovations piled up to form a sustainable competitive advantage from experimentation.
Not so long ago, this daunting path was only open to the most successful technology companies. At Optimizely, we’ve made it our mission to change that – eliminating one pitfall at a time. If you’re not already experimenting at scale, we’re here to help.
Originally published at https://blog.optimizely.com/2018/05/31/failing-to-scale-across-teams
Jon Noronha is Director of Product Management at Optimizely, where he helps companies around the world build a culture of experimentation through A/B testing and feature flags.