Running cool experiments is easily one of my favorite parts of working in data science.
Most experiments don’t deliver big wins, so the winners make for fun stories. We’ve had a few of these at IntelyCare, and I’m sharing each story in a way that highlights a concept related to experimentation.
And in this post, we’ll share a story about how we avoided doing something stupid by running an experiment first, and using it to discuss the multiple comparisons problem.
Background: IntelyCare hires nurses at scale… and it’s covid 😷
IntelyCare connects nurses with work opportunities ranging from full-time work to individual shifts. When dealing with individual shifts, clinicians work for IntelyCare as employees (agency model). This means we’re hiring nurses 24/7.
You may have suppressed this memory, but in 2020 and 2021 we had this global pandemic. Hiring nurses during the pandemic was nothing short of a rock fight. We had full business permission to try everything and anything that could help us hire nurses more quickly and efficiently.
The problem: Lots of applies, but not so many new hires
Working anywhere in healthcare means submitting a big pile of paperwork — licenses, immunizations, certifications, and more in addition to the regular resumes, references, and background checks.
IntelyCare is no different. And even though we make it all phone-friendly and digital, submitting all this paperwork is about as fun as filing your taxes. And that means many people who apply give up somewhere between creating an account and completing a shift.
The solution: Just throw money at it! 💸
We tried lots of things (including different referral incentives). One easy-to-try proposal was to just pay clinicians an extra $100 when they complete their first shift.
Why $100? Because it’s a nice round number and looks good on Marketing materials. You might be surprised how many business decisions are made this way (unless you’re in marketing, in which case it’s perfectly normal).
The idea was so easy we almost went live without a test. There was a lot of pressure to move quickly and we wanted to be fast. But science prevailed and instead of offering $100 to everybody, we randomly offered bonuses ranging from $0 to $100 in increments of $25.
Clinicians were informed of the bonus via email throughout the application process. (Unless you had a $0 bonus — no email for you.)
We ran this test for several months to give candidates sufficient time to complete their applications. By the time we circled back to make a decision, we had several thousand applicants at each bonus level.
Spillovers? It’s always a possibility but it seems unlikely. Demand for nursing talent was insanely high at the time. I have a hard time imagining clinicians with high bonuses stealing all the shifts from those with bonuses (thereby exaggerating the impact of the high bonus). There were plenty of shifts to go around.
Technical aside: Multiple comparisons
If you ever run a test like this, chances are some higher up will ask you to “slice and dice” or “cut” or perhaps “dig into” the data 100 different ways. This is fun but also dangerous. Wait, dangerous?! Let’s discuss.
- Datasets are finite and noisy, which means anytime you test a hypothesis using your dataset there’s a chance your answers are incorrect. Sorry, I didn’t make the rules.
- To understand the risk of an incorrect answer, we look at the variance of a dataset. Knowing the variance helps us know if a statistic is “close” or “far away” from another possible answer. (e.g. “Does a marketing campaign have a non-zero impact on sales?”)
- Suppose, given the amount of noise in my data, there’s a 5% chance I draw a false conclusion for a given hypothesis. I’m curious to know if a marketing campaign increased sales, and my boss wants to know how the impact differs for men, women, old people, young people, people in Idaho, people in Florida, … etc. See the danger now? If I ask 20 questions, good chance at least one of the answers is wrong. And if that means your company starts marketing like crazy to teenagers in Idaho, that could be an expensive mistake!
- While your slicing and dicing isn’t a machine-learning model, you can overfit your analysis by asking too many questions. Just as machine-learning engineers have ways to avoid overfitting models, analysts need ways to avoid drawing overfit conclusions from a finite dataset.
Call before you dig: 1-BON-FER-RONI
So what is an analyst to do? There are many heuristics, all of which make it harder to reject a null hypothesis.
- Adjust p-values required for “statistical significance” (Bonferroni correction).
- Use a ranking of p-values to determine when to stop considering a result as significant (Benjamini-Hochberg).
- Instead of taking the experiment results at face value, use them to update some Bayesian prior representing your current-best view of the world (Bayesian Model Averaging). You can use this to combine results from several tests, when appropriate.
- Bootstrapping — sample from the experimental data with replacement, compute your test statistic, repeat a zillion times, and then consider a full distribution of test statistics. Bootstrapping does not immediately solve your multiple comparisons problem, but knowing the variance of your test statistics can help you be a more critical consumer of p-values.
- Dynamic stopping rules. List out your hypotheses. As results come in, stop testing each hypothesis as soon as the evidence is clear but continue to test other hypotheses with additional data. Eventually, you run out of data or you run out of hypotheses. Why do we not revisit our prior hypotheses with the additional data? Because we’d be right back in multiple comparisons hell. The sequential nature of the exercise ties our hands to the mast so we don’t go swimming after sirens.
If you’re interested in a more detailed summary, I’d recommend the following:
Back to the bonuses
We’re a curious bunch and so considered looking at several cuts of our experiment data: location, age, qualification, and more. Wouldn’t it be amazing if bonuses were ineffective for nurses… except for nurses younger than 30 years old living in Rhode Island with active Netflix accounts? Many marketing teams are ready to jump at exactly these kinds of “patterns” and I‘m kindly going to ask you to show me your Bonferroni receipts.
After taking multiple comparisons into account, we found one dimension that was truly meaningful — whether the applicant was a nurse or a nursing assistant (CNA).
Without a bonus, nurses and nursing assistants went on to complete a shift at about the same rate. Nursing assistants were more likely to start working with a bonus of any amount. Nurses, on the other hand, were less likely to start working! (And yes these are all stat sig different from no bonus, for all you skeptics out there).
For any readers from outside healthcare, it’s important to know that nurses can easily earn between 2X and 4X the hourly rate of a nursing assistant. These populations differ in so many ways, which is why we put this dimension at the top of our sequential-testing list.
Years later, I still scratch my head at this chart and wonder why completion rates decreased among nurses when we offered more money. Maybe no gift is better than a cheap gift? Hospitals at the time were offering signing bonuses as high as $25,000 for full-time work.
What’s the optimal bonus amount?
After running this test, we did away with bonuses for nurses. Maybe some bonus greater than $100 would have improved our funnel metrics? That’s another test for another day.
For CNAs, note the large difference between the no bonus group and the $25 bonus group (nearly 5 full percentage points). From there, each additional $25 has a much smaller effect, and somewhere between $50 and $100 the marginal benefit from bigger bonuses reaches zero. We ended up going with $25 to give us room to bump things up at specific times and places as needed.
Remember the initial proposal was to give $100 to everyone. Had we done that, we would have spent $1M extra in bonuses in one year and would likely have recruited the same number of people.
Key takeaways for those who made it this far
- You don’t need fancy machinery to run an impactful test. For this test, all we needed was (1) random assignment and (2) a way to send 4 variations of an email. We’re lucky to have a nice data warehouse and a CRM, but we honestly could have run this off spreadsheets.
- We have a strong preference for nice, round numbers in our promotions. But we found a $25 bonus was basically as effective as a $100 bonus. We’ve run other tests that show bonuses are more about timing and presentation vs the sheer dollar amount.
- It’s tempting to cut a dataset 900 different ways and then chase the best cuts with promotions or other interventions. This is great, but watch out for the multiple comparisons problem.