Can Privacy Preservation Transform Marketing Analytics?
With all the buzz around new data privacy regulations, Google’s decision to eliminate third-party cookies, and Apple’s recent move to enforce user consent for app tracking, privacy preservation has become part of every marketing conversation. The newly announced delay in the launch of Google’s Privacy Sandbox is doing nothing to kill that buzz. In fact, it’s giving brands time to cast a wider net and find a privacy-preserving solution that’s a good fit for them in the long run.
But while many have heard of privacy preservation and understand its general principles, deploying it in real life remains highly technical. As a result, very few marketing professionals to date have actually put privacy-preserving techniques to the test. At Neustar, we’ve been actively working over the years to bridge the gap between theory and practice, and we've just released a new study, Privatized Machine Learning for Marketing Analytics, to help marketers and their analytics teams measure the benefits of privacy preservation without sugarcoating the tradeoffs involved.
In this new study published this week in I-COM's Frontiers of Marketing Data Science Journal, we examined two leading privacy-preserving techniques—k-anonymization and model calibration—to see how they might affect the performance of standard multi-touch attribution (MTA) processes, one of the most important marketing use cases for granular customer data today.
Is attribution way off-base when privacy preservation is turned on? Do we need a supercomputer to make it work? We found that robust data protection can be achieved without losing much in attribution effectiveness or bringing computing power to its knees. Each brand is different, of course, but our results are encouraging and should open the door to more practical implementations of privacy preservation in the future.
Before diving into our results and recommendations, let’s first review how we got here..
With great power comes great responsibility
Customer data has never been so plentiful: We sign up for loyalty programs, shop online, and get products delivered to our door. We carry our smartphones everywhere, use social media, and log in to watch our favorite shows. Marketers are quick to collect that data, directly or indirectly, and use it to personalize our interactions with their brand and optimize their marketing campaigns.
Most of us appreciate a better customer experience and improved targeting in the advertising we receive.But "with great power comes great responsibility," as Peter Parker's uncle might say. You don't have to be a Spider-Man fan to see that personal data can be abused if it falls in the wrong hands—or in Doc Octopus' tentacles. It's not a big surprise to see that jurisdictions around the world are in the process of adopting strict data privacy regulations. Some are already in effect, like GDPR in Europe and CCPA in California, and having a chilling effect on many data analytics initiatives.
What is sensitive data?
It is only natural that many of us have fears about our sensitive data. We don't always want people to know what websites we visit, how much we drink, or whether we're regular cannabis users. But a bad actor might be able to use other, seemingly innocuous characteristics to reconstruct our identity.
In a seminal study twenty years ago, Latanya Sweeney demonstrated that 87% of the U.S. population could be uniquely identified with nothing more than their gender, zip code, and date of birth. Those are not personal identifiers (the way an SSN, email address, or username might be) but quasi-identifiers that can be combined to the same effect. If quasi-identifiers can be used to access sensitive data, they’re just as sensitive as the data they link to.
Using k-anonymity to roll up input data
To help develop defenses against the problem of re-identification in sensitive datasets, Sweeney and Italian professor Pierangela Samarati introduced the concept of k-anonymity: a guarantee that even if an analyst (or a hacker) goes through all the possible combinations of quasi-identifiers present in the dataset, they will always find at least k customers with the same attributes.
For instance, if there are too few people of the same age (say 27 years of age) and zip code (say 30301) in the dataset, a data scientist might roll up everyone’s age to a five-year bucket (like 25-30), or their zip code to a higher order (like 3030*). There are many ways to k-anonymize datasets—it’s an active research field that continues to borrow from the latest advances in data mining—but every algorithm has the same objective: to roll up the attributes in the dataset only as far as necessary to meet the k threshold. No need to go as far as 20-30 for age if 25-30 is enough, or roll up zip codes to 303* if 3030* is enough.
Adding privatization to the model calibration process
Another way for data scientists to address privacy concerns is to leave the data unchanged and instead use a specific form of calibration to estimate the model (in our case, the MTA model) every step of the way. That’s what data scientists generally refer to when they talk about differential privacy—a concept introduced 15 years ago by Microsoft researchers in the U.S. and their academic colleagues in Israel.
Its exact implementation can be a bit arcane to the non-initiated, but the basic principle is straightforward: when a model is estimated against the dataset, it should be impossible to determine the contribution of any one single person to the model’s result. If we delete that person from the dataset, repeat the model estimation, and get a substantially different result, then a specific level of noise needs to be added to blur the difference: just enough noise to add data protection, but not so much that the result becomes useless.
Using a simulated dataset
In our new study, we combined k-anonymization and private model calibration, and we tested their respective (and cumulative) efficiency using a simulated dataset where each individual was assigned a base propensity to make a purchase, exposed to advertising messages on two separate media channels, and conditioned to make a purchase when a certain utility threshold was reached.
Why use a simulated dataset instead of real-life data? Most real-life datasets are incomplete and data scientists spend nearly half of their time cleaning up and normalizing data. This has implications for machine learning, where part of the data is used to train the model and the rest to test it. We wanted to make sure that those upfront data quality considerations didn’t interfere with our results.
We also wanted an experimental framework where we could test different methodologies and their combinations, as well as benchmark their predictions against a known, labeled outcome. With simulated data, we know the real answers, so we can benchmark the accuracy of different methods against the actual data generating process. And we can examine the impact of counterfactual scenarios too, allowing us to look at what would have happened under different conditions.
Results and recommendations
So, how did our analysis pan out?
With k-anonymization, the value of k can have a significant bearing on the results. Large k values are good for privacy preservation (lots of similar people) but the information loss can be substantial (too much generalization). Small k values help retain more value in the data, but privacy guarantees aren’t as strong, and the computational resources required to k-anonymize the dataset can be daunting.
In the case of MTA, we found that relatively large k values don’t lead to major bias in the attribution results. In a dataset with a million records, a k value of 10,000 (that is, 1% of total records) offers robust privacy preservation and comes with an attribution bias of only 1%. In other words, you can be 99% confident that you're attributing outcomes (like sales) to the right channel (like search, or online video). Working with relatively small k values requires more horsepower to assemble the model (a 20X increase in computational time), but it’s still very feasible.
On the model calibration front, we found that the noise injected into the process added variance to the results, but no significant bias to attribution even at very low values for ε (0.02), and no substantial computational penalty either (a very manageable 2-3X increase in computational time).
Towards a new integrated view of privacy preservation
What does that mean for brands and publishers looking to invest in privacy preservation?
Our takeaway from this research is that data protection and utility aren’t mutually exclusive. For MTA, the attribution error associated with data protection is minimal, and the computational penalty isn't prohibitive.
Ultimately, the choice of which privacy preservation method to employ—or combination of methods—lies in the balance between the measurement accuracy considered acceptable, the computational costs involved in the process, and the levels of privacy protection deemed necessary to be imposed on the system. Every dataset is different, and every brand works with a different mix of channels.But the bottom line is that there's no technical reason for brands to stay on the sidelines of privacy preservation.
Interested in learning more about our analysis? Please read our research paper: Privatized Machine Learning for Marketing Analytics in the latest edition of the Frontiers of Marketing Data Science Journal. Ready to jumpstart your privacy preservation journey? Contact us and we’ll help you get started.