Differential Privacy: A Marketer’s Guide
“Differential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population”
Cynthia Dwork and Aaron Roth, The Algorithmic Foundations of Differential Privacy
What is differential privacy?
There are two ways to define differential privacy: as a technique, and as a discipline.
As a technique, it describes a sophisticated method for injecting small amounts of random noise into statistical algorithms, so that analysts may perform useful aggregate analyses on a sensitive dataset, while obscuring the effect of every individual data subject within that dataset.
As a discipline, it covers the above technique, but also other ‘privacy-preserving techniques’ that share the same practical objective without necessarily involving the injection of random noise into the process. A good example is k-anonymity, where individuals with identical attributes are grouped into micro-segments before any data release, or l-diversity, where individuals are only grouped together when their outcome variables are diverse enough. Advanced cryptography-based approaches are possible as well.
At Neustar, when we talk about differential privacy, we’re casting a wide net and referring to the overall discipline.
Where does it come from?
Differential privacy emerged from the computer science and semantic security literature, and was first formalized in 2006 by Microsoft researchers Cynthia Dwork and Frank McSherry in the U.S., and academic colleagues Kobbi Nissim and Adam Smith in Israel.
If you’re interested in its origins and a non-technical discussion of its key principles, check out this excellent introduction by Salil Vadhan, Kobbi Nissim et al. for the Harvard Privacy Tools Project: Differential privacy: A primer for a non-technical audience.
Why is it called differential privacy?
The ‘difference’ in ‘differential privacy’ refers to the difference between a version of a sensitive dataset that contains a specific user, and a version without that user, and making it impossible for an attacker to tell them apart.
It also conveys the important notion that the ultimate objective of a differential privacy algorithm is to find the right balance between privacy loss and data accuracy.
If we wanted to absolutely guarantee everyone’s privacy in a sensitive dataset, we would simply replace their data with random observations, but the utility of that dataset would be zero. If, on the other hand, we wanted to guarantee the accuracy of aggregate insights from the dataset, we wouldn’t exclude anyone’s data from it, but then everyone’s privacy would be at risk.
It’s a sliding scale, and in practice, statisticians set a parameter on that scale for how much privacy loss they can afford in order to minimize the loss (the difference) in data accuracy.
Who uses differential privacy?
Differential privacy developed as a field of theoretical investigation, from computer science to mathematics, but it came out of the need to respond to the very real possibility of data attacks on sensitive datasets, so it was only natural that it would quickly transition to practical applications.
Today, there are large-scale deployments of differential privacy at the U.S. Census Bureau, Google, Apple, Uber, and Microsoft. It’s used to learn from large user datasets without exposing people’s private information, like internet whereabouts, social media usage, physical locations, commuting habits, financial transactions, or medical records. Any personal data that, if leaked, could be harmful to the individual.
Why should marketers care?
Because increasingly, marketing is being driven by user-level data. The marketing benefits of user-level data are enormous, from targeting to measurement, from attribution to valuation, but consumers and lawmakers are rightly concerned about data leaks and the misuse of personal data. With growing privacy regulations around the world, marketers need a way to maximize the usefulness of their first-party data without jeopardizing the privacy of their customers. It’s exactly for this type of balancing act that differential privacy was developed.
Can’t marketers simply anonymize their user data?
No. For years, marketers and publishers thought they could address the problem by anonymizing their first-party data, but well-publicized incidents have shown that replacing personally identifiable information (PII) with anonymous identifiers is often inadequate and vulnerable to attacks.
What sorts of attacks?
Re-identification attacks on supposedly anonymized datasets are the ones that gather the most press. They typically use outside information to unveil someone’s identity in a sensitive dataset, and in some cases, like in a game of musical chairs, they can even jeopardize the privacy of people who are not in the dataset.
There are other types of attacks to consider. Database reconstruction attacks, for instance, use aggregate results from multiple queries and cleverly recombine them to reconstruct the sensitive data in a database. Another example is membership inference attacks, where attackers can determine whether someone is part of a dataset or a particularly sensitive subset of that dataset.
Every brand’s reputation rides on how well it treats its customers. In a data-driven economy, that means that every brand’s reputation rides on how well it treats its customer data. The stakes are too high for marketers to ignore.
Can differential privacy be used for measurement and attribution?
Absolutely. In fact, that’s the primary use case for marketers. By adding noise to the analysis of datasets, or obfuscating individual-level data in some other way, marketers can still measure campaign performance at a very granular level, while creating plausible deniability of association with the original dataset for anyone included in the analysis.
For example, a digital platform might not tell us that a particular person was exposed exactly five times to a given creative anymore, but somewhere in the four-six range. Or it might tell us that a person who’s been exposed five times has a 10% better chance of purchasing a product than someone who’s been exposed four times.
The bottom line is that those insights, even when they’re derived from micro-segment analyses, can be applied to user-level records, and therefore fed into multi-touch attribution (MTA) models and other performance analysis systems. When done right, the loss in accuracy is inconsequential.
Can it help with omnichannel measurement?
Yes. The main problem with omnichannel measurement today is that some platforms are open while others are closed. Closed digital platforms (e.g., walled gardens) require user authentication, and that direct relationship with their users has some privacy implications above and beyond those that already exist on the open web. For marketers, data from the closed platforms have traditionally been a bit of a black box, and difficult to reconcile with data from other channels.
But with differential privacy, those closed platforms can now aggregate insights (like impressions and clicks) at the micro-segment level and make it possible for outside players to assign those insights to individual users. It’s not user-level data, so privacy isn’t breached, but it’s granular enough to be meaningful for advertisers.
Can it be used for segmentation and targeting?
Absolutely. Browser companies are currently working on applying differential privacy principles to obfuscate user data inside a privacy sandbox at the browser-level and make aggregate data available to marketers via direct API calls (combined with strong encryption). That data can be used to create a new suite of population segments and target individuals within those segments.
Neustar is already working with browsers, app developers, and operating systems to establish these direct API integrations and help marketers and publishers gain access to aggregate marketing intelligence without violating user privacy.
What use is it for publishers?
For publishers, differential privacy is the best way to balance the privacy needs of their audience with the addressability needs of their advertiser clients. Publishers are at the front lines of the privacy debate. Whether they’re on the open web, a closed platform, or a combination of the two, they need to be able to plug into the overall marketing ecosystem without skirting privacy expectations.
Differential privacy gives them the tools to package their first-party data in ways that bring new value to advertisers. It’s a must-have to safeguard CPMs at a time when third-party cookies and other personal identifiers are going away.
Is differential privacy really necessary if you capture consent?
Yes. Providing consent and opt-out mechanisms is not just a great idea—it’s now the law in many places around the world. But giving people an opportunity to opt-in or out does little to protect their privacy if they’re not fully informed of the consequences of using or disclosing their personal information.
In a way, differential privacy can be seen as a mechanism to protect people’s privacy despite themselves. For instance, they may opt in to have their personal data collected in a clinical trial, or in a consumer survey, only to realize later on that their data is being used to deny them medical insurance or to bombard them with unwanted offers. Even more insidiously, they may opt-out of data collection and later realize that their opt-out was interpreted as a decision to conceal compromising information about themselves.
By definition, differential privacy offers the same guarantees whether someone opted in or out. It can help marketers and media companies comply with privacy regulations even in cases where consent is misunderstood.
I heard differential privacy would mean "the end of research as we know it?"
Microsoft’s Frank McSherry has the best response to that pushback:
“I really like this last one. ‘Research, as we know it’ at the moment, seems to allow any random idiot to demand access to data, independent of the privacy cost to the participant. Instead, differential privacy, and techniques like it, align accuracy with privacy concerns, and challenge researchers to frame their questions in a way that best respects privacy. I’m ok with that change.”