Is Deterministic Data Facing an Identity Crisis?
Neustar’s Michael Schoen, VP of Product Management for Marketing Solutions, was interviewed by Ad Tech expert Lauren Fisher and Digital Advertising specialist Nicole Perrin for an episode of Behind the Numbers, an eMarketer podcast. They discussed the role that deterministic data is playing in helping companies understand customer identities and why deterministic data isn't the end-all, be-all solution in understanding who a consumer is, or isn't. In this transcription from the podcast, he also explains why probabilistic data gets a bad rap as “guessing” but should not be discounted because it’s sometimes more accurate than deterministic.
Q: How important is identity to marketers and why is it important that they understand which devices belong to which individuals or households?
Michael Schoen: Identity has become a critical topic for marketers today. As marketing has continued to advance, marketers are very focused on understanding who they are reaching with their media and what impact that has on their business results. And that’s impossible to do if you’re not able to recognize the consumers that you’re actually reaching. That’s important across all devices. As media has become bifurcated through so many devices and so many channels, it’s become an increasing challenge for marketers. And what’s really critical is for them to understand not just how multiple devices belong to a consumer, but who that consumer is from an offline perspective.
Q: Cross-device is important but that offline piece is too.
MS: We’re pretty formal about the difference between cross-device and identity. Cross-device identification is an important part of identity, but it’s not the whole thing. From our perspective, identity starts with understanding the consumer from an offline perspective. We speak a lot about what we call fractional identifiers. They would include my name, my address, my multiple phone numbers, my multiple email addresses. Each of those is an important part of my identity, and only when you have a complete understanding of who I am from an offline perspective, can you begin to connect that to my multiple devices.
Q: And in that example, all of those fractional identifiers are deterministic, right?
MS: In the parlance of cross-device identity, deterministic and probabilistic are typically used to refer to the linkages between offline and online. But if you consider deterministic to be explicit data, you’re correct, all the offline data would also be deterministic or also explicit. What’s interesting about that is if you think of your offline identity data, it comes in a variety of forms: I receive mail that is addressed to Michael Schoen or Michael A. Schoen, but also Michel Schoen or Michaela Schoen … all the various misspellings that various marketers have. We actually capture all of those as part of the fractional identity, because it’s important to understand how marketers understand me so we can piece together all the different linkages.
Q: Please define probabilistic and deterministic.
MS: I don’t particularly like the terms probabilistic and deterministic, because they seem to imply a value judgment, where deterministic sounds like it’s a gold standard and probabilistic sounds like you’re guessing, and that’s not really true. I prefer the terms explicit and implicit. When you’re talking about deterministic or explicit, you’re talking about typically an email address that is associated with a cookie or a mobile ad ID. And when we’re referring to implicit data, it’s all of the signals that are associated with how consumers are observed on the web and through various interactions that allow you to draw conclusions. So it might be the fact that you’ve seen a consumer on a device at a particular IP address at a particular point in time. And when you have enough of that signal, it gives you a lot of data to draw really firm conclusions.
Q: In the industry, there have been a lot of assumptions that probabilistic is guessing. But there are probably cases where probabilistic is very accurate and deterministic is inaccurate.
MS: Yes, that’s important to understand. Deterministic seems like it’s known, but really, it’s just explicit, and that doesn’t mean it’s always true. And in many cases, it’s simply that truth can be ambiguous. If you think about known interactions that may happen on my devices, my wife may log in to my laptop to check her email, or when I’m traveling, one of my co-workers might log in to my laptop to check on something, and that is a deterministic or known signal that may be observed. It doesn’t mean that my device belongs to my colleague or my wife, and you need to have the probabilistic data to help disambiguate between those signals.
In the market for linkages, there is an economy, where players like Neustar or LiveRamp are acquiring these deterministic linkages from publishers, and whenever there is an economic incentive, there is always some form of fraud. We also see that as well, where because we are buying linkages from publishers, there’s an incentive for those publishers to create additional linkages to drive incremental income.
Very often, it’s not fraud that’s directly perpetrated by our partners or LiveRamp’s partners, but there ends up being a downstream economy. A publisher that we may be working with signs on an affiliate network to generate additional linkages. Somewhere down the line you end up with some identity farm in India or China where folks are logging in with tons of email addresses. Some of that fraud is really easy to detect … when we see hundreds or thousands of email addresses associated with a single device or single IP address, we know that that’s not true. But some of those linkages are harder to ferret out, so we use a very discrete quality mechanism that allows us to validate the effectiveness and quality of the publishers we work with.
In the market today, we discard about 50% of the deterministic linkages that we observe. So even in this market for known linkages, we end up using only about half of it, because the rest of it isn’t high enough quality.
Q: When you discard 50% of the linkages, is that because they are all coming from fraud or do you think that some of it just isn’t as high quality?
MS: It’s a combination of the two. There’s some very obvious fraud that we throw away, maybe 20%-30% falls into that bucket, where even the most naïve check would say this is not believable. But some of it does come from the ambiguous linkages, where it’s not fraud, it’s just a linkage that you need to discard when you are trying to disambiguate.
An example there, I referred to my wife logging in to my laptop. The way that probabilistic signal, or implicit signal, allows you to differentiate that is simply looking at the IP address activity associated with these devices over time. My phone and my laptop are almost always observed together. And this week alone they may have been observed by Neustar at home, in San Francisco, at the San Francisco airport, in L.A., at Newark airport, in your offices here today. When you see these two devices together at too many IP addresses over enough time, your belief and confidence that these two devices really belong to the same individual is quite high. And the belief that this device belongs to my wife is simply not true.
Q: What do you do when you have a truly shared device … an iPad, a TV, a gaming console? Do you just throw it out?
MS: It’s a really interesting topic. From Neustar’s perspective, the credibility of our identity is really important. We differentiate between household linkages and individual linkages. When we have confidence that a device belongs to a household, we’ll associate that device with the household. We will only then further associate that device with a specific individual when we have enough confidence and enough signal to indicate that. Again, in my household, there’s a home computer that both my wife and I use and there is a deterministic signal from both of us. There are not enough probabilistic signals that this device belongs only to one of us, because that device never travels, and so that device belongs to the household and neither to me nor to my wife.
When we work with marketers, we help them to understand the difference between household and individual linkages and the increased scale that is provided if you are to leverage the household linkages as an alternative. It is always the case that the sum of individual-level linkages is less than the sum of household linkages. So for my household, I may have two devices connected to me, three to my wife, but nine devices connected to the household. If you’re looking to influence the purchase decisions that are made at the household level, marketing to the household is the right decision.
Q: So do you show that? That these individuals belong to this household?
MS: Yes, and it’s a key component of why cross-device is not identity. When you have true offline identity, you’re able to associate individuals to a household.
There is some growing understanding in the market about the value of this implicit signal data on top of explicit data. Most often today, it’s still understood as being a tradeoff on scale alone, so folks still generally believe that deterministic is the most accurate data source. But it’s limited in scale because there is an understanding that the volume of those direct known linkages between email addresses and device are limited and that probabilistic signal can be used to greater scale. Our perspective is that including probabilistic signals not only increases scale, but increases accuracy, for the reasons we’ve discussed.
What some folks are starting to do, when they recognize the value of combining both, is acquire some source of a deterministic graph and then look to separately combine that with a probabilistic graph. That has its own issues as well. Because the combination of two graphs is not as strong either from a scale and accuracy perspective as creating a single graph that leverages all of the signals.
An example there, we may see my work laptop always located at my office, if I didn’t travel as much as I do, and my home computer always located at my home, and a phone device that travels between home and work. If you were to build a deterministic graph, those devices would all be separate. If you then would try to combine it with a probabilistic graph, you’d end up with still three separate devices. Only if you were able to combine the probabilistic signal that shows the device traveling between both locations, with the deterministic signal that shows I’m logged into both devices, you’re able to create a single cluster that combines them.
Q: You mentioned that this is really the foundation of understanding customer value and the overall effects of your marketing on the organization. Where are companies on their journey today in terms of using identity to understand the results of their marketing efforts?
MS: They are clearly on a journey. There are some use-cases where they clearly understand that identity is the key. For example, when it comes to activities like onboarding, it’s very obvious that identity is the key to onboarding and marketers are really well attuned to that. When it comes to the marketing analytics space, understanding the effectiveness of marketing, most marketers are still very focused on the media exposure data and the underlying algorithms to determine things like multi-touch attribution, and don’t clearly understand as much the importance of identity in that process.
It’s something that we work closely with marketers to understand. As the leading provider of identity resolution, we understand that marketing analytics is very much a “garbage in, garbage out” situation. And if you’re not able to recognize media exposures across multiple devices connected to the offline individual that may take a marketing action, or an action influenced by marketing and do that accurately, it doesn’t matter how good your algorithms are. You’re going to end up with a bad result.
Q: To step back in the marketing cycle, let’s talk about segmentation which is foundational, it comes before the measurement analytics piece. How does identity fit into segmentation and how reliable it may be?
MS: We take a flexible approach to segmentation, because a lot of it depends on how well a marketer understands their consumer from an offline perspective. We see a lot of diversity in terms of how much of their customer base they actually have data on. Many marketers have a loyalty program and some significant portion, but generally a minority, of their customers may be members of that loyalty program. So they may have a lot of data about 20%-30% of their customers and then very little data about the rest of those customers. We’ll take their CRM data, connect it through identity, and then add additional attributes to help them better enrich their understanding of that customer.
Q: How difficult is it to pair that identity data with their CRM? In most cases, are you seeing that happen?
MS: We are. We work with marketers in the anonymous unidentified space, particularly around marketing measurement and media effectiveness. But we also work with marketers in the CRM space, to help them better understand and correct their offline data. Many consumers have a single point of interaction with a brand in which they provide known identity information. That identity information ages pretty quickly. We’ve found within two years, 60% of customers have had some change in their offline identity – they’ve changed a last name or they’ve moved or a phone number has changed, or they’ve changed an email address. And if marketers are blind to that, they are unable to market effectively.
Q: Where is this headed in the next 24 months?
MS: There is going to be a continued evolution in marketers’ understanding of this space and we’re excited about helping them in that education process. There’s also going to be increased awareness of the privacy implications. Already we’re seeing the changes related to GDPR and the impacts that had on the ability of marketers through various walled gardens to be able to understand the consumers that they are reaching. There is also the legislation in California that is changing the rules here in the U.S. It’s something we’re working closely on with our clients so they understand the implications, and that they are providing the necessary controls to their consumers. Because, ultimately, marketers want to provide the best, most personalized experiences they can to their consumers, but do it in a way that’s truly sensitive to consumer privacy.