We all know that crude oil varies widely in its quality based on density and sulfur content:
Source: U.S. Energy Information Administration, based on Energy Intelligence Group -- International Crude Oil Market Handbook.
If data is indeed the new oil, it stands to reason that data varies in quality as well. So, what are the equivalent axes that would determine high and low data quality? Unfortunately, there are no simple answers. To give you an idea as to why, here are just three axes that we regularly encounter when measuring TV/video content and advertising:
- Survey-based vs. passive behavioral data. Back in the day, when detailed consumer information was hard to come by, surveys were a fantastic way to gather information on consumer tastes, interests and preferences. But these days embedded meters are tracking consumer behavior, and in general are more dependable than survey results. For example, a meter tracking what TV programs you watched will be more accurate than asking "What program did you watch 10 days ago?" As a result, today's best data practitioners generally prefer passive measures when they are available, as self-reported data are regarded as more biased.
- Panel vs. census. Again, back in the day it would have been cost-prohibitive to gather information on every American, so panels were created to represent the entire U.S. population, and statisticians determined proper methods to "weight" and "balance" panel segments so they could properly be projected to the overall population. In today's Big Data world, panels are relatively more expensive, and why query a small, subset of the population when you can have cost effective information on everyone? For example, many media researchers prefer to use the vast data from set top boxes or other census TV-viewing data companies, as opposed to smaller TV viewing panels, although there continues to be debate here as panels typically have more granular persons-based data. Nonetheless, the train has left the station, and today best practitioners prefer census data even with its shortcomings.
- Deterministic vs. probabilistic. "Deterministic" means that the specific ID being targeted is known factually to be a certain type of buyer either from the advertiser's own records, frequent shopper card data, etc. "Probabilistic" means lookalike modelling, i.e. it is not known whether the ID is a buyer but the ID either resembles buyers demographically, geographically or in terms of the websites they visit. Probabilistic represents the great majority of all programmatic digital buying ($32B/year in U.S. and still growing). Best practitioners greatly prefer and are willing to pay higher Data CPM for deterministic because probabilistic is merely lookalike modelling which, with demographics/geographics as predictors, has been shown to be only 6% better than random (Simmons, "Empowering ROI With Psychographics", ARF June 2017).
As is clear from these three examples alone, there is no absolute when it comes to assessing data quality. The good news is that numerous industry groups are embracing this challenge and are in the process of proposing data quality guidelines. Here are some of the most notable efforts in process:
- The Trust and Transparency Forum, led by Andrew Susman and Mike Donahue. This group's original meeting at the UN agenda was led by Jerry (Yoram) Wind, a professor of marketing at The Wharton School of the University of Pennsylvania.
- A consultancy called JOLT! is seeking funding from the ARF, the ANA, CIMM and the CRE to create data labels and recommend an approach for compliance and governance. The JOLT!initiative is being managed by Jim Spaeth, Alice Sylvester and Gerard Broussard. In fact Gerard Broussard probably started the current industry focus on this area when he wrote "Enriching Media Data: Quality is Key Requisite for Maximizing ROI" for CIMM in June, 2015. Once the industry data label is agreed, JOLT! will look to uncover the target segment offerings that most closely represent advertiser clients' intended consumers.
- The Trustworthy Accountability Group (TAG) was created with a focus on four core areas: eliminating fraudulent digital advertising traffic, combating malware, fighting ad-supported internet piracy to promote brand integrity, and promoting brand safety through greater transparency. AdAge recently reported that 49 companies have received TAG's "Certified Against Fraud" seal.
- The Alliance for Audited Media (AAM) is launching AAM Quality Certification, a program developed to minimize digital ad fraud. The program addresses fraud by differentiating good publishers by ensuring that they are doing everything they can to serve marketers' ads only to humans. The initiative has been developed with close collaboration and input from Dr. Augustine Fou, one of the industry's most well regarded cyberfraud authorities.
- RMT using its DriverTagsTM tech is working with MVPDs to derive psychological data from set top box data which multiplies the predictive power of probabilistic targets. (Full disclosure: Bill Harvey is a principal in RMT.)
- Lucid has devised a scalable method of reliably sampling any defined audience to determine the "density" of a given attribute. For example, what portion of an "auto intenders">