Artie Bulgrin invented the idea of a calibration panel with Project Blueprint, which proved that media duplication patterns were very non-random, and therefore used as a proof of concept a 3000 sample size calibration panel to ground a fusion of media data from other sources, so that they would have the right duplication patterns.
Years later in 2019 the World Federation of Advertisers (WFA) and the Association of National Advertisers (ANA) and other advertiser associations in many countries announced their own blueprint for cross-media audience measurement, and it included a calibration panel. In this case, the duplication patterns were only a subset of all of the possible things that could be calibrated, that is, controlled for, including:
These details were not covered in the published documents.
Nielsen also uses calibration in its combining of panel plus big data. In addition to all of the foregoing dimensions in which calibration is possible and relevant, Nielsen also corrects errors in big data which are caught by the meters on the same devices. Thus editing rules, logic checks, and adjustments based upon truth sets are all part of the raison d’etre of calibration panels.
In the modern age of big data which I had a little bit to do with starting, the core reason calibration panels make sense to almost everyone is that individual pools of convenience sample ought not just be slushed together naively, but rather some ideal structuring framework should cause the pools to each get their appropriate weight and be overlapped with each other as in the real world.
The real questions are in the details. As always. What are the chosen control points for the calibration? You can’t calibrate everything simultaneously because the cells become too small for any affordable calibration panel size. Therefore you have to choose what to calibrate on. Here AI and ML (machine learning) both come in handy because ultimately they can optimize the choice of control points. AI and ML will greatly increase the number of options that can be pragmatically considered in terms of what to calibrate on.
In terms of sample size, you cannot really estimate that until you specify what you will be calibrating on.
The most advanced work that I know of that is going on in this field is by Pete Doe at Nielsen (for whom Bill Harvey Consulting is an advisor). In local, one calibration model being tested with clients is to calibrate on each station by dayparts by 10 sex/age groups. What this means is that the station’s rating by dayparts will be based on the panel, but how the programs and quarter hours in that daypart vary in ratings will be based on the big data.
This model shows high plausibility in its data, increases report-to-report stability, and virtually eliminates cases of zero audience. However, it can reduce by up to 5% or more the high peaks in sports and news, what one might term “advantageous sampling error”, and Nielsen is refining the data science to improve the transition of the old measurement to this new one.
One has to be careful in thinking through this question of what to calibrate on. The whole point of big data is to increase sample size and thereby increase stability, however, if you calibrate to a smaller panel, the greater variability of the smaller panel will affect the stability of the calibrated big data. Therefore you ought to apply calibration with extreme precision, choosing the calibration points by optimization, and not fooling oneself about the sample size you will need.
I bring up the point of fooling oneself because historically, first there came the idea of calibration panels, then came the notion that they can be small and still do the job. Where did this idea come from?
Small calibration panels will have two problems:
Just to clarify as much as possible, the term “calibration” in the context of audience measurement has a very specific meaning: it means that some of the data will be conformed to a part of the total data set which the researchers in charge consider to be the most accurate. I have referred to these as the “calibration points”. The calibration points are the datapoints which are considered the closest to truth, to which other data will be calibrated (forced to agree).
Let’s look at some specific examples which illustrate the important relationship between sample size and the practicality of having more versus fewer calibration points. Although logic makes it obvious that the more calibration points the better, this only works out with a large calibration sample size, and if part of the goal of the researchers in charge is to get away with the smallest possible calibration panel for reasons of economy, then in that scenario it will turn out to be impractical to have more than the smallest possible number of calibration points.
First, let’s look at the idea of a calibration panel consisting of 3000 households. This is about the smallest size I’ve heard bandied about in recent years. The average minute or second rating level today for television is about 0.1. The average minute or second rating level today for the average page in the top 10,000 websites/apps used by the top 100 advertisers is below 0.001. In order to be applicable across TV and digital and across panel and big data let’s call these household ratings for now. Later we will look at people data.
In a 3000 sample, these ratings turn out to represent 3 households for TV and far less than one household for digital. In a 3000 sample, any media vehicle with less than a 0.033 rating is likely to show up as a zero audience. This is before the act of calibration.
In this scenario, the safest calibration points might be, by specific day/date, the average network-by-network ratings in television, and if digital is also to be calibrated, the average platform-by-platform ratings, for a manageable number of platforms, with the long tail thrown together as a single platform type calibration point. Let’s assume this approach gives us 100 TV calibration points and 100 digital calibration points.
From TV experience, the correction factors applied by the calibration/conforming method tend to be small (e.g. within plus or minus 20%) – that is, panel and big data do not differ that much – however within the full distribution of cases there are always some that are quite large (plus or minus in the range of 20% to over 100%). “Plus or minus” means that the correction to the “truth standard” (actually the “truth proxy” in these calibration cases) might be an upward correction or a downward one. For any given calibration point e.g. ESPN, it might be upward one day and downward the next.
We do not have experience in calibrating digital audiences to a truth proxy e.g. panel. The panels used in digital have never quite reached the degree of confidence that the currency TV panels have achieved in many countries. This is a whole different subject to come back to at another time; there is no law of physics that says digital panels cannot be as rigorous as television panels.
Focusing on TV for now, which is where today’s calibration discussions are focused, with a 3000 panel and average ratings of 0.1 (=3 households), the average minute or second rating for a TV network on a given day is an average of every minute or second within the time that network is on the air. The fact that there are all of these observations tends to provide the effect of a somewhat larger sample size, but this is to a limited degree, and so the (at 95% confidence) tolerance range around the panel’s rating estimate will be plus or minus ~50% - which means it could be anywhere between 0.05 and 0.15 (and one out of 20 times it will tend to be further away from 0.1 because we are choosing to look at 95% confidence).
This means that the panel being used as the truth proxy will tend to be unstable and by conforming big data to that instability, the big data will be forced to be unstable.
In order to reduce that instability, the data scientists involved might institute a lookback period. Averaging across more days in the past creates in effect rolling averages which smooth out instabilities. However, for programs or networks which sometimes or always have special sports events, or for news networks where what is happening that day causes audience spikes, this smoothing will bring down the highest upward audience spikes.
Now let’s look at the larger calibration panel sample size scenario. For example, what if the calibration panel is 42,000 households – i.e. about the current size of Nielsen’s U.S. national panel. Although the sample size is 14 times larger than in the prior case, the laws of statistics rule that this only has the effect of the square root of 14, which is about 3.75. Thus the tolerance range will be 3.75 times smaller – instead of 50% the tolerance range will be 13%.
This means that with a large sample size calibration panel, it will be practical to minimize the use of lookback periods and maximize the number of calibration points, for example not only controlling for network but also for dayparts within each network, not only controlling for households but also for people by some number of sex/age groups – and for people the sample sizes will be larger – the Nielsen national panel for example with over 42,000 households also is measuring each of over 66,000 persons using pushbutton peoplemeters (gradually to be replaced by wearable passive peoplemeters).
At these larger sample size levels, AI and ML can test all possible combinations of calibration strategies to determine the optimum approach to ensure the most accurate sex, age, and ethnicity ratings by network dayparts, without introducing instability to the big data. This would provide the optimal calibration panel at whatever sample size the industry believes it can afford. However, if it is less than the present size of the Nielsen panel in the U.S., the optimal calibration approach would be to use the fewest calibration points with a lookback period and Bayesian adjustment strategies, and to be prepared to tolerate instabilities and higher degrees of granular inaccuracy.
Coming back to where we started regarding duplication patterns between TV and digital for example, today’s discussions about calibration panels assume that ad tagging will be the way digital big data are collected by each advertiser, and that those tags will be overlaid over the calibration panel with a further privacy-protecting procedure called Virtual IDs, which purposely reduce the signal to noise ratio. The results of Truthset show that the signal-to-noise ratio in matching identity graphs today is about 50% so that VIDs will reduce this to below 50%, in the name of privacy protection, even though clean rooms now achieve the privacy protection with no drop in signal to noise ratio.
We shall definitely need to use calibration panels to integrate big data for the foreseeable future, but we ought not to have unrealistic expectations about being able to cut research costs during a period of intensified audience fragmentation and complexification, unless we are also willing to accept higher risks in the investment of a trillion dollars a year in marketing actions.
As Robert A. Heinlein often wrote, “There’s no such thing as a free lunch (TNSTAAFL).”
Posted at MediaVillage through the Thought Leadership self-publishing platform.
Click the social buttons to share this story with colleagues and friends.
The opinions expressed here are the author's views and do not necessarily represent the views of MediaVillage.org/MyersBizNet.