PPM continued: Does "reliability" trump accuracy? ·Dec 3, 08:49 AM In last Friday’s blog posting, here, I took a brief look at my hypothesis that sample size issues might be one of the causes of the problems that Arbitron is having with PPM results. For example, the fact that New York’s classic rock station, WAXQ, skyrocketed in the PPM results might not be due to either (A) a bias against classic rock in the diary system that was eliminated by PPMs or (B) a bias in favor of classic rock in the PPM system that wasn’t present in the diary system. It might simply be a sample wobble. Here’s why I think that hypothesis is possible: I haven’t researched exact numbers here, but let’s say Arbitron has been getting back about 12,000 diaries per book and let’s tentatively assume that WAXQ genuinely has about a 3.0% share of listening in the market — i.e., that about 3% of New Yorkers at any given moment are WAXQ P-1s (in other words, it’s their favorite station). Plug the numbers “3” and “12,000” into the equation used to calculate margin of error (take 100 minus 3, multiply that by 3, divide by 12,000, and take the square root of that) and you’ll find that due to sample size alone, ignoring all other factors (e.g., the type of people that return diaries vs. not), there’s a 33% chance any given study result will be off of reality .15% or more. (Math note: One “standard error” sets the limits of the range at the “67% confidence level” — i.e., the range you should get an answer within 2/3 of the time. So there’s a 33% chance any given finding will be outside that range. Is that clear?) In other words, if with a huge, huge, huge sample size you’d get a 3.0 finding for WAXQ, with a 12,000-person sample size, you’ll get a result that’s outside of the 2.85-to-3.15 range about 1/3 of the time. But it’s relatively unlikely (only 5% of the time) you’ll get a result that’s outside double that range (or 2.7 to 3.3). However, with the PPM, Arbitron is planning to cut its sample size by 2/3 in terms of number of participants. (Lots more data from each participant, but fewer unique individuals.) That means — play around with the equation if you don’t believe me — their error range will almost double. In other words, about 1/3 of the time, you’ll get a result outside of the 2.7 to 3.3 range, and there’s a 5% chance you’ll get a result outside of the 2.4 to 3.6 range. So if WAXQ’s first PPM result is, say, a 3.7, does that say anything conclusive about inherent PPM bias vs. diary bias? Not necessarily! We could simply be looking at the one book in 20 that WAXQ should be expected to have a big sample wobble, and the wobble happened to be up. (Or, to put it another way, it might be the one station out of 20 in any given ratings report that wobbles that much.)
“Reliable”Arbitron has convinced most of the radio industry, including many researchers, that this is not a problem, arguing that the results are reliable, by which they mean consistent.Sure they’re consistent! If WAXQ happens to have a 3.7 share within a batch of 4,000 PPM-carriers, they’re going to have the same 3.7 the next week and the next and the next (subject to minor variations due to other factors, of course), because it’s the same frigging people every week! And the next and the next, until slowly, over the course of maybe several quarters, the panel gets refreshed with a new set of respondents. Doesn’t make it right. “Right” might still be a 3.0. So, a big question is whether the declines for Urban and Spanish stations — the ones that triggered the problems with the New York City Council, etc. — are going to be real and permanent declines due to the substantive nature of the change from diaries to PPM, or whether we’re simply seeing some “down” sample wobbles. Some data exists to study this. For example: Did the same patterns occur with the classic rock stations (or the Urban or Hispanic stations) in the Philadelphia and Houston PPM tests? Inquiring minds should want to know. To paraphrase “Dirty Harry,” you know, I’m curious myself.
Response biasBut there’s another possibility, too.I believe it’s true that only about 14% of people who Arbitron would like to carry PPMs around all day every day for several months actually are both contactable and willing to do it. And I know that Arbitron applies weighting within age/sex/race cells to make sure each cell is represented proportionally. So let’s take a cell like African-American M25-34, and let’s take two representative guys: Robert listens to WLTW (soft rock), WINS (all news), and WNYC (public radio), whereas Darrell listens to WQHT (hiphop) and WRKS (R&B). Arbitron wants each of these two guys to carry a clunky pager around with them for several months. As noted above, on average, only about 14% of people who Arbitron wants to do this will do this. Is it possible that Robert is more likely to do this than Darrell? I.e., could African-American M25-34s like Robert, as a class, be willing to go along, say, 16% of the time, while African-American M25-34s like Darrell, as a class, are only willing to go along, say, 10% of the time? (By the way, I think the Darrells of the world might be the ones acting rationally here — I know I wouldn’t carry a clunky pager around for several months for a market research firm unless they gave me a lot of compensation. Which I don’t think Arbitron does.) To put this in “30 Rock” terms, is Toofer (the argyle sweater-wearing alumnus of the Harvard Crocodilios glee club) more likely to be willing to help out Aribtron by carrying around a clunky pager for several months than Tracy Jordan (the style-conscious hiphop fan and movie star (“Black Cop/White Cop” and “Who Dat Ninja?”)) is? If so, that’s a different issue entirely! That’s called “response bias.” It’s not so much how many African-American M25-34s (to use one example of an age/sex/race cell) are surveyed, but whether the ~14% who participate are representative of the ~86% who don’t. We (by which I mean savvy observers) always knew that there was a certain form of response bias involved in diaries — e.g., that alternative rock listeners tended to not return diaries. But it’s very reasonable to expect a different type of response bias in this PPM approach. And, hypothetically, it’s possible that this response bias is what’s hammering stations that serve African-American and Latino audiences. More later this week. ReferencesStandard error calculator here. Plug in “14,000,000” for NYC population, “.03” for Proportion (i.e., a 3.0 share), and “12,000” or “4,000” for sample size, and click the “95%” button for “Confidence Level,” which is TWO standard errors. (Thank you, Australian federal government!)Found reference to “Arbitron’s low-to-mid teens SPI” (which I simplified to “14%” above) in a column by radio researcher Dr. Roger Wimmer — here, about halfway down the page. (Note that Roger seems to buy into the argument that consistency is what’s most important.) share: del.icio.us. Reddit Digg Yahoo Wink Windows Google Newsvine
Comment Other blog entries Podcasting: Fad or trend? Labor Day musings: Blame it on bad luck...again? Must Pandora die? Kurt's summary of the Internet radio royalty dispute iPhone's radio apps are a canary in a coal mine Part 2: DI's Ari Shohat on how to grow audience Digitially Imported's Ari Shohat reveals how he built a huge global audience Emmis's Jeff Smulyan responds re: FM in cell phones Ramsey: "Broadcasters don't understand the radio 'experience'" Honolulu's Brock Whaley: "I have heard the future in my car" |















Kurt —
After all those years of inserting “Star Trek” references and analogies into your columns, it’s excellent to see you moving on to making “30 Rock” references instead!
Keep working those mind grapes!
— Laura Holt · Dec 5, 05:53 AM · #
Kurt:
The Australian National Statistical Service standard error calculator you reference above assumes simple random sampling with one observation per sampling unit. That is not the case for Arbitron AQH ratings. Therefore, the calculator standard errors will be considerably larger than the Arbitron PPM standard errors.
You are using the equivalent of the Australian National Statistical Service calcultor to make this point on reliability. But the reliability of Arbitron AQH estimates is more complicated than that. So your reliability comparisons are not appropriate. Most importantly, the reliability of AQH ratings is based on the number of quarter hours measured for each panelists. That’s what makes Arbitron PPM estimates reliable.
You says that Arbitron defines reliable as consistent. That’s not the case. That’s a separate issue. It is true that PPM estimates are more stable than diary estimates. Or another way of saying that is that the reliability of the month to month change estimates is better with a panel than with two independent samples.
The bottom line is that we’ve been saying that we’ve designed the PPM sample so that monthly PPM AQH ratings are as reliable as quarterly diary AQH ratings. For some demos, the reliability of the monthly PPM AQH ratings are a bit better. And, as a consequence of the panel, the month to month change in ratings is more reliable than the quarter to quarter diary changes. It’s important that we recognize that these are two separate concepts.
Monthly PPM AQH ratings are as reliable as quarterly diary AQH ratings due to more quarter hours measured per person.
PPM AQH ratings more valid than diary AQH ratings due to passive instrument.
Month to month change in PPM AQH ratings are more reliable than quarter to quarter change in diary AQH ratings due to use of a panel.
Sub-monthly (even hourly) PPM estimates are more reliable than sub-monthly diary estimates, due to use of a panel.
There is better population coverage for PPM AQH ratings due to sampling of CPO households.
— John Snyder · Jun 23, 08:07 AM · #
Are there variability data (e.g Standard Deviations) and Sample Size data available so that one can calculate a standard error for the “averages” Arbitron reports (which I assume are Means), such as AQH and cume?
— Frank LeFever · Aug 14, 08:04 PM · #