Great America

Data Quality Matters More Than Data Analysis

Making decisions based on bad data is worse than making decisions with no data.

When we read the official data reporting the daily increases in new cases of COVID-19 and deaths, we tend to believe them. We are not China, our data must be correct, right?

The evidence seems to say otherwise.

For example, half of U.S. states said they could not provide data on nursing home deaths (or declined to do so) and some states said they do not track these deaths at all. To have an idea of how many uncounted deaths this could represent, NBC News tallied 2,246 deaths associated with long-term care facilities—that’s only from the states that provided the data, and it probably represents a floor to the number of real deaths in such facilities. In Italy and France, nursing homes have been a large source of undercounted deaths, sometimes because the management of the facility does not want to tarnish its image. It seems reasonable to expect that the United States might suffer from the same problem.

In addition, we already know that official data was not correct. In February and early March, the United States lacked the capacity to perform tests in most of its territory. The official count of cases was extremely low, and probably it wasn’t because the virus was contained, but because testing was. Data is only as good as the consistency with which it is collected.

There are other reasons to question official data. For many European countries, “today’s confirmed new cases” refers to the tests whose results have been transcribed to the database today, not to the new infections that took place today. The difference is significant, considering that in Italy it might take up to 20 days to take a swab, process it, and record the result in the central database.

Moreover, swabs are mostly taken from symptomatic patients, and conservative estimates tell us that it might take anywhere from three to 14 days for an infected patient to show symptoms. Put the two numbers together and it turns out that official data could lag the real number of cases by up to 34 days! These numbers refer to Italy; other countries might be faster but the point stands: official data on case count lags reality by weeks.

The Larger Problem

The larger problem is that the delay is not constant across results. If all results were delayed by the same number of days, we would still be able to extract trends. Delayed ones, yes, but still trends. The delay, however, is not constant. One test might have a three-day delay and the next one 20. Trends based on official data might be wrong.

One could argue that the delay could be “averaged” and, over large numbers, it can be approximated to constant. The problem is that delays tend to cluster. One county might be processing data faster and another one might be slower. One week we might have timely tests and the other week we might lack reagents and incur large delays.

These problems are in addition to others regarding testing, notably that the number of tests made does not equal the number of people tested. To be discharged from the hospital, people might require multiple tests to validate their recovery; moreover, healthcare workers frequently undergo routine tests. If we do not know who gets tested and how data is reported, we cannot know whether curves about new daily cases represent new daily cases or testing policies or operational capabilities.

Finally, official data does not include the number of indirect deaths—those who die of unrelated medical conditions because they cannot be treated with the usual responsiveness by a local hospital clogged with COVID-19 patients.

Put all of this together and it’s a mess. Official data is incorrect and we do not know by how much. I do not understand how people can still look at charts and trends without asking themselves how good the data is.

The habit of analyzing data without first validating it is an instance of ludic fallacy, in which, as in an exercise from a statistic textbook, we begin with the implicit assumption that the data is correct.

This is the real world, however, and data cannot be assumed correct. Some investigative work is required. Making a chart without first validating the integrity of the data is a purely hedonic performance, an expression of scientism—the ritualistic display of competence which has become the backbone of those institutions whose members don’t know what they are talking about but are very good at not giving a damn about it.

The root problem is that when people do not understand a field well enough to be able to discern competence, they have to resort to proxies to evaluate it: credentials, jargon, charts, and the ability to perform similar superficial rituals.

Unfortunately, schools have long since given up teaching any skill whose related exam cannot be passed through sheer imitation. Scientism is the dangerous result.

“Let’s Use the Data We Have”

No. Making decisions based on bad data is worse than making decisions with no data.

If you have no data, at least you either take the conservative choice or you realize the need to go get some data upon which one can rely. Making decisions with bad data is dangerous. It might lead one to underestimate the danger caused by the virus and to underreact.

The first rule of models is that you spend time discussing them only after having protected the population from immediate risks. Protect first, measure second, model third. Not the other way around. Not when one person infected might infect others and the situation can grow out of hand before we know it.

We made enough dangerous choices in January, February, March, and in April. Let’s play it safe in May. Let’s go dig up some good data, and let’s not use charts and models until we have it.