On my next birthday I will reach my father’s age when he died of a major heart attack. My dad was, in some ways, like every other World War II veteran. He worked very hard, smoked, and was never sick enough to visit a physician. Upon his death in the hospital, his physician mentioned that my dad had likely suffered several less severe attacks and that men like him typically treated these as heartburn by taking antacids.
It was true. When cleaning out my dad’s home we found a variety of antacid containers, conspicuously located within reach of his favorite spots in the living room, on the front porch, in the bedroom. Twenty-five years ago, this was an interesting anecdote. Today, it is a signal for how we might finally make meaningful improvements to the quality of healthcare in this country by tapping into the power of big data.
To get there, we’re first going to have to get past some stubborn hang-ups that have limited the progress of big data in healthcare from reaching its full potential. That process needs to start with understanding how big data works.
The Three Vs of Big Data
In 2001, Doug Laney of META Group Inc. coined the phrase, the “3 Vs of big data,” referring to the need to control data volume, velocity and variety when conducting deep analytics with fragmented data sets. His thesis was that any potential application of big data analytics needs to consider the dependence of the method on each of the size of the data set, the disparate sources of the data, and the speed at which new data is being processed. Let’s consider the implications each of these three dimensions of big data in healthcare analytics through the lens of my dad’s scenario.
My father’s doctor’s observation that many patients self-diagnose and treat minor heart attacks as heartburn cannot be considered adequate evidence to support a specific intervention. There are many questions that must be answered before this casually observed pattern might be considered as signal: How sensitive is an increase in use of antacids in detecting an imminent heart attack? How often does an increase actually precede to an attack? Are there other factors that have an impact on the answers to these questions?
If the objective is to discover a meaningful and reliable indicator of high probability for a major heart attack, then these questions must be answered through the application of statistical and data mining techniques. To ensure that these techniques will provide the desired results, they must be applied to a volume of data large enough to include historical data for a sufficient number of people to present the methods with many examples of all possible combinations of antacid use and clinical outcomes.
Big data provides the opportunity to link and access new data sources that are outside of the healthcare domain. In our scenario, the need for variety is obvious. We are seeking a signal in retail sales data—a disparate data source—in order to assign risk of a medical event. But it is important to acknowledge two serious issues with the access and application of this data that have prevented this type of research from becoming a practical reality.
The first is the individual’s right to privacy. The ethical access and use of electronic data is a serious social issue. Most people are comfortable with a requirement of a person’s consent to access as long as there is a restriction on its use to a specific purpose. There is certainly disagreement on what constitutes real consent, but it is likely that many people will permit access if personal benefit can be demonstrated. It is, therefore, important that this value be documented with real examples of results demonstrating the direct connection between this information and improved outcomes (e.g., saving a life).
The second issue is a methodological one. With access to high volumes of data containing a large number of data elements, it is likely that some relationships between a characteristic and an outcome will appear to be significant when they are not. For example, in our scenario, a relationship between antacids and heart attack may result from a common relationship to a third variable that hasn’t been controlled for in the samples. Interpreting these results requires subject matter expertise, offering a plausible medical mechanism for the relationship in our scenario of self-diagnosed heartburn.
More than 23 billion credit card transactions are processed in the U.S every year. That’s 63 million each day, 2.6 million each hour, and 44,000 every minute. Intermediaries process more than 1.2 billion fee-for-service claims every year for more than 1 million providers, 2,300 every minute. Combining dozens of data sources, each at these high levels of velocity, and searching for patterns that suggest some type of intervention in real-time requires new efficient computing and analytic capabilities.
In our scenario, adequate velocity might be a daily or weekly review of new data to alert the care manager. In this case, the risk can only be recognized over a period of time sufficient to suggest a measurable increase in the purchase of antacids. Can the increased use of antacids be detected with enough reliability to prevent the event? Of course this depends not only on the risk measure, but on the speed with which the healthcare system can intervene.
Veracity: The Fourth Dimension