Very exciting developments have been taking place recently at the Danville, Pennsylvania-based Geisinger Health System, an integrated health system long renowned for its innovations in many operational areas. Among the many exciting developments of late has been the push on the part of senior leaders at Geisinger to develop and implement an enterprise-wide unified data architecture (UDA), something that for most patient care organizations nationwide remains futuristic—yet is happening now at Geisinger.
For the accomplishment of the development a unified data architecture, the editors of Healthcare Informatics have named Geisinger Health as a semi-finalist winner in the 2017 Healthcare Informatics Innovator Awards Program.
At Geisinger, senior vice president and CIO John Kravitz, Bipin Karunakaran, the vice president in charge of data management, and Joseph Scopelliti, IT director, data management, have been helping to lead their colleagues forward in moving to leverage data and analytics. And, as large numbers of professionals at Geisinger move forward to leverage data for many, many purposes, it has become clearer and clearer over time that a very broad-based and unified data architecture will be needed in order to service the broadly cresting wave of needs for data and analytics. Thus, Kravitz, Karunakaran, Scopelliti, and other healthcare IT leaders at Geisinger, have come to the conclusion within two years that the organization would need to rework its data infrastructure to support is groundbreaking work in population health management, care management, clinical transformation, and other key areas.
As Scopelliti wrote in his team’s Innovator Awards submission, “The project was to create a Unified Data Architecture (UDA), which integrates all of the analytic platforms at Geisinger Health System. The key component of the UDA would be the creation of the Big Data (Hadoop) platform. This platform was the first phase of the project. In a one-year timeframe, the team established a big-data platform, based on Hadoop and other open-source components. In this first year, we have developed code for a source ingestion pipeline (which pulls in source data, performs the necessary transformations, and loads the data into various views, each of which have specific benefits to the data analysts. We have pulled in all of the source data currently populating the data warehouse (EDW), plus additional sources not in the EDW. Additionally, we've done work with the non-discrete data (using the NLP capabilities of Hadoop), and now can analyze the thyroid and pulmonary clinic notes. Further, we've decided that all new development should be done on the big data platform (instead of the EDW) wherever possible; case in point being the work we did on Hadoop for BPCI (Bundled Payments Care Initiative).”
Scopelliti added in his team’s submission that “Geisinger has taken a bold step with this project, even the first phase (building out the big data platform), as we plan to deviate from industry standard and the common opinion that Big Data should augment the EDW, not replace it. We are on our way to proving that we CAN replace the EDW. By running analytics from our Hadoop infrastructure, we have all of the benefits of distributed computing, plus the additional benefits of late binding and the ability to deal with non-discrete data, such as we find in clinic notes. I have included a presentation we recently did at the Healthcare Data and Analytics Association conference, which gives more background on the work we did, and benefits achieved.”
Scopelliti spoke recently to HCI Editor-in-Chief Mark Hagland regarding Geisinger’s unified data architecture initiative. Below are excerpts from that interview.
How did your unified data architecture initiative begin?
Geisinger has a long history of analytics. There are a couple of organizations in the country like Intermountain and Geisinger, that have been doing this for a long time. And honestly, the start of it was 1995-1996, when we implemented our Epic EHR. And ten years later, leadership said, we’ve got all this clinical data, we need a data warehouse. So in 2008, we went live with the first iteration of our EDW—we called it CDIS—the Clinical Decision Intelligence System. The beauty of this—there are a couple of things. Number one, we pulled in not only EHR data, but financial data, claims data, because we have a health plan, and other types of data as well.
In the past, if data analysts wanted to do some research or analysis, they would have to request research from the data team. Now, all of a sudden, with the data warehouse, they could do this themselves, and data analytics exploded, in a good way. And IBM came in and helped us with this. And we ran with that until 2012. And then we decided that we needed a different data warehouse, so we moved to a TeraData data warehouse with stronger computing capability. And we’re still running that. We have thousands of reports and dashboards that are running on CDIS.
So last year, it was decided by executive leadership that we needed to move beyond the CDIS data warehouse, to a unified data architecture. And how I see the UDA is that it’s an integration of all of our key data platforms. So for example, we’re doing some work with Cerner on population health, via their HealthyIntent platform. And Epic is going to be coming out with this EDW of their own—they keep changing its name. The point is that, we’re tying all of our key analytics platforms together. But one major component of this UDA is this new data platform based on Hadoop.
It was excellent meeting with John Kravitz, Bipin Karunakaran, and other members of your IT team in Danville last summer, and to hear about the progress that you had already made by that time, on your UDA initiative.
Yes, last summer, we were putting in 10-12-hour days, six to seven days a week, to create the Hadoop platform. But that platform is a key component of the UDA. I and a colleague did a presentation at a conference a few months back, and we made a statement. Most people in healthcare see a big data platform as a supplement; they’re very hesitant to put all their eggs into one basket. But we see this as a replacement. We think we can retire the data warehouse, and that the UDA will effectively take its place, with most of the work being on the Hadoop platform. And our goal is to achieve that within 18 months.
How hard is it to move forward at that kind of pace?
Well, it is hard. The first step is setting up the infrastructure of the new Hadoop platform. It’s commodity hardware, so we set up all these servers and all these nodes, and got it functional. Then, you’re taking all the traditional data warehouse sources—roughly 20, including Epic EHR, other clinical data, financial data, claims data—and channel all those sources into the data platform. But the key is that the team wrote a data ingestion pipeline, and they wrote it in Map-Reduce and Java code. And so the idea is that it makes it very accessible—it allows us to ingest new source data very quickly. So now we have a quick way of ingesting data into the Hadoop platform. So you build the hardware infrastructure, set up the underlying code to ingest data—and now, we have the task of migrating all of the existing analytics programs into that infrastructure.
My team’s primary task is the building and maintaining of the hardware and software for this program. And any new development, we’re trying to do in the new platform. One example is our BPCI, Bundled Payments Care Initiative—we’re a part of the federal program for bundled payments. So we did development for that program on the Hadoop platform. We’ve also been doing a lot of work on sepsis on the Hadoop platform; and we’re doing a lot of NLP on it—assessments of thyroid and pulmonary, analyzing clinical notes, on this platform. This is something we couldn’t have done using a traditional data warehouse, which really requires discrete data in rows and columns. In Hadoop, we can use NLP to analyze freetext data. So we’ve broken ground with thyroid and pulmonary conditions, but there’s no limit to this.
What have been the biggest challenges so far in all this work?
Certainly, getting as much work done successfully as we’ve done in such a short time, has been huge. We did a presentation at the Healthcare Data Analytics Conference; it’s really an all-analytics conference. And when we told them what we did, jaws dropped. And that was in the same sentence as our saying, we think we can replace the data warehouse. And we’re kind of beyond that already. So the second big challenge, which still faces us, is converting existing analytics to the new platform, and that’s underway now. And, as with anything, change is difficult for some people.
Also, you’re really trying to move to a self-service mode for the end-users of data, correct?
Yes, exactly. And to do that, we have to take the semantic layer into account. So we’re pushing Tableau right now as a key front-end tool so that end-users can self-service. And we’ve never really had a good data model for this data. End-users were used to working off a particular vendor’s data model, with all its limitations; but we’re creating our own data model. And why that’s so important is that finally, analysts will have the ability to take data that’s already cleansed and governed, and they can really focus on just analytics. And further, when you have a tool like Tableau, you really want to set it on top of model data, to work optimally.
What have been the key lessons learned so far in this work?
First of all, we have to embrace change. This is IT. You can’t keep the same technology forever, or it’s stale. So you have to look at what is the next big thing, and what we can take advantage of for greater efficiency. ROI is very important. And one thing about Hadoop is that it’s open-source, it’s commodity hardware. In other words, we can go to any vendor and buy some HP servers and use those. We don’t have to buy a specific vendor’s appliance that is a combination of hardware and software that is so expensive.
And in the old days, you’d take the data that you have that makes sense with the use cases you have. And we’ve got 30 terabytes of data in our CDI. Now with the Hadoop platform, we’ve got 600 terabytes of data. Now, Hadoop in effect, inherently makes a triple copy of everything, for high availability built in. But we’re up to a capacity of 200 terabytes of usable data.
That’s a lot of data!
Yes, it’s amazing, right? That’s our capacity. And we have a new mindset now: if we need one piece of data, don’t just go for that one piece of data, get it all. If cardiology needs one or two tables of data, we’ll tell cardiology, let’s go for all of your data. So it’s a different mindset as well.
What would you tell your colleagues about the work you and your colleagues have been doing, and how they should understand it, in the context of data architecture work they might be doing?
I would say, keep your eyes on Geisinger. This is our plan—to get rid of our data warehouse in 18 months. So I’m telling colleagues, keep your eyes on Geisinger. I think we’re going to win this. I think we can do it; and we’re happy to share what we’ve done. We have done a lot. It’s been a crazy ride, but it’s also been an exciting one, and we’re making big strides here.