In June, just eight months into its $100 million, five-year, enterprise big-data initiative, leaders at the vast University of Pittsburgh Medical Center (UPMC) health system in Pittsburgh, Pa. were able to announce that, using the foundational architecture of their recently created enterprise data warehouse, researchers at the University of Pittsburgh and UPMC were able to electronically integrate for the first time clinical and genomic information on 140 patients previously treated for breast cancer. Adrian V. Lee, Ph.D., a renowned expert in the molecular and cellular biology of breast cancer, and director of the Women’s Cancer Research Center at the University of Pittsburgh Cancer Institute, has been leading his colleagues in research on differences between pre-menopausal and post-menopausal breast cancer. Now, leveraging the organization’s data warehouse capabilities and its core electronic health record (EHR), Dr. Lee and his colleagues are mining breast cancer data available to them and applying genomics data to the care of 140 patients who have been treated at UPMC for breast cancer. The work of Lee and his colleagues is building on the creation of the data warehouse, which in turn required collaboration among several vendor partners, including Oracle, IBM, Informatica, and dbMotion.
HCI Editor-in-Chief Mark Hagland spoke recently with Dr. Lee about the work that he is helping to lead in Pittsburgh, and about its implications for development of what is variously being called personalized and precision medicine. Below are excerpts from that interview.
Dr. Lee, when I met you in Pittsburgh in January, you were just beginning to roll out this initiative.
Yes, we had just started the project, when you and I met, and the goal was related to the point that we now have an incredible capacity to sequence patients, and now are creating the capability to impact patient care, but progress has occurred so quickly that we have no information infrastructure to store that data and/or analyze that data. So this is what we wanted to build with Oracle.
Adrian V. Lee, Ph.D.
So, what were the basic building blocks of the program?
Oracle has installed an Exadata server here at Pitt and UPMC that can handle large amounts of data, and provides fast access to data, and we also use Cohort Explorer, a tool that encompasses something the TRC—the Translational Research Center, which helps you do SQL queries on the database. And what I told you when you were here was that we were starting out with a very small, discrete use case. We took a very unique set of patients—140 breast cancer patients—with tumors that were sent to a national consortium that has sequenced their tumors—the Cancer Genome Atlas, run by the National Institutes of Health. It’s the largest-ever effort to sequence and analyze the genome for cancer patients, and involves doing every molecular test possible on 10,000 tumors. And obviously, for that scale, only the NIH could manage an effort that big. What happens is that multiple sites submit tumors—we send off the actual tissue, it’s frozen. They then send it out to the Data Centers, and each data center does something different; some sequence, some measure the gene expression. The three major centers are Baylor College of Medicine, the Broad Institute in Boston, and Washington University in St. Louis.
This is a very complicated structure; you have all these medical centers submitting tissue, with tons of analysis, etc. And it’s taken management to do all this. And it creates large data; the data is now about 720 terabytes large.
Yes. When we sequence a single tumor, it generates a terabyte of data. And that’s the raw data; once you start analyzing it, it gets worse. Looking at the sequencing center, Wash U has sequenced 14 pedabytes of data already. It’s like a little village. This is why we need new systems, because we are fundamentally changing the way data is used.
Are you involved in the Internet2 initiative?
We likely are, because we have a supercomputing center here. Pittsburgh has the largest supercomputing center in the world, with two times 15 terabytes of RAM. And the supercomputing center is an independent center.
So please tell me more about what you’ve been doing with these particular patients.
So we took those 140 patients… They’re special in that we sent their tumors to this consortium. So University of Pittsburgh and UPMC are the largest single submitter of tumor tissue; we’ve submitted in all cancer areas. I submitted in breast cancer, because that’s my specialty. The 140 women are breast cancer patients; that was the first use case. But since then, 600 tumors have been submitted in all cancers.
So that makes us unique, because we have 140 patients for whom we know all of their clinical information that sits in the UPMC system, and we know all their genomic or molecular information that sits within the consortium. And the nice thing about the consortium is that that data is all made public; it’s made public to everyone.
All 140 patients have the same mutation?
No, they all have different mutations. Ultimately, once we have the system built, we’ll be able to translate what we’re beginning to learn, into actionable care plans for individual patients. For instance, if you take Tylenol and another patient takes Tylenol, you’re going to respond differently to the Tylenol, you’ll metabolize it differently because of your genetics. So by understanding the differences in tumors, we hope to personalize the therapy; it’s the whole idea of personalized medicine.
How far away are you from the concept of personalized or precision medicine?
As you know, it was in the news recently that the actress Angelina Jolie had a family risk of breast and ovarian cancer, and they sequenced her DNA and found that she had a gene, BRCA1; and because she has that mutant gene, she was at 80-percent risk of getting breast cancer. So she had a prophylactic mastectomy. Some women respond well to what’s called a PARP [poly ADP ribose polymerase] inhibitor. The therapy is the inhibitor, you want to inhibit PARP, because they also don’t have BRAC-1. Once they lose BRCA1, they become susceptible to PARP.
So we are in an unparalleled time at the moment in terms of collecting and analyzing data. We now have great tools that help us; but the bottleneck has been around the ability to store data and share it. And we’ll need things like Internet2 to help us. I think most of the supercomputing centers are on that network. It’s not only about the storage of the data, but also about the transfer of the data, and finally, analytics. So we’re trying to solve the storage issue, and so the storage is this warehouse. And the warehouse does two things with this Oracle server: we’re trying to combine together all the clinical information in a central warehouse, and then alongside it, bring together all of the molecular characterization data. And we would then like to rapidly do searches between them, and that’s what this Oracle system is really good at. The architecture allows you to rapidly search clinical data.
So where are you now in that process?
We’ve tested the system, and it’s all worked; we kind of tweaked the architecture. It went live [in the third week of June]. So it’s installed on the UPMC system; it uses the Exadata server, and we can now log onto the system called TRC and analyze these 140 patients. So the first goal was, can we load them, and can we execute on them? It was a lot of work, with a lot of people. It took nine months, and we had to overcome a bunch of obstacles. But the fact that we could do it and reach that goal was very important, because this is simply the building block for the bigger picture. Now we’re going to try to load more patients and more data, and move into other cancers; next, we’re going to begin to load ovarian and head-and-neck cancers. The timeline hasn’t been set yet, but it will probably be early next year. The more we can load and the more we can ask the data, the more we can learn.
So working with or manipulating data like this requires this level of supercomputing?
Yes. You can see the scale. The New York Times has run several articles on big data challenges of hospitals in New York. This is like what happened with Google and web search and with Amazon and retailing on line. We have a limitless capacity to produce data, but still a limited capacity to store, share, analyze, and use data, and that’s the problem.
What is going to come together in organizations like yours around the country, the next couple of years?
Well, I think you’ll see that the use of the data is going to lead to rapid improvements in care delivery; that’s the end goal for us, to produce personalized care. We will see that on the provider side, just in terms of outcomes. And I think you’ll see that consuming the data and creating knowledge from it, will shift us more towards evidence-based medicine. You see this already; we have this health information network in Pennsylvania where the hospitals are sharing data from the network. For example, the patient comes in unconscious, and they know nothing, but there’s data in the HIE.
Sharing information is of great benefit; and you’ll see that in the research sphere. There have actually been recent crowdsourcing efforts, where you use social networking efforts to advance thinking rapidly. Think now about your daily life; there’s virtually nothing you do that doesn’t use a computer. It’s hard to predict in 10 years where we’ll be. Most likely, most of us will be sequenced by then; and likely that data will be in your EMR; this will change the way you see your primary care physician, basically.
And one important point of the project is that it’s scalable. It will grow over time, and hopefully, others should be able to install it and use it in their systems. So I hope we set a model for how others can do it.