Analysis tools for NCI genomic databases

What we were presenting today is a set of analysis tools that is overlaid over the Genomic Data Commons, the NCI Genomic Data Commons, which is the place where we store all the data of large genomics projects that are done either with the NCI or with NCI funding and even without NCI funding but in the States. That’s where the data resides. The problem is that this data is of a really large scale, petabytes in size, so it has become very difficult for your regular researcher to use the data unless they are in a large institution. Being that the projects are funded by federal taxpayer money we want everybody to be able to access the data, not only because of how it was paid also because this data is gathered from samples that patients have given in order for the data to be available. So if we don’t make it available we are basically lying to our donors. As I said, available by download it reaches only a very small population.

So what we have done with the overlay of DAVE, everything has an acronym these days, it’s Data Analysis, Visualisation and Exploration tool. So when you go to DAVE you can select the types of tumours that you want to see, you can select the age of the patients, you can select a lot of features of the patients and then look at what the molecular changes in that tumour are. Then you can start selecting cohorts that you can compare against each other. This is something that is done without downloading a single byte; it’s something that happens in the website, online, and then when your results are aggregated and finalised you can download the data. Obviously this is very useful for researchers but it’s also very useful for patients and patient advocates and clinicians because once you have the results of your analysis you can go here and try to see if you find cases that match your diagnostic. Then you can look at how they responded to therapy and the different options that you have. So it helps people to understand the information on the tumours and take charge of their treatment.

It’s in the very early stages, we are just dealing with DNA at this point. Very soon we will get the RNA information and protein is following hot on its tail. So we will keep updating it. One of the things that I always say whenever I do these presentations is I ask people to go and look at it and play with it, even if they are not scientists; it’s very easy to use. Then I want them to tell us what is wrong with it because if you cannot understand what you can do with the tool then we did something wrong. It needs to be self-evident. I have this idea that if you need to read a 500 page user manual to use a tool the tool is useless, nobody will do that. So this is fairly self-explanatory so I invite everybody that sees this video to go and try to destroy the system and then tell us all the things that are right or wrong about it. We like hearing, ‘You did a great job,’ I’m not going to tell you that it’s not but we like hearing more when people say, ‘Oh, but it would be great if we could do this or that.’ Sometimes those things are already in our list of things to do but many, many, many times we get ideas from users that are not experienced users that we go, ‘Oh, yes, right. We never thought about that but it’s a good idea.’ So the tool only gets better when we get the input from the users.

How would an untrained patient be able to comprehend this data?

Originally when we were setting this up we were thinking researchers. So you did the selections of your cohorts and we had Excel spreadsheets that researchers understand. But then when we got the first presentations of those I immediately went, ‘Yeah, yeah, yeah, but that’s only for researchers.’ So what we have is you can download the Excel spreadsheets, you can download the files and do the work on your own but when we do the presentations they’re all graphic presentations so pie charts galore, graphs that go up and down, things that my fourteen year old understands perfectly fine. In fact, I use him as one of my guinea pigs from time to time; I tell him, ‘Okay, get on line and tell me what you can see.’ It’s actually very easy to understand and if the user does not readily understand it we have a lot of support material, mostly in the form of YouTube videos because, again, as I said, a 500 page manual is not helping anybody. But we have recorded sessions in which we demonstrate how you can use it and what your results mean. So if you have any doubts you can go to the video systems and then try to learn what’s happening from them.

Do you find there are barriers with using international data?

Most definitely, the laws in different countries. Each country decides to deal with privacy, and privacy is the main issue, in different ways. There are some countries that don’t have privacy laws whatsoever; there are some countries that have privacy laws that cover only health, the US tends to be that one. European countries have very stringent privacy laws and none of those models is wrong, it all depends what the country wants to do, but the system needs to operate in the context of going across national borders because there are many places where this data resides. So we do believe that being able to do the analysis online without downloading the files helps solve the problem of privacy because you can go, check the data, aggregate it, basically make it even more de-identified than it already is because now you have a lot of patients that are reporting together instead of a single one, and then you can download the result without ever touching the file to download. So you maintain privacy, it solves the issue of crossing borders where the rules are different if you are doing everything basically in the cloud, if you want to say that. Then you actually never got the file. So that has been one of our guiding principles, basically, and it’s the guiding principle of the Safe Harbour idea.

So I do believe that this is the way of the future, we are going to be exchanging data more and more without actually exchanging the data then doing the analysis of the data and then bringing the results back. Because, really, if you trust that the data was done correctly you want to do the analysis. If you don’t trust that the data was done correctly you probably don’t want to add it into your analysis anyway. So you don’t really need the file to verify that it was done correctly.

Are there many other organisations you are working with on this?

Most definitely. In the genomics space there’s the GA4GH Alliance which is an alliance of many, many countries. We are trying to decide common standards that everybody should agree on. We are part of GA4GH, NCI is one of the founding members. So we most definitely follow the rules but the interesting thing is when GA4GH was put together there were no systems dealing with these large datasets, GDC was probably the first one to do it and we did it without the guidance being there because it was being discussed. But then after the success of the database basically it was kind of a feedback loop where we did it thinking of what GA4GH would like to see and then GA4GH saw the database and went, ‘Ooh, this is what we want to do.’ So we are feeding off each other.

But in genomics with the International Cancer Genomics Consortium and other groups like that we have been participants since the very beginning, we have been founding members of all those organisations and we do believe that the only way to really, truly understand cancer is to go across national frontiers. So we have to do it in collaborations. What a gastric cancer looks in the United States is very different from what a gastric cancer looks in Japan but that doesn’t meant that we cannot learn and apply the lessons we learn from one ethnic group to the other. So it is vital to be able to participate in this if you want UN of genomics type of organisations.

Analysis tools for NCI genomic databases

AACR 2018

Related Videos

More from ecancer