Big data, machine learning and drug discovery in oncology

Drug discovery is a notoriously difficult field because, if we think about it, it’s actually trying to make really important decisions that will end up costing not only a lot of money and a lot of time but potentially opportunities based on imperfect information and imperfect data. So a lot of my work over the past ten years has really focussed on how we can use the vast amounts of interdisciplinary data that all relate to drug discovery in order to help inform drug discovery and development decisions in a really objective, data-driven way to help us find novel, innovative paths to novel therapeutics while at the same time minimising the risks and the mis-investment in targets or in drug discovery projects that are less likely to bear fruit at the end of a long activity.

So in my talk I really describe our thinking on this. I describe firstly the establishment of our canSAR database which next year will be celebrating its tenth anniversary. And canSAR is a massively integrative knowledge base that brings together tens of billions of experimental and clinical measurements from across many, many disparate and interdisciplinary fields, all of which bear on drug discovery. So this could be information from the clinic, from large-scale genomic profiling and other molecular profiling information of cancer patients, all the way to medicinal chemistry, structural biology, systems biology, pharmacology and much more.

So firstly I describe how we bring together all of this information into a single unified knowledge base that allows the researcher to investigate all of these areas very rapidly and identify connections and relationships that may otherwise not be easy to find if you are just looking around in the literature.

Then, secondly, I describe a whole suite of AI machine learning technologies that we’ve developed to learn from these billions of interconnected data points and really help the drug discoverer generate the right hypotheses, design the best experiments, the key experiments in the lab, and then pursue them. Then I describe how we actually applied this over the past ten or so years, not only in our own drug discovery but also in many publications that we’ve put out there to show how one could then delve into the total mess that can come out of, for example, a patient profiling activity or a functional genomic activity where so much beautiful data is coming out but the chaos of trying to select which target to go after, which is the gene that is most likely going to eventually deliver the best drug and the most innovative drug and help design those experiments.

So I describe how we did that with some examples. I describe how, by using this, we really can identify new opportunities that were hidden, actually almost hidden in plain sight, just because of the noise generated from all the data around them and exemplify how we validated these experimentally. And also describe how you can design the best experiments and think about what are the best experiments. So this isn’t just about selecting the winner but it’s also about identifying the dead-ends as early as possible so that you don’t waste too much time on them.

Then after that I describe how we then applied also this technology in collaboration with prostate cancer clinicians in order to really take a very objective, comprehensive view of the disease networks and the disease communication within the prostate cancer cells that identify the best drug targets that are either established or the best druggable targets to go after in this field.

Why is it important to have interdisciplinary data?

A key thing about the importance of this interdisciplinary data is that historically drug discovery had been a field where almost each group of people do their own activity and then almost throw it over a fence to the next group of people. So if we think about it in this molecularly driven world that we have, you have the biologist or even clinicians who would do this big molecular profiling or functional genomic experiments, whichever they are, come up with the hypotheses, they might do some target validation their end and then at some point they would, in effect, hand it over to the drug discover team who would then think about the hit generation, if it’s small molecules for example, to do the screening, devise the best screening technologies. There you’re having to play between chemistry and structural biology often but then at some point it just gets handed over to the next stage and then handed over to the next stage.

Now, of course the disadvantage of that is that when you’re really far down the line, three years later, millions of dollars later you’re further down the line, you then find yourself asking the question, ‘Well, really, what should I have done really early on in order to make this better or to refine my hypotheses better?’ That’s why what we try to do with our approach and our knowledge base is to actually bring in all the questions you’re going to ask along that path, bring them in right to the beginning. So you can ask the questions about the structural biology, the systems biology, the model organisms, the assay types, the assay signals, the biomarkers, all of those, right at the beginning. This has two advantages; the first, and key one I think, is that then this way, right at the beginning you can start to identify where the likely pitfalls are and do the right experiments really, really early on to find whether they really are pitfalls or whether actually they have a solution. But also it really does crystalize the thinking about the best experiments and the best route and speeds up the process altogether.

How do you ensure there isn’t a disconnect with real-world data and practice?

It’s really interesting. One of the reasons for developing this approach and bringing together these tens of millions of experimental data in one place is precisely to minimise this disconnect with observations in the real world. Increasingly we’re bringing in information that are reporting, for example, what the clinicians may be observing in the clinic beyond the initial studies. The really lovely thing about this is that, like I said, it helps you map out relationships that you may not necessarily come across in the sanitised preclinical laboratory sense but they really direct you to things that you do then go on to observe in the real world, so to speak.

However, it is also an important point to bring up is how dependent we are on the present data. One of the problems that we all have as a field is that data are so inconsistent between different areas. There are areas of biology that are so beautifully well studied, there are vast swathes of structural biology, molecular profiling, biomarkers, pathway mapping, all of those things are happening and then other areas of biology that are equally valid for cancer research that there’s so little information about them. The same goes for chemistry, clinical information etc.

That’s an exciting challenge in the sense that actually we’re starting to develop methodologies that try to identify the information missing and try to guide us towards the experiments to fill in that information but it also really points to the importance of the community as a whole really sharing more information, more knowledge and actively going to fill in the gaps in our knowledge rather than working on the same areas again and again.

How do you see this developing in the foreseeable future?

The way I see it is, really my dream, is to have this fluidity in the information flow from the clinic back to drug discovery, drug development and back to the clinic and use AI technologies and knowledge bases to be essentially the engine oil that really helps this very rapid transfer of information. Cancer, as we know, is a disease that is fuelled by change, it’s fuelled by evolution. The cancer is always being one step ahead and the only way to beat it is to make sure that there is a free information flow from the clinic to the pre-clinic and back all the time and across all of those areas of drug discovery and development that I mentioned. That’s what I see in the future. I see real, big, strong empowerment of this information flow, decision making and communication between these different parts really fuelled and powered and oiled by data and AI.

Big data, machine learning and drug discovery in oncology

ENA 2020

Related Videos

More from ecancer