Beyond p-values: When does clinical impact outweigh statistical significance?
Prof Matthew Sydes - University College London, London, UK
GU ASCO is an educational session. I think over the years it has shifted to start to include more primary results than they used to do. Last year they decided that they would include a lunchtime session, an optional session, run by the excellent statistician Steve Goodman and apparently people were queuing out of the door in order to attend. So they decided that they were going to put something in the main session this year to provide an educative element about basic clinical trial statistics and they were kind enough to ask me to do it. So the aim yesterday was really to help people bob back up to the surface for their statistical knowledge. Not to get into anything new and complicated but to really give people a refresher of statistical knowledge.
So the title they gave me yesterday was Beyond p-values. It was about clinical and statistical significance which is a complicated topic to cover in twenty minutes, particularly bearing in mind the basics you’re asking me to cover as well. The argument I was trying to make is it’s a false dichotomy to pick on those two things. What we’re really looking for is something that’s clinically important as well as being statistically significant. So by statistically significant we were trying to get to the idea that this is convincing data that people are looking at or how convincing is the data in terms of the strength of the evidence that people have collected? But we have to put that in the context of the clinical relevance of the data that people have collected – are they seeing an effect that would be clinically important?
So I drew people this little matrix about clinically important results and statistically convincing results and suggested that they should be asking themselves the question: how important is the effect that has been determined and how convinced are they that this is a real effect? So if the answer is yes and yes, it’s positive to both of those, it should be in this top right hand corner of this matrix – clinically convincing, statistically convincing; clinically relevant, statistically convincing. And the example I gave was Karim Fizazi’s LATITUDE data from 2017 but there are many things that could fit into that box.
Sometimes we’ll have things that are clinically important, there’s a big effect there or they’re seeing a big effect but it’s statistically uncertain. At the moment if the p-value is less than 0.05 people have a tendency to think that there’s no effect. Well, actually, there might be an effect, what we haven’t got is convincing evidence that there’s an effect. So if they see that what they should be doing is going away and getting more evidence or putting it in the context of evidence that we’ve already got.
Then we’ve also got things that may be statistically convincing but the effect is too small to be real and is that good use of evidence that we’ve gone to, or good use of time and money to target those kinds of effects?
I’ve been involved with STAMPEDE for many years. It’s worked out to be a rather remarkable trial. The concept of the multi-arm, multi-stage approach was really that of our director, Max Parmar, back in the early 2000s. The concept, and it’s one I’ve very much grown up believing in, is if you’ve got many things that are worthy of looking at then rather than setting up competing trials or doing things in sequence you should put them all into one question. Given that we know that most new treatments don’t work we should be using a series of interim analyses to see if we can move away early from things that aren’t going to be sufficiently interesting. So that’s the basics of the multi-arm, multi-stage approach.
Then we’ve grown that to start to draw in new questions. If you’ve got something that’s important and worth asking and there’s already an ongoing trial, rather than set up a competing trial we should try and incorporate it.
So with the clinical aspects really led originally by Nick James, who is a remarkable chief investigator and a man with boundless energies, and then drawing in other excellent clinical experts, STAMPEDE has gone on to ask a number of questions over the years.
The most recently reported comparison, first result of a comparison from STAMPEDE, was the M1 radiotherapy comparison where the lead comparison chief investigator was Chris Parker from the Royal Marsden, the Institute of Cancer Research, in Sutton in the UK. This was a trial looking at men who had metastatic prostate cancer who were starting long-term hormone therapy for the first time. Now, for men with localised prostate cancer we know that generally it’s a good idea to give radiotherapy to the prostate. This was now looking at whether for men whose disease had spread elsewhere, should we also be irradiating the prostate. There was good biological rationale to support that. We recruited just over 2,000 men to the study so we’ve got lots of power there to assess the question. What we found was that overall there was no good evidence of an effect to support the use of radiotherapy.
There was a smaller Dutch trial called HORRAD that reported a few months before STAMPEDE and they had broken patients down by whether they had high volume metastatic disease or low volume metastatic disease, whether it had spread to many places or only spread to a few places. What they saw was a very substantial effect for men whose prostate cancer had spread a little bit but no effect for men whose prostate cancer had spread a lot. So before we looked at the data in STAMPEDE we pre-specified that we would do that subgroup analysis as well in the patients for whom we could determine metastatic volume, which was the majority of them. So we worked out what power we would have separately for each of those comparisons and then as well as the overall effect we reported in those patients, really saying that we believe there should be an effect, if there’s going to be an effect and it’s not in everybody then it would be in the low volume patients. And that’s what we saw – a strong effect on overall survival for the men with low volume metastatic disease. There was no evidence of an effect in men with high volume metastatic disease. The point estimates we saw in both those groups were almost identical to what HORRAD saw.
One of the things I went through yesterday was a simple sample size calculation with the audience. I think it’s quite helpful for clinicians to remember how these are calculated. So the information, the example I gave, was that we imagined a condition where median survival is four years – half of the people are going to live longer, half the people are going to live less time than four years. We were going to target a 25% relative improvement, a hazard ratio of 0.75. That’s a pretty standard impact effect that one might look for in cancer trials. Now, to work out how many patients we need there are a number of characteristics that one is going to put in. So we need to put in our α and our β. Our α is our type 1 error. Effectively a type 1 error is saying that something is going on when nothing is going on, it’s kind of a false positive. Our α is our acceptable risk of making that type 1 error and we usually set that as 5% for traditional reasons. A β is our acceptable risk of making a type 2 error. Type 2 error is the converse of that, that’s saying that nothing is going on when actually it is – of missing something, of a false negative. So we generally take that as slightly more relaxed, 10% or 20%, and the power is just 1 take away that, so 80% or 90%. These are traditional values, you can pick any values.
So if we take the α and the β of, say, 5% and 20% and we put them together, I’m sure it was 20% I used, we put them together with that hazard ratio we’re looking for, what we get is a number of events that we require in order to see an effect. So the number of events was something like 382. By events we’re talking here about deaths in this example. So it’s miserable, they’re recruiting enough patients until you’ve got 382 deaths. It’s real people and I hate thinking about it in those terms. But how many patients do you need to recruit in order to do that? Well that depends on that survival, effectively the event rate, the rate at which the events are going to come in; the accrual rate, how quickly do you recruit patients and over how long and how quickly you want an answer. So I gave an example of what if we recruit for three years and we want an answer in five years? Put that together with four years survival to look for those 382 patients and we needed… I can’t remember the number, it was about 900-1,000 patients you’d need to recruit at a rate of about 300 patients per year.
So that gives us a sense and then from there one can play around, one can investigate the impact of, say, looking for a larger effect or for a smaller effect. If you look for a larger effect, a hazard ratio of 0.65, it’s much easier to pick out the signal from the noise and you’d need far fewer patients but you’ve got to think about whether the effect size is plausible.
If you look for a smaller effect it may be more modest, you’ve got to ask whether it’s clinically relevant, but you need many, many more events and therefore many, many more patients. So a hazard ratio of 0.85, a much more modest effect, it translates into an improvement in median survival from four years to about four years and eight months, would require nearly three times the number of events, or was it four times the number of events, and three times the number of patients. So suddenly you go from 900, 1,000 patients to nearly 3,000 patients. In three years that’s impossible for most people. So thinking about what the implications are and then being generous when they read people’s papers to think about what they’ve done to try to get the study that they’ve then presented.
It’s really important in any study that we choose the right outcome measures. Now we know in many instances we want to look at overall survival, that’s true in cancer trials. It’s a very objective outcome measure and it used to be where prognosis isn’t very good then it’s very easy to get to those events, miserably, right? Where patients are doing better then we need to choose different outcome measures and we need to agree what those outcome measures are. So Chris Sweeney and the ICECaP group have been starting to look at harmonisation of outcome measures. There’s also in prostate cancer and in many, many other disease settings, not just in cancer but beyond, there’s a lovely group of work called COMET, which is Core Outcome Measures for Efficacy Trials, led by Paula Williamson. The work was initially funded by the Medical Research Council’s network of hubs for trial methodology research and then got funding from the European Union. Effectively it’s trying to define core outcome sets that should be used to support research – things we should collect for all patients, regardless of what the outcome measures are of the trial, to support meta-analysis and to support our facilitation of patients. So loads of MD students and PhD students are out doing research trying to determine core outcome sets that should be collected. So whenever one is thinking about what the outcome measure is that you should use you could turn to COMET and look.
There’s also a really great piece of research led from the DELTA2 or the DELTA2 consortium, first author was Jonathan Cook from Oxford. They again set out guidance and how to choose a primary outcome measure for a clinical trial. One of the key bits that jumps out to me is choosing an outcome measure that’s really important to at least one key stakeholder group, an outcome measure that really makes a meaningful difference with a justifiable improvement for a group. By a group I’m thinking that the patients are a really important part of that, not just the practitioners’ perspective but the patient perspective as well, what’s important.
If you think about examples where we may have a potentially clinically important effect but we’re not sure about whether it’s statistically convincing overall. That’s really what phase II studies are. Phase II studies are looking at early outcome measures; we see something that may be clinically important, it’s statistically convincing within the outcome measures we pick, but we then go off and collect more evidence. So a nice example of something we should potentially explore further is the combination of zoledronic acid and celecoxib that we looked at amongst the original five research arms of STAMPEDE. There, when we looked at our interim outcome measures, it wasn’t looking sufficiently interesting and our Data Monitoring Committee looked at our predefined criteria and said, ‘That’s not interesting enough. You should probably stop recruiting.’ So we stopped recruiting those patients and we followed them up, they continued to have the treatment, or most of their treatment. And then we followed up and we reported, the first author was Malcolm Mason, a paper in The Journal of Clinical Oncology in 2017. What we saw for that combination in a subset of metastatic patients was a hazard ratio of 0.78, I think it was 0.78 – very close to the hazard ratio of 0.75 we were targeting – which could be a clinically relevant effect and statistically there was some evidence that some people might think it might be convincing.
Personally, we’d stopped early, it’s an intriguing result, clinically and statistically. So it’s not going to define practice, we wouldn’t want to claim a win for it because of the design of the study but it is the sort of thing people might look at and say, ‘We should go and get some more data.’ Our conclusion in the paper was where people are already using zoledronic acid or other bisphosphonate drugs, then potentially they should go and do studies looking at other COX-2 inhibitors.
There are folk within the STAMPEDE trial management group who look and say, ‘Actually, we should have another go. Within STAMPEDE we could look again. We could do another study looking at bisphosphonates and a COX-2 inhibitor and collect more information.’ So whether that would be more believable done by another group or whether it would be done efficiently by putting it back into STAMPEDE is up for discussion. But I think that’s a nice example of something that’s intriguing and we need to go and get more data.