When Can We Be Confident about Estimates of Treatment Effects?

Please subscribe to use our print features or to download PDF files.

DR. GUYATT: Thank you very much. There was increasing recognition, as Dr. Montori has described, about the limitations of just relying on study design and the necessity to take other factors into account. That was, in part, the motivation behind the formation of the Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group and the GRADE approach to assessing quality of evidence that has emerged since publication in the British Medical Journal in 2004.2 Dr. Schünemann, would you like to tell us about how GRADE works and how GRADE conceptualizes and deals with confidence in estimates otherwise known as quality of evidence?

DR. SCHÜNEMANN: Dr. Guyatt has described how confidence in estimates of effect is important for decision making, describing an example from the medical literature on hormone replacement therapy. Dr. Glasziou described the origin of evidence hierarchies and Dr. Montori then described how over time, knowledge has changed and how we would look at the overall confidence in estimates of the fact by focusing on more than just original study design.

One of the underlying concerns with these evidence hierarchies has been that over a period of approximately 15 to 20 years, over 100 such systems were developed, and some of these systems were conceptually inappropriate. In 2000, a group of international scientists, practitioners, public health researchers, and individuals from many different disciplines got together and formed the GRADE Working Group.2 This approach has now been adopted by over 70 organizations, and the group has, in particular, analyzed the factors that make us more confident or less confident in estimating treatment effect or management effect.

The group has also determined factors that influence our confidence when we move from evidence to recommendation; in other words, healthcare recommendations and the overall factors that influence such a movement or this development of recommendation. The group conducted these tasks systematically by exploring all factors that might increase or decrease our confidence. My colleagues have already referred to, what we typically call, risk of bias. This is when studies, although described as a certain study design, still have limitations, and we would lower our confidence in the result of these studies. We’ve also talked about inconsistency between studies and lack of application of study conditions and outcomes to clinical situations.

There are 2 additional factors (from a total of 5 factors) that would lower our confidence in a body of evidence. These other 2 factors are concerned with the ideal number of patients in the studies or the experience outcomes being very small resulting in lower precision.

The fifth factor that lowers our confidence in the estimates of effect is concerned with studies, for whatever reason, remaining unpublished. In other words, their results remain unknown. This is typically the case when studies do not validate the investigators’ anticipated effects. This publication bias can lead to an important distortion of the overall estimate of effect.

Grading of Recommendations Assessment, Development and Evaluation looks at 5 factors that can lower your confidence in the estimates of treatment effect. This is particularly relevant when we look at the initial study design. There is wide agreement that randomized control trials give us higher confidence when we start looking at evidence because randomization is the best way to protect against bias and confounding. Therefore, in GRADE, randomized control trials are initially classified as high-quality evidence and observational studies, as low or two plus quality of evidence. In GRADE, there are a total of 4 categories of quality of evidence, with randomized control trials starting at the very top, and observation studies, classified into the second of these 4 categories.

If there are no important limitations or reasons to lower our confidence in the estimates of effect in observational studies, then, as Dr. Glasziou described, there may be reasons for why we are more confident in the estimates of effect from observational studies. There are indeed 3 factors in GRADE that allow us to have higher confidence in estimates of effect. I will just mention one of them quickly.

Large treatment effects or management effects generally increase our confidence about an intervention having a positive influence on an outcome—think about the use of insulin in diabetic ketoacidosis. Although there are no randomized control trials, the observations indicate that insulin significantly improves the outcomes of diabetic ketoacidosis, ie, prevents death and complications from diabetic ketoacidosis, and it does so with a very large effect.

There are 5 factors that lower our confidence in the estimate of effect. According to GRADE, the evaluation is done for each of the outcomes determined as important for decision making by a guideline panel. An overall estimate of the confidence in these estimates of effect for a given healthcare question is obtained by GRADE.

As Dr. Montori indicated, one of the big contributions of the GRADE Working Group has been to shift the thinking in guideline panels towards outcomes that are important for the patient and defining what outcomes are important, or unimportant for decision making. To sum up, GRADE makes another important contribution by helping define the factors that then influence the development of recommendations. This is based on 4 factors, the first looks at the balance that we benefit from downsizing after having determined what outcomes are important for decision making.

The second is to evaluate how the intervention or management strategy influences the utilization of resources. The third is how important all of the outcomes are relative to each other; for example, a management strategy may reduce mortality, with temporary nausea as the side effect. Under such circumstances, it is likely that people affected by the recommendation would place a higher value on preventing mortality than experiencing temporary nausea.

The fourth factor is how confident we can be in any of the estimates that affect decision making, ie, moving from the evidence to recommendations. The GRADE approach then provides confidence of recommendations and specific recommendations after evaluating these 4 factors and focusing on what is important to patients. This approach has been used in probably over a thousand recommendations by now and is, as I indicated earlier, widely disseminated.

DR. GUYATT: Thank you very much. I think we’re going to have opportunities to talk about the strength of recommendation issues in another one of these sessions. Let’s focus on the confidence in estimating quality of evidence issues. Does anybody want to comment on what anybody else has said up to now?

DR. MONTORI: One of the things that Dr. Schünemann just pointed out, which I think was also very poorly recognized at the beginning, but is now a really important issue, is the general notion of the corruption of evidence. One of the aspects about confidence of estimate that we haven’t really gotten into is about the studies that are not available in the published record, also called publication bias. What you see published and disseminated beyond the medical journals to the media and public is only a subset of the research that is being conducted. The fact that new approaches, such as GRADE, take that practice into account improves our confidence in the estimates of effect.

This has been critical in areas such as the use of antidepressants, where evidence highlighting the harmful effects of antidepressants was essentially hidden from public view. Estimates that were presented to the Food and Drug Administration were not the same estimates that were published in the medical journals; the latter suggested that antidepressants were more potent. This is the critical element of corruption of the evidence, which is taken into account by approaches such as GRADE. The related element is the issue of fraud, which thankfully, continues to be a smaller problem for which there is, to my knowledge, no specific solution except to promote ethical conduct of research, transparency, and accountability.

DR. GUYATT: Would anyone else like to comment at this point?

DR. GLASZIOU: On Dr. Montori’s point there, I think the major message is that practitioners should be cautious and a little skeptical about any new recommendations that they see because of the difficulties of getting high-quality evidence on the basis of published literature alone and the awareness of problems such as publication bias.

This is even more important for full-time clinicians. If you’re looking at a guideline or a set of recommendations, the basic first question you need to ask is whether the writers have used any sort of hierarchy of evidence in their processing. If they haven’t, then how do they sift through the evidence to find what’s good and what’s bad. If they have used the more traditional approaches, whose history I went through earlier, then that’s a step forward, but it’s still using a pretty primitive sort of tool. The more the writers shift towards something that looks like the GRADE approach described by Dr. Schünemann, the more confidence you can have that the group developing the guideline has used a modern approach to find the best quality evidence, giving you a reliable grading of that evidence.

DR. SCHÜNEMANN: I have to follow up on what both Dr. Montori and Dr. Glasziou said. It also becomes clear that it is extremely challenging for most practitioners to perform this evaluation by themselves, because of how sophisticated both research methodology has become and how well equipped one has to be in order to identify flaws, as well as, perhaps, some intentional suppression of data. In this context, obviously the work of the GRADE Working Group is important. It becomes critical to focus on whether those who have developed recommendations have used an appropriate approach.

In addition, the issue of faith in this approach, as well as the trust in those developing recommendations becomes an extremely important issue. Therefore, those who carry out evidence assessment and evaluation need to do this appropriately because practitioners will have to put increasing trust in such evaluations, since it is simply infeasible to do this by themselves in their practice.

DR. GUYATT: Dr. Schünemann has just emphasized the need to look at preprocessed evidence and particularly in the format of guidelines, and I think we would all advise people that one test of whether the guidelines have handled things appropriately is use of the GRADE approach. There are other approaches, which have similar elements, but the wide adoption of the GRADE approach and how carefully it’s been thought out leaves it as a good litmus test for giving us greater confidence in the recommendations in a guideline.

Practitioners will certainly want to keep in mind that to make optimal decisions, you need systematic summaries of the best evidence. GRADE is also applied appropriately to systematic reviews, which are systematic summaries of the best available evidence. Anybody wants to comment on the use of GRADE in the context of systematic reviews?