Interim analyses of randomised controlled trials are sometimes published before the final results are available. In several cases, the treatment effects were noticeably different after patient recruitment and follow-up completed. We therefore conducted a literature review of peer-reviewed journals to compare the reported treatment effects between interim and final publications and to examine the magnitude of the difference.
We performed an electronic search of MEDLINE from 1990 to 2014 (keywords: ‘clinical trial’ OR ‘clinical study’ AND ‘random*’ AND ‘interim’ OR ‘preliminary’), and we manually identified the corresponding final publication. Where the electronic search produced a final report in which the abstract cited interim results, we found the interim publication. We also manually searched every randomised controlled trial in eight journals, covering a range of impact factors and general medical and specialist publications (1996–2014). All paired articles were checked to ensure that the same comparison between interventions was available in both.
In all, 63 studies are included in our review, and the same quantitative comparison was available in 58 of these. The final treatment effects were smaller than the interim ones in 39 (67%) trials and the same size or larger in 19 (33%). There was a marked reduction, defined as a ≥20% decrease in the size of the treatment effect from interim to final analysis, in 11 (19%) trials compared to a marked increase in 3 (5%), p = 0.057. The magnitude of percentage change was larger in trials where commercial support was reported, and increased as the proportion of final events at the interim report decreased in trials where commercial support was reported (interaction p = 0.023). There was no evidence of a difference between trials that stopped recruitment at the interim analysis where this was reported as being pre-specified versus those that were not pre-specified (interaction p = 0.87).
Published interim trial results were more likely to be associated with larger treatment effects than those based on the final report. Publishing interim results should be discouraged, in order to have reliable estimates of treatment effects for clinical decision-making, regulatory authority reviews and health economic analyses. Our work should be expanded to include conference publications and manual searches of additional journal publications.
The use of data monitoring committees in the conduct of clinical trials has increased and evolved, but there is a lack of published information on when data monitoring committees are needed and utilized, the acceptable range of data monitoring committee practices, and appropriate qualifications of data monitoring committee members.
To gain a better understanding of data monitoring committee operations and areas for improvement, the Clinical Trials Transformation Initiative conducted a survey and set of focus groups. A total of 143 respondents completed the online survey: 76 data monitoring committee members, 52 sponsors involved with organization of data monitoring committees, and 15 statistical data analysis center representatives. There were 42 focus group participants, including data monitoring committee members; patients and/or patient advocate data monitoring committee members; institutional review board and US Food and Drug Administration representatives; industry, government, and non-profit sponsors; and statistical data analysis center representatives.
Participants indicated that the primary responsibility of a data monitoring committee is to be an independent advisory body representing the interests of trial participants by assessing the risk and benefit ratio in ongoing trials. They noted that data monitoring committees must have access to unmasked data in order to perform this role. No clear consensus emerged regarding specific criteria for requiring a data monitoring committee for a given trial, and some participants felt data monitoring committees may be overused. Respondents offered suggestions for the data monitoring committee charter and communications with sponsors, institutional review boards, and regulators. Overall, data monitoring committee members reported that they are able to function independently and their recommendations are almost always accepted by the sponsor. Participants indicated that there are no standards or guidelines pertaining to qualifications of data monitoring committee members. Furthermore, only 8% (6/72) of data monitoring committee member survey respondents received any formal training, and 94% (68/72) were not aware of any training programs.
Findings from the survey and focus groups provide a better understanding of contemporary data monitoring committee operations and insights regarding challenges and best practices. Overall, it was clear that increased training will be needed to prepare the next generation of qualified data monitoring committee members to meet the growing demand. These findings can be used by Clinical Trials Transformation Initiative and others to develop recommendations and tools to improve data monitoring committee operations and the overall quality of trial oversight.
In clinical research, minimizing patients lost to follow-up is essential for data validity. Researchers can employ better methodology to prevent patient loss. We examined how orthopedic surgery patients’ contact information changes over time to optimize data collection for long-term outcomes research.
Patients presenting to orthopedic outpatient clinics completed questionnaires regarding methods of contact: home phone, cell phone, mailing address, and e-mail address. They reported currently available methods of contact, if they changed in the past 5 and 10 years, and when they changed. Differences in the rates of change among methods were assessed via Fisher’s exact tests. Whether participants changed any of their contact information in the past 5 and 10 years was determined via multivariate modeling, controlling for demographic variables.
Among 152 patients, 51% changed at least one form of contact information within 5 years, and 66% changed at least one form within 10 years. The rate of change for each contact method was similar over 5 (15%–28%) and 10 years (26%–41%). One patient changed all four methods of contact within the past 5 years and seven within the past 10 years. Females and younger patients were more likely to change some type of contact information.
The type of contact information least likely to change over 5–10 years is influenced by demographic factors such as sex and age, with females and younger participants more likely to change some aspect of their contact information. Collecting all contact methods appears necessary to minimize patients lost to follow-up, especially as technological norms evolve.
In settings like the Ebola epidemic, where proof-of-principle trials have provided evidence of efficacy but questions remain about the effectiveness of different possible modes of implementation, it may be useful to conduct trials that not only generate information about intervention effects but also themselves provide public health benefit. Cluster randomized trials are of particular value for infectious disease prevention research by virtue of their ability to capture both direct and indirect effects of intervention, the latter of which depends heavily on the nature of contact networks within and across clusters. By leveraging information about these networks—in particular the degree of connection across randomized units, which can be obtained at study baseline—we propose a novel class of connectivity-informed cluster trial designs that aim both to improve public health impact (speed of epidemic control) and to preserve the ability to detect intervention effects.
We several designs for cluster randomized trials with staggered enrollment, in each of which the order of enrollment is based on the total number of ties (contacts) from individuals within a cluster to individuals in other clusters. Our designs can accommodate connectivity based either on the total number of external connections at baseline or on connections only to areas yet to receive the intervention. We further consider a "holdback" version of the designs in which control clusters are held back from re-randomization for some time interval. We investigate the performance of these designs in terms of epidemic control outcomes (time to end of epidemic and cumulative incidence) and power to detect intervention effect, by simulating vaccination trials during an SEIR-type epidemic outbreak using a network-structured agent-based model. We compare results to those of a traditional Stepped Wedge trial.
In our simulation studies, connectivity-informed designs lead to a 20% reduction in cumulative incidence compared to comparable traditional study designs, but have little impact on epidemic length. Power to detect intervention effect is reduced in all connectivity-informed designs, but "holdback" versions provide power that is very close to that of a traditional Stepped Wedge approach.
Incorporating information about cluster connectivity in the design of cluster randomized trials can increase their public health impact, especially in acute outbreak settings. Using this information helps control outbreaks—by minimizing the number of cross-cluster infections—with very modest cost in terms of power to detect effectiveness.
Bayesian statistics are an appealing alternative to the traditional frequentist approach to designing, analysing, and reporting of clinical trials, especially in rare diseases. Time-to-event endpoints are widely used in many medical fields. There are additional complexities to designing Bayesian survival trials which arise from the need to specify a model for the survival distribution. The objective of this article was to critically review the use and reporting of Bayesian methods in survival trials.
A systematic review of clinical trials using Bayesian survival analyses was performed through PubMed and Web of Science databases. This was complemented by a full text search of the online repositories of pre-selected journals. Cost-effectiveness, dose-finding studies, meta-analyses, and methodological papers using clinical trials were excluded.
In total, 28 articles met the inclusion criteria, 25 were original reports of clinical trials and 3 were re-analyses of a clinical trial. Most trials were in oncology (n = 25), were randomised controlled (n = 21) phase III trials (n = 13), and half considered a rare disease (n = 13). Bayesian approaches were used for monitoring in 14 trials and for the final analysis only in 14 trials. In the latter case, Bayesian survival analyses were used for the primary analysis in four cases, for the secondary analysis in seven cases, and for the trial re-analysis in three cases. Overall, 12 articles reported fitting Bayesian regression models (semi-parametric, n = 3; parametric, n = 9). Prior distributions were often incompletely reported: 20 articles did not define the prior distribution used for the parameter of interest. Over half of the trials used only non-informative priors for monitoring and the final analysis (n = 12) when it was specified. Indeed, no articles fitting Bayesian regression models placed informative priors on the parameter of interest. The prior for the treatment effect was based on historical data in only four trials. Decision rules were pre-defined in eight cases when trials used Bayesian monitoring, and in only one case when trials adopted a Bayesian approach to the final analysis.
Few trials implemented a Bayesian survival analysis and few incorporated external data into priors. There is scope to improve the quality of reporting of Bayesian methods in survival trials. Extension of the Consolidated Standards of Reporting Trials statement for reporting Bayesian clinical trials is recommended.
Futility (inefficacy) interim monitoring is an important component in the conduct of phase III clinical trials, especially in life-threatening diseases. Desirable futility monitoring guidelines allow timely stopping if the new therapy is harmful or if it is unlikely to demonstrate to be sufficiently effective if the trial were to continue to its final analysis. There are a number of analytical approaches that are used to construct futility monitoring boundaries. The most common approaches are based on conditional power, sequential testing of the alternative hypothesis, or sequential confidence intervals. The resulting futility boundaries vary considerably with respect to the level of evidence required for recommending stopping the study.
We evaluate the performance of commonly used methods using event histories from completed phase III clinical trials of the Radiation Therapy Oncology Group, Cancer and Leukemia Group B, and North Central Cancer Treatment Group.
We considered published superiority phase III trials with survival endpoints initiated after 1990. There are 52 studies available for this analysis from different disease sites. Total sample size and maximum number of events (statistical information) for each study were calculated using protocol-specified effect size, type I and type II error rates. In addition to the common futility approaches, we considered a recently proposed linear inefficacy boundary approach with an early harm look followed by several lack-of-efficacy analyses. For each futility approach, interim test statistics were generated for three schedules with different analysis frequency, and early stopping was recommended if the interim result crossed a futility stopping boundary. For trials not demonstrating superiority, the impact of each rule is summarized as savings on sample size, study duration, and information time scales.
For negative studies, our results show that the futility approaches based on testing the alternative hypothesis and repeated confidence interval rules yielded less savings (compared to the other two rules). These boundaries are too conservative, especially during the first half of the study (<50% of information). The conditional power rules are too aggressive during the second half of the study (>50% of information) and may stop a trial even when there is a clinically meaningful treatment effect. The linear inefficacy boundary with three or more interim analyses provided the best results. For positive studies, we demonstrated that none of the futility rules would have stopped the trials.
The linear inefficacy boundary futility approach is attractive from statistical, clinical, and logistical standpoints in clinical trials evaluating new anti-cancer agents.
Factorial analyses of 2 x 2 trial designs are known to be problematic unless one can be sure that there is no interaction between the treatments (A and B). Instead, we consider non-factorial analyses of a factorial trial design that addresses clinically relevant questions of interest without any assumptions on the interaction. Primary questions of interest are as follows: (1) is A better than the control treatment C, (2) is B better than C, (3) is the combination of A and B (AB) better than C, and (4) is AB better than A, B, and C.
A simple three-step procedure that tests the first three primary questions of interest using a Bonferroni adjustment at the first step is proposed. A Hochberg procedure on the four primary questions is also considered. The two procedures are evaluated and compared in limited simulations. Published results from three completed trials with factorial designs are re-evaluated using the two procedures.
Both suggested procedures (that answer multiple questions) require a 50%–60% increase in per arm sample size over a two-arm design asking a single question. The simulations suggest a slight advantage to the three-step procedure in terms of power (for the primary and secondary questions). The proposed procedures would have formally addressed the questions arising in the highlighted published trials arguably more simply than the pre-specified factorial analyses used.
Factorial trial designs are an efficient way to evaluate two treatments, alone and in combination. In situations where a statistical interaction between the treatment effects cannot be assumed to be 0, simple non-factorial analyses are possible that directly assess the questions of interest without the zero interaction assumption.
Many clinical trial designs are impractical for community-based clinical intervention trials. Stepped wedge trial designs provide practical advantages, but few descriptions exist of their clinical implementational features, statistical design efficiencies, and limitations.
Enhance efficiency of stepped wedge trial designs by evaluating the impact of design characteristics on statistical power for the British Columbia Telehealth Trial.
The British Columbia Telehealth Trial is a community-based, cluster-randomized, controlled clinical trial in rural and urban British Columbia. To determine the effect of an Internet-based telehealth intervention on healthcare utilization, 1000 subjects with an existing diagnosis of congestive heart failure or type 2 diabetes will be enrolled from 50 clinical practices. Hospital utilization is measured using a composite of disease-specific hospital admissions and emergency visits. The intervention comprises online telehealth data collection and counseling provided to support a disease-specific action plan developed by the primary care provider. The planned intervention is sequentially introduced across all participating practices. We adopt a fully Bayesian, Markov chain Monte Carlo–driven statistical approach, wherein we use simulation to determine the effect of cluster size, sample size, and crossover interval choice on type I error and power to evaluate differences in hospital utilization.
For our Bayesian stepped wedge trial design, simulations suggest moderate decreases in power when crossover intervals from control to intervention are reduced from every 3 to 2 weeks, and dramatic decreases in power as the numbers of clusters decrease. Power and type I error performance were not notably affected by the addition of nonzero cluster effects or a temporal trend in hospitalization intensity.
Stepped wedge trial designs that intervene in small clusters across longer periods can provide enhanced power to evaluate comparative effectiveness, while offering practical implementation advantages in geographic stratification, temporal change, use of existing data, and resource distribution. Current population estimates were used; however, models may not reflect actual event rates during the trial. In addition, temporal or spatial heterogeneity can bias treatment effect estimates.
In testing for non-inferiority of anti-infective drugs, the primary endpoint is often the difference in the proportion of failures between the test and control group at a landmark time. The landmark time is chosen to approximately correspond to the qth historic quantile of the control group, and the non-inferiority margin is selected to be reasonable for the target level q. For designing these studies, a troubling issue is that the landmark time must be pre-specified, but there is no guarantee that the proportion of control failures at the landmark time will be close to the target level q. If the landmark time is far from the target control quantile, then the pre-specified non-inferiority margin may not longer be reasonable. Exact variable margin tests have been developed by Röhmel and Kieser to address this problem, but these tests can have poor power if the observed control failure rate at the landmark time is far from its historic value.
We develop a new variable margin non-inferiority test where we continue sampling until a pre-specified proportion of failures, q, have occurred in the control group, where q is the target quantile level. The test does not require any assumptions on the failure time distributions, and hence, no knowledge of the true
Our new test is exact and has power comparable to (or greater than) its competitors when the true control quantile from the study equals (or differs moderately from) its historic value. Our nivm R package performs the test and gives confidence intervals on the difference in failure rates at the true target control quantile. The tests can be applied to time to cure or other numeric variables as well.
A substantial proportion of new anti-infective drugs being developed use non-inferiority tests in their development, and typically, a pre-specified landmark time and its associated difference margin are set at the design stage to match a specific target control quantile. If through changing standard of care or selection of a different population the target quantile for the control group changes from its historic value, then the appropriateness of the pre-specified margin at the landmark time may be questionable. Our proposed test avoids this problem by sampling until a pre-specified proportion of the controls have failed.
Patients, clinicians, and policymakers alike need access to high-quality scientific evidence in order to make informed choices about health and healthcare, but the current national clinical trials enterprise is not yet optimally configured for the efficient creation and dissemination of such evidence. However, new technologies and methods hold significant potential for accelerating the rate at which we are able to translate raw findings gathered from both patient care and clinical research into actionable knowledge. We are now entering a period in which the quantitative sciences are emerging as the critical disciplines for advancing knowledge about health and healthcare, and statisticians will increasingly serve as critical mediators in transforming data into evidence. In this new, data-centric era, biostatisticians not only need to be expert at analyzing data but should also be involved directly in diverse efforts, including the review and analysis of research portfolios in order to optimize the relevance of research questions, the use of "quality by design" principles to improve reliability and validity of each individual trial, and the mining of aggregate knowledge derived from the clinical research enterprise as a whole. In order to meet these challenges, it is imperative that we (1) nurture and build the biostatistical workforce, (2) develop a deeper understanding of the biological and clinical context among statisticians, (3) facilitate collaboration among biostatisticians and other members of the clinical trials enterprise, (4) focus on communication skills in training and education programs, and (5) enhance the quantitative capacity of the research and clinical practice worlds.
The emergence, post approval, of serious medical events, which may be associated with the use of a particular drug or class of drugs, is an important public health and regulatory issue. The best method to address this issue is through a large, rigorously designed safety study. Therefore, it is important to elucidate the statistical issues involved in these large safety studies.
Two such studies are PRECISION and EAGLES. PRECISION is the primary focus of this article. PRECISION is a non-inferiority design with a clinically relevant non-inferiority margin. Statistical issues in the design, conduct and analysis of PRECISION are discussed.
Quantitative and clinical aspects of the selection of the composite primary endpoint, the determination and role of the non-inferiority margin in a large safety study and the intent-to-treat and modified intent-to-treat analyses in a non-inferiority safety study are shown. Protocol changes that were necessary during the conduct of PRECISION are discussed from a statistical perspective. Issues regarding the complex analysis and interpretation of the results of PRECISION are outlined. EAGLES is presented as a large, rigorously designed safety study when a non-inferiority margin was not able to be determined by a strong clinical/scientific method. In general, when a non-inferiority margin is not able to be determined, the width of the 95% confidence interval is a way to size the study and to assess the cost–benefit of relative trial size.
A non-inferiority margin, when able to be determined by a strong scientific method, should be included in a large safety study. Although these studies could not be called "pragmatic," they are examples of best real-world designs to address safety and regulatory concerns.
To reduce research costs in the context of pragmatic trials, consideration is given to using administrative data (Medicare claims) to ascertain clinical outcomes.
In the historical context of the Women’s Health Initiative, the correspondence between selected cardiovascular events derived from Medicare claims was compared to those documented and adjudicated in this large-scale prevention trial.
Classification performance varies somewhat by type of outcome, but hazard ratios and confidence intervals derived from the two data sources were quite comparable.
These encouraging results provided the needed support to launch a new embedded pragmatic trial of physical activity that will rely heavily on Medicare claims to ascertain cardiovascular disease incidence in the majority of those randomized.
A learning health care system ideally incorporates the ability to adapt to the pace of change, the incorporation of new clinical research paradigms, and leverages electronic health record systems and clinical decision support systems to narrow the divide between research and clinical practice.
An adaptive clinical trial can be embedded into the sites and practice of clinical care in a highly pragmatic way to simultaneously generate high-quality data on treatment efficacy and improve the care of patients. This approach can be expanded into a pragmatic platform trial, meaning a trial that is intended to evaluate multiple treatments for a disease or diseases, possibly in combination, and with the available treatments potentially changing over time. This strategy is illustrated using a trial currently being implemented in Europe and funded by the European Union, evaluating three different "domains" of treatments for patients with severe community-acquired pneumonia requiring intensive care.
Simulation studies demonstrate that this approach has the potential to save lives while identifying the best treatment strategies for this critically ill population.
Patients are likely to benefit if we can merge clinical trials and decision support into a single continuous learning process.
Randomized clinical trials provide gold-standard evidence for the efficacy of interventions, but have limitations, including highly selected populations that make inference on effectiveness difficult and a lack of ability to adapt and change midstream.
We propose two innovations for pragmatic trial design.
Evidence-based evolutionary testing, a framework that allows adaptation of interventions and rapid-cycle innovation, preserves the power of randomization while acknowledging the need for adaptation and learning. An opt-out consent framework increases the fraction of the target population who participate in trials, but may lead to dampening of effect sizes.
Pragmatic trials offer numerous advantages in the evaluation of behavioral interventions in health. Statistical innovations, including evidence-based evolutionary testing and opt-out framing of consent and enrollment processes, can enhance the power of pragmatic trials and lead to more rapid progress.
Despite the wide use of the design with statistical stopping guidelines to stop a randomized clinical trial early for efficacy, there are unsettled debates of potential harmful consequences of such designs. These concerns include the possible over-estimation of treatment effects in early stopped trials and a newer argument of a "freezing effect" that will halt future randomized clinical trials on the same comparison since an early stopped trial represents an effective declaration that randomization to the unfavored arm is unethical. The purpose of this study is to determine the degree of bias in designs that allow for early stopping and to assess the impact on estimation if indeed future experimentation is "frozen" by an early stopped trial.
We perform simulations to study the effect of early stopping. We simulate a collection of trials and contrast the treatment-effect estimates (risk differences and ratios) with the simulation truth. Simulations consider various scenarios of between-study variation, including an empirically derived distribution of effects from the clinical literature.
Across the trials whose true effects are sampled from a uniform distribution, estimates from trials that stop early for efficacy deviate minimally from the simulation truth (median bias of the estimate of risk difference is 0.005). Over-estimation becomes appreciable only when the true effect is close to the null value 0 (median bias of the risk difference estimate is 0.04) or when stopping happens with 40% information or less; however, stopping under these situations is rare. We also find slight reverse bias of the estimated treatment effect (median bias of the risk difference estimate is –0.002) among trials that do not cross the early stopping boundaries but continue to the final analysis. Similar results occur with relative risk estimates. In contrast, Bayesian estimation of the treatment effect shrinks the estimate from trials stopping early and pulls back under-estimation from completed trials, largely rectifying any over-estimation among trials that terminate early. Regarding the so-called freezing effect, the pooled effects from meta-analyses that include truncated randomized clinical trials show an unimportant deviation from the true value, even when no subsequent trials are conducted after a truncated randomized clinical trial.
Group sequential designs with stopping rules seek to minimize exposure of patients to a disfavored therapy and speed dissemination of results, and such designs do not lead to materially biased estimates. The likelihood and magnitude of a "freezing effect" is minimal. Superiority demonstrated in a randomized clinical trial stopping early and designed with appropriate statistical stopping rules is likely a valid inference, even if the estimate may be slightly inflated.
Pragmatic clinical trials embedded within health care systems provide an important opportunity to evaluate new interventions and treatments. Networks have recently been developed to support practical and efficient studies. Pragmatic trials will lead to improvements in how we deliver health care and promise to more rapidly translate research findings into practice.
The National Institutes of Health (NIH) Health Care Systems Collaboratory was formed to conduct pragmatic clinical trials and to cultivate collaboration across research areas and disciplines to develop best practices for future studies. Through a two-stage grant process including a pilot phase (UH2) and a main trial phase (UH3), investigators across the Collaboratory had the opportunity to work together to improve all aspects of these trials before they were launched and to address new issues that arose during implementation. Seven Cores were created to address the various considerations, including Electronic Health Records; Phenotypes, Data Standards, and Data Quality; Biostatistics and Design Core; Patient-Reported Outcomes; Health Care Systems Interactions; Regulatory/Ethics; and Stakeholder Engagement. The goal of this article is to summarize the Biostatistics and Design Core’s lessons learned during the initial pilot phase with seven pragmatic clinical trials conducted between 2012 and 2014.
Methodological issues arose from the five cluster-randomized trials, also called group-randomized trials, including consideration of crossover and stepped wedge designs. We outlined general themes and challenges and proposed solutions from the pilot phase including topics such as study design, unit of randomization, sample size, and statistical analysis. Our findings are applicable to other pragmatic clinical trials conducted within health care systems.
Pragmatic clinical trials using the UH2/UH3 funding mechanism provide an opportunity to ensure that all relevant design issues have been fully considered in order to reliably and efficiently evaluate new interventions and treatments. The integrity and generalizability of trial results can only be ensured if rigorous designs and appropriate analysis choices are an essential part of their research protocols.
Independent central review of clinical imaging remains the standard for oncology clinical trials with registration potential. A limited independent central review strategy has been proposed for solid tumor trials based on concordance between central and local evaluation of response. Concordance between independent central review and local evaluation of response in hematological malignancies is not known.
We retrospectively evaluated concordance between prospectively performed central and local assessments of response using the Revised Response Criteria for Malignant Lymphoma across two international, open-label, single-arm, registration studies of brentuximab vedotin in patients with relapsed or refractory Hodgkin lymphoma (N = 102) or systemic anaplastic large-cell lymphoma (N = 58).
Overall objective response rates were similar between assessors for both the trial in Hodgkin lymphoma (75% independent central review, 72% local evaluation) and the trial in anaplastic large-cell lymphoma (86% independent central review, 83% local evaluation). Patient-specific objective response concordance was also substantial (Hodgkin lymphoma: kappa = 0.68; anaplastic large-cell lymphoma: kappa = 0.74). Median progression-free survival was similar between assessors for patients with anaplastic large-cell lymphoma (14.3 months by independent central review (95% confidence interval: 6.9, -); 14.5 months by local evaluation (95% confidence interval: 9.4, -)), but longer by local evaluation in patients with Hodgkin lymphoma (5.8 months by independent central review (95% confidence interval: 5.0, 9.0); 9.0 months by local evaluation (95% confidence interval: 7.1, 12.0)). Median duration of response was longer by local evaluation in both malignancies, which was primarily attributable to earlier computed tomography and positron emission tomography–based scoring of progression by independent central review.
A limited independent review audit strategy for clinical trials of some lymphomas appears feasible and practical based on substantial concordance in assessments of overall objective response by central and local evaluation in two international, prospective, registration trials in lymphoma. Some variability between assessors in the time-to-event endpoints was observed, which appeared attributable to earlier assignments of progression by independent central review compared with local evaluation.
Interim monitoring is a key component of randomized clinical trial design from both ethical and efficiency perspectives. In studies with time-to-event endpoints, timing of interim analyses is typically based on observing a pre-specified proportion of the total number of events required for the final analysis. While most randomized clinical trial designs pool events over the experimental and control arms in determining the analysis times, some designs use only the control-arm events for scheduling interim looks.
To evaluate the performance of the pooled and control-arm-based interim monitoring approaches and to propose a new procedure, the earliest information time procedure, that combines the benefits of the two approaches.
The analytical and logistical considerations for the procedures are presented. The methodology is illustrated on data from three published randomized clinical trials. The procedures are compared in a simulation study.
The control-arm approach results in a slight inflation of the study type I error in one-sided randomized clinical trial designs. When the new treatment is no better than the control treatment, the pooled-arm approach results in, on average, earlier stopping times than the control-arm approach. When the new treatment works exceptionally well, the average stopping times under the control-arm approach are earlier than those under the pooled approach. The proposed earliest information time procedure is shown to result in stopping times corresponding to the best (earliest) of the two approaches over the entire range of alternatives.
The earliest information time procedure may result in a slight inflation of the type I error (especially in small trials); when exact control of the type I error is required, it is necessary to use a simulation-based method to correct the inflation.
In time-to-event settings, the earliest information time procedure is an attractive alternative to the pooled and control-arm approaches. Improving the timing of interim analyses helps to minimize patient exposure to inferior treatments and to accelerate dissemination of the study results.
For the past few decades, randomized clinical trials have provided evidence for effective treatments by comparing several competing therapies. Their successes have led to numerous new therapies to combat many diseases. However, since their conclusions are based on the entire cohort in the trial, the treatment recommendation is for everyone, and may not be the best option for an individual. Medical research is now focusing more on providing personalized care for patients, which requires investigating how patient characteristics, including novel biomarkers, modify the effect of current treatment modalities. This is known as heterogeneity of treatment effects. A better understanding of the interaction between treatment and patient-specific prognostic factors will enable practitioners to expand the availability of tailored therapies, with the ultimate goal of improving patient outcomes. The Subpopulation Treatment Effect Pattern Plot (STEPP) approach was developed to allow researchers to investigate the heterogeneity of treatment effects on survival outcomes across values of a (continuously measured) covariate, such as a biomarker measurement.
Here, we extend the Subpopulation Treatment Effect Pattern Plot approach to continuous, binary, and count outcomes, which can be easily modeled using generalized linear models. With this extension of Subpopulation Treatment Effect Pattern Plot, these additional types of treatment effects within subpopulations defined with respect to a covariate of interest can be estimated, and the statistical significance of any observed heterogeneity of treatment effect can be assessed using permutation tests. The desirable feature that commonly used models are applied to well-defined patient subgroups to estimate treatment effects is retained in this extension.
We describe a simulation study to confirm that the proper Type I error rate is maintained when there is no treatment heterogeneity, and a power study to show that the statistics have power to detect treatment heterogeneity under alternative scenarios. As an illustration, we apply the methods to data from the Aspirin/Folate Polyp Prevention Study, a clinical trial evaluating the effect of oral aspirin, folic acid, or both as a chemoprevention agent against colorectal adenomas. The pre-existing R software package stepp has been extended to handle continuous, binary, and count data using Gaussian, Bernoulli, and Poisson models, and it is available on the Comprehensive R Archive Network.
The extension of the method and the availability of new software now permit STEPP to be applied to the full range of clinical trial end points.
The use of adaptive designs has been increasing in randomized clinical trials. Sample size re-estimation is a type of adaptation in which nuisance parameters are estimated at an interim point in the trial and the sample size re-computed based on these estimates. The Secondary Prevention of Small Subcortical Strokes study was a randomized clinical trial assessing the impact of single- versus dual-antiplatelet therapy and control of systolic blood pressure to a higher (130–149 mmHg) versus lower (<130 mmHg) target on recurrent stroke risk in a two-by-two factorial design. A sample size re-estimation was performed during the Secondary Prevention of Small Subcortical Strokes study resulting in an increase from the planned sample size of 2500–3020, and we sought to determine the impact of the sample size re-estimation on the study results.
We assessed the results of the primary efficacy and safety analyses with the full 3020 patients and compared them to the results that would have been observed had randomization ended with 2500 patients. The primary efficacy outcome considered was recurrent stroke, and the primary safety outcomes were major bleeds and death. We computed incidence rates for the efficacy and safety outcomes and used Cox proportional hazards models to examine the hazard ratios for each of the two treatment interventions (i.e. the antiplatelet and blood pressure interventions).
In the antiplatelet intervention, the hazard ratio was not materially modified by increasing the sample size, nor did the conclusions regarding the efficacy of mono versus dual-therapy change: there was no difference in the effect of dual- versus monotherapy on the risk of recurrent stroke hazard ratios (n = 3020 HR (95% confidence interval): 0.92 (0.72, 1.2), p = 0.48; n = 2500 HR (95% confidence interval): 1.0 (0.78, 1.3), p = 0.85). With respect to the blood pressure intervention, increasing the sample size resulted in less certainty in the results, as the hazard ratio for higher versus lower systolic blood pressure target approached, but did not achieve, statistical significance with the larger sample (n = 3020 HR (95% confidence interval): 0.81 (0.63, 1.0), p = 0.089; n = 2500 HR (95% confidence interval): 0.89 (0.68, 1.17), p = 0.40). The results from the safety analyses were similar to 3020 and 2500 patients for both study interventions. Other trial-related factors, such as contracts, finances, and study management, were impacted as well.
Adaptive designs can have benefits in randomized clinical trials, but do not always result in significant findings. The impact of adaptive designs should be measured in terms of both trial results, as well as practical issues related to trial management. More post hoc analyses of study adaptations will lead to better understanding of the balance between the benefits and the costs.
In many randomized controlled trials, patients and doctors are more interested in the per-protocol effect than in the intention-to-treat effect. However, valid estimation of the per-protocol effect generally requires adjustment for prognostic factors associated with adherence. These adherence adjustments have been strongly questioned in the clinical trials community, especially after 1980 when the Coronary Drug Project team found that adherers to placebo had lower 5-year mortality than non-adherers to placebo.
We replicated the original Coronary Drug Project findings from 1980 and re-analyzed the Coronary Drug Project data using technical and conceptual developments that have become established since 1980. Specifically, we used logistic models for binary outcomes, decoupled the definition of adherence from loss to follow-up, and adjusted for pre-randomization covariates via standardization and for post-randomization covariates via inverse probability weighting.
The original Coronary Drug Project analysis reported a difference in 5-year mortality between adherers and non-adherers in the placebo arm of 9.4 percentage points. Using modern approaches, we found that this difference was reduced to 2.5 (95% confidence interval: –2.1 to 7.0).
Valid estimation of per-protocol effects may be possible in randomized clinical trials when analysts use appropriate methods to adjust for post-randomization variables.
Phase II clinical trials are important milestones to determine whether a dose-effect exists and to decide on future doses to use in confirmatory studies. To take into account the overall shape of the dose–response curve, modeling the relationship by linear or non-linear models is preferable to the classical pair-wise comparisons of the effect of each dose versus the placebo or the comparator. The multiple comparisons and modeling approach has been developed within the last 10 years to address this important question in the clinical development of drugs. Despite some recent publications referring to this methodology, few detailed applications have been shown so far and several practical questions remain to be addressed.
Starting from a set of candidate models, model selection using classical methods criteria is possible. However, it suffers some limitations, not taking into account the uncertainty of the selection process itself. An attractive solution is to use model averaging, which applies appropriate weights to the parameters (e.g., the minimum effective dose) obtained from each model.
A discussion of the selection criteria is first presented. Through two real examples, how to proceed with model selection and model averaging is presented and discussed.
The first multiple comparisons and modeling approach papers addressed normal responses. More recently, an extension of this methodology has been proposed to deal with other types of responses, in particular binary, time-to-event and longitudinal data. Questions that remain are concerned with the choice of the candidate models and of their parameters’ guesstimates.
The analysis of clinical dose-finding studies using a modeling of the entire curve offers a promising alternative as compared with the classical multiple comparisons methods, while not compromising the necessary rigor of the analysis.
Recent advances in medical research suggest that the optimal treatment rules should be adaptive to patients over time. This has led to an increasing interest in studying dynamic treatment regime, a sequence of individualized treatment rules, one per stage of clinical intervention, which maps present patient information to a recommended treatment. There has been a recent surge of statistical work for estimating optimal dynamic treatment regimes from randomized and observational studies. The purpose of this article is to review recent methodological progress and applied issues associated with estimating optimal dynamic treatment regimes.
We discuss sequential multiple assignment randomized trials, a clinical trial design used to study treatment sequences. We use a common estimator of an optimal dynamic treatment regime that applies to sequential multiple assignment randomized trials data as a platform to discuss several practical and methodological issues.
We provide a limited survey of practical issues associated with modeling sequential multiple assignment randomized trials data. We review some existing estimators of optimal dynamic treatment regimes and discuss practical issues associated with these methods including model building, missing data, statistical inference, and choosing an outcome when only non-responders are re-randomized. We mainly focus on the estimation and inference of dynamic treatment regimes using sequential multiple assignment randomized trials data. Dynamic treatment regimes can also be constructed from observational data, which may be easier to obtain in practice; however, care must be taken to account for potential confounding.
Bayesian predictive probabilities can be used for interim monitoring of clinical trials to estimate the probability of observing a statistically significant treatment effect if the trial were to continue to its predefined maximum sample size.
We explore settings in which Bayesian predictive probabilities are advantageous for interim monitoring compared to Bayesian posterior probabilities, p-values, conditional power, or group sequential methods.
For interim analyses that address prediction hypotheses, such as futility monitoring and efficacy monitoring with lagged outcomes, only predictive probabilities properly account for the amount of data remaining to be observed in a clinical trial and have the flexibility to incorporate additional information via auxiliary variables.
Computational burdens limit the feasibility of predictive probabilities in many clinical trial settings. The specification of prior distributions brings additional challenges for regulatory approval.
The use of Bayesian predictive probabilities enables the choice of logical interim stopping rules that closely align with the clinical decision-making process.
Missing data are unavoidable in most randomized controlled clinical trials, especially when measurements are taken repeatedly. If strong assumptions about the missing data are not accurate, crude statistical analyses are biased and can lead to false inferences. Furthermore, if we fail to measure all predictors of missing data, we may not be able to model the missing data process sufficiently. In longitudinal randomized trials, measuring a patient’s intent to attend future study visits may help to address both of these problems. Leon et al. developed and included the Intent to Attend assessment in the Lithium Treatment – Moderate dose Use Study (LiTMUS), aiming to remove bias due to missing data from the primary study hypothesis.
The purpose of this study is to assess the performance of the Intent to Attend assessment with regard to its use in a sensitivity analysis of missing data.
We fit marginal models to assess whether a patient’s self-rated intent predicted actual study adherence. We applied inverse probability of attrition weighting (IPAW) coupled with patient intent to assess whether there existed treatment group differences in response over time. We compared the IPAW results to those obtained using other methods.
Patient-rated intent predicted missed study visits, even when adjusting for other predictors of missing data. On average, the hazard of retention increased by 19% for every one-point increase in intent. We also found that more severe mania, male gender, and a previously missed visit predicted subsequent absence. Although we found no difference in response between the randomized treatment groups, IPAW increased the estimated group difference over time.
LiTMUS was designed to limit missed study visits, which may have attenuated the effects of adjusting for missing data. Additionally, IPAW can be less efficient and less powerful than maximum likelihood or Bayesian estimators, given that the parametric model is well specified.
In LiTMUS, the Intent to Attend assessment predicted missed study visits. This item was incorporated into our IPAW models and helped reduce bias due to informative missing data. This analysis should both encourage and facilitate future use of the Intent to Attend assessment along with IPAW to address missing data in a randomized trial.
In June 2013, a 1-day workshop on Dynamic Treatment Strategies (DTSs) and Sequential Multiple Assignment Randomized Trials (SMARTs) was held at the University of Pennsylvania in Philadelphia, Pennsylvania. These two linked topics have generated a great deal of interest as researchers have recognized the importance of comparing entire strategies for managing chronic disease. A number of articles emerged from that workshop.
The purpose of this survey of the DTS/SMART methodology (which is taken from the introductory talk in the workshop) is to provide the reader the collected articles presented in this volume with sufficient background to appreciate the more detailed discussions in the articles.
The way that the DTS arises naturally in clinical practice is described, along with its connection to the well-known difficulties of interpreting the analysis by intention-to-treat. The SMART methodology for comparing DTS is described, and the basics of estimation and inference presented.
The DTS/SMART methodology can be a flexible and practical way to optimize ongoing clinical decision making, providing evidence (based on randomization) for comparative effectiveness.
The DTS/SMART methodology is not a solution for unstandardized study protocols.
The DTS/SMART methodology has growing relevance to comparative effectiveness research and the needs of the learning healthcare system.
Cancer affects millions of people worldwide each year. Patients require sequences of treatment based on their response to previous treatments to combat cancer and fight metastases. Physicians provide treatment based on clinical characteristics, changing over time. Guidelines for these individualized sequences of treatments are known as dynamic treatment regimens (DTRs) where the initial treatment and subsequent modifications depend on the response to previous treatments, disease progression, and other patient characteristics or behaviors. To provide evidence-based DTRs, the Sequential Multiple Assignment Randomized Trial (SMART) has emerged over the past few decades.
To examine and learn from past SMARTs investigating cancer treatment options, to discuss potential limitations preventing the widespread use of SMARTs in cancer research, and to describe courses of action to increase the implementation of SMARTs and collaboration between statisticians and clinicians.
There have been SMARTs investigating treatment questions in areas of cancer, but the novelty and perceived complexity has limited its use. By building bridges between statisticians and clinicians, clarifying research objectives, and furthering methods work, there should be an increase in SMARTs addressing relevant cancer treatment questions. Within any area of cancer, SMARTs develop DTRs that can guide treatment decisions over the disease history and improve patient outcomes.
Due to the cost and complexity of conducting a sequential multiple assignment randomized trial (SMART), it is desirable to pre-define a small number of personalized regimes to study.
We proposed a simulation-based approach to studying personalized dosing strategies in contexts for which a therapeutic agent’s pharmacokinetic and pharmacodynamics properties are well understood. We take dosing of warfarin as a case study, as its properties are well understood. We consider a SMART in which there are five intervention points in which dosing may be modified, following a loading phase of treatment.
Realistic SMARTs are simulated, and two methods of analysis, G-estimation and Q-learning, are used to assess potential personalized dosing strategies.
In settings where outcome modelling may be complex due to the highly non-linear nature of the pharmacokinetic and pharmacodynamics mechanisms of the therapeutic agent, G-estimation provides for which the more promising method of estimating an optimal dosing strategy. Used in combination with the simulated SMARTs, we were able to improve simulated patient outcomes and suggest which patient characteristics were needed to best individually tailor dosing. In particular, our simulations suggest that current dosing should be determined by an individual’s current coagulation time as measured by the international normalized ratio (INR), their last measured INR, and their last dose. Tailoring treatment only based on current INR and last warfarin dose provided inferior control of INR over the course of the trial.
The ability of the simulated SMARTs to suggest optimal personalized dosing strategies relies on the pharmacokinetic and pharmacodynamic models used to generate the hypothetical patient profiles. This approach is best suited to therapeutic agents whose effects are well studied.
Prior to investing in a complex randomized trial that involves sequential treatment allocations, simulations should be used where possible in order to guide which dosing strategies to evaluate.
Background Recent research has proposed a new method for defining a favorable outcome in traumatic brain injury and stroke research.
Purpose This new method is called the sliding dichotomy, and it is suggested as a potential solution to the problem of underpowered clinical trials.
Methods We present a brief simulation study and graphical comparison of the power of each method to detect varying treatment effect sizes.
Results Simulations of a patient population similar to the National Acute Brain Injury Study: Hypothermia (NABISH) study indicate that the sliding dichotomy method does not result in higher power than traditional methods.
Conclusions Although the sliding dichotomy may present gains in power in some cases, several aspects of the patient population need to be considered in choosing between sliding dichotomy and traditional definitions of favorable outcomes. Clinical Trials 2012; 0: 1–11. http://ctj.sagepub.com