Back to guidelines general
Back to members only page
Back to IPEG homepage
G. Ferber, K. Abt, K. Fichte, R. Luthringer
Neuropsychobiology. 1999, 39(2), 92-100.
1 Introduction
2 Design and Planning Issues
2.1 The Study Population
2.2 Overall Development Considerations
2.3 Goals of the Study
2.4 Data Sources
The Parallel Group Design
The Cross-Over Design
The Factorial Design
Stratification by Time Blocks
Stratification by Additional Factors
Dose Response Trials
2.6 The Multiplicity Issue
2.7 Choice of Inferential Strategy and
Analytical Methods
2.8 Sample Size Determination
2.9 Techniques to Avoid Bias and Other Design
Issues
Blinding
Randomisation
Study Protocol
3 Analysis Issues
3.1 Sample Distributions
3.2 Estimation and Testing of Effects
3.3 Coping with Missing Values and Protocol
Violations
3.4 Techniques of Coping with Multiplicity
Multiple Testing Approaches
Multivariate Methods
Modelling Dependencies
Summary Measures
Descriptive Data Analysis
4 Safety Issues
5 Reporting
6 References
Guidelines
Textbooks
Review Articles
Methodological Articles
Applications to
Pharmaco-Electrophysiology
This guideline is a supplement to the "Guidelines for pharmaco-EEG studies in man" [1]. Other supplements have been issued previously covering data acquisition [2] and mapping and evoked potentials [3]. For related guidelines, see also [4] and [5]. The present guideline is intended to provide support with respect to the statistical aspects of the design and analysis of pharmacodynamic studies in man, in particular electroencephalographic studies. However, the present guideline applies as well to general pharmacodynamic evaluations in the early phases of clinical drug development. In view of the inherent variability of measurements in medicine from subject to subject and from condition to condition within subjects, the separation of treatment - and other - effects from phenomena caused by mere random variation is possible only by application of tools from the field of mathematical statistics. This application demands a careful planning of the design and analysis of a study in order to avoid systematic errors (biases) in the conclusions to be gained from the data. Therefore, an essential part of the guideline is devoted to the planning issues of statistical nature.
Some of the recommended procedures in both the planning and analysis issues of the present guideline follow the Biostatistical Guidelines established by the CPMP Working Party on Efficacy of Medicinal Products [6], however, with a selection towards the needs in studies in healthy volunteers and in patients in the early phase of a pharmacological evaluation. The ICH Topic E9 Guidelines "Statistical principles for clinical trials" [7], which represent a consensus between regulatory agencies of Japan, the European Union and the United States of America, are also based on the CPMP guidelines. The present guideline cannot and should not give a detailed listing of statistical tools to be applied, rather the reader will often be referred to the published and accessible statistical literature. Also, mathematical formulas are avoided in the text in order to improve the legibility for the non-statistician. Whenever the investigator feels unsafe with respect to statistical issues in the planning of a study, advice from a statistician should be sought. The basis of most of the topics treated or mentioned in this guideline are covered in many textbooks on medical statistics such as [8] or [9].
During the development of a drug, clinical trials focusing on pharmacodynamic and, in particular, electrophysiologic recordings will, in general, be conducted in Phase I in healthy volunteers and in Phase II in patients. Safety as well as efficacy are the primary goals of such studies. The trials for which this guideline applies will usually be conducted once the maximal well tolerated dose has been established. Typically their objective is to investigate pharmacodynamic characteristics of a compound, e. g. time or dose-effects. Although pharmacodynamic recordings may also be used during the process of establishing the maximal well tolerated dose, they will not be the primary interest in such studies. Therefore, this latter type of studies will not be covered in the present guideline.
Each group (sample) of study subjects is only representative for a defined study population. Ideally this is the population described by the inclusion and exclusion criteria for subjects laid down in the study protocol. This population will always differ from the general population and sometimes even from the target population of a drug.
Results derived from a sample from one population may be invalid for another population. Sometimes biomedical reasoning may allow the transfer of results from one population (e. g. young healthy volunteers) to another one (e. g. patients), but in principle this is subject to erroneous conclusions. In early clinical drug development phases however, where the number of subjects exposed must be kept to a minimum for ethical reasons, and where emphasis is on pharmacological mechanisms more than on generalisation to clinical routine, there are good reasons to investigate drug effects in a homogeneous, i. e. restricted population. This is because the resulting reduction in variability may lead to an increase in power and/or a decrease in the number of subjects required under otherwise constant conditions.
Pharmacodynamic trials are usually set up to compare the effects of one or more doses of a drug to placebo and possibly to a reference drug. In particular in studies in volunteers, there is, in general, no good reason to exclude placebo from a trial. In many cases, emphasis will be on the time course of a drug effect.
A trial may be set up to investigate the efficacy of a substance, i. e. desired effects. However, the observation of possible expected or unexpected undesired effects, i. e. the safety of the drug, will always be of concern. Electrophysiologic measurements can be used for both goals. In an investigation for efficacy one will, in general, be interested in a protection against a type-I-error, i. e. erroneously claiming efficacy. With respect to safety, avoidance of a type-II-error, i. e. overlooking an existing effect is of concern.
In general, inferences from a study can range from confirmation of an already predefined working hypothesis, or a predefined description of a set of observations, to data driven exploration and generation of new hypotheses. In many studies all types of inference will be used. In pharmacodynamic studies emphasis will often be on descriptive and exploratory methods. Confirmation, to be reached by a "confirmatory" test result, follows from keeping to the preset "level of significance" , e. g. = 0.05. This is the maximally tolerated type I error probability of claiming the existence of a true treatment effect difference when in reality this effect does not exist.
Trials may be set up to show differences between, or equivalence of, true treatment effects. Originally, statistical tests were exclusively based on null hypotheses of no differences between the true effects of two or more conditions or treatments. In order to provide evidence for the difference of true treatment effects, the trial must provide results that have a small probability under the null hypothesis. In pharmacodynamic trials, the investigation of dose-response relationships is often of this type. Tests for differences can be adapted to this special situation (see 2.5). However, if the goal of a study is to establish equivalence of the true effects of two treatments or conditions with respect to a measurement, the null hypothesis should specify that there is a difference between these effects that is at least as large as what is considered the smallest medically relevant difference. Again, to produce evidence of equivalence, the trial is to produce results that are improbable under the null hypothesis. Failure to reject a null hypothesis of no difference of the true effects alone is not sufficient to assume equivalence. It should be noted that in phase I and phase II trials, a simple question of equivalence of two treatment effects will be rarely of interest. A more common task might be the identification of one of a number of doses of a new treatment that produces an effect equivalent to that of a standard treatment. For the topic of equivalence testing see, for example, [16] or [17].
Tests and confidence intervals can be defined in a one-sided way (one is interested only in a positive difference or only in a negative difference) or in a two-sided way (differences in both directions are of interest or must be considered possible). In trials looking for simple differences, there is wide agreement that the two-sided methods should be used. In dose dependency studies and in equivalence trials, one-sided methods are justifiable.
Studies will yield data which are measured/observed via various types of variables:
In electrophysiologic trials the variable values are measured/observed at different time points and locations (e.g. electrode positioning), and the choice of the variables determines the techniques of statistical analysis (see section 3) which, in general, have to be chosen before start of the study (see 2.7). Any standardization or pre-processing - e. g. logarithmic transformations, differences from baseline or percentages of baseline - and adjustments for covariates should be considered as well. It has to be assured that individual measurements are stochastically independent; in the particular situations of cross-over and repeated measurement designs this must apply to the residuals, and in multivariate analysis to the vectors of measurement.
In a parallel group design each subject is prospectively randomized to one treatment. Since, in general, subjects can be considered stochastically independent with regard to measurements or observations, this design is optimal in matching statistical modeling with clinical reality. Comparisons between treatment effects will have to be made against the interindividual variability which sometimes may be large compared to the effect differences of interest. As a consequence, the parallel group design may require more subjects than the cross-over design although the latter requires more assumptions to be valid, see below.
In a cross-over design each subject is randomised to a sequence of treatments with appropriate washout intervals in between. This means that each subject serves as his own control and differences between treatment effects can be compared against specifically defined intraindividual variability. When modelling such data, the following effects must be taken into account: those of the treatments, of the sequences of treatments, of periods of treatment application, and of the subjects, where the latter are considered as random effects. Particularly one has to pay attention to the possibility of treatment dependent carryover effects, although it may be impossible (in particular in the "two by two" cross-over) or inappropriate to incorporate all these effects into the model. Cross-over designs are optimal if treatment effects are expected to be fully reversible after sufficiently long washout periods for the chemical substance. However, carryover effects may also be due to learning and accommodation effects different for the treatments compared and where these effects may last much longer than the presence of the substance in the body. A possibility to increase protection against problems induced by treatment dependent carryover effects is to apply "extra period" designs (see [10]). Pharmacodynamic trials in volunteers are probably one of the most appropriate domains for cross-over designs. These designs include also Latin squares which yield balance with respect to application periods and treatments, and William's squares giving additionally approximate balance with respect to carryover effects. Textbooks on the cross-over design include [10], [11] and [12].
This design is of interest for testing several treatment effects in combination, e. g. the effects of two or more drugs at several dosages. A factorial design can be performed as parallel-group or as cross-over design.
Time blocks - see also 2.9 - are randomized sequences of treatments successively applied to different subjects, each block containing each treatment equally often. This stratification is highly recommended to assure balance of treatments in the time course of the trial. Time blocks are not intended to be included into a model as an additional factor.
Stratification by age groups or sex, for example, would be attractive from a medical and a statistical point of view. However, in practice the applicability of such designs is often limited by logistic constraints. On the other hand, subgroup analyses by factors not accounted for in the design phase will have limited value except for exploratory analyses because of the usually limited number of subjects in pharmacodynamic trials.
The investigation of a dose-response relationship is much more straight forward if we are looking for monotonous trends only rather than for a minimum or maximum to occur at an intermediate dose. Dose-response relationships can either be established directly using dose or its logarithm (or log[1 + dose] to accommodate placebo) as a covariate in a quantitative model or by looking for monotonicity (or more complex patterns) across the ordered dose groups.
In most pharmacodynamic studies more than one significance test will be performed or more than one confidence interval will be constructed. In this case, and if the overall probability of erroneously rejecting valid null hypotheses is to be confined, e.g. to = 5 %, the investigator will face the problem of "multiplicity" of testing. Sources of multiplicity are several variables, time points and locations of measurement/observation, and comparisons of more than two treatments. In the simplest case of two significance tests to be performed for the same group of subjects, multiplicity implies that the actual overall error probability of erroneously rejecting at least one null hypothesis when both are true (e.g. no true treatment effect differences in two variables under defined conditions) may be increased to a maximum of 2 = 10 % when in each test = 5 % is used as significance level. By application of the so-called Bonferroni adjustment to the individual levels, here as ½ = 2.5 %, the "-inflation" will be reduced to the originally tolerated overall error probability of 2(½ ) = 5 %.
The value 2 = 10% may be interpreted as the probability of at least one "20" to appear when casting two twenty-faced dice (two differently marked icosahedrons with faces "1" to "20"): of the 20 x 20 = 400 possible different results, 20 + 20 -1 = 39 or 9.75 % of all castings, in the long run, will show at least one "20". When using = 1/20 = 5 % as individual error probability, 9.75 % will be the probability of at least one null hypothesis rejection when both are true and when the two test variables are mutually independent, whereas the probability is 10 % in case of mutual exclusiveness. In pharmacodynamic reality two test variables are usually dependent, and the -inflation will be less than 9.75 %. However, it is only in exceptional cases that use can be made of this, because of the limited information available at the planning stage.
In general, for N 2 tests, the Bonferroni adjustment for each test is /N but should be planned for at most N = 3 tests. Otherwise, either the decrease of the test power and/or the increase of the necessary sample size becomes excessive. Also, with four or more tests, the medical relevance of the overall result may seriously be diminished. A number of improvements of the Bonferroni method exist, but they do not alter the situation fundamentally. Bonferroni- and related -adjustments must not be used for safety testing unless the analysis of the study has been planned to address these questions in a confirmatory way. A more detailed discussion of the problems of multiplicity can be found in [18]. For further techniques to cope with multiplicity see also section 3.4.
In case of many interesting individual null hypotheses and when application of multivariate analysis is not an issue or not possible, the investigator must decide - prior to data collecting - upon the inferential strategy for coping with the multiplicity problem. Three strategies may be applied: "Confirmatory", "Descriptive" and "Exploratory" Data Analysis, abbreviated as CDA, DDA and EDA, respectively. In the analysis of the data of a particular study, principles of all three strategies may be used. In the present context, CDA implies individual significance tests and confidence intervals each performed/constructed after -adjustments such that the maximally tolerated overall error probability is , e.g. = 5 %. DDA (see section 3.4 for more details) generally aims at a particularly interesting, predefined subarea of possibly many individual tests to show a confirmatory result for the whole subarea and not necessarily for the individual tests in the subarea. In addition, DDA allows for inferences from mere descriptive p-values at predefined test locations. EDA may be used when the data is to show hints for the design of subsequent studies, that is, for the generation of hypotheses rather than for testing them as is the case in CDA and often in DDA. To this end, also in EDA the analysis techniques as applied in CDA and DDA may be used, but without confirmatory interpretation of the results (e.g. p-values) as "significant" in relation to a tolerated overall error probability , e.g. = 5 %. EDA often appears to be an appropriate strategy in pharmacodynamic studies of Phase I.
Together with the choice of the strategy and except for EDA, the analytical methods too have to be pre-chosen. Otherwise, the methods might be chosen according to the eventual appearance of the data which approach will make the pre-chosen tolerated error probabilities and b invalid. Replacement methods will also have to be pre-specified in case the assumptions are not met for applying the methods specified in the first place. In some situations, e. g. with missing values or protocol violations, changes of pre-specified methods become necessary. This is acceptable as long as the changes are made and documented to have been made before unblinding of the treatment allocation, e. g. at latest at a blinded review meeting at the end of the data collection period.
The number of subjects in a study should always be large enough to provide a reliable answer to the questions addressed, but should also be the minimum necessary to achieve this aim. This number is usually determined by the primary efficacy objectives of the study. Sample size determination requires the investigator to make the following specifications:
Moreover, and in general, an estimate or at least a guess of the respective variability is required. (Note: The variability guesses are explicitly needed only for parametric methods but may approximately be used also for non-parametric sample size determination. For more precise sample size determinations in this case see [19] and [20]). Since the error probabilities for erroneously rejecting true null hypotheses are highly dependent upon the inferential strategy to be used, the values of these probabilities must enter the sample size determination based on the required a-adjustments. (The latter does not apply to the inferences to be drawn from an EDA.) The values of a and b are conventionally set at 5% and at 10% (the latter not more than 20 %), respectively. In CDA and in the confirmatory parts of DDA sample size calculation should be performed for all planned individual tests relating to the primary efficacy objectives of the study. The sample size resulting as the largest one should be taken as definitive for the study and documented in the study protocol. An overview over the literature for sample size calculations for equivalence trials can be found in [14], while [15] gives an overview over software tools for sample size determination.
In many studies sample sizes are limited by cost and/or logistic constraints. It is important that the relation of the power of tests and detectable true treatment effect differences at a chosen - level, given the limited sample size, is investigated before starting the study.
For a sample size based on the choice of the inferential strategy early consultation of a statistician is recommended.
The two most important design techniques for avoiding bias in clinical trials are blinding and randomisation, and these should be normal features of most controlled clinical trials. Most such trials follow a double-blind approach in which the medications are pre-packed in accordance with a suitable randomisation schedule, and supplied to the trial centre(s) labelled only with the patient number and the treatment period (in case of cross-over) so that no-one involved in the conduct of the trial is aware of the specific treatment allocated to any particular subject (or subject period), not even by a code letter.
Blinding is intended to limit the occurrence of conscious and unconscious bias in the conduct and interpretation of a clinical trial arising from the influence which the knowledge of treatment may have on the recruitment and allocation of subjects, on their expectations and attitudes and on the conduct of the trial.
In a double blind trial neither the subject nor any of the staff involved in the treatment or clinical evaluation of the subject is aware of the treatment received. In a single blind trial the investigator and/or his staff are aware of the treatment but not the subject. The double blind trial is the optimal approach. This requires that the medications to be applied during the trial cannot be distinguished in any way (appearance, taste, etc.) either before or during administration.
Randomisation assures equal chances of the assignment of each treatment (treatment sequence) to a subject in a clinical trial. During subsequent analysis of the trial data, it provides the sound statistical basis for the quantitative evaluation of the evidence relating to treatment effects. It also tends to produce treatment groups in which the distributions of prognostic factors (known and unknown) are similar. In combination with blinding, randomisation helps to avoid possible bias in the selection of patients arising from the predictability of treatment assignments.
The randomisation schedule of a clinical trial documents the random allocation of treatments or treatment sequences (in case of cross-over studies) to subjects. In the simplest situation it is a sequential list of treatments, or treatment sequences in a cross-over study, or corresponding codes by subject number. Different designs will require different randomisation schedules.
Randomisation will usually be performed in blocks, particularly in time blocks. In a Latin Square design, each square will correspond to a block. In general, care must be taken to choose time block lengths which are sufficiently short to limit possible imbalance, but which are long enough to avoid predictability towards the end of the sequence in a block. Investigators should generally be blind to the block length; the use of two or more different time block lengths, randomly selected for each block, can achieve the same purpose. (Adapted from [6], chapter 7)
All planning issues of a study will be laid down in the so-called Study Protocol. Such a protocol signed by all responsible personnel (in particular the sponsor, the investigator, and the statistician) prior to the begin of the study may become a necessity in later documentation for regular agencies. It also forces the personnel to adhere to the study plan as agreed upon before. Such an agreement may include a data driven analysis when so decided in the planning phase (see "EDA", section 2.7). More technical details and adaptations of the analysis should be laid down in an analysis plan to be finalised after the blinded review meeting, i. e. before unblinding with respect to treatment allocation (see also 3.3 below).
The methods used for statistical analysis will be determined by the choice of variables and of transformations of these variables. In general, quantitative variables which do not have a unimodal distribution should be avoided (see e. g. [25]). If parametric methods are considered, a fairly symmetric distribution is required. Moreover, transformations toward the normal should be considered (see [26], [27]).
In this chapter methods will be discussed disregarding the problem of multiplicity. Therefore, these methods apply only for one variable measured at one time point and location, otherwise they have to be seen in conjunction with the methods to deal with multiplicity discussed in 3.4 below.
The elementary description of the characteristics of a quantitative variable is conventionally made by giving a measure of location and of variability. If the distribution is close to normal (symmetric, unimodal and without outliers) mean and standard deviation are good estimators. In other cases the median and the interquartile range or the MAD (median absolute deviation from the median) are to be preferred. (Note that for some distributions, in particular for proportions, variability is a function of location.)
Tests are the classical tool for confirmatory data analysis and are also central to the confirmatory parts of a DDA. In simple situations, such as the two-sample-comparison, nonparametric tests are usually to be preferred over their parametric counterpart. This is because they need far less assumptions to be met, but randomisation of treatments and independence of variable values (or residuals) are requirements as for parametric analysis. The loss of efficiency with nonparametric tests, if the assumptions of a parametric analysis are also met, is minimal. For nonparametric methods, no transformation of variables is needed as is sometimes necessary for parametric analyses. The most common tests are the two-sample Wilcoxon-Mann-Whitney test and the Kruskal-Wallis Analysis of Variance for the parallel group design and the one-sample Wilcoxon-Signed-Rank test for differences between treatment effects in the cross-over design if sequence effects are negligible. Under certain assumptions the Friedman Analysis of Variance plays an important role in cross-over designs with more than two treatments. Contingency table techniques, such as extended Mantel-Haenszel techniques, usually aimed at nominal variables, may also be applied to ordinal data, e.g., for trend analyses (see e. g. the overview of Koch and Edwards [21]). In more complex situations some nonparametric methods have been developed but are not yet in common use or contained in statistical program packages. In such situations restriction to two-sample tests will be a good choice (see "DDA", section 3.4).
Parametric models, in particular if they describe more complex situations, are based on many assumptions, for which it is often difficult if not impossible to verify that they hold even approximately. Transformations (see 2.4) and in particular, rank transformations [13] may be of some help, but often provide only partial solutions. On the other hand, parametric models allow taking into account covariates and factors that might influence the outcome. Including too many factors and covariates into a model may become problematic because with the limited number of subjects per factor level combination, models may become unstable, and moreover, the interpretation of the results may become difficult. In cross-over trials period effects and baseline values, i. e. the values of the response variable measured before start of treatment are the most likely candidates for inclusion into the model (see also 2.5 above).
Confidence intervals combine the information from estimates of location and significance tests. This method of inference from the study group to the defined study population is superior to that of a test alone. Whenever possible, therefore, confidence intervals should be constructed rather than tests performed only. The assumption underlying a confidence interval are those of the corresponding test. Confidence intervals corresponding to nonparametric tests exist, however, they tend to be conservative and are not supported by some of the popular software packages.
Missing values represent a potential source of bias in a clinical trial. Hence, every effort should be undertaken to fulfill all the requirements of the protocol concerning the collection of data and their subsequent management. However, in reality there will almost always be some missing data. A study may be regarded as valid, none-the-less, provided the methods of dealing with missing values are sensible, and particularly if these methods are pre-defined in the analysis plan of the protocol. Pre-definition of methods may be updated during the blind review (a review to plan details of the analysis under blind conditions, i. e. before the treatment codes have been broken). Unfortunately, no universally applicable methods of handling missing values can be recommended. An investigation should be made concerning the sensitivity of the results of analysis to the method of handling missing values, especially if the number of missing values is substantial. Again, consulting a statistician is recommended.
A similar approach should be adopted to the exploration of the influence of outliers, the statistical definition of which is, to some extent, arbitrary. Clear identification of a particular value as an outlier is most convincing when justified medically as well as statistically, and the medical context will then often define the appropriate action. Any outlier procedure set out in the protocol should not favour any treatment a priori. Once again, this aspect of the analysis plan can be usefully updated during the blind review. If no procedure for dealing with outliers was foreseen in the study protocol, one analysis with the actual values and at least one other analysis eliminating or reducing the outlier effect should be performed and differences between their results discussed.(From [6], 10.2) This procedure may as well be part of the analysis part in the Study Protocol.
It should be noted that nonparametric methods essentially remove the analytical problem of outliers (e.g. use of medians rather than averages), and coping with missing values is easier by restriction to two-sample comparisons (see section 3.2 and DDA in 3.4).
Nevertheless, outliers, missing values and protocol violators will always be possible sources of bias and a concern in interpreting results. In Phase III trials to show superiority, the use of the Intention-to-Treat (ITT) sample, i. e. all patients randomised and treated at least once, and an appropriate (conservative) imputation scheme to replace missing values is accepted practice. Since in Phase I and Phase II generalisation to the future patient population is not the central issue of a trial, more restricted samples may be more appropriate for the primary analysis. In particular, in cross-over trials, a drop out may cause severe imbalances. Therefore, in these trials, replacement of discontinuations which do not seem drug related may be recommended. In order to safeguard against the introduction of bias, it is recommended to perform an additional analysis with the data available, but without replacement of discontinuations.
There are many sources of multiplicity in pharmacodynamic trials, in particular in electrophysiologic trials, with different degrees of inherent structures. The most important ones are multiple comparisons of treatment effects (with a partial order in case of several doses, active control and placebo), multiple time points (typically three to ten, with a strict order), locations (e. g. 19 to 32 electrode positions with right left symmetry) and electrophysiological parameters (e. g. in four to five frequency bands in the traditional EEG). For studies with repeated drug administration, there is a multiplicity on a second time scale.
The most important techniques to cope with multiplicity can be categorised as
In general, a combination of these techniques will be used, adapted to the structure of the various sources of multiplicity.
By this technique sets of individual tests are considered and conditions sought to maintain the overall level of significance. The so-called closed test procedure is central to these methods. In addition to the Bonferroni method mentioned in 2.6 and some sharpenings such as Bonferroni-Holm, Simes and Hochberg, ordered test procedures [22] deserve special mention. Implications on the power (b-correction) of the test procedure must not be forgotten with the latter procedures. The advantage of all these procedures is their generality and the simple fact that the individual tests may show individual confirmatory significances while the global significance level (global error probability) is confined to the prechosen , e. g. = 5 %. However, the limitations outlined in 2.6 above should be kept in mind.
These methods consider all or a subset of the measurements of a subject (under one treatment in case of cross-over designs) as one vector-valued observation. Parametric methods are Hotelling's T2 and its generalisation, the Multivariate Analysis of Variance (MANOVA). The problem common to all parametric methods, i. e. that critical assumptions often cannot be verified or are likely not to be met becomes even more severe with these multivariate methods. In addition, a large number of covariances have to be estimated, a requirement incompatible with the usually small sample size of pharmacodynamic trials. Finally, results in terms of linear combinations of the components (i. e. the individual variables) of the measurement-vector are difficult to interpret.
A nonparametric method based on the sum of normalised ranks of individual measurements has been proposed by O'Brien [23]. It avoids the assumption of multivariate normality and the need to estimate covariances. However, the difficulties in interpreting the results remain.
One method to overcome some problems of multivariate models is to make use of constraints to the covariance matrix that can either be derived from the study design or are imposed as additional assumptions. Repeated measurements analyses and mixed linear models are of special interest here. The robustness of the results obtained by these parametric methods should be assessed carefully, since they are based on quite strong assumptions on the data. On the other hand, these results can, in general, be interpreted much easier than those of general multivariate methods.
In many cases it is possible to reduce multiple observations to one or a small set of meaningful derived variables, thus avoiding or reducing the problem of multiplicity. In the case of CDA or DDA the derivation should be specified a priori and be described in the Study Protocol. If any coefficients are needed, these should have been estimated from an independent sample, not from the trial data. Examples of such summary measures are the maximum, the time point of the maximum or the area under the curve for time courses, equivalent dipoles, instantaneous power derived from "momentary maps" for a spatial array of amplitudes, and discriminant functions or factor loadings derived from a database of reference subjects which should be distinct from the trial subjects.
If well established summary measures are at hand, this is a very powerful way of reducing multiplicity. Of course this method breaks down if the investigator has only vague ideas about the effects to be expected on individual measurements. A numerical example of the dangers of summary measures is given in [24]
DDA [18] is a generally applicable concept and of particular use in the early phases of drug development when there is little a priori knowledge about treatment effects. It is suitable for studies with treatment effects measured or observed in many variables, at many locations and many time points. In spite of this multiplicity situation, the investigator quite often will be able to predefine a subarea of any number of desired individual variable/location/time-point combinations ("test points", favourably regarding two-group-comparisons only) which appear most important with respect to the goals of the study. In this subarea all individual null hypotheses may then be tested, each at a significance level that needs only moderate -adjustment, e. g. /3 or larger. If at least a desired and predefined percentage of test points show nominal significance, each at the adjusted -level (which is determined by the desired percentage) the "global" null hypothesis comprising all individual null hypotheses in the subarea is rejected in a confirmatory way at level according to the Hailperin-Rüger-test. For reasonable conclusions the individual (nominal) significances should display medically plausible patterns. All individual tests outside the predefined subarea may be performed in the descriptive way only, each at a "descriptive" level , e. g. = 5 %. Again, but without confirmatory support, medically plausible patterns of "descriptive" significances may lead to valuable conclusions. Corresponding conclusions, if applicable, may be drawn from the nominal significances in the predefined subarea when the Rüger test did not show confirmatory significance.
As in CDA, it is important in DDA that all individual null hypotheses to be tested must be defined before the start of data collection, or at latest before unblinding of the data. This includes the definition of the subarea for the Rüger-test. With respect to its confirmatory part DDA is of advantage when compared to other techniques of coping with multiplicity: it allows the investigator to immediately judge treatment effects at individual test points. When applicable, points in the predefined subarea showing nominal significance (though not necessarily individual confirmatory significance) may be interpreted as supporting the global confirmatory significance relating to the whole subarea. As is obvious, close co-operation between physician and statistician is required to elaborate the structure of the data to be obtained in order to fully exploit the possibilities of this technique.
Pharmacodynamic studies may address safety issues as their primary goal (e. g. vigilance or drug interactions). In this case the methods described in chapters 2 and 3 apply. In addition there are also safety issues common to other phase I or phase II trials that apply to pharmacodynamic studies. These include rules for early termination of a trial etc. It must be kept in mind that in safety testing, as a rule, no -adjustment must be applied (see also last paragraph of 2.6 above).
A detailed report should be prepared documenting the study design and methods of data collection, analysis and results of all measurements and observations including adverse events, and demographic and anamnestic information. For DDA a graphical or tabular display of all individual test points and of the subarea predefined for the Hailperin-Rüger global test is necessary, and the corresponding significances should be identified.
Publications in scientific journals will, in general, not allow for such completeness. However, some statistical aspects should be covered: Study design (including the basis for sample size determination), number of subjects treated, number of subjects included into the analysis samples and a summary of the reasons for exclusion; also, the definition of the variables of primary interest including data collection and pre-processing (see also guidelines for data acquisition [2]). A clear distinction between preplanned Confirmatory or Descriptive Data Analysis on one hand and data driven Exploratory Data Analysis on the other is necessary. Also reported must be the statistical methods used - a p-value without mention of the test it was obtained with is not acceptable - and a summary of methods used to deal with missing, incomplete or implausible data.
[8] Armitage P, Berry G: Statistical methods in medical research, ed 2. Oxford, Blackwell, 1987.
[9] Sachs, L: Angewandte Statistik, 8. Aufl, Berlin, Heidelberg, New York, Tokyo, Springer, 1997.
[11] Senn, S: Cross-over trials in clinical research. Chichester etc, Wiley, 1993
[13] Conover WJ: Practical nonparametric statistics, ed 2, New York, Wiley, 1980.
Back to top
Back to guidelines general
Back to members only page
Back to IPEG homepage