P value adjustments for multiple tests in multivariate binomial models

Peter H. Westfall; S. Stanley Young

doi:10.1080/01621459.1989.10478837

P value adjustments for multiple tests in multivariate binomial models

Peter H. Westfall, S. Stanley Young

Business Administration

Research output: Contribution to journal › Article › peer-review

167 Scopus citations

Abstract

Data from rodent carcinogenicity (preclinical) and clinical studies involving new drugs may be modeled as having come from multivariate binomial distributions. In two-year rodent carcinogenicity studies, there are typically 20–50 tissues examined for occurrence of any of several possible lesions. For a particular treatment group, the number of occurrences of a particular lesion at a particular tissue may be modeled as binomial, and the vector of such frequencies may be considered multivariate binomial with unspecified dependence structure. The same model may also apply to clinical side-effects data; in this case the marginal frequencies may represent occurrences of events ranging from headaches to ingrown toenails. Frequently, the goal of such studies is to isolate site-specific significant differences between treatment and control groups. For example, in rodent carcinogenicity analyses it is generally not sufficient to claim that a new compound causes an increase in tumors at some unspecified site; rather, the report should identify the particular sites where unusual increases are noted. Such an analysis requires separate tests for each site. False significances may easily occur when multiple tests are performed. When a marginal significance criterion p ≤ .05 is used, experimentwise false significance rates as large as 44% have been reported (Haseman, Winbush, and O’Donnell 1986). Others have reported the experimentwise false significance rate much lower; for example, Gart, Chu, and Tarone (1979) reported 8%–10% for each sex and species combination of a two-sex, two-species experiment. In this article it is proposed that the experimentwise false significance rate be controlled by adjusting all p values for the multiplicity of testing using vector-based resampling methods. This analysis is an extension of the bootstrap method described by Westfall (1985) to the multisample case, with particular application to models useful in clinical and preclinical biopharmaceutical analyses; it is also similar to the methodology proposed by Brown and Fears (1981). Assuming no differences between treatment and control groups (the null case), one may estimate the multivariate binomial distribution or permutation distribution conveniently via vector resampling. Using this estimated distribution, one may easily estimate (via Monte Carlo) the probability that the smallest p value in the study is smaller than any given threshold. An adjusted p value is then defined as the probability that the smallest p value in the study is less than or equal to the observed p value for the given test. This methodology is compared to the usual Bonferronistyle adjustments, and it is demonstrated that these adjustments are grossly conservative in certain instances because of their failure to account for dependence between tests and the discreteness of the data. Results of bootstrap and permutation resampling adjustments tend to be similar, particularly for large sample sizes. The approaches are philosophically different: Bootstrap resampling is preferable if an unconditional analysis is desired [Upton (1982) demonstrated that nominal and actual Type I errors are closer and that statistical power is greater in the univariate two-sample case] whereas permutation resampling gives essentially exact results and is preferable if a conditional analysis is desired [Yates (1984) gave philosophical arguments for favoring the conditional approach].

Original language	English
Pages (from-to)	780-786
Number of pages	7
Journal	Journal of the American Statistical Association
Volume	84
Issue number	407
DOIs	https://doi.org/10.1080/01621459.1989.10478837
State	Published - Sep 1989

Keywords

Bootstrap
Clinical trial
Permutation test
Rodent carcinogenicity study
Simultaneous test procedure

Access to Document

10.1080/01621459.1989.10478837

Cite this

@article{bca89cf6e2f74acbaabf7907bf941b66,

title = "P value adjustments for multiple tests in multivariate binomial models",

abstract = "Data from rodent carcinogenicity (preclinical) and clinical studies involving new drugs may be modeled as having come from multivariate binomial distributions. In two-year rodent carcinogenicity studies, there are typically 20–50 tissues examined for occurrence of any of several possible lesions. For a particular treatment group, the number of occurrences of a particular lesion at a particular tissue may be modeled as binomial, and the vector of such frequencies may be considered multivariate binomial with unspecified dependence structure. The same model may also apply to clinical side-effects data; in this case the marginal frequencies may represent occurrences of events ranging from headaches to ingrown toenails. Frequently, the goal of such studies is to isolate site-specific significant differences between treatment and control groups. For example, in rodent carcinogenicity analyses it is generally not sufficient to claim that a new compound causes an increase in tumors at some unspecified site; rather, the report should identify the particular sites where unusual increases are noted. Such an analysis requires separate tests for each site. False significances may easily occur when multiple tests are performed. When a marginal significance criterion p ≤ .05 is used, experimentwise false significance rates as large as 44% have been reported (Haseman, Winbush, and O{\textquoteright}Donnell 1986). Others have reported the experimentwise false significance rate much lower; for example, Gart, Chu, and Tarone (1979) reported 8%–10% for each sex and species combination of a two-sex, two-species experiment. In this article it is proposed that the experimentwise false significance rate be controlled by adjusting all p values for the multiplicity of testing using vector-based resampling methods. This analysis is an extension of the bootstrap method described by Westfall (1985) to the multisample case, with particular application to models useful in clinical and preclinical biopharmaceutical analyses; it is also similar to the methodology proposed by Brown and Fears (1981). Assuming no differences between treatment and control groups (the null case), one may estimate the multivariate binomial distribution or permutation distribution conveniently via vector resampling. Using this estimated distribution, one may easily estimate (via Monte Carlo) the probability that the smallest p value in the study is smaller than any given threshold. An adjusted p value is then defined as the probability that the smallest p value in the study is less than or equal to the observed p value for the given test. This methodology is compared to the usual Bonferronistyle adjustments, and it is demonstrated that these adjustments are grossly conservative in certain instances because of their failure to account for dependence between tests and the discreteness of the data. Results of bootstrap and permutation resampling adjustments tend to be similar, particularly for large sample sizes. The approaches are philosophically different: Bootstrap resampling is preferable if an unconditional analysis is desired [Upton (1982) demonstrated that nominal and actual Type I errors are closer and that statistical power is greater in the univariate two-sample case] whereas permutation resampling gives essentially exact results and is preferable if a conditional analysis is desired [Yates (1984) gave philosophical arguments for favoring the conditional approach].",

keywords = "Bootstrap, Clinical trial, Permutation test, Rodent carcinogenicity study, Simultaneous test procedure",

author = "Westfall, {Peter H.} and Young, {S. Stanley}",

year = "1989",

month = sep,

doi = "10.1080/01621459.1989.10478837",

language = "English",

volume = "84",

pages = "780--786",

journal = "Journal of the American Statistical Association",

issn = "0162-1459",

number = "407",

}

TY - JOUR

T1 - P value adjustments for multiple tests in multivariate binomial models

AU - Westfall, Peter H.

AU - Young, S. Stanley

PY - 1989/9

Y1 - 1989/9

N2 - Data from rodent carcinogenicity (preclinical) and clinical studies involving new drugs may be modeled as having come from multivariate binomial distributions. In two-year rodent carcinogenicity studies, there are typically 20–50 tissues examined for occurrence of any of several possible lesions. For a particular treatment group, the number of occurrences of a particular lesion at a particular tissue may be modeled as binomial, and the vector of such frequencies may be considered multivariate binomial with unspecified dependence structure. The same model may also apply to clinical side-effects data; in this case the marginal frequencies may represent occurrences of events ranging from headaches to ingrown toenails. Frequently, the goal of such studies is to isolate site-specific significant differences between treatment and control groups. For example, in rodent carcinogenicity analyses it is generally not sufficient to claim that a new compound causes an increase in tumors at some unspecified site; rather, the report should identify the particular sites where unusual increases are noted. Such an analysis requires separate tests for each site. False significances may easily occur when multiple tests are performed. When a marginal significance criterion p ≤ .05 is used, experimentwise false significance rates as large as 44% have been reported (Haseman, Winbush, and O’Donnell 1986). Others have reported the experimentwise false significance rate much lower; for example, Gart, Chu, and Tarone (1979) reported 8%–10% for each sex and species combination of a two-sex, two-species experiment. In this article it is proposed that the experimentwise false significance rate be controlled by adjusting all p values for the multiplicity of testing using vector-based resampling methods. This analysis is an extension of the bootstrap method described by Westfall (1985) to the multisample case, with particular application to models useful in clinical and preclinical biopharmaceutical analyses; it is also similar to the methodology proposed by Brown and Fears (1981). Assuming no differences between treatment and control groups (the null case), one may estimate the multivariate binomial distribution or permutation distribution conveniently via vector resampling. Using this estimated distribution, one may easily estimate (via Monte Carlo) the probability that the smallest p value in the study is smaller than any given threshold. An adjusted p value is then defined as the probability that the smallest p value in the study is less than or equal to the observed p value for the given test. This methodology is compared to the usual Bonferronistyle adjustments, and it is demonstrated that these adjustments are grossly conservative in certain instances because of their failure to account for dependence between tests and the discreteness of the data. Results of bootstrap and permutation resampling adjustments tend to be similar, particularly for large sample sizes. The approaches are philosophically different: Bootstrap resampling is preferable if an unconditional analysis is desired [Upton (1982) demonstrated that nominal and actual Type I errors are closer and that statistical power is greater in the univariate two-sample case] whereas permutation resampling gives essentially exact results and is preferable if a conditional analysis is desired [Yates (1984) gave philosophical arguments for favoring the conditional approach].

AB - Data from rodent carcinogenicity (preclinical) and clinical studies involving new drugs may be modeled as having come from multivariate binomial distributions. In two-year rodent carcinogenicity studies, there are typically 20–50 tissues examined for occurrence of any of several possible lesions. For a particular treatment group, the number of occurrences of a particular lesion at a particular tissue may be modeled as binomial, and the vector of such frequencies may be considered multivariate binomial with unspecified dependence structure. The same model may also apply to clinical side-effects data; in this case the marginal frequencies may represent occurrences of events ranging from headaches to ingrown toenails. Frequently, the goal of such studies is to isolate site-specific significant differences between treatment and control groups. For example, in rodent carcinogenicity analyses it is generally not sufficient to claim that a new compound causes an increase in tumors at some unspecified site; rather, the report should identify the particular sites where unusual increases are noted. Such an analysis requires separate tests for each site. False significances may easily occur when multiple tests are performed. When a marginal significance criterion p ≤ .05 is used, experimentwise false significance rates as large as 44% have been reported (Haseman, Winbush, and O’Donnell 1986). Others have reported the experimentwise false significance rate much lower; for example, Gart, Chu, and Tarone (1979) reported 8%–10% for each sex and species combination of a two-sex, two-species experiment. In this article it is proposed that the experimentwise false significance rate be controlled by adjusting all p values for the multiplicity of testing using vector-based resampling methods. This analysis is an extension of the bootstrap method described by Westfall (1985) to the multisample case, with particular application to models useful in clinical and preclinical biopharmaceutical analyses; it is also similar to the methodology proposed by Brown and Fears (1981). Assuming no differences between treatment and control groups (the null case), one may estimate the multivariate binomial distribution or permutation distribution conveniently via vector resampling. Using this estimated distribution, one may easily estimate (via Monte Carlo) the probability that the smallest p value in the study is smaller than any given threshold. An adjusted p value is then defined as the probability that the smallest p value in the study is less than or equal to the observed p value for the given test. This methodology is compared to the usual Bonferronistyle adjustments, and it is demonstrated that these adjustments are grossly conservative in certain instances because of their failure to account for dependence between tests and the discreteness of the data. Results of bootstrap and permutation resampling adjustments tend to be similar, particularly for large sample sizes. The approaches are philosophically different: Bootstrap resampling is preferable if an unconditional analysis is desired [Upton (1982) demonstrated that nominal and actual Type I errors are closer and that statistical power is greater in the univariate two-sample case] whereas permutation resampling gives essentially exact results and is preferable if a conditional analysis is desired [Yates (1984) gave philosophical arguments for favoring the conditional approach].

KW - Bootstrap

KW - Clinical trial

KW - Permutation test

KW - Rodent carcinogenicity study

KW - Simultaneous test procedure

UR - http://www.scopus.com/inward/record.url?scp=0000553457&partnerID=8YFLogxK

U2 - 10.1080/01621459.1989.10478837

DO - 10.1080/01621459.1989.10478837

M3 - Article

AN - SCOPUS:0000553457

SN - 0162-1459

VL - 84

SP - 780

EP - 786

JO - Journal of the American Statistical Association

JF - Journal of the American Statistical Association

IS - 407

ER -

P value adjustments for multiple tests in multivariate binomial models

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this