Sample Size Tables for Logistic Regression

x Hsieh 1989.pdf

Crash Risk Associated with Drug and Alcohol Use by Drivers in Fatal and Serious Injury Crashes

Sample Size Tables for Logistic Regression

OMB: 2127-0744

Document [pdf]
Download: pdf | pdf
STATISTICS IN MEDICINE, VOL. 8, 795-802 (1989)

SAMPLE SIZE TABLES FOR LOGISTIC REGRESSION
F. Y. HSIEH*
Department of Epidemiology and Social Medicine, Albert Einstein College of Medicine, Bronx, N Y 10461, U.S.A.

SUMMARY
Sample size tables are presented for epidemiologic studies which extend the use of Whittemore’s formula. The
tables are easy to use for both simple and multiple logistic regressions. Monte Carlo simulations are
performed which show three important results. Firstly, the sample size tables are suitable for studies with
either high or low event proportions. Secondly, although the tables can be inaccurate for risk factors having
double exponential distributions, they are reasonably adequate for normal distributions and exponential
distributions. Finally, the power of a study varies both with the number of events and the number of
individuals at risk.
KEY WORDS

Logistic regression Sample size

INTRODUCTION AND ASSUMPTIONS
Logistic regression is commonly used in the analysis of epidemiologic data to examine the
relationship between possible risk factors and a disease. In follow-up studies the proportion of
individuals with disease (event)is usually low, but it is higher in case-control studies. In this paper
I present tables of the required number of subjects in such studies for event proportions ranging
from 001 t o 0.50, covering most follow-up and case-control studies.
In the logistic regression model, the dependent variable (the disease status) is a dichotomous
variable taking the values 0 for non-occurrence and 1 for occurrence. If the independent variable
(such as the risk factor) is also dichotomous, the approximate required sample size can be found
from published sample size tables for the comparison of two proportions.’ For matched casecontrol studies, sample size calculations can be obtained from Dupont.2 The sample size tables
. ~ tables assume that the
which I present in this paper are derived from Whittemore’s f ~ r r n u l aThe
risk factors are continuous and have a joint multivariate normal distribution. The following
section describes the sample size tables and their use.
THE SAMPLE SIZE TABLES
Tables I to V display the required sample size for a study using logistic regression with only one
covariate (that is, risk factor). To use the tables one must specify (1) the probability P of events at
the mean value of the covariate, and (2) the odds ratio r of disease corresponding to an increase of
one standard deviation from the mean value of the covariate.
The tables give five choices of percentage values for the one-tailed significance level a per cent
and the power 1 - B per cent: (I) a=5, 1-Q=70 (11) a=5, 1-8=80 (111) a=5, 1-j3=90
* Current address: Anaquest, BOC Health Care, 100 Mountain Avenue, Murray Hill, NJ 07974, U.S.A.

027747 15/89/07079548$05.00
0 1989 by John Wiley & Sons, Ltd.

Received September 1988
Revised January 1989

796

F. Y. HSIEH

(IV) a=5, 1 -b=95 (V) a= 1, 1 -/?=95. As explained in Appendix I, the sample size for an
odds ratio r is the same as that required for an odds ratio l/r. For example, the sample sizes for
odds ratios of 2 and 2.5 are the same as those required for odds ratios 0.5 and 0-4, respectively.
When there is more than one covariate in the model, multiple logistic regression may be used to
estimate the relationship of a covariate to disease, adjusting for the other covariates. The sample
size required to detect such a relationship is greater than that listed in Tables I to V. For the
calculation of sample size, let p denote the multiple correlation coefficient relating the specific
covariate of interest to the remaining covariates. One must specify (1) the probability P of an
event at the mean value of all the covariates, and (2) the odds ratio r of disease corresponding to an
increase of one standard deviation from the mean value of the specific covariate, given the mean
values of the remaining covariates. The sample size read from Tables I to V should then be divided
by the factor 1 - p 2 to obtain the required sample size for the multiple logistic regression model.
This method yields an approximate upper bound rather than an exact value for the sample size
needed to detect a specified association. Unlike Whittemore’s f ~ r m u l a this
, ~ method does not
require the user to specify the coefficients of the remaining covariates.
EXAMPLES
Whittemore3used Hulley’s data4 to calculate the sample size for a follow-up study designed to test
whether the incidence of coronary heart disease (CHD) among white males aged 39-59 is related
to their serum cholesterol level. For this study, the probability of a CHD event during an
18-month follow-up for a man with a mean serum cholesterol level is 0.07. To detect an odds
ratio of 1.5 for an individual with a cholesterol level of one standard deviation above the mean
using a one-tailed test with a significance level of 5 per cent and a power of 80 per cent, we need
614 individuals (from Table 11).
To detect the same effect while controlling for the effects of triglyceride, and assuming that the
correlation coefficient of cholesterol level with log triglyceride level is 0.4, we would need
614/(1-0.16)=731 individuals for the study.
DISCUSSION
These sample size tables do not explicitly require knowledge of the number of covariates in the
regression model. The results in Appendix1 indicate that the number of covariates is not
important, and that the inclusion of new covariates which do not increase the multiple correlation
coefficient with the covariate of interest does not affect sample size. An adjustment of the overall
P-value for multiple significance testing may be needed when several covariates are of interest as
potential risk factors for the disease.
Where one is interested in the effect of one specific covariate:, Appendix I also shows that the
sample sizes given in the tables may be used whatever the values of the coefficients of the remaining
covariates.
The results of Monte Carlo simulations in Appendix I1 indicate that, when there is only one
covariate in the model, the given sample sizes are reasonably accurate for both normal and
exponential distributions of the covariate, although the tables can be inaccurate for some
distributions, such as the double exponential. When there are two covariates having a bivariate
normal distribution, the values in the tables overestimate the required sample size, but to an
acceptable degree. The simulations also show that the power of the test varies both with the
number of events and with the number of individuals at risk.
Whittemore3 has found the required sample size to be very sensitive to the distribution of the
covariates. I recommend that when a covariate is not normally distributed, leaving the adequacy

797

SAMPLE SIZE TABLES FOR LOGISTIC REGRESSION

Table I. Sample size required for univariate logistic regression having an overall event proportion P and an odds ratio r at one standard
deviation above the mean of the covariate when a= 5 per cent (one-tailed) and 1-8=70 per cent
~~

P

0.6

07

0.8

0.9

1.1

0.0 1
0.02
0.03
0.04

1799
925
633
487
400
341
300
268
244
225
195
175
159
147
137
120
108
100
93
89
85

3732
1909
1301
997
815
694
607
542
49 1
451
390
346
314
289
268
232
207
190
177
167
159

9601
4900
3334
2550
2080
1767
1543
1375
1245
1140
984
872
788
723
670
576
514
469
435
409
388

43222
22040
14980
11450
9332
7920
6911
6154
5566
5095
4389
3885
3507
3213
2977
2554
227 1
2069
1918
1801
1706

52828
26938
18308
13993
11404
9678
8445
7520
6801
6225
5362
4746
4284
3924
3636
3119
2773
2527
2342
2198
2083

0.05

0.06
0.07
0.08
0.09
0.10
0.12
0.14
0.16
0.18
0-20
0.25
0.30
0.35
0.40
0.45
0-50

Odds ratio r
1.2
1.3
1.4
14403
7349
4997
3822
3116
2646
23 10
2058
1862
1705
1470
1302
1176
1078
lo00
859
765
697
647
608
576

6933
3540
2410
I844
1505
1279
1117
996
902
827
714
633
512
525
487
420
374
342
318
299
284

1.5

1.6

1.7

1.8

1.9

4198 2878 2132 1665 1351 1128
2147 1474 1094 856 696 583
1463 1006 748 587 478 402
1121 772 575 452 369 311
916 631 47 1 371 304 256
779 538 402 317 260 220
68 1 471 352 278 229 194
608 421 315 249 206 175
551 382 286 227 187 160
173 147
505 351 263 209
437 304 229 182 151 129
388 270 204 163 135 116
351 245 185 148 124 107
323 226 171 137 115
99
300 210 160 128 107
93
259 182 139 112
94
82
75
232 163 125 101
86
93
70
212 150 115
79
88
66
198 140 108
75
83
63
186 132 102
71
80
68
97
177 126
60

2.0

2.5

3.0

964
500
345
268
222
191
169
152
139
129
114
103
94
88
83
73
67
63
60

546
29 1
205
163
137
120
108
99
92
86
78
72
67
64
61
56
52
50
48
47
45

386
214
157
129
112
100
92
86
81
17
72
68
65
62
60
57

57
55

55

53
52
51
50

Note: To obtain sample sizes for multiple logistic regression, divide the number from the table by a factor of 1 - p2. where p is the multiple
correlation coefficient relating the specific covariate to the remaining covariates.

Table 11. Sample size required for univariate logistic regression having an overall event proportion P and an odds ratio r at one standard
deviation above the mean of the covariate when a=5 per cent (one-tailed) and 1-8=80 per cent

P

0.6

0.7

0.8

0.9

1.1

0.01
0.02
0.03

2334
1199
821
632
518
443
389
348
317
291
254
227
206
191
178
155
140
129
121
115
110

4872
2492
1699
1302
1064

12580
642 1
4368
3342
2726
2315
2022
1802
1631
1494
1289
1142
1032
947
878
755
673
614
570
536
509

56741
28935
19666
15031
12251
10397
9073
8080
7307
6689
5762
5100

69359
35367
24037
18371
14912
12706
11087
9873
8929
8174
7041
6231
5624
5152
4774

0.04
0.05

0.06
0-07
0.08
0.09
0-10
0.12
0.14
0.16
0.18
0.20
0.25
0-30
0.35
0.40
0.45
0.50

905

792
707
641
588
509
452
410
377
350
303
271
248
231
218
207

4604
4218
3909
3352
2982
2717
2518
2364
2240

4095

3641
3318
3075
2886
2735

Odds ratio r
1.2
1-3 1.4
18889
9637
6554
5012
4086
3470
3029
2699
2442
2236
1928
1708
1542
1414
1311
1126
1003
915
848
797
756

9076
4635
3155
2414
1970
1674
1463
1304
1181
1082
934
828
749
687
638
549
490
448
416
391
372

5485
2804
1911
1464
1196
1018
890
794
720
660
571
507
459
422
392
339
303
277
258
243
231

1-5

1.6

1.7

1.8

1.9

2.0

2.5

3.0

900
618
477
392
336
296
266
242
223
195
175
160
148
139
122
111
103
96
92
88

751
517
401
330
284
250
225
206
190
167
150
137
128
120
106
96
90
85
81
78

642

367
260
206
174
152
137
125
116
109
98
91
85
80
77
70
66
63
61
59
57

480
267
196
160
139
125
115
107
101
96
89
84
80
77
75
71
68
66
64
63
62

---2771 2158 1746 1453 1237 690

3751
1921 1422 1110
1311 972 760
1006 747 585
823 612 481
701 522 411
614 458 361
548 410 323
497 372 294
457 342 27 1
396 297 236
352 265 21 1
320 241 192
294 222 178
274 207 166
237 180 145
213 162 131
195 149 121
182 140 114
172 132 108
164 126 103

444
344
285
245
217
196
179
166
146
132
121
113
106
94
86
81
76
73
70

Note: To obtain sample sizes for multiple logistic regression, divide the number from the table by a factor of 1 - p 2 , where p is the multiple
correlation coefficient relating the specific covariate to the remaining covariates.

of the calculated sample size in doubt, one should perform a transformation’ of the covariate to
achieve normality before using Tables I to V.
In conclusion, the methods in this paper provide slightly conservativeestimates of the required
sample size for normally distributed covariates. The tables are simple to use and are suitable for a
variety of epidemiologic studies.

798

F. Y.HSIEH

Table 111. Sample size required for univariate logistic regression having an overall event proportion P and an odds ratio r a t one standard
deviation above the mean of the covariate when a = 5 per cent (one-tailed) and 1 -8=90 per cent
~~~~~

~

P

0.6

0.7

0.8

0.9

1.1

0.0 1
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.12
0.14
0.16
0.18
0-20
0.25
0.30
0.35
0.40
0.45
0.50

3192
1640
1123
864
709
605
532
476
433
398
347
310
282
26 1
243
212
192
177
166
157
150

6706
3430
2338
1792
1465
1246
1090
973
882
810
700
622
564
518
482
417
373
342
318
300
286

17383
8873
6036
4618
3767
3199
2794
2490
2254
2065
1781
1578
1426
1308
1214
1043
930
849
788
741
703

78551
40056
27225
20809
16959
14393
12560
11185
10116
9260
7977
706 1
6373
5839
541 1
4641
4128
3761
3486
3272
3101

96029
48966
33279
25435
20729
17591
15350
13670
12362
11317
9748
8627
7787
7133
6610
5669
5042
4593
4257
3996
3787

Odds ratio r
1.2
1.3
1.4
26120
13327
9063
6930
5651
4798
4189
3732
3377
3092
2666
2361
2133
1955
1813
1557
1387
1265
I I73
1102
1045

12529
6398
4355
3333
2720
2311
2019
1800
1630
1494
1289
1143
1034
949
881
758
676
618
574
540
513

7554
3863
2632
2017
1648
1402
1226
1094
99 1
909
786
698
632
58 1
540
466
417
382
355
335
319

1.5
5154
2639
1801
1382
1131
963
843
753
683
628
544
484
439
404
376
326
292
268
250
236
225

1.6

1.7

1.8

1.9

2.0

3797 2948 2377 1972 1674
1948 1516 1225 1020 869
1332 1038 842 702 600
1024 800 650 544 466
839 657 534 448 385
715 561 458 385 332
627 493 403 340 293
561 442 362 306 265
510 402 330 279 242
469 370 304 258 224
407 322 266 226 197
363 288 238 203 178
330 263 218 186 164
305 243 202 173 153
284 227 189 163 144
247 198 166 144 128
222 179 151 131 117
205 166 140 122 109
192 155 131 115 103
99
181 147 125 110
95
173 141 120 105

2.5

3.0

917
488
345
274
231
202
182
167
155
145
131
121
113
107
102
94
88
84
81
78
76

627
349
256
210
182
163
150
140
132
126
117
110
105
101
98
93
89
86
84
83
81

Note: To obtain sample sizes for multiple logistic regression, divide the number from the table by a factor of 1 - p 2 . where p is the multiple
correlation coefficient relating the specific covariate to the remaining covariates.
Table IV. Sample size required for univariate logistic regression having an overall event proportion P and an odds ratio r at one standard
deviation above the mean of the covariate when a= 5 per cent (one-tailed) and 1 -8=95 per cent

P

0.6

0.7

0.8

0.0 1

4001
2055
1407
1083
888
759
666
597
543
499
435
388
353
326
305
266
240
222
208
197
188

8439
4316
2942
2255
1843
1568
1372
1225
1110
1019
881
783
710
652
607
524
469
430
401
378
359

21927
11192
7614
5825
4751
4036
3525
3141
2843
2604
2247
1991
1799
1650
1531
1316
1173
1071
994
935
887

0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.12
0.14
Q16
0.18
0.20
0.25
0.30
0.35
0.40
0.45
0.50

0.9

1.1

99209 121290
50591 61847
34384 42033
26281 32126
21419 26182
18178 22219
15863 19389
14127 17266
12776 15614
11696 14293
10075 12312
8918 10897
8049
9835
9010
7374
6834
8349
5861
7160
6368
5213
5802
4750
5377
4403
5047
4133
4783
3917

Odds ratio r
1.2
1.3
1.4
32967
16820
11438
8747
7132
6056
5287
4710
4262
3903
3365
2980
2692
2468
2288
1965
1750
1596
1481
1391
1319

15795
8066
5490
4202
3429
2914
2546
2270
2055
1883
1626
1442
1304
1196
1110
956
853
779
724
681
647

9511
4863
3314
2539
2074
1764
1543
1377
1248
1145
990
879
796
732
680
587
525
481
448
422
401

1.5

1.6

6478
3318
2264
1737
1421
1210
1060
947
859
789
684
608
552
508
473
410
367
337
315
297
283

4765
2445
1671
1284
1052
898
787
704
640
588
511
456
414
382
356
310
279
257
240
228
217

1.7

1.8

1.9

2.0

2.5

3692 2971 2461 2084 1130
1898 1532 1272 1081 601
1301 1052 876 747 425
1002 812 678 580 337
822 668 559 480 284
703 572 480 413 249
617 504 424 365 224
553 452 381 329 205
503 412 348 301 190
464 380 322 279 179
404 332 282 246 161
361 298 254 222 148
329 272 233 204 139
304 252 216 190 132
284 236 203 179 126
248 207 179 159 115
224 188 163 145 108
207 175 152 136 103
194 164 143 129
99
185 156 137 123
96
177 150 132 119
94

3.0
764
425
312
255
221
199
183
170
161
153
142
134
128
123
120
113
108

105
103
101
99

Note: To obtain sample sizes for multiple logistic regression, divide the number from the table by a factor of 1 - p 2 . where p is the multiple
correlation coefficient relating the specific covariate to the remaining covariates.

APPENDIX I: SAMPLE SIZE FORMULAE
Let Y denote the disease status, and let Y = 1 if the disease occurs and Y=O otherwise. Let X I ,
X , , . . . , X , denote the covariates, which are assumed to have a joint multivariate normal
distribution. The logistic regression model specifies that the conditional probability of disease

799

SAMPLE SIZE TABLES FOR LOGISTIC REGRESSION

Table V. Sample size required for univariate logistic regression having an overall event proportion P and an odds ratio r at one standard
deviation above the mean of the covariate when CI=1 per cent (one-tailed) and 1-!3=95 per cent

P

0.6

0.7

0.0 1

5897
3030
2074
1596
1309
1118
982
879
800
736

12367
6326
4312
3305
2701
2299
201 1
1795
1627
1493
1292
1148
1040
956
889
768
688
630
587
553
527

0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.12
0.14
0.16
0.18
0.20
0.25
0.30
0.35
0.40
0.45
0.50

640

572
521
48 1
449
392
354
326
306
290
271

0.8

0.9

Odds ratio r
1.2
1.3
1.4

1.1

32029 144672 176857
16349 73774 90182
11122 50141 61290
8508 38325 46844
6940 31235 38177
5895 26508 32398
5148 23132 28211
4588 20600 25175
4153 18631 22768
3804 17055 20842
3282 14692 17953
2908 13004 15889
2628 11738 14341
2410 10753 13137
2236
9966 12174
1923
8548 10441
9285
1713
7602
1564
8460
6927
1452
7840
6421
7359
6027
1365
1295
6914
5712

48120
24552
16695
12767
10411
8839
7717
6875
6221
5697
4911
4350
3929
3602
3340
2869
2554
2330
2162
2031
1926

23090
11792
8026
6143
5013
4260
3722
3318
3004
2753
2316
2107
1906
1749
1623
1397
1247
1139
1058
996
945

13930
7123
4854
3719
3038
2584
2260
2017
1828
1677
1450
1288
1166
1072
996
860
769
704
656
618
587

1.5

1.6

1.7

1.8

9509
4870
3323
2550
2086
1777
1556
1390
1261
1158
1003
893
810
746
694
601
539
495
462
436
416

7011
3597
2459
1890
1549
1321
1158
1037
942
866
752
671
610
562
524
456
411
378
354
335
320

5447
2801
1919
1478
1213
1037
911
816
743
684
596
533
485
449
419
366
331
306
287
272
260

4395
2266
1556
1201
988
846
745
669
610
562
49 1
441
403
373
349
307
278
258
243
231
222

1.9

2.0

2.5

3.0

3650 3 100 1706 1172
1887 1609 908 651
1300 1111 642 478
1006 863 509 391
830 714 429 339
712 614 376 305
628 543 338 280
565 490 310 26 1
516 448 288 247
477 415 270 235
418 366 243 218
376 330 224 206
345 303 210 196
321 283 199 189
301 266 190 183
266 236 174 173
242 216 163 166
225 202 156 161
213 192 150 157
203 183 146 154
195 I77 142 152

Note: To obtain sample sizes for multiple logistic regression, divide the number from the table by a factor of 1 - p 2 , where p is the multiple
correlation coefficient relating the specific covariate to the remaining covariates.

occurrence P = P(X) = P ( Y = 1IX,,

. . . , X k ) is related to X,, . . . ,Xk by

iOg[P/(i-P)]=80+8,x,

+ ... +8kxk.

Assume, without loss of generality, that among the k covariates X , is the covariate of primary
interest. We wish to test the null hypothesis of H,:8 = [0, tl,, . . . ,Ok] against the alternative
hypothesis H , : 8 = [e*, 8,, . . . $k]*
Let p denote the multiple correlation coefficient of X , with X , , . . . ,Xk. If each of the covariates
X i has been normalized to have mean zero and variance one, the sample size N needed to test at
level a and power 1 - p can be approximated, according to Whittem~re,~
by

+

N exp (0,) = [ v1l2(eo)z, vl/,( e * ) z , ~ ~ / e * ~ ,

(1)

where v(e)=[exp(em/2)(1 - p z ) ] - I , 8, = (0, e,, . . . ,&.), e*=(e*, e,, . . . ,&), C is the correlation matrix of X, , . . . ,xk, and Z, and Z, are standard normal deviates with probabilities a and
/?, respectively, in the upper tail. When there is only one covariate in the model, (1) reduces to

+

NP = [z, exp( - e*2/4)z,]2/e*2.

(2)
Note that (2) relates the power of the test directly to the expected number of events NP, implying
that power will be independent of sample size N given a fixed number of events. However, because
of deviations from the approximate formula (l), the above statement is not accurate. Monte Carlo
simulations in Appendix I1 show that the power of the test is an increasing function of sample size
even when the number of events remains constant.
Whittemore3 suggested that the approximation could be improved by a multiplying factor of
1 + 2PS, where

6 = [1+ (1+ e*’)e~p(50*~/4)][1 + exp ( - 13*~/4)]- l .

(3)

800

F. Y.HSIEH

The required sample size may then be written as a function of P and 8* as follows:

+

+

N , = [za exp(- 8*2/4)Z,]2 (1 2 ~ s ) / ( ~ e * ~ ) .

(4)

In this formula the value of 8* represents the log odds ratio of disease corresponding to an increase
in X of one standard deviation from the mean. In practice, as in Tables I to V, one would specify
the value of the odds ratio I instead of the value of 8*.
Because the standard normal distribution is symmetric about the mean 0, the sample size to
detect a log odds ratio 8*, or an odds ratio r=exp(O*), is the same as a log odds ratio - 8*, or an
odds ratio l/r=exp( - 8*).
According to Whittemore’s f ~ r m u l a the
, ~ sample size calculation for the multivariate case
requires specification of the coefficients for each covariate and their correlation matrix. This is
impractical for routine use. Whittemore has already pointed out that the inclusion of cobut independent of the event (that is, o,= . . .=8,=0)
variates which are correlated with
leads to a loss of power when testing the relationship of X, to the event. Since V ( 0 )= V ( -0) 3 0,
it can be shown from (1) that the sample size required for 8 2 = . . . = & = 0 is an approximate
upper bound after applying the correction of (3). Therefore, I suggest using approximate sample
size NM, derived from (1) by substituting 8, = . . . =8, = 0, for multiple logistic regression:

x,

NM = N,/(1-p2).

(5)

This formula provides maximum sample sizes and does not require the specification of coefficients
for each covariate, or the full correlation matrix.
APPENDIX 11: MONTE CARL0 POWER SIMULATIONS
To check the accuracy of the calculated sample size tables, Monte Carlo simulations were carried
out for various proportions of events and for covariates with different distributions. A set of lo00
trials was generated for five event proportions, four odds ratios and three distributions. Let m and
s be the mean and standard deviation, respectively, of the covariates; let U and G be standard
uniform and standard normal variables, respectively. The distributions of covariates were
generated as follows:
1. Normal distribution: m + s G.
2. Double exponential distribution: m Zslog U , where I = -- 1 if U *
U * is a standard uniform variable independent of U.
3. Exponential distribution: m - s - s log U .

+

-= 0.5 and I = 1 otherwise.

Under the alternative hypothesis that the specified association between covariate and disease
occurrence exists, the above mean and standard deviation are specified by the odds ratios through
the following equations:
Diseased group:

m = log(odds ratio) and s = exp( - m2/4),

Non-diseased group:

m = 0 and s = l .

When two covariates ( X , , X,) have a joint bivariate normal distribution with a correlation
coefficientp, the first covariate ( X I )was generated by the above procedure 1 and the second
covariate (X,) was obtained from X 2 = p X , + G(l -p2)l12.
The results are shown in Table VI. Formula (4) seems to slightly underestimate the power for a
covariate with a normal distribution and severely overestimate the power for a double exponential
covariate. For an exponential covariate, formula (4) overestimates the power when both odds ratio

801

SAMPLE SIZE TABLES FOR LOGISTIC REGRESSION

Table VI. Estimated power from Monte Carlo simulations (lo00 repetitions) for normal, double exponential and
exponential covariates, compared with formula (4)
Event
proportion
P

No. of
events/
sample size Source

1.30
Sig. level
0.05
0.01

Odds ratio r
1.50
1.70
Sig. level
Sig. level
0.05
0.01
0.05
001

2.00
Sig. level
0.05
0.01

0.02

20/1o00

Formula (4)
Normal
Double exp.
Exponential

0.296
0.326
0.217
0.265

0.110
0.111
0.078
0.1 12

0.533
0.532
0.338
0.485

0.265
0.281
0.125
0.226

0.741
0.742
0.522
0.718

0,466
0.475
0.266
0.420

0.923
0.923
0.729
0.947

0.744
0.757
0.434
0.755

0.05

30/600

Formula (4)
Normal
Double exp.
Exponential

0.388
0.418
0.263
0.364

0.164
0.163
0.107
0151

0.680
0.723
0.439
0.675

0.404
0.463
0193
0.376

0.874
0.907
0.672
0903

0.661
0.702
0.384
0.667

0.980
0.989
0&51
0.994

0.902
0.945
0.648
0.936

010

40/400

Formula (4)
Normal
Double exp.
Exponential

0,443
0456
0.294
0.431

0.201
0.206
0.107
0.208

0751
0.813
0341
0.787

0.487
0546
0.291
0516

0.920
0.949
0.733
0.955

0.749
0.834
0.463
0.785

0990
0.999
0.905
1QOO

0.941
0.979
0.732
0978

0.20

60/300

Formula (4)
Normal
Double exp.
Exponential

0.520
0.562
0.350
0,603

0.260
0.283
0139
0298

0.832
0.894
0.630
0,869

0.599
0.711
0.331
0.652

0.959
0990
0.823
0,977

0843
0.917
0588
0.898

0.996
1.o00
0.970
1.o00

0.972
0.994
0468
0.996

050

100/200

Formula(4)
Normal
Double exp.
Exponential

0568
0.599
0.368
0.627

0301
0.327
0.152
0.336

0.866
0.888
0.674
0.891

0.655
0.697
0.397
0.687

0969
0.986
Q.855
0.991

0871
0.929
0.643
0.921

0.996
1400
0.975
1.o00

0972
0.995
0.895
1.OOo

Table VII. Estimated power from Monte Carlo simulations ( l W repetitions) for bivariate normal covariates compared
with formula (4)
Event
proportion
P

No. of
events/
sample size Source
~

0.05

30/600

60/300

Formula (4)

Normal

0.50

100/200

2.00
Sig. level
0.05
001

~

Formula (4)

Normal

0.20

Correlation
coefficient

Odds ratio r
1.50
1.70
Sig. level
Sig. level
0.05
0.01
005
001

Formula (4)

Normal

0.0
0.3
0.7
0.0
0.3
0.7

0.680
0644
0.438
0.723
0.659
0.457

0-366
0.193
0463
0.392
0.217

0.874
0.845
0.624
0.907
0.897
0.647

0.66 1
0.61 1
0.339
0.702
0.66 1
0.377

0.980
0.970
0.827
0.989
0.987
0.870

0.902
0.867
0.568
0.945
0.923
0.627

0.0
0.3
0.7
0.0
0.3
0.7

0.832
0.799
0.578
0494
0.849
0.657

0-599
0.55 1
0.304
0.71 1
0.635
0.378

0.959
0.943
0.769
0.990
0.975
0.842

0.843
0.801
0,502
0.917
0.890
0607

0996
0.993
0.917
1~o00
0999
0976

0.972
0.955
0.703
0.994
0992
0878

0.0
0.3
0.7
0.0
0.3
0.7

0.866
0.836
0.619
0.888
0494

0.655
0606
0.341
0.697
0.633
0.351

0.969
0.955
0.796
0.986
0.978
0.871

0471
0.833
0.538
0929
0906
0658

0.996
0.993
0.918
1.o00
1.o00
0.979

0972
0.956
0.732
0.995
0.995
0.897

0.640

0.404

802

F. Y.HSIEH

Table VIII. Estimated power from Monte Carlo simulations (1000 repetitions) for different sample sizes and number of
events

Number of
events

Sample
size

Odds ratio I
1.50
1.70
Sig. level
Sig. level
005
0.01
0.05
001

1.30
Sig. level
0.05
0.01
~

~

~

2.00
Sig. level
0.05
0.01

~

20

40
100
400

0.170
0.272
0.325

0.042
0.084
0.131

0.343
0.482
0.551

0.088
0.200
0.256

0.508
0.698
0,736

0187
0420
0.465

0.756
0.887
0.9 31

0.372
0.668
0.754

50

100
250
1000

0360
0507
0.566

0.140
0.243
0.316

0.671
0.83 1
0.869

0.347
0.571
0662

0.847
0.959
0.984

0.626
0.826
0.927

0.977
0.999
1~000

0.881
0.990
1~OOO

100

200
500
2000

0.583
0.749
0.840

0.301
0.509
0.610

0.883
0.987
1 .Ooo

0.702
0.915
1.Ooo

0.986
1Qoo
1~Ooo

0.932
0.994
1~Ooo

1 .Ooo

0.998
1.OOO

1.Ooo
1.Ooo

1 ~OOO

and event rate are low and underestimates the power otherwise. In general, if the covariate is
normally distributed, we are assured that the sample size obtained from the tables will be slightly
conservative. Table VII shows that formula (4)underestimates the power for bivariate normal
covariates, but to an acceptable degree. Table VIII shows the results of simulations using normal
covariates relating the number of events and the sample size to the power of the test. They show
that when the number of events remains constant, the power of the test varies with sample size.
ACKNOWLEDGEMENTS

I thank the reviewers for very helpful comments.
REFERENCES
1. Fleiss, J. Statistical Methods for Rates and Proportions, Wiley, New York, 1981.
2. Dupont, W. D. ‘Power calculations for matched case-control studies’, Biometrics, 44, 1157-1 168 (1988).
3. Whittemore, A. ‘Sample size for logistic regression with small response probability’, Journal of the
American Statistical Association, 76, 27-32 (1981).
4. Hully, S.B., Rosenman, R. A., Bowol, R. and Brand, R. ‘Epidemiology as a guide to clinical decisions: the
association between triglyceride and coronary heart disease’, New England Journal of Medicine, 302,
1383-1389 (1980).
5. Sokal, R. R. and Rohlf, F. J. Biometry, Freeman, San Francisco, 1969.


File Typeapplication/pdf
File TitleSample size tables for logistic regression
File Modified2012-04-23
File Created2006-07-25

© 2024 OMB.report | Privacy Policy