Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) Checklist - BMC Medical Research Methodology
- ️de Vet, Henrica CW
- ️Wed Sep 22 2010
Box A Internal consistency (n = 195)b
A1
Does the scale consist of effect indicators, i.e. is it based on a reflective model?
185
82
193
0.06
Design requirements
A2c
Was the percentage of missing items given?
183
87
190
0.48
A3c
Was there a description of how missing items were handled?
180
90
187
0.54
A4
Was the sample size included in the internal consistency analysis adequate?
177
87
185
0.06d
A5c
Was the unidimensionality of the scale checked? i.e. was factor analysis or IRT model applied?
180
92
187
0.69
A6
Was the sample size included in the unidimensionality analysis adequate?
166
79
178
0.27
A7
Was an internal consistency statistic calculated for each (unidimensional) (sub)scale separately?
179
85
187
0.31d
A8c
Were there any important flaws in the design or methods of the study?
174
86
179
0.22d
Statistical methods
A9
for Classical Test Theory (CTT): Was Cronbach's alpha calculated?
179
93
187
0.27d,e
A10
for dichotomous scores: Was Cronbach's alpha or KR-20 calculated?
151
91
165
0.17d,e
A11
for IRT: Was a goodness of fit statistic at a global level calculated? e.g. χ2, reliability coefficient of estimated latent trait value (index of (subject or item) separation)
154
93
167
0.46d,e
Box B. Reliability (n = 141) b
Design requirements
B1c
Was the percentage of missing items given?
129
87
140
0.39
B2c
Was there a description of how missing items were handled?
125
91
137
0.43d
B3
Was the sample size included in the analysis adequate?
127
77
139
0.35
B4c
Were at least two measurements available?
129
98
140
0.72 d
B5
Were the administrations independent?
129
73
139
0.18
B6c
Was the time interval stated?
125
94
136
0.50d
B7
Were patients stable in the interim period on the construct to be measured?
126
75
138
0.24
B8
Was the time interval appropriate?
125
84
137
0.45
B9
Were the test conditions similar for both measurements? e.g. type of administration, environment, instructions
127
83
138
0.30
B10c
Were there any important flaws in the design or methods of the study?
117
77
129
0.08
Statistical methods
B11
for continuous scores: Was an intraclass correlation coefficient (ICC) calculated?
119
86
133
0.59e
B12
for dichotomous/nominal/ordinal scores: Was kappa calculated?
111
81
127
0.32e
B13
for ordinal scores: Was a weighted kappa calculated?
111
83
127
0.42e
B14
for ordinal scores: Was the weighting scheme described? e.g. linear, quadratic
108
81
124
0.35e
Box D. Content validity (n = 83) b
Design requirements
D1
Was there an assessment of whether all items refer to relevant aspects of the construct to be measured?
62
79
83
0.33
D2
Was there an assessment of whether all items are relevant for the study population? (e.g. age, gender, disease characteristics, country, setting)
62
76
83
0.46
D3
Was there an assessment of whether all items are relevant for the purpose of the measurement instrument? (discriminative, evaluative, and/or predictive)
62
66
83
0.21
D4
Was there an assessment of whether all items together comprehensively reflect the construct to be measured?
62
66
83
0.15
D5c
Were there any important flaws in the design or methods of the study?
58
76
78
0.13
Box E. Structural validity (n = 118) b
E1
Does the scale consist of effect indicators, i.e. is it based on a reflective model?
99
78
116
0f
Design requirements
E2c
Was the percentage of missing items given?
95
87
110
0.41
E3c
Was there a description of how missing items were handled?
93
91
109
0.55
E4
Was the sample size included in the analysis adequate?
94
87
109
0.56d
E5c
Were there any important flaws in the design or methods of the study?
89
84
103
0.27
Statistical methods
E6
for CTT: Was exploratory or confirmatory factor analysis performed?
92
90
106
0.51d,e
E7
for IRT: Were IRT tests for determining the (uni-) dimensionality of the items performed?
62
87
80
0.39e,f
Box F. Hypotheses testing (n = 170) b
Design requirements
F1c
Was the percentage of missing items given?
158
87
168
0.41
F2c
Was there a description of how missing items were handled?
159
92
169
0.60d
F3
Was the sample size included in the analysis adequate?
157
84
167
0.12d
F4
Were hypotheses regarding correlations or mean differences formulated a priori (i.e. before data collection)?
158
74
168
0.42
F5
Was the expected direction of correlations or mean differences included in the hypotheses?
159
75
169
0.26e
F6
Was the expected absolute or relative magnitude of correlations or mean differences included in the hypotheses?
159
82
168
0.29e
F7c
for convergent validity: Was an adequate description provided of the comparator instrument(s)?
125
83
136
0.30
F8c
for convergent validity: Were the measurement properties of the comparator instrument(s) adequately described?
124
81
135
0.35
F9c
Were there any important flaws in the design or methods of the study?
131
81
145
0.17
Statistical methods
F10
Were design and statistical methods adequate for the hypotheses to be tested?
150
78
161
0.00d,e,f
Box G. Cross-cultural validity (n = 33) b
Design requirements
G1c
Was the percentage of missing items given?
25
88
32
0.52
G2c
Was there a description of how missing items were handled?
22
82
30
0.32
G3
Was the sample size included in the analysis adequate?
26
81
33
0.23
G4c
Were both the original language in which the HR-PRO instrument was developed, and the language in which the HR-PRO instrument was translated described?
28
89
33
0.34d
G5c
Was the expertise of the people involved in the translation process adequately described? e.g. expertise in the disease(s) involved, expertise in the construct to be measured, expertise in both languages
28
86
33
0.46
G6
Did the translators work independently from each other?
28
89
33
0.61
G7
Were items translated forward and backward?
28
100
33
1.00
G8c
Was there an adequate description of how differences between the original and translated versions were resolved?
28
86
33
0.50
G9c
Was the translation reviewed by a committee (e.g. original developers)?
25
88
31
0.56
G10c
Was the HR-PRO instrument pre-tested (e.g. cognitive interviews) to check interpretation, cultural relevance of the translation, and ease of comprehension?
21
90
29
0.61
G11c
Was the sample used in the pre-test adequately described?
28
79
32
0f
G12
Were the samples similar for all characteristics except language and/or cultural background?
26
81
31
0.41
G13c
Were there any important flaws in the design or methods of the study?
26
85
31
0.42
Statistical methods
G14
for CTT: Was confirmatory factor analysis performed?
27
74
32
0.03e,f
G15
for IRT: Was differential item function (DIF) between language groups assessed?
13
77
23
0.28e,f
Box H. Criterion validity (n = 57) b
Design requirements
H1c
Was the percentage of missing items given?
35
91
56
0.59d
H2c
Was there a description of how missing items were handled?
35
97
56
0.79 d
H3
Was the sample size included in the analysis adequate?
35
69
54
0.06
H4
Can the criterion used or employed be considered as a reasonable 'gold standard'?
37
62
57
0f
H5c
Were there any important flaws in the design or methods of the study?
33
79
54
0.10
Statistical methods
H6
for continuous scores: Were correlations, or the area under the receiver operating curve calculated?
37
78
56
0.16e
H7
for dichotomous scores: Were sensitivity and specificity determined?
29
83
47
0.28e,f
Box I. Responsiviness (n = 79) b
Design requirements
I1c
Was the percentage of missing items given?
71
82
76
0.14d
I2c
Was there a description of how missing items were handled?
73
92
77
0.36d
I3
Was the sample size included in the analysis adequate?
72
72
76
0.40
I4c
Was a longitudinal design with at least two measurement used?
73
100
78
1.00 d
I5c
Was the time interval stated?
73
89
78
0.25d
I6c
If anything occurred in the interim period (e.g. intervention, other relevant events), was it adequately described?
72
78
75
0.17
I7c
Was a proportion of the patients changed (i.e. improvement or deterioration)?
70
97
73
0.32d
Design requirements for hypotheses testing
For constructs for which a gold standard was not available
I8
Were hypotheses about changes in scores formulated a priori (i.e. before data collection)?
65
69
72
0.35
I9
Was the expected direction of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses?
60
78
65
0.19e
I10
Were the expected absolute or relative magnitude of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses?
61
90
66
0.05d,e
I11c
Was an adequate description provided of the comparator instrument(s)?
56
70
63
0f
I12c
Were the measurement properties of the comparator instrument(s) adequately described?
56
80
63
0.06
I13c
Were there any important flaws in the design or methods of the study?
63
71
68
0.03
Statistical methods
I14
Were design and statistical methods adequate for the hypotheses to be tested?
63
73
67
0.21e,f
Design requirements for comparison to a gold standard
For constructs for which a gold standards was available:
I15
Can the criterion for change be considered as a reasonable 'gold standard'?
21
67
28
0f
I16c
Were there any important flaws in the design or methods of the study?
12
67
21
0f
Statistical methods
I17
for continuous scores: Were correlations between change scores, or the area under the Receiver Operator Curve (ROC) curve calculated?
28
79
39
0.47e,f
I18
for dichotomous scales: Were sensitivity and specificity (changed versus not changed) determined?
28
79
37
0.15e
Box J. Interpretability (n = 42) b
J1c
Was the percentage of missing items given?
22
95
41
0.80
J2c
Was there a description of how missing items were handled?
21
76
41
0.19
J3
Was the sample size included in the analysis adequate?
23
74
41
0f
J4c
Was the distribution of the (total) scores in the study sample described?
23
74
41
0.08
J5c
Was the percentage of the respondents who had the lowest possible (total) score described?
20
95
40
0.84
J6c
Was the percentage of the respondents who had the highest possible (total) score described?
21
90
41
0.70
J7c
Were scores and change scores (i.e. means and SD) presented for relevant (sub) groups? e.g. for normative groups, subgroups of patients, or the general population
21
76
41
0.05
J8c
Was the minimal important change (MIC) or the minimal important difference (MID) determined?
19
89
40
0.26d
J9c
Were there any important flaws in the design or methods of the study?
21
71
41
0f