Background: readers of systematic reviews (SRs) and overview authors require valid, reliable, and practical means to evaluate the methodological quality and risk of bias of SRs. Evidence of the comparative reliability, usability, and applicability of common tools will inform how each should be used and interpreted.
Objective: to evaluate and compare the inter-rater and inter-centre reliability, usability, and applicability of AMSTAR, AMSTAR 2, and ROBIS.
Methods: using a random sample of 30 SRs of randomized trials, two review authors at each of three collaborating centres (Canada, Germany, Portugal) independently appraised the methodological quality or risk of bias of each SR using AMSTAR, AMSTAR 2, and ROBIS and reached consensus. We tested for inter-rater reliability between pairs of review authors and consensus decisions between centres, using Gwet’s AC1 statistic. To estimate usability, we calculated the median (interquartile range (IQR)) time to complete the appraisals and reach consensus. To inform applications of the tools, we tested for associations between methodological quality or risk of bias and the results and conclusions of the SRs.
Results: the median (IQR) time for review authors to complete the assessments was 15.7 (11.3), 19.7 (12.1), and 28.7 (17.4) minutes for AMSTAR, AMSTAR 2, and ROBIS respectively. The time to reach consensus was 2.6 (3.2), 4.6 (5.3), and 10.9 (10.8) minutes for AMSTAR, AMSTAR 2, and ROBIS, respectively. Inter-rater reliability varied by centre, but across all centres was substantial (AC1 0.61 to 0.80) to almost perfect (AC1 0.81 to 0.99) for 8/11 (73%) AMSTAR, 8/16 (50%) AMSTAR 2, and 13/24 (54%) ROBIS items. Inter-centre reliability was substantial to almost perfect for 6/11 (55%) AMSTAR, 12/16 (75%) AMSTAR 2, and 10/24 (42%) ROBIS items. Agreement on confidence in the results of the review (AMSTAR 2) ranged from slight (AC1 0.05, 95% confidence interval (CI) −0.17 to 0.27) to perfect (1.00) between review authors and moderate (AC1 0.58, 95% CI 0.30 to 0.85) to substantial (AC1 0.74, 95% CI 0.30 to 0.85) across centres. Agreement on overall risk of bias in the SR (ROBIS) ranged from moderate (AC1 0.47, 95% CI 0.17 to 0.77) to almost perfect (AC1 0.96, 95% CI 0.89 to 1.00) between review authors and from poor (AC1 −0.21, 95% CI −0.55 to 0.13) to moderate (AC1 0.56, 95% CI 0.30 to 0.83) across centres. There was no clear relationship between centre-specific appraisals and the results or conclusions of the SRs.
Conclusions: compared to AMSTAR 2 and ROBIS, review authors completed AMSTAR appraisals the quickest and obtained substantial agreement for a greater number (most) of items. Inter-centre reliability was highest for AMSTAR 2. Low levels of inter-centre reliability, particularly on overall AMSTAR 2 and ROBIS ratings, may limit readers’ ability to interpret the ratings applied by review groups. Improved documentation may be needed to assist review authors in consistently interpreting and applying each tool.
Patient or consumer involvement: none