Background: recently a 'Risk of bias' (RoB) tool was developed for non-randomized studies (NRS) of interventions (ROBINS-I), which was later modified and adapted for NRS of environmental/nutritional exposures (ROBINS-E). However, the inter-rater reliability (IRR) and inter-consensus reliability (ICR) of these tools has yet to be independently verified.
Objectives: the objectives of our study are to establish the IRR, ICR, and evaluator burden of ROBINS-I, and ROBINS-E.
Methods: an international team of evaluators from six participating centres appraised the RoB of a sample of NRS of either interventions or exposures, using either ROBINS-I (n = 44) or ROBINS-E (n = 44), respectively. Evaluators were paired up into teams that reviewed the same sample study publications in order to allow the evaluation of ICR. After completion of individual adjudications, each pair of evaluators resolved conflicts through consensus. They also tracked the time for completion of each step. For analysis of the IRR and ICR, we used Gwet’s AC1 statistic. We categorized agreements among evaluators as follows: poor (< 0), slight (0.00 to 0.20), fair (0.21 to 0.40), moderate (0.41 to 0.60), substantial (0.61 to 0.80), near perfect (0.81 to 0.99), or perfect (1.00). To assess evaluator burden, we analyzed the average time taken for individual adjudications, and the consensus process. We used Microsoft Excel, Review Manager 5.3 and SAS 9.4 for data management and analysis.
Results: for both ROBINS-1 and ROBINS-E, the IRR (Table 1) indicated slight agreement for evaluating 'bias due to confounding'. For ROBINS-I, the agreements for the remaining domains ranged from fair to substantial agreement. For ROBINS-E, the agreements for the remaining domains ranged from poor to moderate agreement, except for the domain 'bias in measurement of outcomes' for which there was almost perfect agreement. The overall bias assessments showed poor agreement for ROBINS-I, and slight agreement for ROBINS-E.
For ICR, ROBINS-I (Table 2) ranged from slight to substantial agreement and ROBINS-E (Table 2) ranged from poor to perfect agreement. The overall bias assessments showed slight and poor agreement between raters for ROBINS-I, and ROBINS-E, respectively.
With regards to the evaluator burden, the average time taken (reading the study report + adjudication + consensus) by the evaluators were 42.7 ± 7.7 minutes and 48 ± 8.3 minutes for ROBINS-I, and ROBINS-E, respectively. As an extension of this project, we are currently investigating whether training and additional supportive material would improve the IRR and ICR for both the tools.
Conclusions: overall, ROBINS-I had better IRR and ICR compared to ROBINS-E. The assessment times for both tools were similar. Measures to increase agreements between raters are required (e.g. detailed training, supportive material).
Patient or healthcare consumer involvement: healthcare consumers were not involved in this methods project.