Inter-rater reliability, inter-consensus reliability and evaluator burden of ROBINS-E and ROBINS-I: a cross-sectional study

Presentation video:




Oral session: Investigating bias (2)


Tuesday 22 October 2019 - 14:00 to 15:30


All authors in correct order:

Jeyaraman MM1, Rabbani R1, Robson R2, Copstein L1, Al-Yousif N1, Xia J3, Pollock M4, Hofer K5, Balijepalli C1, Mansour S6, Bond K4, Fazeli M5, Ansari M7, Tricco A8, Abou-Setta AM1
1 George and Fay Yee Center for Healthcare Innovation, University of Manitoba, Canada
2 Li Ka Shing Knowledge Institute, St. Michael's Hospital, Canada
3 Nottingham Ningbo GRADE Centre, UK
4 Institute of Health Economics, Canada
5 Evidinno, Canada
6 University of Montreal, Canada
7 University of Ottawa, Canada
8 Dalla Lana School of Public Health, University of Toronto, Canada
Presenting author and contact person

Presenting author:

Maya Jeyaraman

Contact person:

Abstract text
Background: recently a 'Risk of bias' (RoB) tool was developed for non-randomized studies (NRS) of interventions (ROBINS-I), which was later modified and adapted for NRS of environmental/nutritional exposures (ROBINS-E). However, the inter-rater reliability (IRR) and inter-consensus reliability (ICR) of these tools has yet to be independently verified.

Objectives: the objectives of our study are to establish the IRR, ICR, and evaluator burden of ROBINS-I, and ROBINS-E.

Methods: an international team of evaluators from six participating centres appraised the RoB of a sample of NRS of either interventions or exposures, using either ROBINS-I (n = 44) or ROBINS-E (n = 44), respectively. Evaluators were paired up into teams that reviewed the same sample study publications in order to allow the evaluation of ICR. After completion of individual adjudications, each pair of evaluators resolved conflicts through consensus. They also tracked the time for completion of each step. For analysis of the IRR and ICR, we used Gwet’s AC1 statistic. We categorized agreements among evaluators as follows: poor (< 0), slight (0.00 to 0.20), fair (0.21 to 0.40), moderate (0.41 to 0.60), substantial (0.61 to 0.80), near perfect (0.81 to 0.99), or perfect (1.00). To assess evaluator burden, we analyzed the average time taken for individual adjudications, and the consensus process. We used Microsoft Excel, Review Manager 5.3 and SAS 9.4 for data management and analysis.

Results: for both ROBINS-1 and ROBINS-E, the IRR (Table 1) indicated slight agreement for evaluating 'bias due to confounding'. For ROBINS-I, the agreements for the remaining domains ranged from fair to substantial agreement. For ROBINS-E, the agreements for the remaining domains ranged from poor to moderate agreement, except for the domain 'bias in measurement of outcomes' for which there was almost perfect agreement. The overall bias assessments showed poor agreement for ROBINS-I, and slight agreement for ROBINS-E.

For ICR, ROBINS-I (Table 2) ranged from slight to substantial agreement and ROBINS-E (Table 2) ranged from poor to perfect agreement. The overall bias assessments showed slight and poor agreement between raters for ROBINS-I, and ROBINS-E, respectively.

With regards to the evaluator burden, the average time taken (reading the study report + adjudication + consensus) by the evaluators were 42.7 ± 7.7 minutes and 48 ± 8.3 minutes for ROBINS-I, and ROBINS-E, respectively. As an extension of this project, we are currently investigating whether training and additional supportive material would improve the IRR and ICR for both the tools.

Conclusions: overall, ROBINS-I had better IRR and ICR compared to ROBINS-E. The assessment times for both tools were similar. Measures to increase agreements between raters are required (e.g. detailed training, supportive material).

Patient or healthcare consumer involvement: healthcare consumers were not involved in this methods project.