Is it time to trust the robots? The reliability and usability of machine learning tools for screening in systematic reviews

Session: 

Oral session: Innovative solutions to challenges of evidence production (3)

Date: 

Wednesday 23 October 2019 - 11:00 to 12:30

Location: 

All authors in correct order:

Gates A1, Pillay J1, Guitard S1, Elliott S1, Dyson M1, Newton A2, Hartling L1
1 Alberta Research Centre for Health Evidence, University of Alberta, Canada
2 Department of Pediatrics, University of Alberta, Canada
Presenting author and contact person

Presenting author:

Allison Gates

Contact person:

Abstract text
Background: machine learning tools can expedite the completion of systematic reviews (SRs) by reducing manual screening workloads, yet their application has been minimal. Evidence of their benefits and enhanced usability may improve their acceptance within the SR community.

Objectives: we tested the performance of three tools when used to: 1) eliminate irrelevant records (Simulation A); and 2) replace one of two independent reviewers (Simulation B). We evaluated the usability of each tool.

Methods: we selected three SRs completed at our Centre and subjected these to two retrospective screening simulations. Using each tool (Abstrackr, DistillerSR, and RobotAnalyst), we screened a 200-record training set and downloaded the predicted relevance of the remaining records. To test their performance, we calculated the proportion missed, workload savings, and estimated time savings compared to independent screening by two reviewers. To test usability, reviewers at our Centre undertook a screening exercise in each tool and completed a user experience survey, incorporating the System Usability Scale (SUS).

Results: using Abstrackr, DistillerSR, and RobotAnalyst respectively, the median (range) proportion of records missed was 5 (0 to 28)%, 97 (96 to 100)%, and 70 (23 to 100)% in Simulation A and 1 (0 to 2)%, 2 (0 to 7)%, and 2 (0 to 4)% in Simulation B. The median (range) workload savings was 90 (82 to 93)%, 99 (98
to 99)%, and 85 (85 to 88)% for Simulation A and 40 (32 to 43)%, 49 (48 to 49%), and 35 (34 to 38%) for Simulation B. The median (range) time savings was 154 (91 to 183), 185 (95 to 201), and 157 (86 to 172) hours for Simulation A and 61 (42 to 82), 92 (46 to 100), and 64 (37 to 71) hours for Simulation B. Based on the median (interquartile range (IQR)) SUS scores (/100), Abstrackr fell in the usable (79 (23)), DistillerSR the marginal (64 (31)), and RobotAnalyst the unacceptable (31 (8)) usability range (n = 8). Participants indicated that usability was contingent on six interdependent properties: user friendliness, qualities of the user interface, features and functions, trustworthiness, ease and speed of obtaining predictions, and practicality of the export file(s).

Conclusions: our findings support the cautious use of machine learning tools to replace the second reviewer (Simulation B); the workload savings were substantial and few, if any, records were erroneously excluded. Designing tools based on reviewers’ self-identified preferences may improve their usability.

Patient or healthcare consumer involvement: none