Background: automated approaches to improve the efficiency of systematic reviews are greatly needed. When testing any of these approaches, the criterion standard of comparison (gold standard) is usually human review authors. Yet human review authors make errors in inclusion and exclusion of references.
Objectives: to determine citation false inclusion and false exclusion rates in systematic reviews conducted by pairs of independent review authors. These rates can help in designing, testing and implementing automated approaches.
Methods: we identified all systematic reviews conducted between 2010 and 2017 by an evidence-based practice centre in the USA. Eligible reviews had to follow standard systematic review procedures with dual independent screening of abstracts and full texts in which citation inclusion by one review author prompted automatic inclusion through the next level of screening. Disagreements between review authors were reconciled via consensus or arbitration by a third review author. Data were extracted from web-based commercial systematic review software. We defined a false inclusion or exclusion as a decision made by a single review author that was inconsistent with the final included list of studies that underwent data extraction and analysis.
Results: we included a total of 25 systematic reviews with 139,467 citations in the analysis, representing 329,332 inclusion and exclusion decisions from 86 unique review authors. The final systematic reviews included 5.48% of the references identified through bibliographic database search (95% confidence interval (CI) 2.38% to 8.58%). After abstract screening, the false inclusion rate was 12.59% (95% CI 8.72% to 16.45%) and the false exclusion rate was 10.76% (95% CI 7.43% to 14.09%).
Conclusions: this study of 329,332 screening decisions made by a large, experienced, and relatively homogeneous group of systematic review authors suggests important false inclusion and exclusion rates. When deciding about the validity of a future automated study selection algorithm, it is important to keep in mind that the gold standard is not perfect and that achieving error rates similar to humans is likely adequate and can save resources and time.
Patient or healthcare consumer involvement: not applicable