Background: the use of natural language processing (NLP) to support systematic review (SR) conduct is receiving increased attention. The observed accuracy of most approaches is inadequate to permit their use in autonomous reference screening. Yet, incorporating NLP in the review process in some fashion remains an attractive way to increase efficiency of SRs. Research is needed to improve our knowledge about practical uses of NLP in SRs.
Objectives: the objectives of this study are to examine the use of two parallel NLP classifiers (Naïve Bayes (NB) and Support Vector Machine (SVM)) to screen records for exclusion from reviews and to compare their performance to human error rates.
Methods: we analyzed three systematic reviews conducted by one of the United States Agency for Healthcare Research and Quality Evidence-Based Practice Centers. The reviews were chosen at random from a larger set in which dual screening was employed. The reviews addressed the topics of kidney stones, sickle cell disease and childhood obesity. For each SR, data are available for human screening decisions and modifications. We randomly selected 10% of screened references from each SR to use for training the NLP classifiers on inclusion/exclusion decisions. The classifiers then independently screened the remaining references. Where there was no consensus between the classifiers, references remained unreviewed. We caclulated error rates (false exclusions) for the NLP machine-indicated exclusions and individual human-indicated exclusions and compared them to the dual human-screened results (reference standard). We repeated this process three times per review.
Results: these three reviews included 6378, 1245 and 7260 screened references. After removing training sets, the NLP classifiers achieved consensus exclusion for an average (across three iterations) of 28.4%, 41.0% and 16.1% of the screened references. Average false omission rates by NLP classifiers were 0.07%, 0% and 0.04% per review. Average false omission rates for single reviewers were 2.92%, 1.25%, and 7.85%, respectively. Additional analyses are planned and will be presented for 22 additional SRs comprising 124,584 dual screened references.
Conclusions: across these three reviews, NB/SVM classifiers, running independently, achieved consensus exclusion error rates that were lower than individual human review authors. However, both the classifiers and humans produced false exclusions, reinforcing dual screening as best practice. More research is planned and required to test the combined NB/SVM classifiers as a second screener or its use in autonomous screening to exclude irrelevant references across topic domains and review types.
Patient or healthcare consumer involvement: none