Background: PubMed is a widely used database for retrieving health research. Staff at PubMed manually label articles by their research type (including whether the article is a randomized controlled trial (RCT) or not; known as the Publication Type (PT)), but there is a lag between the time articles are published and when tags are manually applied to them of an estimated 250 days. As a result, using this PubMed PT tag to identify studies risks missing nearly a year of the most recent research.
Objectives: we sought to develop and evaluate a living database of RCT articles published in PubMed, using a machine learning system to classify new abstracts daily on publication.
Methods: we developed a machine learning system (an ensemble of Support Vector Machines, and Convolutional Neural Networks), which ‘learned’ to retrieve abstracts describing RCTs, and ignore other designs. The system was trained by analysing ~280,000 abstracts manually labelled by the Cochrane Crowd project. We calibrated a decision threshold that matched the specificity of PubMed manual tagging (i.e. would be expected to have an identical false positive rate). Initially, we used the trained model to identify RCTs from a bulk download of the PubMed database, published December 2018. Subsequently we obtained and classified updates daily.
Results: the final machine learning ensemble had 97.4% sensitivity and 96.8% specificity for identifying RCTs. As of March 2019, 29.6 million articles were indexed in PubMed. According to PubMed manually applied PT tags, 479,486 were identified as RCTs. Using the machine learning system, 539,779 articles were identified as RCTs (i.e. 60,293 additional articles).
Both manual and machine learning approaches show a near identical accelerating increase in RCT publications year-on-year until 2013 (see Figure 1). From 2013 to date, the manual approach shows publications remaining static, then reducing to 13,343 in 2018 (54% of the 2013 number). However, the machine learning system finds that RCT publications continue to increase year-on-year, with 33,552 found in 2018. The vast majority of the additional trials found by the machine learning system were from the past five years.
Conclusions: a machine learning system can be applied to PubMed to produce a living database of RCTs, with new articles available on the same day as publication. Machine learning retrieves substantially more articles than using manual applied indexes: this is likely explained by the delay from publication to manual indexing. Using the PubMed PT tag alone is likely to miss a large proportion of recently published clinical trials.
Patient or healthcare consumer involvement: this project relies on data from Cochrane Crowd, where members of the public can contribute to systematic review production. The ‘crowd’ labelled hundreds of thousands of articles, which the machine learning system used to ‘learn’ how to do the task automatically.