Comparison of biomedical relationship extraction methods and models for knowledge graph creation
Biomedical research is growing at such an exponential pace that scientists,
researchers, and practitioners are no more able to cope with the amount of
published literature in the domain. The knowledge presented in the literature
needs to be systematized in such a way that claims and hypotheses can be easily
found, accessed, and validated. Knowledge graphs can provide such a framework
for semantic knowledge representation from literature. However, in order to
build a knowledge graph, it is necessary to extract knowledge as relationships
between biomedical entities and normalize both entities and relationship types.
In this paper, we present and compare few rule-based and machine learning-based
(Naive Bayes, Random Forests as examples of traditional machine learning
methods and DistilBERT, PubMedBERT, T5 and SciFive-based models as examples of
modern deep learning transformers) methods for scalable relationship extraction
from biomedical literature, and for the integration into the knowledge graphs.
We examine how resilient are these various methods to unbalanced and fairly
small datasets. Our experiments show that transformer-based models handle well
both small (due to pre-training on a large dataset) and unbalanced datasets.
The best performing model was the PubMedBERT-based model fine-tuned on balanced
data, with a reported F1-score of 0.92. DistilBERT-based model followed with
F1-score of 0.89, performing faster and with lower resource requirements.
BERT-based models performed better then T5-based generative models.