open access publication

Article, 2023

S1000: a better taxonomic name corpus for biomedical information extraction

BIOINFORMATICS, ISSN 1367-4803, 1367-4803, Volume 39, 6, 10.1093/bioinformatics/btad369

Contributors

Luoma, Jouni [1] Nastou, Katerina [2] Ohta, Tomoko 0000-0002-9418-8352 [3] Toivonen, Harttu [1] Pafilis, Evangelos 0000-0001-5079-0125 [4] Jensen, L. J. 0000-0001-7885-715X (Corresponding author) [2] Pyysalo, Sampo (Corresponding author) [1]

Affiliations

  1. [1] Univ Turku, Dept Comp, TurkuNLP Grp, Turku 20014, Finland
  2. [NORA names: Finland; Europe, EU; Nordic; OECD];
  3. [2] Univ Copenhagen, Novo Nord Fdn Ctr Prot Res, Blegdamsvej 3, DK-2200 Copenhagen, Denmark
  4. [NORA names: KU University of Copenhagen; University; Denmark; Europe, EU; Nordic; OECD];
  5. [3] Textimi, Tokyo, Japan
  6. [NORA names: Japan; Asia, East; OECD];
  7. [4] Inst Marine Biol Biotechnol & Aquaculture, Hellen Ctr Marine Res, Iraklion 71003, Greece
  8. [NORA names: Greece; Europe, EU; OECD]

Abstract

Motivation The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize that this is primarily due to the lack of appropriate corpora.Results We introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We demonstrate that S1000 makes highly accurate recognition of species names possible (F-score =93.1%), both for deep learning and dictionary-based methods.Availability and implementationAll resources introduced in this study are available under open licenses from . The webpage contains links to a Zenodo project and three GitHub repositories associated with the study.

Data Provider: Clarivate