A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases

Large, open-source DNA sequence databases have been generated, in part, through the collection of microbial pathogens by swabbing surfaces in built environments. Analyzing these data in aggregate through public health surveillance requires digitization of the complex, domain-specific metadata that a...

Full description

Saved in:

Bibliographic Details
Published in:	mSystems 2023-04, Vol.8 (2), p.e0128422-e0128422
Main Authors:	Feng, Jingzhang, Daeschel, Devin, Dooley, Damion, Griffiths, Emma, Allard, Marc, Timme, Ruth, Chen, Yi, Snyder, Abigail B
Format:	Article
Language:	eng
Subjects:	Applied and Industrial Microbiology Artificial Intelligence Automation Classification schemes Cognition & reasoning Communicable Diseases Databases, Nucleic Acid Environmental monitoring Epidemics Epidemiology Foodborne diseases foodborne pathogen Genomes genomic surveillance Genomics Humans Infectious diseases informatics Information processing Metadata Nucleotide sequence Ontology Pathogens Public health Research Article Site location
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Large, open-source DNA sequence databases have been generated, in part, through the collection of microbial pathogens by swabbing surfaces in built environments. Analyzing these data in aggregate through public health surveillance requires digitization of the complex, domain-specific metadata that are associated with the swab site locations. However, the swab site location information is currently collected in a single, free-text, "isolation source", field-promoting generation of poorly detailed descriptions with various word order, granularity, and linguistic errors, making automation difficult and reducing machine-actionability. We assessed 1,498 free-text swab site descriptions that were generated during routine foodborne pathogen surveillance. The lexicon of free-text metadata was evaluated to determine the informational facets and the quantity of unique terms used by data collectors. Open Biological Ontologies (OBO) Foundry libraries were used to develop hierarchical vocabularies that are connected with logical relationships to describe swab site locations. 5 informational facets that were described by 338 unique terms were identified via content analysis. Term hierarchy facets were developed, as were statements (called axioms) about how the entities within these five domains are related. The schema developed through this study has been integrated into a publicly available pathogen metadata standard, facilitating ongoing surveillance and investigations. The One Health Enteric Package was available at NCBI BioSample, beginning in 2022. The collective use of metadata standards increases the interoperability of DNA sequence databases and enables large-scale approaches to data sharing and artificial intelligence as well as big-data solutions to food safety. The regular analysis of whole-genome sequence data in collections such as NCBI's Pathogen Detection Database is used by many public health organizations to detect outbreaks of infectious disease. However, isolate metadata in these databases are often incomplete and of poor quality. These complex, raw metadata must often be reorganized and manually formatted for use in aggregate analyses. These processes are inefficient and time-consuming, increasing the interpretative labor needed by public health groups to extract actionable information. The future use of open genomic epidemiology networks will be supported through the development of an internationally applicable vocabulary system with which swab site
ISSN:	2379-5077 2379-5077