This report only contains the results of the specific experiments and analyses in which the
BfRshort forGerman Federal Institute for Risk Assessment was involved. The full report on the results of all institutes is attached. WP1 - Assess feasibility of Long-read metagenome sequencing on exemplar matrices. Investigate the use of Hi-C metagenomics.
WP1 was led by Sciensano and
BfRshort forGerman Federal Institute for Risk Assessment, with contributions from all members of the consortiumJRP12-WP1-T1-Assess feasibility /perform long-read metagenomics MinION from ‘defined’ microbial community.A survey within the consortium was conducted and identified the different methods and experiences of institutes with both short- and long-read sequencing technologies, highlighing the different matrices from a variety of human, animal, and environmental sources examined, DNA extraction methods, and bioinformatics tools used to detect bacterial species and/or
AMRshort forAntimicrobial resistance. The crucial step in any sequencing process involves the extraction and isolation of DNA from samples, ensuring it meets the criteria of quality, quantity, and purity. Without this, long-read metagenomes may yield biased results and fail to accurately associate bacteria or plasmids with
AMRshort forAntimicrobial resistance genes. WP1 initiated an assessment of DNA extraction methods within the consortium, utilizing pond water and water buffalo feces samples spiked with six bacterial species at various concentrations, along with un-spiked control samples. The chosen sample matrices, water buffalo feces (a complex matrix) and pond water (a simple matrix), were selected due to their prevalence within the consortium.
BfRshort forGerman Federal Institute for Risk Assessment created a defined microbial community (DMC) comprising four Gram-negative isolates and two Gram-positive isolates with known complete genome sequences and diverse
AMRshort forAntimicrobial resistance profiles (see Table 1). Two differently composed communities, with a fixed concentration of 105
CFUshort forColony Forming Units/
mlshort formillilitre or a range from 103 to 107
CFUshort forColony Forming Units/
mlshort formillilitre, were used to spike the two matrices. The matrices underwent screening for
Salmonella, ESBL and/or Carbapenemase-producing
Enterobacteriaceae, as well as PCR testing for isolates harboring the mcr-1 to mcr-9 genes to ensure negativity and suitability for DMC analysis. Preserving the integrity of DNA, samples within the defined mock community and the matrices themselves were prepared using the commercial nucleic acid preservation solution DNA/RNA Shield (Zymo Research).The DNA extraction protocols, primarily involving commercially available kits, were utilized by each partner in their laboratories, employing methods such as bead beating, enzymatic lysis, or a combination thereof. The DNA concentration obtained through various methods ranged from 1 to 60 ng/μl, with a satisfactory 260/280 ratio (1.6-1.8). Fragment sizes of DNA (3-50kb) varied among the methods. Consistent long-read sequencing kits (ONT Ligation sequencing kit (SQK-LSK109) with native barcoding Expansion) were employed by project partners to ensure comparable sequencing results within the consortium. Sequencing involved Illumina short-read (varied across the consortium) and ONT-MinION (Flow Cell R9) long-read technologies, aiming for ~5Gbp per sample.An assessment of short (Illumina) and long (ONT) read sequencing on a defined mock community (DMC) of 6 species, spiked into water buffalo feces or pond water at varying concentrations, was conducted. Long-read sequencing using ONT technology identified some bacterial species in the DMC, dependent on the DNA extraction method. Microbial profiles exhibited different clustering based on the database used (SILVA/16S or complete genomes). Illumina data clustered by institute with the 16S SILVA database, whereas the FARMED_db (complete genomes) resulted in clustering based on sample type. For MinION data using the FARMED_db, clustering was less clear, generally separating samples based on blank (FB) or mixed concentration (FM) types.
AMRshort forAntimicrobial resistance content analysis was less impacted by institute or DNA extraction method. Variations in bacterial composition were attributed to sequencing depth and DNA extraction method. Despite sequencing differences, most identified the spiked bacterial species in DMC samples.PacBio's long-read sequencing, based on circular consensus sequencing (CCS), promises >10kb high-fidelity long (HiFi) reads for metagenome profiling and assembly.
BfRshort forGerman Federal Institute for Risk Assessment, parallel to Illumina and ONT sequencing, used PacBio for DMC samples spiked into water buffalo feces. Despite successful identification of DMC members and the water buffalo matrix, low read numbers prevented assembly by different programs. The extensive optimization required for PacBio, along with the inability to multiplex samples efficiently and the resulting high cost per sample, led to the decision to focus on the more efficient ONT methodology. The high input DNA concentrations and lengthy library preparation time of PacBio made it unsuitable as a long-read alternative, as its data quality did not compensate for the additional workload and higher costs in metagenomic contexts.Three promising commercially available DNA extraction kits were selected for further evaluation, with at least two institutes using each kit. Real-life 'simple' (WP1-T2) and 'complex' (WP1-T3) matrices were chosen by respective institutes
JRP12-WP1-T3- Assess feasibility/perform long-read metagenomics MinION from ‘complex’ sample matrices.Through the German National Reference Laboratories for
Salmonella and
Escherichia coli, two complex sample matrices with bacterial contamination, ground up insect (cricket) powder, and sesame paste (tahini), were provided to the sequencing unit of the
BfRshort forGerman Federal Institute for Risk Assessment. The choice of matrices was influenced by their relevance in terms of EU Regulation 2015/2283 and ongoing international outbreaks with
Salmonella serovars. DNA extraction from 200
mgshort formilligram of these samples utilized the ZymoBiomics HMW DNA Kit. Despite challenges such as low DNA concentrations in cricket powder due to high salt levels and storage conditions, both sample types yielded sufficient DNA for ONT sequencing. Four complex matrix samples were sequenced using ONT technology:
Salmonella-contaminated tahini, uncontaminated tahini,
Bacillus-positive cricket powder, and
Salmonella-positive cricket powder.The ONT rapid kit and ONT Ligation sequencing kit (SQK-LSK109) with native barcoding Expansion were employed for sequencing, with the former having a shorter laboratory workflow. Analysis of sequencing data from the
Bacillus-contaminated insect sample correctly identified the contamination, allowing assembly of contigs, assignment of a
Bacillus clade, and identification of
AMRshort forAntimicrobial resistance genes. However, sequencing of the
Salmonella-contaminated tahini and insect samples yielded a small number of
Salmonella reads. The
Salmonella-positive tahini sample showed four classified
Salmonella reads but couldn't be assembled or assigned to a specific serovar. The discrepancy between
Salmonella and
Bacillus contamination may be attributed to different contamination levels, with the
Salmonella contamination expected to be very low.Short-read Illumina technology was also employed for sequencing these samples. Comparisons with long-read sequencing data revealed that the percentage of classified long-read sequencing reads far exceeded that of short-read sequencing reads. Long-read technology demonstrated superiority in identifying bacterial contaminants in complex matrices with challenging properties such as high fat or salt contents. ONT sequencing, even with the rapid library preparation workflow, enabled the assembly of bacterial contigs, clade assignment, and
AMRshort forAntimicrobial resistance identification. However, bacterial contamination levels remained a crucial factor, and for the matrices analyzed, long-read technology outperformed short-read sequencing in identifying selected bacterial contaminants.To facilitate cross-institute comparison and determine detection limits, a commercially available mock community (ZymoBIOMICS Gut Microbiome Standard, GMS, CAT no D6331) was employed for consistent output and method comparison between consortium partners. The Gut Microbiome Standard (GMS) was inoculated in either PBS (PURE sample) or buffalo feces (SPIKED sample). Uninoculated buffalo feces (BLANK sample) was also analyzed to understand the natural microflora. DNA extraction from the three samples (PURE, SPIKED, and BLANK) utilized the Quick-DNA ™ HMW MagBead Kit from ZymoResearch (selected in WP1-T1) and yielded sufficient pure long fragment DNA, meeting the Nanopore sequencing requirement with the ligation prep (min 1 μg). The samples were sequenced with Nanopore and Illumina for comparison. All sequencing runs generated substantial genomic data suitable for bioinformatics analysis (see WP2.T1). Overall, the accuracy and sensitivity of the sequencing were significantly influenced by experimental settings. A comparison by Sciensano, using detection criteria (see WP2-T1), of the KMA output data from APHA,
BfRshort forGerman Federal Institute for Risk Assessment, Sciensano, SSI, and WBVR revealed that different institutes performed similar in experiments with GMS analyzed in pure form or spiked in a complex matrix. However, they used different sequencing multiplexing levels (1plex, 3plex, 4plex, 5plex, and 6plex), run times (24h, 48h, and 72h), library preparation methods (rapid or ligation kit), and DNA extraction kits (Quick-DNA HMW MagBead kit, ZymoResearch or GenFind V3, Beckman). The various combinations of experimental settings were categorized from "ideal settings" (i.e., singleplex, 72h, ligation kit, and Quick-DNA HMW MagBeat kit) to "cost-time efficient settings" (i.e., 6plex, 24h, rapid kit, and GenFind V3 kit). It was observed that as experimental settings shifted from ideal to cost-time efficient, detection accuracy and sensitivity decreased, no longer comparable to Illumina sequencing.
JRP12-WP1-T4- Perform Hi-C metagenomics This task was undertaken in close collaboration by
BfRshort forGerman Federal Institute for Risk Assessment and SciensanoHi-C metagenomic sequencing, a recently developed technology, involves proximity ligation of DNA molecules to ascertain the genetic context of
AMRshort forAntimicrobial resistance genes and their linkage to the bacterial host chromosome. This approach, conducted prior to short-read sequencing, aims to provide insights into the genetic context of
AMRshort forAntimicrobial resistance genes and their genomic location, including extrachromosomal elements like plasmids or their connection to bacterial host chromosomes. Feasibility was explored using both artificial samples and those from the EFFORT project.
BfRshort forGerman Federal Institute for Risk Assessment constructed an artificial community (MC) consisting of four species with known ARG profiles, genomes, and plasmid sequences. Hi-C and shotgun Illumina libraries were prepared and sequenced, with subsequent analysis conducted using ProxiMeta and the in-house developed Hiccap pipeline. Both pipelines successfully linked most
AMRshort forAntimicrobial resistance genes to the species within the artificial community, offering adjustable sensitivity and specificity in Hiccap analysis through threshold parameters. While artificial communities lack real sample complexity, EFFORT project samples from an in-vivo experiment involving
Salmonella enterica serovar Corvallis provided more realistic conditions. Hi-C metagenomics demonstrated high specificity and sensitivity (>90%) in assigning ARGs to species.Sciensano utilized an artificial community (MC1) from BFR, comprising six species with known reference genomes, to construct a Hi-C library. Sequencing on Illumina MiSeq and analysis with ProxiMeta and HiCExplorer tools showed successful linking of plasmids to host strains. Future work will explore the use of ONT-based assemblies as shotgun libraries. The results suggest a potential for scientific inquiries to bypass the shotgun dataset in the future, reducing costs.Challenges include limited available pipelines for Hi-C data analysis. While ProxiMeta is an option, alternative open-source pipelines like HAM-ART and HiCBin have been identified. Ongoing development of the in-house Hiccap pipeline aims to rely solely on a paired-end deep sequenced Hi-C library, potentially reducing practical expenses for sequencing. Although Hi-C data analysis is not yet standardized or routine, utilizing the in-house Hiccap pipeline demonstrates the ability of Hi-C metagenomic sequencing to identify genomic identities of reads and assign ARGs to species. Future developments may eliminate the need for paired-end deep-sequenced shotgun libraries in this methodology. Despite the current lack of standardization, the unique capability of Hi-C metagenomic sequencing to link plasmids to hosts in metagenomics samples makes it a promising research tool for addressing genetically modified microorganisms (GMM) characterization issues.
WP2 - Bioinformatics tools to analyse the sequencing data and defining the characteristics within the sample.
For the purpose of sharing and storing sequencing data within the consortium, with the option of maintaining privacy until publication, research was conducted by APHA and SCIENSANO to identify suitable solutions, as OHEJP did not provide this facility. Anticipating the generation of up to 120GB of MinION data and 192GB of Illumina data for each task comparison within the consortium, three options were considered (Google Cloud, Google Drive, and ownCloud), each with its own advantages and disadvantages. The decision was made by the consortium to utilize ownCloud (https://farmedejp12.owncloud.online/). It is crucial to consider data storage, not only for the volume and size of sequencing data and analysis but also for the databases used in metagenome analysis.
JRP12-WP2-T1- Development/adaptation of a pipeline that can predict species within sample/matrixOwing to the extensive data generated by metagenome sequencing, the presence of a resource-efficient analysis pipeline is crucial, whether conducted within a laboratory or on-site. In the project proposal phase, tools were developed for interrogating metagenome data using short-read sequencing. At the project's commencement, the KMA (k-mer alignment) pipeline was introduced by DTU, employing an innovative mapping method to map raw reads and fasta files against a database (Clausen
et al.short foret alii (lat. "and others"), 2018, https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2336-6). This command-line-based KMA pipeline can be implemented for bacterial community and
AMRshort forAntimicrobial resistance assignments from both long-read and short-read sequences. It was considered for the standard workflow in FARMED data analysis and is freely available at https://bitbucket.org/genomicepidemiology/kma.
BfR , SSI, WBVR and APHA contributed to the following analyses led by Sciensano. The sequencing data from the UNSPIKED, PURE, and SPIKED samples in Task WP1-T3 were analyzed using KMA and the Custom-2 parameters, as detailed in the 12M-Report-Y4. A key consideration was the selection of the reference database, with several databases tested for analyzing metagenomics data, including gmstd_db, gmstdFP_db, DTU_db, newDTU_db, and FARMED_db. The comparison of gmstd_db, DTU_db, and newDTU_db revealed that the composition of the database (complete genomes vs. contigs/scaffold) impacts the KMA results. A database mainly composed of complete genomes, with contigs/scaffolds added only when complete genomes are unavailable, was recommended for accurate interpretation.To analyze the UNSPIKED, PURE, and SPIKED samples, a new database called FARMED_db was created, consisting of DTU_db with appended complete genomes of Methanobrevibacter smithii and contigs/scaffolds lists of Veillonella rogosae and Prevotella corporis. The analysis of the PURE sample using gmstd_db, gmstdFP_db, and FARMED_db helped define detection thresholds for species identifications based on depth, template identity, and template coverage values generated by KMA. Based on the extensive data retrieved from KMA with the Custom-2 parameters, Sciensano proposed six different detection levels with different confidence levels to simplify the data interpretation. Depending on the level of detection, taxonomic identification can be relied upon or additional analysis is required to confirm the result.