Faction II Comparative Genomics Group

From Compgenomics2017

Jump to: navigation, search



Salmonella enterica is a gram-negative, rod shaped bacteria (1). It is a facultative intracellular pathogen and can be divided into 6 subspecies ( enterica, salamae, arizonae, diarizonae, houtenae, and indica) (1). Salmonella enterica subsp. enterica, contains 2610 different serovars which are characterized by two surface antigens: the flagellar H antigen and the ologosaccharide O antigen. Here, we focus on Salmonella enterica subsp. enterica serovar Heidelberg which is the 7th most common serovar isolated from humans in the U.S and has been linked to multiple outbreaks linked to poultry (1).

The genome size of S. enterica serovar heidelberg ranges from 4.73 to 4.98 Mb and the GC% content is 52.1%. S. enterica serovar Heidelber's pathogenicity comes from many components found on mobile elements of the genome (2). One group of components is found on Salmonella Pathogenicity Islands (SPIs) (3). SPIs are segments of the genome that encode for genes that are virulence factors and are required for the infection of the host (3). They typically display a higher AT content that the rest of the genome due to the fact that they are possible elements of horizontal gene transfer. SPIs can play a role in virulence, host specification, and bacterial invasiveness from the host immune system (3).

S. enterica serovar Heidelberg also typically carry a large plasmid that contains virulence genes.These plasmids encode for virulence genes that are required for intracellular survival and possible anti-microbial resistance (2).


Here, we preformed a comparative genomics analysis on 26 “sporadic” S. enterica serovar Heidelberg isolates. These are clinically and environmentally derived isolates found within the U.S from 2014-2016. They were considered sporadic due to the fact that these isolates had no known outbreak associations. Therefore our objectives included:

1. To devise an approach to distinguish the isolate according to the line list meta data.

2. To analyze gene content differentially present/absent in strains between the outbreak and sporadic factions.

3. If differences are found, report possible functional implications.

We approached these objectives with various phylogenetic analysis to determine any distinguishable patterns within our sample set when linked back to the line list meta-data.


To first analyze phylogenetic relatedness, multiple techniques were utilized. To differentiate the isolates, we preformed a MASH analysis, an ANI analysis, and a WG-SNP analysis.


MASH extends the MinHash, an optimized set comparison algorithm, dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling accurate all-pairs comparisons between many large genomes and metagenomes. MASH provides mainly two functions for sequence comparison. One is sketch, the other is dist. When we run MASH to do genome comparison, first we need to run sketch function, which converts a sequence or collection of sequences into s MinHash sketch. Then, we should call dist function to compare two sketches and returns an estimate of the Jaccard index (i.e. the fraction of shared k-mers) and the MASH distance, which estimates the rate of sequence mutation under a simple evolutionary model. Moreover, the MASH distance is highly correlated with 1 - ANI.

MASH cluster of all 50 isolates and 8 reference genomes downloaded from NCBI. Each row of heatmap is labeled by name of isolates or reference genome. The labels of heatmap are colored by faction, red represents isolates are from faction 1 which are outbreak; green represents isolates are from faction 2 which are sporadic and blue represents reference genome which are downloaded from NCBI..


Average nucleotide identity determines genetic relatedness of prokaryotic organisms. Programs like OrthoANI that calculate average nucleotide identity utilize BLAST to perform their calculations. The whole genome of the first organism is split into short segments that are BLASTed against the genome of the second organism, and the scores for each segment are averaged to determine the average nucleotide identity.

SNP Analysis

Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide in the genome sequence is altered. To differentiate and classify outbreak isolates that have little diversity and require extensive genomic analysis, SNP analysis tools are used to construct a phylogenetic tree of genomes based on SNPs. Parsnp is one of SNP analysis tools, and it is designed to be quick in aligning the core genome of bacterial genomes. Input can be either draft assemblies or completed genomes, and outputs SNP calls, core genome phylogeny and multi-alignments. Fig1.

Snp fin.png

  Fig1: WG-SNP phylogenetic tree of all 50 isolates. Isolates are colored by PFGE-pattern.

Comparative Genomics


In order to further differentiate between the line list metadata, we sought to use computational methods to identify putative prophage sequences in each of the 26 isolates. Prophages are part of the bacterial mobilome, which is comprised of all mobile genetic elements in a genome. In prokaryotes, prophages and plasmids are an important part of the mobilome and can be transferred between individual bacteria via methods of horizontal gene transfer (HGT). Mechanisms of HGT include transformation and transduction of genetic material, or bacterial conjugation.

A well-documented software that can be used to identify prophage sequences in prokaryotic genomes is PHAST, a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. PHAST is an integrated search and annotation tool that combines genome-scale ORF prediction and translation (via GLIMMER), protein identification (via BLAST matching and annotation by homology), phage sequence identification (via BLAST matching to a phage-specific sequence database), tRNA identification, attachment site recognition and gene clustering density measurements using density-based spatial clustering of applications with noise (DBSCAN) and sequence annotation text mining. We implemented a slightly newer version of the software called PHASTER, which provides enhanced visuals in correlation with the provided sequence.

After running the program with the 26 sporadic isolates, we identified prophage IDs corresponding to putative prophage sequences and combined them to generate a binary heatmap shown in Fig 2.

Phage final.png

  Fig 2: Hierarchical clustering of prophage sequence content for all sporadic isolates. Isolates are colored by PFGE-pattern. Blue= JF6X01.0022, Yellow= JF6X01.0055,   
  Orange= JF6X01. 0167, Green= JF6X01. 0173.

The isolates are clustered into seven grouping based on prophage sequence content. This information can be used in conjunction with the line-list metadata, particularly PFGE pattern, as an additional method for categorizing the isolates.


To further discriminate the sporadic isolates, a SNP-based phylogenetic analysis was done on the plasmids associated with each isolate. The plasmid contig was manually removed from each isolate and SNPs were then identified against the reference plasmid sequence Salmonella enterica subsp. enterica serovar Heidelberg strain AMR588-04-00435 plasmid pAMR588-04-00435_37 (CP016575). A SNP-based phylogenetic tree was then construced (Fig 3).

P fin.png

  Fig.3 SNP-based Phylogenetic tree of the plasmids isolated from all 50 isolates. Isolates colored by PFGE-pattern. Blue= JF6X01.0022, Yellow= JF6X01.0051, 
  Orange= JF6X01.0167, Green= JF6X01.0173. F1 are outbreak isolates. F2 are sporadic isolates.

The isolates showed some clustering that matched the line-list meta-data. We viewed clustering based on PFGE pattern and some instances of groupings based on Source State, Isolated Date, and other epidemiological attributes. We continued to see clustering with the sporadic isoaltes: Cluster 1= SP0005, SP0006, SP008; Cluster 2=SP0004, SP0017, SP0019; Cluster 4= SP0009, SP0011, SP0012. These same groupings were found in the Prophage analysis as well as WG-SNP and MASH.


In order to find out differentially present/absent genes in 26 isolates from sporadic group and 24 isolates from outbreak group, we performed pangenome analysis.Pangenome represents the union of the gene sets which include "core genome" containing genes present in all isolates, a "dispensable genome" containing genes present in two or more isolates, and finally "unique genes" specific to single isolate. For pangenomics studies, Roary is one of the high speed stand alone pan genome pipeline which can analyze 128 samples in under 1 hour using 1 GB of RAM and a single processor. Roary accepts GFF3 files as an input and outputs the list of core genes,differentially present/absent genes and summary statistics.

Out of total 8773 genes from all isolates, 3664 genes belong to core genome, soft core genes(shared between 95%-99% of total isolates) are 494, shell genes (shared between 15%-95% of total isolates) are 418 and cloud genes(shared between 0%-15% of total isolates) are 4197. Figure 4 represents present genes in all sporadic and outbreak isolates.


 Fig.4 Pangenome Analysis of 24 Outbreak and 26 Sporadic samples. 

We identified the differentially present/absent genes in 3 possible clusters of outbreaks. Cluster 1 (SP0005,SP0006,SP0008) has total 38 unique genes, cluster 2 (SP0004,SP0017,SP0019) has 7 unique genes and cluster 4(SP0009,SP0011,SP0012) has 129 unique genes. Functionally important genes among the unique genes in all clusters are summarized in Figure 5.

Selection 084.png

 Fig.5 Unique Genes present in Cluster 1, 2 and 4



Shown below is a BRIG image comparing representative genomes from each clade determined from the SNP based tree. SP0014 is used at the reference genome. A large insertion can be shown to be present in SP0014 and a few other sporadic genomes of PFGE pattern JF6X01.0022 but absent from seemingly all other PFGE patterns and all the outbreak genomes.

Rings are colored by PFGE pattern and separated by outbreak and sporadic. Outbreak genomes are all colored teal (all are PFGE pattern JF6X01.0022). Sporadic PFGE pattern JF6X01.0022 are colored yellow (SP0014 falls into this group). Sporadic PFGE pattern JF6X01.0051 are blue. Sporadic PFGE pattern JF6X01.0167 is green. Sporadic PFGE pattern JF6X01.0173 is purple. The alternating red and blue ring on the inside shows the contig boundaries of the reference genome (SP0014).

Shown below is a BRIG image comparing representative plasmid sequences from each clade determined from the plasmid SNP based tree. It can be seen that one plasmid (labeled SPAdes_34_len43882 in the BRIG image) is present in all the genomes compared for this. This particular plasmid was found in 48 of the 50 Salmonella samples that we studied. OB0024 was used as the reference for this image since it contained 5 plasmid sequences while most other samples had only 2 or 3.

Rings are colored by PFGE pattern and separated by outbreak and sporadic. Outbreak samples are colored teal and sporadic samples are purple. The alternating red and blue ring on the inside shows boundaries between different plasmid sequences seen in the reference (OB0024).


We utilized various comparative genomic techniques to deduce any patterns that cluster the sporadic isolates and relate back to the line list metadata. Three cluster patterns were derived based on groupings found in the various analyses.

Cluster 1: SP0005, SP0006, SP0008

The first cluster identified contained isolates: SP0005, SP0007, SP0008 (Table1). These isolates were isolated from clinical patients in June 2015 in the southeast regional lab in the US. They showed a PFGE pattern of JF6X01.051. These isolates clustered together in both whole-genome techniques as well as the plasmid SNP phylogeny and the prophage analysis. This shows the phylogenetic relatedness of these specific isolates. The pangenomes of this cluster showed for 38 unique genes that were found in this cluster and not in any other isolates in the study. Two of these unique genes, group_5460 and virB2_2, encoded for the components of a Type IV secretion system. This system could have been obtained from a horizontal gene transfer event (13). The type IV secretion system has been shown to play a role in virulence in other Gram-negative bacteria and therefore, could provide some form of a virulence factor for this specific cluster of isolates (13).

Table 1: Group1.png

Cluster 2: SP0004, SP0017, SP0019

The second cluster identified contained isolates: SP0004, SP0014, and SP0017. These isolates were from clinical and environmental patients fround in the central and mountain regions of the U.S. They belonged to the PFGE pattern JF6X01.022. These isolates showed grouping patterns in both the whole-genome analyses as well as the plasmid and prophage analyses. This group contained 7 unique genes in their pangenomes that were not found in any of the other isolates. However, based on the CDC guidelines for recognizing clusters, this grouping would not be considered a cluster do to the fact that the grouping does not occur in the correct time range and the isolates come from multiple different regional labs.

Table 2: Group2.png

Cluster 4: SP0009, SP0011, SP0012

The final cluster identified contained the isolates: SP0009, SP0011, and SP0012 (Table 3). These isolates were isolated from clinical patients from March 2015 from the northeast regional lab in the US. These isoaltes showed the JF6X01.022 PFGE pattern. This cluster showed grouping patterns in both the whole-genome analyses as well as the plasmid and prophage analyses. The pangenome of this cluster contained 129 unique genes that were only found in this cluster. One of these unique genes, ampC_3, encodes for a beta-lactamase which is an enzyme that confers resistance to B-lactams (penicilin). This shows possible anti-microbial resistance unique to this cluster.

Also, this cluster contained genes encoding for the CcdA/CcdB Type II Toxin-antitoxin system (14). This system is used to synthesize a protein that is a inhibitor of cellular proliferation and the other protein is its specific antitoxin. This system is used to ensure that the daughter cells coming from cellular proliferation contain the plasmid of the mother cells (14). If a daughter cell does not contain that specific plasmid, the bacterial toxin encoded by this system will target kill that daughter cell. This ensures a clonal population containing the desired plasmid (14). This correlates with the fact that the Salmonella plasmid has been shown to be important in the pathogenicity of this organism (2).

Table 3: Group3.png

Here, we show various comparative genomic techniques in order to better profile these "sporadic" isolates in an attempt to distinguish any patterns that could possibly group these isolates into clusters. These isolates did not have any known outbreak association, however, possible clusters were identified based on phylogenetic clustering matching the line-list metadata. This shows the power of next-generation sequencing techniques in the world of pathogen surveillance. As PFGE becomes more outdated, surveillance techniques would be able identify outbreaks based on comparative genomic analyses.


1. Ibarra, J. A., & Steele-Mortimer, O. “Salmonella—the ultimate insider. Salmonella virulence factors that modulate intracellular survival.” Cellular Microbiology, 11(11), 1579-1586. 2009

2. Rotger, R., Casadesus, J. “The virulence plasmids of Salmonella” International Microbiology, Vol.2: 177-184, 1999.

3. Marcus, S. L., et al. “Salmonella pathogenicity islands: big virulence in small packages.” Microbes and Infection. 2(2), 14-156. 2000.

4. Leekitcharoenphon, P., Kaas, R. S., Thomsen, M. C. F., Friis, C., Rasmussen, S., & Aarestrup, F. M. “SnpTree - a web-server to identify and construct SNP trees from whole genome sequence data.” BMC Genomics, 13(Suppl 7), S6, 2012.

5. Kaas RS, Leekitcharoenphon P, Aarestrup FM, Lund O. “Solving the Problem of Comparing Whole Bacterial Genomes across Different Sequencing Platforms.” PLoS ONE 9(8), 2014.

6. Ondov, Brian D, et al. “Mash: fast genome and metagenome distance estimation using MinHash.” Genome Biology, 2016. 7. Liu, Yi-Yen, et al. “Construction of a Pan-Genome Allele Database of Salmonella enterica Serovar Enteritidis for Molecular Subtyping and Disease Cluster Identification.” Frontier Microbiology, 2016.

8. Demvzuk Walter, et al. “Phage-Based Typing Scheme for Salmonella enterica Serovar Heidelber, a Causative Agent of Food Poisoings in Canada.” Journal of Clinical Microbiology, col. 41 no. 9, 2003. 9. Zhou, You, Yongjie Liang, Karlene H. Lynch, Jonathan J. Dennis, and David S. Wishart. "PHAST: A Fast Phage Search Tool." Nucleic Acids Research. Oxford University Press, 14 June, 2011.

10. Arndt, David, Jason R. Grant, Ana Marcu, Tanvir Sajed, Allison Pon, Yongjie Liang, and David S. Wishart. "PHASTER: A Better, Faster Version of the PHAST Phage Search Tool." Nucleic Acids Research 44.W1, 2016.

11. Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC. “Getting Started in Biological Pathway Construction and Analysis.” PLoS Computational Biology, 4(2): e16, 2016.

12. A. Stamatakis: "RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies". Bioinformatics, 2014.

13. Seubert, A., et al. “Type IV Sercretion Systems.” Nature Reviews in Microbiology, September, 2004.

14. Bahassi, El Mustapha., et al. “ Interactions of CcdB with DNA Gyrase Inactivation of GyrA, Poisoning of the Gyrase-DNA Complex, and The Antidote Action of CcdA.” Journal of Biological Chemistry vol. 274: 10936-10944. 1999.

15. https://www.cdc.gov/salmonella/pdf/CDC-Salmonella-Factsheet.pdf

Personal tools