Faction I Comparative Genomics Group

From Compgenomics2017

Jump to: navigation, search



Salmonella enterica subsp. enterica serovar Heidelberg is a serovar of the O:4 (B) serogroup of Salmonella (a gram negative, rod-shaped bacteria). It is frequently found in poultry and poultry meat, and is increasingly implicated in human foodborne illness. It is the fourth most common Salmonella serotype in human foodborne disease.


During our analyses, we pursued three main objectives:

  1. Develop a novel approach to distinguish our Salmonella isolates by relevant classifications, such as specific outbreak location.
  2. Identify differentially present gene features across outbreak groups (including pathogenicity islands, virulence factors, antimicrobial genes, #SNPs and insertions/deletions, and other genomic features).
  3. Research the possible functional implications of the genes identified in objective #2.

We are also particularly interested in finding a possible sequence-level explanation for the isolates with an unusually-short incubation period.


The data we have are: assembled genomes for 24 Salmonella enterica serovar Heidelberg isolates from outbreaks in 2013 (fasta files), assembled plasmids for the 23 isolates that contain one or more plasmids (fasta files), predicted gene locations (gff files), and annotated genomic features and genes (gff files).

Our approach for comparative genomics can be broadly divided into two types: whole genome approaches and phylogenetic approaches. Whole genome approaches can help us locate large regions that display variability across out isolates. They can provide a “big picture” view that can answer questions such as:

  • Is variation located to specific regions?
  • Are the plasmids the main source of variation?
  • How similar/different are the isolates across the whole genome?

Whole genome approaches are relatively low-resolution approaches that can locate homologous blocks, recombinations, rearrangements, large duplications or deletions, and unique genes. We will combine this approach with phylogenetic approaches, which perform clustering based on gene-content similarity, orthologs, SNPs, pan-genome, and core-genome. Phylogenetic approaches can identify whether certain isolates share features that set them apart from others (forming a cluster).

We can also classify our approaches in terms of the level of resolution they offer for examining differences among the isolates. Our lowest-resolution approach involves identified presence or absence of genes annotated using CARD and VFDB across all isolates. This approach is the easiest to interpret because we are working at the resolution of genes whose function has been established. The next approach, identifying presence or absence of genomic features, is a higher-resolution method that we developed. This approach can be trickier to interpret since the findings are not guaranteed to link back to biologically well-understood genomic features. At a higher-still resolution are orthologs and SNPS, and we have chosen tools to examine the isolates at this level as well. Interpretation at this level of resolution is considerably more difficult because functional research is likely to be lacking.


Presence or Absence of Genes

We compared the gene annotations from VDFB and CARD across all isolates, removed all the non-informative columns (isolates for which the presence/absence profile is the same across all genes) and examined the remaining isolates. We investigated whether they formed a known cluster (e.g. if they belong to a specific outbreak), which unique gene(s) they contained that wasn’t found in remaining isolates, and then also investigated the functional implications of the gene(s).

Pres abs method.jpg

Figure 1:Manipulation of data to study presence vs. absence of genes across all isolates.

Class Correlation

We developed a novel method which we call the Class Character Scoring Method, and test it’s utility in distinguishing the different outbreaks identified by the CDC. This method identifies features (e.g. genes) that segregate between classes (e.g. outbreak or PFGE pattern). The required inputs are an annotation (gff) file for each sample and metadata indicating classifications of the isolates. The outpus are vectors that score samples according to how well they fit into a given outbreak and identify features unique to a given outbreak.

This method begins with the construction of a matrix containing statistics that estimate the correlation between features and classes. To construct this matrix two matrices are created that quantify the relationship between a sample and a feature as well as a sample and a class. The matrix for samples and features contains a 1 in a cell if the sample for that cell contains the feature, and zero if it doesn’t. For this project I wrote a script to automatically extract features from a set of gff files and construct this matrix in csv format. The matrix for classes contains a 1 if the sample belongs to that class and a 0 if it doesn’t. This matrix is easy enough to construct in excel. These two matrices are then multiplied together to get a matrix of feature-class correlation statistics.

Quantifying sample-feature relationship, sample-class relationship, and their correlation:




Figure 2:Class Correlation methods explained

These two matrices are then multiplied together to get a matrix of feature-class correlation statistics. Separate vectors of feature-class correlations are then created from this correlation matrix. In each of these resulting vectors, there will be a feature-class correlation value for a feature, if the absolute value of the feature-class correlation is maximal for the class corresponding to the vector. These vectors are further refined by applying cutoff values to eliminate correlations near zero. The vectors are then all scaled to a size of 1, to create a set of final vectors called characteristic vectors. The dot product of these final vectors and a given samples vector of normalized feature content with produce that samples, class score which if relatively positive indicates likely membership in that class or if negative indicates lack of membership in that class. Further, the content of the characteristic vectors can indicate the presence or absence of features that distinguish a given class.

Average Nucleotide Identity (ANI)

Average Nucleotide Identity (ANI) is an in silico method to replace DNA-DNA hybridization that involves alignment of two genomes followed by pairwise comparison of matching regions. The typical species boundary threshold is 95%. Because our isolates are all of the same serovar, we expect our boundary should be higher than 95%. There are several different methods of ANI that general differ in choice of algorithm (e.g. MUMmer, BLAST) and input (e.g. whole assembly, 1020nt fragments). We implemented ANI using the ANI.rb script from the Kostas lab.

The input to ANI.rb is a set of fasta files, and the output is a matrix of average nucleotide identities from which a heatmap is generated using R.


Mash is a computational algorithm that implements MinHash sketches (MinHash is an optimized set comparison algorithm) to accurately and efficiently compute pairwise genomic comparisons. Mash reduces large sequences to small, representative sketches. Then a distance function compares the sketches and returns an approximation of the Jaccard index which is the fraction of shared k-mers. This Mash distance is a metric for estimating mutation rate between two compared sequences.

The parameters used in Mash to calculate pairwise distances for sequence data were Kmer size = 21 and Sketch size = 1000.

BLAST Ring Image Generator

BLAST Ring Image Generator (BRIG) is tool to visualize potential differences across prokaryotic genomes among categories such as antibiotic resistance genes and virulence factors. It BLASTs input sequences against a user-chosen reference sequence and creates a ring image of the results. Its input is fasta sequences (either protein sequences in amino acid format, or gene sequences in nucleotide format). It outputs a jpeg image file of the resulting ring that compares the genomes of interest.

Ortholog Clustering (OrthoFinder)

Orthologs are genes that evolved from a common ancestral gene via speciation (formation of new species from a biological population over evolutionary time). Orthologs generally retain the same function over evolutionary time; therefore, their identification is an important component of gene prediction in newly-sequences genomes. Two very closely-related organisms are likely to have similar DNA sequences between orthologs they share. We do not expect identification of orthologs to give us discriminatory power because our isolates are likely very closely-related; however, we will still apply the tool OrthoFinder to check for orthologs among our isolates.

OrthoFinder is a python tool for identification of orthologous protein sequence families. It’s input is aa directory of FASTA files with protein sequences, one per species. It outputs a file containing the orthologs, their composition, a presence/absence matrix of these groups across the different isolates, a phylogenetic classification, and some summary statistics.

Function: OrthoFinder has a 5 step algorithm:

  1. All-vs-all BLAST:

a.All the protein sequences are blasted against every other protein in the dataset to get an e value score, coverage, and identity

  1. Gene length and phylogenetic distance normalisation of the BLAST bit scores. This is done so that the best hits between all species achieve the same scores regardless of sequence length or phylogenetic distance. This is done to identify and remove the gene similarity dependency on gene length and phylogenetic distance.
  2. Delimitation of orthogroup sequence similarity thresholds using RBNHs: This step uses information from Reciprocal Best length-Normalised hits to define the lower limit of sequence similarity for putative cognate genes of each query gene.

a.To be included in the orthogroup graph a gene-pair must be an RBNH or produce a hit that is better scoring than the lowest scoring RBNH (irrespective of species) for either gene.

  1. Constructing an orthogroup graph for input into MCL: Putative cognate gene-pairs are identified as above and are connected in the orthogroup graph with weights given by the normalised BLAST bit scores.
  2. Clustering of genes into orthogroups using MCL

The output can be parsed into two categories - a core gene set and a variable gene set.

  • Core Gene Set: Any gene that is present in all the samples will be included in the Core gene set. Phylogenetic trees can be constructed to see if the outbreaks cluster together. The core genes can be examined to find any genes that have an uncommon number of SNPs across samples.
  • Variable Gene Set: Any gene that is absent in at least one of the samples will be part of the variable gene set. A presence/absence matrix can be made to check which genes may cause atypical behavior in some outbreaks.

SNP Trees (kSNP)

SNPs are DNA sequence variations that occur when a single nucleotide – A, T, C, or G – differs between members of a given species. Given the low degree of sequence polymorphisms in S. Heidelberg standard techniques that compare the presence/absence of genes is poorly informative and inefficient. A new typing strategy based on analysis of single nucleotide polymorphisms (SNPs) might be useful. Single nucleotide polymorphisms (SNPs) fall into two major groups: synonymous- SNPs and nonsynonymous- SNPs. The latter, if present in coding regions, introduce amino acid changes to the proteins. This in turn may influence the phenotype and be subjected to selection pressure. Screening for nonsynonymous- SNPs in resistance and virulence-conferring genes provides important insights into the molecular mechanisms and dynamics of the development of drug resistance and pathogenicity.

Synonymous SNPs are believed to be evolutionarily neutral and thus can be used for studying phylogenetic relationships among S. Heidelberg strains.

We chose the kSNP tool for this approach. Our approach for finding marker SNPs is outlined below.

KSNP method.png

Figure 3:A reference/alignment-free method (kSNP) was used to call SNPs. A PERL script maps the SNPs back to Genes. Association of these genes with virulence and antibiotic drug resistance was found by mapping these genes to annotation files.

kSNP commands used:

  • Create the input file for the kSNP3 –in option:
    MakeKSNP3infile GenomesDirectory kSNP3_in_list A
  • Make fasta file from infile for kChooser:
    MakeFasta all_assembly_list sh.fasta
  • Use kChooser to find optimum k-mer size:
    Kchooser sh.fasta
  • Use this k-mer size to run kSNP:
    kSNP3 -in in_list_A -outdir outbreak_kSNP -k 19 -vcf -ML


Presence or Absence of Genes

Pres abs res1.jpg

Figure 4:

Pres abs res2.jpg

Figure 5:

Class Correlation

Panther performed slightly better than Pfam for dispersion/differentiation based on incubation period (3 hours vs. 72 hours) and outbreak (1307COJF6-1 (Colorado) vs. 1307ALJF6-1 (Alabama) vs. 1306MLJF6-3 (New York, Kentucky, North Carolina, and Nebraska), and 1306MLJF6-1 (California). From these results we can conclude that with this class correlation scoring method we have a moderate ability to distinguished samples into the CDC defined outbreak categories.


Figure 6: Scores by incubation period


Figure 7: Scores by outbreak


Figure 8: Scores by outbreak and incubation period


Figure 9: Characteristic vector content for the 3hr incubation classes


Figure 10: Characteristic vector content for the 72hr incubation classes


Figure 11: Normalized scores for the 3hr incubation classes


Figure 12: Normalized scores for the 72hr incubation classes

Based on the results from the class correlation procedure, one can look up relevant genomic features either in an annotation database or a genomic feature lookup table to try to identify the biological relevance of the informative feature.

We looked up one of our relevant Panther features and found the following information that points to a functional relevance of this feature that separates some of our isolates among two classes:

  • Aminoglycoside N(3)-acetyltransferase IV
  • Gene: aacC4
  • Organism: Salmonella sp.
  • Status: Reviewed
  • Function: Resistance to antibiotics containing the 2-deoxy-streptamine ring including gentamicin, kanamycin, tobramycin, neomycin and apramycin.

Average Nucleotide Identity (ANI)

The ANI results were non-informative for sequence-level data, with the average nucleotide identity ranging from 99.99% to 100%. The ANI results for plasmid DNA only range from 99.89% to 100% and the distribution is shown in the heatmap below. The isolate pairs OB0008 and OB0010, as well as OB0016 and OB0018, seem more different from most of the other isolates. However, this result does not correspond with any of the epidemiologic data provided by CDC.


Figure 11:Heatmap of average nucleotide identity across all plasmid sequences


The results of MASH place isolates OB0022 and OB0009 in a cluster, and also suggest cluster containing OB0023, OB0024 (these two come from one outbreak, 1306MLJF6-1 – California), OB0004 (from 1307ALJF6-1 – Alabama), OB0012, OB0015, and OB0013 (1307COJF6-1 - Colorado).

Mash k21 sketch1000.png

Figure 12:Heatmap of Mash distances across all assembled genomes

BLAST Ring Image Generator


Figure 12:Heatmap of Mash distances across all assembled genomes

Ortholog Clustering (OrthoFinder)

The results of OrthoFinder were not discriminatory.

SNP Trees (kSNP)

The SNPs unique to each outbreak were mapped to Virulence Factor Database and only 2 genes from Alabama outbreak (SsaC and RatB) were found to be associated with virulence.

  1. SsaC - codes for Type III secretion system outer membrane ring protein. Nonsynonymous mutation is seen at position 22 ( Gly -> Ser)
  2. RatB - codes for putative outer membrane protein. A nonsynonymous mutation at position 297 changes Proline to Threonine.

KSNP results.jpg

Figure 13:The phylogenetic tree above shows the close clustering of four outbreak classes.

KSNP result.png

Figure 14:Number of SNPs present in one particular outbreak class and not present in other classes. It is seen that 63 SNPs characterize the Alabama outbreak samples. 28 SNPs characterize the Colorado outbreak samples and 2 SNPs distinguish Foster Farm 2 outbreak.

Personal tools