Faction I Genome Assembly Group

From Compgenomics2017

Jump to: navigation, search

Here are the contents from Faction 1: Outbreak

Members: Yanxi Chen, Carl Dyson, Sean Lucking, Chris Monaco, Shashwat Deepali Nagar, Jessica Rowell, Ankit Srivastava, Camila Medrano Trochez, Venna Wang, and Seyed Alireza Zamani


Contents

Introduction

Background

Samella.png

Figure 1:Salmonella Heidelberg (http://www.foodpoisonjournal.com)

Salmonella enterica subsp. enterica serovar Heidelberg is a gram-negative, rod shaped bacteria. It genome is approximately 4.75 Mb, containing ~4827 genes and having a GC content percentage of 52.1.

Heidelberg is an isolate of Salmonella enterica that can be found in chicken, beef, horses, etc. Thus, it maintains a means of contact transmission to humans. Exposure to these infected animals can lead to discomfort, illness, or even death. This serovar of Salmonella is one of the most common causes of enteric infections (food poisoning) in the United States.

People infected with Salmonella Heidelberg typically experience diarrhea, cramps, and fever 12 to 72 hours after exposure to the pathogen. The symptoms can be severe enough to require hospitalization, but with antibiotic treatment usually pass within 4 to 7 days. It can, however, pose a much more serious threat to children, the elderly, the immuno-compromised, or those without access to sufficient healthcare response.

Why is this isolate important? Serovar Heidelberg is a frequent culprit behind foodborne illnesses and outbreaks in the U.S.

Foodborne illness: illness resulting from exposure to contaminated food, pathogenic bacteria, viruses, or parasites.

Therefore it is a critical target in preventing future outbreaks and contamination in food sources.

This isolate is a commonly occurring outbreak, with a CDC-reported event occurring in multiple states as recently as November 28, 2016 linked to dairy bull calves from Wisconsin. This strain, as is often the case with this serovar, was resistant to multiple antibiotics. This factor compounds the importance of understanding and preventing the effectiveness of Salmonella Heidelberg from outbreaks in the future.

Data

We have received reads for 24 Salmonella enterica serovar Heidelberg outbreak isolates from 2013, which are sequenced using Illumina MiSeq (2nd Generation) and short paired-end reads (250 base pairs in length) from Centers for Disease and Prevention (CDC).

Objective

The objective of the Genome Assembly group is to clean and assemble the paired-end reads from the 24 isolates from Salmonella Heidelberg. Our work is to search and assess the best available tools and determine the best approach to assemble the bacterial genomes.

Methods

Pipeline

Flowchart.png


TrimGalore

Trimgalore is the most used tool to do quality control and trimming. It uses Cutadapt to find and remove adapters created by Illumina, primers, poly-A tails and other types of unwanted sequences. It also uses FastQC to perform the quality control of the reads.

trim_galore --illumina --clip_R1 20 --clip_R2 20 --three_prime_clip_R1 5 --three_prime_clip_R2 5 --paired <reads1 file> <reads2 file> -o <output directory>

Before-Per base sequence content.png After-Per base sequence content.png

Figure 2: Per base sequence content, before and after trimming

TrimGalore have two output files: a .zip file and a .html file.

Both files show all the calculated statistics:

- Basic Statistics: information such as the file name, file type, sequence length, and %GC

- Per base sequence quality: quality scores across all base positions in the read file

- Per tile sequence quality: heat plot showing average quality scores from flowcells (Illumina data)

- Per sequence quality scores: average quality scores per read in an input file, showing if subsets of sequences differ from the mean

- Per base sequence content: proportion at each base position for base-calling of A,C,G, or T

- Per sequence GC content: %GC in each sequence of the read file as compared to a theoretical normal distribution for the genome

- Per base N content: percentage of base positions where an N was called, indicating insufficient confidence for calling A,C,G, or T

- Sequence Length Distribution: total lengths of all of sequence fragments of the input file

- Sequence Duplication Levels: degree of duplication of sequences, which could indicate higher coverage or negative results such as PCR artifacts

- Overrepresented sequences: diversity of sequences contributing to the read, higher levels of overrepresentation could indicate contamination

- Adapter Content: utilizes k-mers to detect overrepresented adapter sequences that need to be trimmed

- Kmer Content: detects k-mers with positionally biased enrichment as compared to an even distribution at all base positions

KAT

KAT (Kmer Analysis Toolkit)[1] is a set of tools that is designed for assessing sequencing completeness for assembly, determining sequencing kmer bias, identifying contaminants and filtering them, quality check of genomic assemblies.

kat comp [options] <input_1> <input_2> [<input_3>]: The most common use case for this tool is to compare two (or three) K-mer hashes. The typical use case for this tool is to compare K-mers from two K-mer hashes both representing K-mer counts for reads. However, it is also common to compare K-mers generated from reads to those generated from an assembly. If comparing K-mers from reads to K-mers from an assembly, the larger (most likely the read) K-mer hash should be provided first, then the assembly K-mer hash second. The third optional jellyfish hash acts as a filter, restricting the analysis to the K-mers present on that set. The manual contains more details on specific use cases. Options:

 -o [ --output_prefix ] arg (=kat-comp)
                                       Path prefix for files generated by this
                                       program.
 -t [ --threads ] arg (=1)             The number of threads to use.
 -x [ --d1_scale ] arg (=1)            Scaling factor for the first dataset - 
                                       float multiplier
 -y [ --d2_scale ] arg (=1)            Scaling factor for the second dataset -
                                       float multiplier
 -i [ --d1_bins ] arg (=1001)          Number of bins for the first dataset.  
                                       i.e. number of rows in the matrix
 -j [ --d2_bins ] arg (=1001)          Number of bins for the second dataset. 
                                       i.e. number of rows in the matrix
 -C [ --canonical1 ]                   (DEPRECATED) If counting fast(a/q) for 
                                       input 1, this option specifies whether 
                                       the jellyfish hash represents K-mers 
                                       produced for both strands (canonical), 
                                       or only the explicit kmer found.
 -D [ --canonical2 ]                   (DEPRECATED) If counting fast(a/q) for 
                                       input 2, this option specifies whether 
                                       the jellyfish hash represents K-mers 
                                       produced for both strands (canonical), 
                                       or only the explicit kmer found.
 -E [ --canonical3 ]                   (DEPRECATED) If counting fast(a/q) for 
                                       input 3, this option specifies whether 
                                       the jellyfish hash represents K-mers 
                                       produced for both strands (canonical), 
                                       or only the explicit kmer found.
 -N [ --non_canonical_1 ]              If counting fast(a/q) for input 1, this
                                       option specifies whether the jellyfish 
                                       hash represents K-mers produced for 
                                       both strands (canonical), or only the 
                                       explicit kmer found.
 -O [ --non_canonical_2 ]              If counting fast(a/q) for input 2, this
                                       option specifies whether the jellyfish 
                                       hash represents K-mers produced for 
                                       both strands (canonical), or only the 
                                       explicit kmer found.
 -P [ --non_canonical_3 ]              If counting fast(a/q) for input 3, this
                                       option specifies whether the jellyfish 
                                       hash represents K-mers produced for 
                                       both strands (canonical), or only the 
                                       explicit kmer found.
 -m [ --mer_len ] arg (=27)            The kmer length to use in the kmer 
                                       hashes.  Larger values will provide 
                                       more discriminating power between kmers
                                       but at the expense of additional memory
                                       and lower coverage.
 -H [ --hash_size_1 ] arg (=100000000) If kmer counting is required for input 
                                       1, then use this value as the hash 
                                       size.  If this hash size is not large 
                                       enough for your dataset then the 
                                       default behaviour is to double the size
                                       of the hash and recount, which will 
                                       increase runtime and memory usage.
 -I [ --hash_size_2 ] arg (=100000000) If kmer counting is required for input 
                                       2, then use this value as the hash 
                                       size.  If this hash size is not large 
                                       enough for your dataset then the 
                                       default behaviour is to double the size
                                       of the hash and recount, which will 
                                       increase runtime and memory usage.
 -J [ --hash_size_3 ] arg (=100000000) If kmer counting is required for input 
                                       3, then use this value as the hash 
                                       size.  If this hash size is not large 
                                       enough for your dataset then the 
                                       default behaviour is to double the size
                                       of the hash and recount, which will 
                                       increase runtime and memory usage.
 -d [ --dump_hashes ]                  Dumps any jellyfish hashes to disk that
                                       were produced during this run.
 -g [ --disable_hash_grow ]            By default jellyfish will double the 
                                       size of the hash if it gets filled, and
                                       then attempt to recount.  Setting this 
                                       option to true, disables automatic hash
                                       growing.  If the hash gets filled an 
                                       error is thrown.  This option is useful
                                       if you are working with large genomes, 
                                       or have strict memory limits on your 
                                       system.
 -n [ --density_plot ]                 Makes a spectra_mx plot.  By default we
                                       create a spectra_cn plot.
 -p [ --output_type ] arg (=png)       The plot file type to create: png, ps, 
                                       pdf.  Warning... if pdf is selected 
                                       please ensure your gnuplot installation
                                       can export pdf files.
 -h [ --output_hists ]                 Whether or not to output histogram data
                                       and plots for input 1 and input 2
 -v [ --verbose ]                      Print extra information.
 --help                                Produce help message.

Example:

     kat kat comp -t 16 -n -o OB0001.comp.out OB0001_R1_val_1.fq OB0001_R2_val_1.fq

kat filter <mode>: Filtering tools. Contains tools for filtering k-mers and sequences based on user-defined GC and coverage limits. First argument should be the filter mode you wish to use:

 * kmer:            Filters a jellyfish k-mer hash using user defined properties.
 * seq:             Filters sequences in a file based on k-mers in a given hash


The kmer spectrum of a data set can be very helpful when diagnosing problems with an assembly or assessing data quality. The y-axis of kmer spectrum represents the number of distinct kmers while the x-axis shows how many time each kmer has appeared in the data set.

OB0010.out-main.mx.spec.png Trimmed-OB0010 R1.out-main.mx.spec.png

Figure 3: The first plot depicts the kmer spectra at K=27 for the unfiltered fragment reads. The unfiltered reads are not error corrected, hence the strong divergence at low kmer frequency and the slight shift to lower kmer frequencies compared to the corrected second graph. After error correction (second graph), the divergence is gone. The closer the low frequency corrected spectrum is to zero, the better the error correction algorithm worked. The second plot depicts the same spectra as the first but for trimmed data. One can see the dramatic effect that error correction has on the very low frequency end of the spectra. It is also clear the power law distribution at high frequencies, roughly proportional to (kmer frequency)-2.

kat plot spectra-cn [options] <matrix_file>

Creates a stacked histogram showing the level of duplication in an assembly.

Shows K-mer duplication levels, which correspond to copy number variation within an assembly by comparing K-mers found in sequenced reads, to K-mers found in an assembly of those reads. Uses matrix output from the "kat comp" tool.

Options:

 -p [ --output_type ] arg (=png)                The plot file type to create: png, ps, pdf.  Warning... if pdf is selected please ensure your gnuplot
                                                installation can export pdf files.
 -o [ --output ] arg                            The path to the output file
 -t [ --title ] arg (=Spectra Copy Number Plot) Title for plot
 -a [ --x_label ] arg (=X)                      Label for the x-axis (value taken from matrix metadata if present)
 -b [ --y_label ] arg (=Y)                      Label for the y-axis (value taken from matrix metadata if present)
 -x [ --x_max ] arg                             Maximum value for the x-axis (default value auto calculated from matrix, otherwise 1000)
 -y [ --y_max ] arg                             Maximum value for the y-axis (default value auto calculated from matrix if possible, otherwise, 1000000)
 -w [ --width ] arg (=1024)                     Width of canvas
 -h [ --height ] arg (=1024)                    Height of canvas
 -i [ --ignore_absent ]                         Ignore K-mers in reads but absent from the assembly
 -m [ --max_dup ] arg (=6)                      Maximum duplication level to show in plots
 -c [ --columns ] arg                           Comma separated string listing columns to show in plot.  If used, this overrides "--ignore_absent"
 -u [ --cumulative ]                            Plot cumulative distribution of kmers
 -v [ --verbose ]                               Print extra information.
 --help                                         Produce help message.

SPAdes

SPAdes[2] (St. Petersburg genome assembler) – is an assembly toolkit containing various assembly pipelines.

SPAdes has a read correction module for Illumina reads (BayesHammer), which works well on both single-cell and standard data sets. It also has IonHammer, which is the read error correction tool for IonTorrent data, which also works on both types of data.

Like most de novo assembler, SPAdes requires a k-mer value. The values of K are selected automatically based on the read length and data set type - they can also be specified manually.

To run SPAdes 3.10.1 you need at least one library of the following types:

  • Illumina paired-end/high-quality mate-pairs/unpaired reads
  • IonTorrent paired-end/high-quality mate-pairs/unpaired reads
  • PacBio CCS reads


spades.py --help gives:

SPAdes genome assembler v3.10.0

Usage: /data/home/snagar9/bin/compGenomics/bin/spades.py [options] -o <output_dir>

Basic options:
-o      <output_dir>    directory to store all the resulting files (required)
--sc                    this flag is required for MDA (single-cell) data
--meta                  this flag is required for metagenomic sample data
--rna                   this flag is required for RNA-Seq data
--plasmid               runs plasmidSPAdes pipeline for plasmid detection
--iontorrent            this flag is required for IonTorrent data
--test                  runs SPAdes on toy dataset
-h/--help               prints this usage message
-v/--version            prints version  
Input data:
--12    <filename>      file with interlaced forward and reverse paired-end reads
-1      <filename>      file with forward paired-end reads
-2      <filename>      file with reverse paired-end reads
-s      <filename>      file with unpaired reads
--pe<#>-12      <filename>      file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-1       <filename>      file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-2       <filename>      file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-s       <filename>      file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-<or>    orientation of reads for paired-end library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--s<#>          <filename>      file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)
--mp<#>-12      <filename>      file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-1       <filename>      file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-2       <filename>      file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-s       <filename>      file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-<or>    orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--hqmp<#>-12    <filename>      file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-1     <filename>      file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-2     <filename>      file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-s     <filename>      file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-<or>  orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--nxmate<#>-1   <filename>      file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--nxmate<#>-2   <filename>      file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--sanger        <filename>      file with Sanger reads
--pacbio        <filename>      file with PacBio reads
--nanopore      <filename>      file with Nanopore reads
--tslr  <filename>      file with TSLR-contigs
--trusted-contigs       <filename>      file with trusted contigs
--untrusted-contigs     <filename>      file with untrusted contigs  
Pipeline options:
--only-error-correction runs only read error correction (without assembling)
--only-assembler        runs only assembling (without read error correction)
--careful               tries to reduce number of mismatches and short indels
--continue              continue run from the last available check-point
--restart-from  <cp>    restart run with updated options and from the specified check-point ('ec', 'as', 'k<int>', 'mc')
--disable-gzip-output   forces error correction not to compress the corrected reads
--disable-rr            disables repeat resolution stage of assembling  
Advanced options:
--dataset       <filename>      file with dataset description in YAML format
-t/--threads    <int>           number of threads
                                [default: 16]
-m/--memory     <int>           RAM limit for SPAdes in Gb (terminates if exceeded)
                                [default: 250]
--tmp-dir       <dirname>       directory for temporary files
                                [default: <output_dir>/tmp]
-k              <int,int,...>   comma-separated list of k-mer sizes (must be odd and
                                less than 128) [default: 'auto']
--cov-cutoff    <float>         coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset  <33 or 64>      PHRED quality offset in the input reads (33 or 64)
                                [default: auto-detect]

SMALT

SMALT[3] is a reference-guided assembler which aligns DNA sequencing reads with a reference genome.

SMALT also performed best when the sequence and reference are more distant which is an advantage for us since we did not know what species we had and would be choosing representative reference sequences for alignment.

This script[4] runs SMALT on all the genomes based on the identified reference genome.

The basix steps involve creating an index from the reference genome and then mapping the reads onto the indexed reference genome.

Samtools was used to convert the generated SAM file to FASTA format.

smalt --help gives:

SYNOPSIS:
    smalt <task> [TASK_OPTIONS] [<index_name> <file_name_A> [<file_name_B>]]

Available tasks:
    smalt check   - checks FASTA/FASTQ input
    smalt help    - prints a brief summary of this software
    smalt index   - builds an index of k-mer words for the reference
    smalt map     - maps single or paired reads onto the reference
    smalt sample  - sample insert sizes for paired reads
    smalt version - prints version information 

Help on individual tasks:
    smalt <task> -H

ABySS

ABySS[5] (Assembly By Short Sequencing) creates assemblies using short-read, paired-end sequences. Utilizes de Bruijn algorithms and multi-threading to efficiently process genomes with relatively low computational requirements. Initially, ABySS takes in a value for k-mer size, upon which it reads in the input sequence files and forms initial contigs using all possible substrings of length 'k'. Then ABySS uses mate-pair information from the read files to extend these contigs, identify overlaps, and remove ambiguous alignments.

Input:

  > abyss-pe name=<assembly name> k=<target k-mer size> in='readsfile_1.fa readsfile_2.fa'

Additionally, ABySS has an internal metrics output for quickly checking the efficiency and success of de novo assembly, though it is not as in-depth or visually complex as the metrics software we used to calculate optimal assembly statistics.

Input:

  > abyss-fac sample_contigs.fa

Sample Output:

        n       n:200   L50   min     N80     N50     N20     max     sum
      483     186     20      201     35523   72731   118749  271687  4771786 sample_contigs.fa

ABySS also has several parameters for optimizing assembly based on the quality of the input reads:

   a: maximum number of branches of a bubble [2]
   b: maximum length of a bubble (bp) [10000]
   c: minimum mean k-mer coverage of a unitig [sqrt(median)]
   d: allowable error of a distance estimate (bp) [6]
   e: minimum erosion k-mer coverage [sqrt(median)]
   E: minimum erosion k-mer coverage per strand [1]
   j: number of threads [2]
   k: size of k-mer (bp)
   l: minimum alignment length of a read (bp) [k]
   m: minimum overlap of two unitigs (bp) [30]
   n: minimum number of pairs required for building contigs [10]
   N: minimum number of pairs required for building scaffolds [n]
   p: minimum sequence identity of a bubble [0.9]
   q: minimum base quality [3]
   s: minimum unitig size required for building contigs (bp) [200]
   S: minimum contig size required for building scaffolds (bp) [s]
   t: minimum tip size (bp) [2k]
   v: use v=-v for verbose logging, v=-vv for extra verbose [disabled]

Velvet

Velvet[6] is a de novo genome assembler that is designed to assemble read sequencing by maniputating the Brujin graphs to remove errors and resolve repeats. It can produce contigs with N50 length of 50 kb on paired-end prokaryotic data and a 3kb length for regions of mammalian data. Velvet provide the following two functions:

1. Velveth: This command constructs the data; hashing the reads into kmers.

Input format:

          > ./velveth output_directory hash_length [[-file_format][-read_type] filename]]

example:

         > ./velveth output  71  -fastq.gz -separate -shortPaired OB0001_R1_val_1.fq.gz OB0001_R2_val_2.fq.gz

2. Velvetg: This command takes the k-mer result from velveth and build the de Bruijin graph and run simplification and error correction over the graph. Contigs are then extracted and the final files are produced.

Input format:

         > ./velvetg output_directory/ [-options]

example:

         > ./velvetg output_directory/ -cov_cutoff 2 –min_contig_lgth 150

Notes:

Hash lengths in velvet are limited to 31bp and there for it is crucial for is to adjust this upper limit. To do that, we adjusted the MAXKMERLENGTH parameter at compilation time using the command make MAXKMERLENGTH=121 to change the upper limit to 121bp. Increasing kmer length means storing longer words, Velvet will then require more memory, so adjust as necessary. The most important file generated is called the contigs.fa file wich contains the sequences of contigs longer than 2k, where k is the kmer length. However, if you have specified a min_contig_lgth, then all contigs value shorter then that number will be omitted.

VelvetOptimiser [7]: VelvetOptimiser is a Perl script that automatically optimizes the three primary parameter options (hash length, -exp_cov, -cov_cutoff) for velvet. It required Velvet, Perl, and Bioperl.

Velvet.jpg

Figure 4:Velvet Optimiser https://microbialinformaticsj.biomedcentral.com/articles/10.1186/2042-5783-3-2


Input format:

         > VelvetOptimiser.pl [options] -f 'velveth input line'

Example:

         > ./VelvetOptimiser.pl -d ~/contig/0004 -s 21 -e 97 -x 2 -f '-fastq.gz -shortPaired -separate OB0004_R1_val_1.fq.gz OB0004_R2_val_2.fq.gz' -t 4 —optFuncKmer  'n50'
  • This command finds the best kmer length from 21 to 97 for all odd numbers. It takes in separate short pair reads. Runs maximum number of 4 threads (t 4) and optimizes by n50 score.

Notes:

After running VelvetOptimiser with is a wide kmer range, the best kmer length we found for our samples is around 71 (confirms the kmergenie prediction).

Velvet Columbus

Velvet Columbus[8] is an extension of Velvet that allows the user to provide reference sequences along with mappings of sequencing reads onto those reference sequences. We ran Velvet Columbus with VelvetOptimiser using commands in format such as:

          > ./VelvetOptimiser.pl -d ~/ref_velvet/0024 -s 61 -e 83 -x 2 -f '-reference GCF_001690035.1_ASM169003v1_genomic.fna -shortPaired -bam OB0024.sorted.bam' -t 4 —optFuncKmer  'n50'

Notes: Reference read must be in FASTA format. Paired or unpaired read alignments can be in either in SAM or BAM format.

SOAPdenovo2

blah

MaSuRCA

MaSuRCA[9] is a de novo genome assembler that combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches. Since version 3.2.1 it supports hybrid assembly with short Illumina reads and long high error Pacbio/MinIon data.

Compile and install:

System requirements: Only Linux is supported (May or may not compile under gcc for MacOS or Cygwin, Windows, etc).

Hardware requirements: Both Intel and AMD x64 architectures are supported.

Installation instructions: To install, first download the latest distribution from ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Then untar/unzip the package MaSuRCA-X.X.X.tgz, cd to the resulting folder and run './install.sh'. The installation script will configure and make all necessary packages. In the rest of this document, '/install_path' refers to a path to the directory in which './install.sh' was run.


Running the assembler:

IMPORTANT! Do not pre-process Illumina data before providing it to MaSuRCA. Do not do any trimming, cleaning or error correction. This WILL deteriorate the assembly.

To run MaSuRCA, first copy in your assembly directory the template configuration file '/install_path/sr_config_example.txt', which contains all the parameters for assembly.

Second, run the 'masurca' script which will generate from the configuration file a shell script 'assemble.sh'. This last script is the main driver of the assembly.

Finally, run the script 'assemble.sh' to assemble the data.


Output: The output file are two .fasta files located at result_folder/CA/9-termination, named 'genome.ctg.fasta' and 'genome.scf.fasta', for contigs and scaffolds, respectively.


Important parameters:

GRAPH_KMER_LENGTH: This is k-mer size for deBruijn graph values between 25 and 101 are supported, auto will compute the optimal size based on the read data and GC content.

JF_SIZE: This is mandatory jellyfish hash size – 10x the genome size is a good starting value.


Assembly Results

Quality Control and Pre-processing

de novo Assembly

Reference-guided Assembly

Performing reference-guided assembly requires selection of a reference genome to which the assembly tool aligns your reads. The choice of reference genome is critical; if your sample genome(s) differ greatly from your reference genome, the resulting assembly will be incorrect. Assembly metrics typically evaluate length of contigs (longer is better), number of contigs (fewer is better), and an assessment of misassembled contigs (for reference-guided assembly). They will not capture loss of information that can occur if, for example, there are mobile DNA elements in the sample genomes that are not present in the reference genome (and thus do not align and are dropped). To guide us in selection of an appropriate reference genome, we evaluated average nucleotide identity between each of our sample genomes and complete reference genomes of S. enterica ser. Heidelberg downloaded from NCBI Nucleotide database.


Average Nucleotide Identity(ANI)

Meta Assembly

Since every de novo assembler implements its own set of features for generating the final set of contigs, the resulting assemblies may differ considerably. Furthermore, assembling a genome is part art, part science; there are several useful metrics for evaluating the assemblies (as discussed above), but the final choice often requires a trade-off between number of contigs, accuracy, and other metrics. One option for addressing this problem is to generate meta-assemblies, which attempt to merge and reconcile assemblies from several de novo assemblers to generate an assembly of better quality than that derived from any of the de novo assemblers. We chose to run CISA (Contig Integrator for Sequence Assembly), a tool that integrates contigs from different assemblers via 4 phases: identifying the contigs, removing and splitting possibly-misassembled contigs and clipping uncertain regions at the ends of contigs, merging the contigs with a minimum 30% overlap and estimating repetitive regions, and merging the contigs based on the size of repetitive regions. CISA is written in Python and employs NUCmer and blastn to implement sequence alignments; thus it requires installation of Python, the MUMmer system, and Blast+. To use CISA, we edited the Merge.config and CISA.config files to include the location and information about our de novo assemblies, the paths for the dependencies, and the output directory, and then ran the two scripts (Merge.py converts the data format for CISA.py, and CISA.py performs the meta-assembly).

python ./Merge.py Merge.config
python ./CISA.py CISA.config

CISA performed better on our samples than each individual de novo assembler. Also, the meta-assemblies preserve the length of the genomes for each sample, whereas the maximum length of the reference-guided assembly is the length of the reference genome (which is shorter than some of our sample genomes). The assembler with the best assembly scores, SPAdes, also has shortened total genome lengths for some of samples with larger genomes. This is important because one factor that can increase the bacterial genome size is the presence of plasmids. Plasmids are extra-chromosomal segments of DNA that help bacteria adapt to their environment (e.g. through antimicrobial resistance). Downstream analyses will certainly involve evaluations of pathogenicity, virulence, and other aspects important to investigation of an outbreak; thus it is important to preserve these elements in our samples.

Cisa 960.png
Figure : Comparison of assembly scores across de novo assemblies, meta assembly, and reference-guided assembly

Polishing and Gap-filling

Assembly Quality Assessment

Discussion

Conclusion

The reference-guided assembly performed better than the best de novo assembly and the best meta-assembly​

- ANI plots showed the 26 references to be over 99% similar to our 24 samples​

- Since the goal of this exercise is to determine why this specific Salmonella Heidelberg has a shorter time of incubation, it would be wise to use a de novo assembly-based methodology to help identify novel* genes involved in pathogenesis.​

Although the reference-guided assembly performed better, it may have dropped some indels, genes, and transposable elements that didn’t align to the reference​

- The assembly may be better, but we are interested in capturing these elements for downstream analysis​

- The best de novo meta assembly is the best compromise to maintain specificity and sensitivity for mobile DNA elements

Developed Wrapper

Helpful Resources

References

1. Jackson, Brendan R. "Outbreak-associated Salmonella enterica Serotypes and Food Commodities, United States, 1998–2008-Volume 19, Number 8—August 2013-Emerging Infectious Disease journal-CDC." (2013).

2. Gabaldón, T., & Alioto, T. S. (2016). Whole-Genome Sequencing Recommendations. In Field Guidelines for Genetic Experimental Designs in High-Throughput Sequencing (pp. 13-41). Springer International Publishing.

3. Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.

4. Li, Z., Chen, Y., Mu, D., Yuan, J., Shi, Y., Zhang, H., ... & Yang, B. (2012). Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph. Briefings in functional genomics, 11(1), 25-37.

5. Nagarajan, N., & Pop, M. (2013). Sequence assembly demystified. Nature Reviews Genetics, 14(3), 157-167.

6. Meyer, Matthias, and Martin Kircher. "Illumina sequencing library preparation for highly multiplexed target capture and sequencing." Cold Spring Harbor Protocols 2010.6 (2010): pdb-prot5448.

7. Walker, B. J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S., … Earl, A. M. (2014). Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS ONE, 9(11), e112963.

8. Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., … Pevzner, P. A. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5), 455–477.

9. Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075.

10. Krueger, Felix. "Trim Galore." A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files (2015).

11. Bankevich, Anton, et al. "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing." Journal of computational biology 19.5 (2012): 455-477.

12. Zerbino, Daniel R. "Using the velvet de novo assembler for short‐read sequencing technologies." Current protocols in bioinformatics (2010): 11-5.

13. Simpson, Jared T., et al. "ABySS: a parallel assembler for short read sequence data." Genome research 19.6 (2009): 1117-1123.

14. Li, Ruiqiang, et al. "SOAP: short oligonucleotide alignment program." Bioinformatics 24.5 (2008): 713-714.

15. Zimin, Aleksey V., et al. "The MaSuRCA genome assembler." Bioinformatics 29.21 (2013): 2669-2677.

16. "Next Generation Sequencing (NGS)/De novo assembly." Wikibooks, The Free Textbook Project. 4 May 2016

17. Henson, Joseph, German Tischler, and Zemin Ning. "Next-generation sequencing and large genome assemblies." Pharmacogenomics 13.8 (2012): 901-915.

Personal tools