Faction II Genome Assembly Group

From Compgenomics2017

Jump to: navigation, search




Salmonella Heidelberg is a serovar of the O:4 serogroup, commonly found in poultry meat in North America. Outbreaks that were caused by chicken infections in 2013 were known to be multi drug resistant strains. In the United States, it is the seventh-most common serovar (Switt 2017). According to a population-based case-control study in 5 areas under FoodNet surveillance in 1996, one of the most significant risk factors found during the 5-day exposure period was eating eggs prepared outside of one's home (Hennessy 2004). Due to the irregularity of sporadic cases, it is of high priority to be able to obtain as much genomic data from such isolates so that the underlying epidemiological bases can be better understood.


Using 26 isolates of Salmonella Heidelberg classified as sporadic from the CDC. Sequence data generated from Illumina HiSeq, is 250 bp, and paired end.


  • Determine the best method to assemble the Salmonella genomes
  • Evaluate and compare the available tools
  • Assemble reads and combine results into super-assembly
  • Compare results from different tools and find the best assemblies

Main Genome Assembly Pipeline

Fraction 2 assembly workflow.png

Figure 1. Pipeline of Genome Assembly


Before we start preprocessing the data, we first explore the data to obtain the basic statistics of our illumina reads.

Faction 2 genome assembly comparing insert sizes to read lengths.png

Figure 2. 2X read length vs. insert size

Faction 2 genome assembly read depth.png

Figure 3. Average Read Depth of each sample

We decided to use Trim Galore! to trimm all of our data. The following image is the visualized quality report of preprocessing and post processing reads. Preprocessed2.jpg

Figure 4a. Preprocessing Reads


Figure 4b Post Processing Reads


Choosing the Reference Genome

On NCBI, we gathered 52 reference sequences belong to Salmonella Heidelberg. In order to compare these references, we mapped the assemblies of our 26 samples back to all of these reference sequences using Quast. Over half of the reference sequences show almost the same coverage for all of the samples(over 99%). They might be very similar sequences. We picked the reference that gives the highest coverage which is also just slightly higher than the second highest one. The Final Reference sequence we use is "Salmonella enterica subsp. enterica serovar Heidelberg strain SH12-007, complete genome"(Accession: CP016504.1 GI: 1045684451).


Figure 5. Coverage of 26 samples on all reference sequences.We mapped assemblies back to reference sequences using Quast. Different samples are colored with distinct color in favor of comparison. The final reference we picked is the first one on the graph. As we can see many of the references have the same value as the final one. They might be similar in sequence. Later you will see the coverage is slightly lower than those calculated by ABACAS. That's because Quast filtered contigs smaller than 500bp.

Choosing De Novo Assemblers

In order to decide which De Novo assemblers to test for our analysis, we consulted the GAGE-B study by T. Magoc et. al.[7] This study was created in order to determine a new "Gold Standard" for bacterial genome assembly. The GAGE-B study compared the performance of multiple genome assemblers on 12 bacterial samples with varying GC content. MaSuRCA and SPAdes performed the best by a clear margin in the results, with MaSuRCA slightly outperforming SPAdes. This caused us to expect the best results with MaSuRCA, but as will be seen in our results, MaSuRCA did not provide us with the best assemblies for our data. This could be due to the fact that MaSuRCA performs best with a combination of short and long reads, but we only had short reads. This could also be simply because even in the GAGE-B study, while MaSuRCA was the best for the HiSeq reads, SPAdes outperformed MaSuRCA on certain MiSeq reads and our data is taken from Illumina MiSeq reads. In the GAGE-B study, Velvet performed very poorly and ABySS was somewhere in the middle, but we decided to include these tools in our analysis anyway due to their widespread popularity and common use.

We later discovered another assembler called Unicyler that outperformed all the other tools we tested in our preliminary results. Unicycler is a newly developed tool that came out after the GAGE-B study and as such was not included in the GAGE-B study and did not appear in our preliminary results. Unicycler results and how they compared to our other final assembly results can be seen later in the "Merged Assemblies" section of this wiki.

De Novo Assembly

We tested SPAdes, Velvet, ABySS and MaSuRCA on all 26 samples and calculated the assembly score in the way we introduced above. For ABySS and Velvet, a scaffolding step was introduced in their assembly pipeline. So we ran Quast with option --scaffold which split the scaffolds back to contigs by detecting poly-"N"s on scaffold sequences. In both the boxplot and the heat map we show below, SPAdes is the best among these assemblers in terms of assembly score. We thus use SPAdes assembly in further comparison with assemblies produced by mergers.

Assemblers1.png Assembler2.png

Figure 6a & 6b| Assembly score of original assemblies generated by different assemblers.3a is the boxplot to compare the performance of different assemblers. Black dots outside of the box range are outliers produced by boxplot function. Individual samples can be distinguished by their colors. 3b is a heat map comparing assemblies of each sample produced by these assemblers. In most samples, SPAdes has the best performance.

Faction 2 genome assembly assembler score comparison.png

Reference Based Assembly


Figure 7. Reference-guided Assembly comparison Coverage map generated by BRIG. By mapping the reads back to the reference genome, we can see a gap on sample 1,2,3,5,6,8,25. This shows possibly large deletion between sample genome and the reference sequence.

Merged Assemblies

Postassembler1.png Postassembler2.png

Figure 8a & 8b|Assembly score after assembly merging comparing with SPAdes and Unicycler. Assembly scores of raw merger outputs are larger than SPAdes. However, after splitting the scaffolds in these outputs, the assembly scores drop significantly. Split assemblies have even lower assembly scores than the SPAdes output. In terms of assembly score, the software Unicycler performs the best.

Faction 2 genome assembly merged assembler score comparison.png

Post Assembly Improvements

Scaffolding and Sorting Contigs


Figure 9a & 9b & 9c|Contigs sorting with ABACAS. Some basic analysis of the final result after sorting assemblies by ABACAS. 6a, samples with low coverage also shows gaps in coverage map shown before. 6b, most samples show the same length of contigs unaligned indicating some unaligned contigs exist universally in this species, which may be plasmid seqeueces.

Polishing and Gap Closing

Faction 2 genome assembly pilon performance.png

Figure 10. Comparison of number of mismatches before and after PILON polishing

Potential Tools and usage for Assembly Pipeline

Above assembly pipeline depicts 3 ways to generate the final assemblies. Following are the tools for generating the reference assemblies and the DeNOVO assembly.



Found on: http://prinseq.sourceforge.net/

Prinseq allows for filtering, reformatting, and trimming reads. It can also provide summary statistics of your sequences in graphical and tabular format.

Command used:
Prinseq--‐lite.pl ---fastq <filename> read1.fastq --fastq2 <filename> read 2.fasta --trim_left 12 

--trim_right 20


Found on: http://www.usadellab.org/cms/?page=trimmomatic

Trimmomatic does: -removal of adapters using the ILLUMINACLIP argument -removes low quality bases and N bases below quality 3 (default); LEADING and TRAILING arguments -adjustable sliding window (default being 4 bases at a time) with threshold setter for average quality base -HEADCROP allows one to cut off a certain amount of bases from the 5’ end before trimming (remove bias)

Command Used:
java -jar <path to trimmomatic-0.36.jar> PE -phred33 input_forward.fq.gz input_reverse.fq.gz

output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 HEADCROP:20

We opted to use TrimGalore! because although Trimmomatic and TrimGalore! both produced quality trimmed results (seen from our output FastQC images in the lecture slides), TrimGalore! allowed for better parameter selection in bias removal prior to trimming.

Trimm Galore

Trimm Galore is the tool we use to trim all the reads. It performs individualized trimming at 5' and 3' end, allow selecting read source to trim off adapter sequence, allow multiple input file at the same time and it also generates FastQC report at the end.


   trim_galore --<read source> --clip_R1 <R1 5' end trim> --clip_R2 <R2 5' end trim> --three_prime_clip_R1 <R1 3' end trim>
   --three_prime_clip_R2 <R2 5' end trim> --length <read length thresh hold> --paired



Found on: http://soap.genomics.org.cn/soapdenovo.html

SOAPdenovo2 is a novel short-read assembly method that can build a de novo draft assembly for human-sized genomes. It is designed to assemble Illumina GA short reads, and this current iteration of the algorithm allows for memory consumption reduction in graph construction, resolves repeat regions in contig assemblies, and increases coverage in scaffold construction. For the purposes of our preliminary runs, we ran SOAPdenovo2 not only through contig formations but also through scaffolding.

Command Used:
SOAPdenovo -63mer all -s ~/data/config -K 63 -R -o graph_prefix

A configuration files must be made which specifies maximum read length, average insert size, if sequence needs to be reversed, which parts the reads are used, where your reads are, etc. Refer to the GitHub linked in the website link above for an example of how to create a configuration file.

This assembler was not used for the final pipeline, as better results were obtained from different assemblers.


Velvet is a de novo assembler that is designed for short read sequencing technology. De Bruijn graph is utilized to resolve repeats and minimize errors. In our project, we use VelvetOptimiser to run velvet. VelvetOptimiser allows us to run velvet with multithread mode and it also search the best kmer size wth given parameter.

VelvetOptimiser.pl  -s <minimum kmer size> -e <maximum kmer sizer> -x <step size> -f '-<file format> -<read type> 
-separate <path to read 1> <path to read 2> -t <allowed number of threads> -d <output directory> --optFuncKmer <desired optimal parameter i.e. 'n50'>


MaSuRCA stands for Maryland Super Reads CABOG Assembler. It is a relatively new de novo assembler that uses a novel algorithm that combines the benefits of overlap layout consensus with deBruijn graphs. It can use only short reads, but works best with a combination of short and long reads and was listed in the GAGE-B study as a possible new gold standard tool for de novo asssembly.

As input MaSuRCA takes in the raw reads generated from Illumina (it has options to use reads from a couple other supported platforms as well) with absolutely no pre-processing or trimming. It takes all necessary input parameters and file paths in using a configuration file that generates a script called assemble.sh which can then be run to assemble the reads.


masurca configure_file.txt

Sample Configure File:

PE= pe insert_size sd_insert_size forward_read.fastq.gz reverse_read.fastq.gz



#set this to 1 for all Illumina-only assemblies

#these are the additional parameters to Celera Assembler
#set cgwErrorRate=0.25 for bacteria
CA_PARAMETER = cgwErrorRate=0.25

#minimum count k-mers used in error correction 1 means all k-mers are used. ONE CAN INCREASE TO 2 IF iLLUMINA COVERAGE >100


#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
JF_SIZE = 200000000

#set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes


Smalt is a reference-guided assembler that maps the reads back to reference. Smalt should perform best when the target sample is closely related to the reference genome; therefore, picking reference genome becomes a very important component in the reference-guided assembly. Although reference-guided assembler generally performs better than de novo assembler, but it is highly biased as it does not account for anything other than the reference genome (unexpected plasmids).


   smalt <task> [TASK_OPTIONS] [<index_name> <file_name_A> [<file_name_B>]]

Available tasks:

   smalt check   - checks FASTA/FASTQ input
   smalt help    - prints a brief summary of this software
   smalt index   - builds an index of k-mer words for the reference
   smalt map     - maps single or paired reads onto the reference
   smalt sample  - sample insert sizes for paired reads
   smalt version - prints version information

General pipeline:

   smalt index reference.fasta

   smalt map -f sam -o <out.sam> -n 3 <reference> <read1> <read2>  

   samtools view -bS <out.sam> > <out.bam> 

   samtools sort <out.bam> <out_sorted.bam>

   samtools index <out_sorted.bam>

   bam2fastq --no-aligned --strict --force -o <unmaapped.fq> <Final.bam>

   samtools mpileup -f <reference> -gu <Final.bam> | bcftools call -c -O b -o <final.raw.bcf>

   bcftools view -O v <final.raw.bcf> | vcfutils.pl vcf2fq > <input.fq> 

   seqret -sequence <input.fq> -outseq <output.fasta>


Unicycler is a hybrid assembler pipeline that uses both short reads (required) and long reads (optional) for bacterial genomes. We decided to choose Unicycler as our final tool because it performs better compare to other tools we used.

Unicycler performs improvement on the basis of the graph generated by SPAdes. The algorithm "untangles" the graph by bridging to make a circular sequence and form a contig (scaffolding). It also has three modes for the scaffolding: conserve, normal and bold where conserve mode only merged with bridges, normal mode merged contigs when there are bridges and 1X multiplicity and bold mode merge whenever possible.

The following is the basic option for the unicycler:

Unicycler: a hybrid assembly pipeline for bacterial genomes


 -h, --help                           Show this help message and exit
 --help_all                           Show a help message with all program options
 --version                            Show Unicycler's version number


 -1 SHORT1, --short1 SHORT1           FASTQ file of first short reads in each pair (required)
 -2 SHORT2, --short2 SHORT2           FASTQ file of second short reads in each pair (required)
 -s UNPAIRED, --unpaired UNPAIRED     FASTQ file of unpaired short reads (optional)
 -l LONG, --long LONG                 FASTQ or FASTA file of long reads (optional)


 -o OUT, --out OUT                    Output directory (required)
 --verbosity VERBOSITY                Level of stdout and log file information (default: 1)
                                        0 = no stdout, 1 = basic progress indicators, 2 = extra info,
                                        3 = debugging info
 --min_fasta_length MIN_FASTA_LENGTH  Exclude contigs from the FASTA file which are shorter than this length
                                      (default: 1)
 --keep KEEP                          Level of file retention (default: 1)
                                        0 = only keep final files: assembly (FASTA, GFA and log),
                                        1 = also save graphs at main checkpoints,
                                        2 = also keep SAM (enables fast rerun in different mode),
                                        3 = keep all temp files and save all graphs (for debugging)


 -t THREADS, --threads THREADS        Number of threads used (default: 8)
 --mode {conservative,normal,bold}    Bridging mode (default: normal)
                                        conservative = smaller contigs, lowest misassembly rate
                                        normal = moderate contig size and misassembly rate
                                        bold = longest contigs, higher misassembly rate
 --linear_seqs LINEAR_SEQS            The expected number of linear (i.e. non-circular) sequences in the
                                      underlying sequence (default: 0)

Assembly Merger

CISA (Contig Integrator for Sequence Assembly)

Integrate the assemblies into a hybrid set of contigs. We ran CISA for two different sets of contigs. Set one comprised of contigs from Abyss, SPAdes and Velvet. Set two comprised of contigs from Abyss, SPAdes, Velvet and Masurca.


Merging Reads:
$ python Merge.py <config>
Running CISA :
$ python CISA.py <config>

Metassembler Merges and optimizes de novo genome assemblies.


$ metassemble --conf <conf-file> --outd <output-dir>



  • Uses pre-assembled contigs from a de novo assembler to generate scaffolds. It estimates the gap size between contigs to construct scaffolds based on their spatial relationship.
  • Syntax is as follows
./SSPACE_Standard_v3.0.pl -l <library txt file> -s <input contig file> -k 5 -a 0.70 -n 15 -z 0 -b <output> -p 1 
  • We found that scaffolding created larger contigs but didn't improve the overall quality of the contigs because it didn't fill the gaps that existed between the input contigs. This step was not performed for our final assemblies.

Gap Filler

  • GapFiller is a stand-alone program for closing gaps within pre-assembled scaffolds. Gaps are iteratively filled from the left and right edge by incorporating one overhang nucleotide at a time, provided the position is sufficiently covered.
  • Has a perl version and a C version (for both versions we had problems running it on the server due to specific version dependencies to its libraries)
  • Syntax is as follows
$ perl GapFiller.pl -l <library.txt> -s <genome.fasta>
  • We got relatively good results from this tool. However we decided to skip using this since PILON did a much better job.


  • Via alternative assemblies or incorporating alternative data, this tool focuses on deriving sequences best suited for closing gaps.
  • The tool depends upon the functionalities of matlab and blast tools for working out potential sequences.
  • We used the trimmed reads from the preassembly step as alternative data for the tool.
  • Syntax is as follows
$ run_fgap.sh <Matlab-libs> -d <genome.fasta> 
  • There are several reasons why this tool did not make the final cut.
    1. Produced good results for very few samples
    2. Its not open source. Thus comes prebuilt. Therefore working out its dependencies are not straightforward.
    3. Matlab library dependency is not desirable since we cannot easily upgrade this library Matlab without obtaining commercial licences


  • Pilon is a software tool which can be used to automatically improve draft assemblies and find variation among strains (including large event detection)
  • Pilon uses read alignment analysis to identify inconsistencies between the input genome and the evidence in the reads.
  • Syntax is as follows
$java –Xmx15G –jar pilon-1.16.jar --genome <genome.fasta> 
  • While this was the obvious choice, we ended up not using this because the Unicycler tool internal uses PILON for its final improvement stage

Discussion and Conclusion

Calculating Assembly Scores

We found three ways to calculate the assembly score when consulting works and lessons in previous years. Our results show these three assembly scores performs almost the same when comparing different assemblers(Not shown here, but can be found in our final result ppt). However, assembly score2 doesn't consider the similarity of assemblies to the reference sequence, which causes tragic results in the assembly group of 2015 as they wrote on their wiki page. Since we have reference sequences very similar to our sequenced isolates' genome, we picked score1 because it seems highlight the effect of "Reference Genome Coverage" by moving it out of log-transformation.




Figure 11. Different designs of the assembly score. 1a & 1b are two assembly scores used by assembly groups of previous years. 1c is the assembly score introduced by last year's lecture. Three assembly score show almost equal capacity when used to evaluate assemblies. The final assembly score we picked is the first one.

To decide which assembly method to use, we looked at specific criteria including N50, # of contigs, and genome coverage. Genome coverage plays a very important role in our assembly score as Salmonella Heidelberg has a well-established reference genome. We decided to go with the de novo assembly due to the fact that plasmids are very important factors in our isolates, and the de novo assembly approach is unbiased and allows us to get the most information from the reads.

After calculating all of the scores of our approach, unicycler was determined to be the best in most of the samples compared to that of metassembler and SPAdes alone.

Faction 2 genome assembly no of genes comparisons with final denovo assemblies.png

Figure 12. Overall comparison of final result using Assembly score


  1. Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A. 2013. GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinformatics 14(Suppl 7):S6. 10.1186/1471-2105-14-S7-S6.
  2. Wences, A. H. & Schatz, M. C. Metassembler: merging and optimizing de novo genome assemblies. Genome Biology 16, 207 (2015).
  3. Zimin AV, Smith DR, Sutton G, Yorke Ja: Assembly reconciliation. Bioinformatics (Oxford, England). 2008, 24: 42-5. 10.1093/bioinformatics/btm542.
  4. Lin S-H, Liao Y-C. CISA: Contig Integrator for Sequence Assembly of Bacterial Genomes. Watson M, ed. PLoS ONE. 2013;8(3):e60843.
  5. Aleksey V. Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L. Salzberg, James A. Yorke; The MaSuRCA genome assembler. Bioinformatics 2013; 29 (21): 2669-2677. doi: 10.1093/bioinformatics/btt476
  6. Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1(1):18.
  7. Tanja Magoc, Stephan Pabinger, Stefan Canzar, Xinyue Liu, Qi Su, Daniela Puiu, Luke J. Tallon, Steven L. Salzberg; GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 2013; 29 (14): 1718-1725. doi: 10.1093/bioinformatics/btt273
  8. Switt, Andrea. "Dashboard." Salmonella Heidelberg - Food Safety - Dashboard. N.p., 06 Jan. 2017. Web. 03 Feb. 2017.
  9. Hennesy, TW, LH Cheng, and H. Kassenborg. "Egg Consumption is the principal risk factor for sporadic Salmonella serotype Heidelberg infections: a case-control study in FoodNet sites." Clinical infectious diseases : an official publication of the Infectious Diseases Society of America. U.S. National Library of Medicine. 15 Apr 2004. Web 06 Feb
Personal tools