Faction I Gene Prediction Group

From Compgenomics2017

Jump to: navigation, search

Here are the contents from Faction 1: Outbreak

Gene Prediction Group

Members: Adam Dabrowski, Mike Finlayson, Shareef Khalid, Ramyasree Madupuri, Rushika Pandya, Kalyani Patankar, Brandon Smith, Venna Wang, Kai Yuan



Gene Prediction

Gene prediction is the step which follows genome assembly. It is the first step in structural and functional annotation of genes. It is used to locate genes along a genomic DNA sequence. It involves identifying a protein coding region and a non-protein coding region (ncRNA) in a genomic DNA sequence.




Our group was tasked with predicting the locations (Start and Stop sites) of all coding and non coding regions of genes in the genome. Since not all ORFs are genes we had to differentiate between possible ORFs and actual genes.

Data that we received from group 1:

Assembly data of 24 isolates of Salmonella enterica subsp. enterica serovar Heidelberg from an outbreak in 2013. There were two sets of files a) Fasta files of the all of reads (including plasmids), b) Fasta files of assembled plasmids only.


Ab-initio (Intrinsic): Predicts genes using only the genomic DNA sequence. It searches for signals and content (specific sequence, codon usage, GC content) of the protein coding region and statistical property of given DNA sequence.

Examples: Prodigal, GenemarkS, Glimmer

Homology Based: The given genomic DNA sequence is compared with the extrinsic sequence (reference sequences) to find coding regions in the given genome.

Examples: BLAST, ORFfinder.

Ab initio Prediction Tools


Gene Locator and Interpolated Markov Modeler


The ORFs (Open Reading Frames) are scored, and those that score above the threshold value form the candidate set. It begins at a stop codon and then proceeds to score in reverse order to calculate the probability that each individual nucleotide is part of the coding region. It then computes the log-likelihood that a given interval on a DNA sequence was generated by a model of coding versus noncoding DNA. We run Glimmer using the inbuilt script g3-iterated.csh, which includes the following commands

  long-orfs -n -t 1.15 genom.seq run3.longorfs
  extract -t genom.seq run3.longorfs > run3.train
  build-icm -r run3.icm < run3.train
  glimmer3 -o50 -g110 -t30 genom.seq run3.icm run3.run1
  tail +2 run3.run1.predict > run3.coords
  upstream-coords.awk 25 0 run3.coords | extract genom.seq - > run3.upstream
  elph run3.upstream LEN=6 | get-motif-counts.awk > run3.motif
  set startuse = ‘start-codon-distrib -3 genom.seq run3.coords‘
  glimmer3 -o50 -g110 -t30 -b run3.motif -P $startuse genom.seq run3.icm run3
  OUTPUT: .detail



This tool uses an unsupervised training tool for prediction and finding new genes. It utilizes the newest design for the model of a gene which produces two signals near the start codon of the gene are a positional Markov model of fixed length and a variable length spacer. All ORFs predicted by the ‘typical’ model are included into the training set of the next version of ‘typical’ model regardless of its prediction as ‘typical’ or ‘atypical’. The LFinder algorithm is a new addition that adds to the localization of the original installation of the Gibbs algorithm for better alignment upstream. The overall rate of missed COG genes for all the gene finders was less than 2%

  Command: gm [options] seqfile
  OUTPUT: .gff, .fna, .fsa


Prokaryotic Dynamic-Programming Gene finding Algorithm

Prodigal [3]

This tool uses a two pass dynamic programming algorithm to calculate log likelihood scores and GC bias in codons to score and predict possible genes, and also trains dataset to get the correct transcription initiation site.

The first dynamic programming pass consists of, adjusting for GC bias by creating a GC ‘frame plot,’ which gives the preference of each codon for G’s and C’s. Then for each codon within an ORF calculates the GC bias on each position. In case of start site overlaps the highest scoring gene is taken. Then using the frame plot and length of the ORF to score codons and create a preliminary gene model.

The second dynamic programming pass takes each hexamer in the gene model created and calculates a signal to background log-likelihood function. It then treats starts and end codons as nodes, and predicts genes according to the connection of nodes, scores based on likelihood function.

Final gene scoring: Start codon score + ORF score Bonuses and penalties based on overlaps and distance from operons

    INPUT: FASTA file
    Command: prodigal -i my.genome.fna -o genes.gff -f gff
    OUTPUT: .gff or .fna

Choosing Ab-Initio Tools

We compared the outputs of the three programs. Summarized in the following table. It was also noted that almost all of the high confidence genes glimmer presented were already predicted by either GeneMarkS or Prodigal. For these reasons we only included Prodigal and GeneMarkS in our final pipeline. We compared our results with salmonella Heidelberg genome which was annotated using the NCBI pipeline. The results are summarized below.

Choosing AbInitio.jpg

Exact Matches are those genes whose both start and stop codons match exactly. The Stop column contains the number of genes which only match on the stop codon and Overlap column includes genes which overlap by > 60bp. Extra (FP) are genes which were predicted by the tool but not in the NCBI annotated genome. Missed (FN) is the count of genes which were present in the NCBI annotated genome but not in the NCBI genome. Sensitivity was calculated as: TP/(TP + FP) x100. Where TP = (Exact Matches + Stop + Overlap). Positive Prediction Rate (PPV) was calculated as: TP / Total Predicted Genes.

Homology Prediction Tools


BLAST for Basic Local Alignment Search Tool

BLASTN [4](nucleotide-nucleotide BLAST) is a BLAST program that uses a heuristic method to search nucleotide databases using a nucleotide query. It enables the user to compare a query sequence with a library of sequences to identify library sequences that resemble the query sequences above a certain threshold.

To run BLASTN for our data, we first created a library sequences of all Samonella enteria Heidelberg sequences from NCBI. Then we ran BLASTN with our sequenced data using the following command:

  INPUT: FASTA file (or Genbank format)

  blastn -query scaffolds/OB0009_ORFs.fasta -db cds_sh.fna -max_target_seqs 1 -outfmt "6 std qcovs" -out scaffold_blastout/OB0009_ORFs_blastout.gff

  OUTPUT: .gff (various formats such as HTML, plain text, XML formatting)

ncRNA Prediction Tools

Non-coding RNA is the term used for RNA molecules that are not translated into proteins. ncRNA are large and diverse and involved in a variety of complex mechanisms, especially gene regulation. ncRNA are predicted using ab-initio tools, example Infernal-Rfam, tRNAScan-SE and RNAmmer.


tRNAscan-SE[5] is a program that predicts the presence of 99-100% of tRNA genes in a sequence with less than 1 false positive per 15bg. It uses a two pass approach to increase speed without sacrificing sensitivity. In the first pass, it uses highly selective pre-filters to identify potential tRNA genes. Then in the second pass it uses infernal to evaluate the candidate tRNA genes with highly selective covariance models. tRNAscan-SE is considered industry standard for the prediction of tRNA genes.

  COMMAND: tRNAscan-SE -P <input file>
  OUTPUT: tRNAscan-SE specific table format


Following the standards of the NCBI Prokaryotic Genome Annotation Pipeline tRNA predictions below a tRNAscan-SE score of 20 are discarded.

Each assembly should have a tRNA for all amino acids:

21 of 24 assemblies had a tRNA for all 21 amino acids as well as selenocysteine

OB0005 missing Glu

OB0006 missing Glu

OB0009 missing Glu, His, Trp, and Ile

Summary tRNA Prediction Metrics:

Average number predicted: 80.6

Minimum number predicted: 74

maximum number predicted: 90

Aragorn vs Infernal

ARAGORN[6] predicts tRNA and tmRNA genes. tRNA genes are predicted using heuristic algorithms that identify homology with recognized tRNA sequences and base-pairing. tmRNA genes are identified with an updated version of the BRUCE program. The goal of ARAGORN is to be highly sensitive and fast, and much like how tRNAscan-SE is used for tRNA genes, these features of ARAGORN could make it a useful tool in speeding up the prediction of tmRNA genes which would otherwise be predicted using infernal.

  Command: aragorn -m -rp <input file>
  OUTPUT: either FASTA or ARAGORN specific


To evaluate ARAGORN we tested its performance on 390 tmRNA sequences from tmRNA versus infernal 1.1.2.


ARAGORN performed relatively poorly compared to infernal 1.1.2. It had lower sensitivity and higher false positives. Attempting to increase the sensitivity of ARAGORN by lowering its scoring threshold produced many more false positive results. In addition to is performance shortcomings, ARAGORN produced awkwardly formatted results without scores. One positive aspect of ARAGORN was that it was greater than 2000x faster than infernal. Considering ARAGORN’s performance shortcomings and our need for accurate results, we decided not to use it for this project.



It is ab-initio based tool which predicts rRNA molecule coding regions in the genomic DNA. The programme uses hidden markov models trained on two databases; the 5S rRNA database and the european rRNA database project (ERRD). The program involves a pre-screening step which generates ‘spotter-model’ which are constructed from highly conserved loci within a structural alignment of known RNA sequences. It detects the position of the genes. Then the flanking regions are extracted through the full HMM model. This two-level approach increases the speed of prediction.

  INPUT: fasta files
  COMMAND: ./rnammer -S bac  -multi -gff output.gff < input.fasta
  OUTPUT: gff/fasta/html_report

-S kingdom bacteria/archae/eukaryote

-multi runs all molecules and both strands in parallel



Rfam 12.0 (Infernal 1.1.2)

Rfam[8] is a database containing over 2,500 RNA sequence families of structural RNAs including non-coding RNA genes and cis-regulatory elements. Each of the families is represented by multiple sequence alignments (MSAs)[9] and covariance models (CMs)- A secondary structure profile for a RNA structural alignment (also called profile stochastic context-free grammars).

Infernal ("INFERence of RNA ALignment") is an ab-initio tool used to search for homologs of structural RNA sequences, and to make sequence- and structure-based RNA sequence alignments. The Infernal software package is used to build CMs from the alignments and to search those alignments against the Rfam CM database to identify homologs.

  Input: FASTA
  Command: cmscan --tblout <cmdb> <seqfile>
  Output: Infernal specific table format

Converting to gff

  File {if (substr($0, 0, 1) != "#") printf("%- 30s infernal%- 20s %- 10s %- 10s %- 8s %- 4s\n", $3, $1, $8, $9, $15, $10); } 


  awk -f File infernal-table-file > output-file.gff




Gene prediction pipeline.jpg

Merging steps

Merging Ab-initio results

When merging results from ab-initio tools we defined a possible gene match as any genes with an overlap of greater than 60 basepairs. Our literature survey revealed that for protein coding genes overlapping by more than 60bps are likely the same [], and this value is also used in various gene annotation pipelines []. In our results as well we saw most genes which overlapped by greater than 60bps were genes which only differed by start codon sites. Merging steps for ab-initio tools were as follows:

1. Keep all predictions which match exactly and overlap by more than 60bp.

2. Blast overlapping genes ageinst RefSeq database and keep the hits with the highest identity and coverage.

3. If no hit found keep the longer gene

The average number of genes predicted by the tools over all the samples is shown in the figure:


Merging ncRNA and Coding Gene Predictions

Our strategy for merging ncRNA and coding gene predictions follows the approach taken by the NCBI Prokaryotic Genome Annotation Pipeline. For ncRNA predictions there is no specification for merging them other that accepting all ncRNA predictions, however, for coding genes that overlap with tRNA or rRNA genes there is a process by which some genes are discarded.


If a protein coding gene overlaps with an rRNA gene, it is discarded.

If a protein coding gene overlaps by 30 bases or more with a tRNA gene either may be discarded according the following:

  If the protein coding gene is homologous to a known gene we consider it a confident prediction.
  If the tRNA gene is not a pseudogene we consider it a confident prediction.
  If either prediction is not a confident prediction, it is discarded, unless both aren’t confident in which case both are kept.


  2 protein coding genes were removed from OB0013 due to rRNA conflict
  1 protein coding gene was removed from OB0013 due to tRNA conflict
  1 protein coding gene was removed from OB0006 due to tRNA conflict


Our final results are located here: /data/projects/prediction/final. The final format for our predictions are in GFF3 format, which will be handed over to the Functional annotation group.

Our Final Predictions are summarized in the table below.



Lukashin A, Borodovsky M: GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 1998, 26(4):1107–1115. 10.1093/nar/26.4.1107

Delcher A, Bratke K, Powers E, Salzberg S: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 2007, 23(6):673–679. 10.1093/bioinformatics/btm009

Lomsadze, A., Tang, S., Gemayel, K., & Borodovsky, M. GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition.

Besemer, J., Lomsadze, A., & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic acids research, 29(12), 2607-2618.

Bocs S, Cruveiller S, Vallenet D, Nuel G, Médigue C. AMIGene: Annotation of MIcrobial Genes. Nucleic Acids Research. 2003;31(13):3723-3726.

Oliynyk et al., (2007) Complete genome sequence of the erythromycin-producing bacterium Saccharopolyspora erythraea NRRL23338 Nature Biotechnology 25, 447 - 453

John E. Karro, Yangpan Yan, Deyou Zheng, Zhaolei Zhang, Nicholas Carriero, Philip Cayting, Paul Harrrison, Mark Gerstein; Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res 2007; 35 (suppl_1): D55-D60. doi: 10.1093/nar/gkl851

Hori H. Methylated nucleosides in tRNA and tRNA methyltransferases. Frontiers in Genetics. 2014;5:144. doi:10.3389/fgene.2014.00144.

Gong H, Vu G-P, Bai Y, et al. A Salmonella Small Non-Coding RNA Facilitates Bacterial Invasion and Intracellular Replication by Modulating the Expression of Virulence Factors. Monack DM, ed. PLoS Pathogens. 2011;7(9):e1002120. doi:10.1371/journal.ppat.1002120.

Sweeney B.A., Roy P., Leontis N.B. (2015) An introduction to recurrent nucleotide interactions in RNA. Wiley Interdisciplinary Reviews: RNA, 6, 17–45

Harris KA, Lünse CE, Li S, Brewer KI, Breaker RR. Biochemical analysis of pistol self-cleaving ribozymes. RNA. 2015;21(11):1852-1858. doi:10.1261/rna.052514.115.

Tjaden B, Goodwin SS, Opdyke JA, et al. Target prediction for small, noncoding RNAs in bacteria. Nucleic Acids Research. 2006;34(9):2791-2802. doi:10.1093/nar/gkl356.

Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research. 1997;25(5):955-964.

Lasleett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA gnes in nucleotide sequences. Nucleic Acids Research. 2004;32(1):11-16. doi:10.1093/nar/gkh152.

Karin Lagesen, Peter Hallin. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res (2007) 35 (9): 3100-3108 doi:22 April 2007.

Eric P. Nawrocki. Annotating Functional RNAs in Genomes Using Infernal. Methods Mol Biol. 2014;1097:163-97. doi: 10.1007/978-1-62703-709-9_9.

Sam Griffiths-Jones, Alex Bateman. Rfam: an RNA family database. RNA Sequence, Structure, and Function: Computational and Bioinformatic Methods Volume 1097 of the series Methods in Molecular Biology pp 163-197 doi: 02 December 2013.

Palleja, A., Harrington, E. D. & Bork, P. ( 2008 ). Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions? BMC Genomics 9, 335

Galens, Kevin et al. “The IGS Standard Operating Procedure for Automated Prokaryotic Annotation.” Standards in Genomic Sciences 4.2 (2011): 244–251. PMC. Web. 28 Apr. 2017.

Personal tools