Faction II Gene Prediction Group
All results available in /data/home/xzhou328/GenePrediction_Final [biogenome2017b.biology.gatech.edu]
We received assembly data of 26 isolates of 'Salmonella Heidelberg sporadic isolates from 2012-2014.
Gene prediction is the process of identifying the regions of genomic DNA that encode genes, which includes protein-coding genes and RNA genes. Gene prediction is simpler in Prokaryotes than in Eukaryotes for three main reasons. One, Prokaryotic genomes are gene-rich with their genomes comprised with at least 90% coding regions. Second, they do not have introns. Third, Prokaryotic genes are usually arranged into operons with a known operator, promoter, start site, coding region, and then a stop codon. But one of the biggest problems with prokaryotic gene prediction is determining which of two or more overlapping ORFs (Open reading frame) are true genes.
There are two main methods of gene prediction, Ab initio and homology based method. Ab inition gene prediction is a method for predicting genes based on intrinsic factors such as start and stop sites. It looks for promoter sequences and other functional parameters. Homology gene prediction is a method for predicting genes based on sequence similarity between the sequence in question and a database.
A typical gene prediction pipeline has two major goals: 1) Predicting the coding regions of a genome 2) Predicting the non-coding regions of a genome
Markov model is a stochastic model where future state is dependent on current state, but independent of any previous states. Therefore, it is known to have a property of memorylessness. Most of Gene Predicting programs use algorithm like Hidden Markov Models (HMMs). HMM is where the system being modeled is a Markov process with "hidden" states. And the hidden state determines emission probabilities. In some other Gene Predicting programs, Interpolated Markov Models (IMMs) are used. This algorithm is used since it is ideal to use the highest-order Markov model for a better prediction. For a better prediction, as moving up to higher order models, the estimation of the number of probabilities increases exponentially. Therefore, for DNA sequence, 4k+1 probabilities in a kth-order Markov model needs to be learned.
Glimmer is a gene-prediction software that is used for finding genes in bacteria, archaea, and viruses. Glimmer is an ab-initio prediction software and is in it's third iteration, Glimmer3. Glimmer3 uses an interpolated context model (ICM) approach which is an improvement to the interpolated Markov models used by previous versions. Using the ICM approach, Glimmer3 is much improved at correctly identifying the "true" start codon of a coding sequence. When tested on the previously annotated bacterial genomes, Glimmer3 showd a stat-site prediction accuracy for 3'5' matches of 99.5%. (Salzberg et. al., 2007)
The programs of the GeneMark family are ab-initio gene finders, meaning that they make no explicit use of information about DNA or proteins outside the sequence being studied. The previously developed GeneMark program that was made available by Dr. Borodovsky’s lab at Georgia Tech identified genes mainly as the ORF (open reading frame) where the gene resides. GeneMark uses a Bayesian formalism to calculate the a posterior probability of the presence of the genetic code in short DNA sequence fragments, and is thus considered to be a local approach (Besemer 2001). Improvements have been made to the original GeneMark program to handle the re-occuring problem with accurately predicting the 5’ end of the gene. These improvements have been channeled into two separate programs, GeneMark.hmm and GeneMarkS, which will be discussed in detail below.
Using an HMM framework that follows the logic of the genetic structure of the bacterial genome, the GeneMark.hmm program uses transitions between hidden Markov states to generate stretches of DNA sequence with coding or non-coding statistical patterns (Lukashin 1998). It incorporates the Viterbi algorithm, which is a dynamic programming algorithm for finding the most likely sequence of hidden states. Genes are divided into Typical or Atypical classes based on whether they exhibit codon usage patterns specific to the majority of genes in the given species.
The architecture of the hidden Markov model used in the GeneMark.hmm algorithm is shown below. In order to properly handle both forward and reverse DNA strands, as was done in the original GeneMark program, nine hidden states correspond to the functional units of bacterial genomes were defined. The models of Typical and Atypical genes were derived from the sets of protein-coding DNA sequence obtained by clusterization of the whole set of genes from the genome of a given species.
Figure 1. Hidden Markov model of a prokaryotic nucleotide sequence used in the GeneMark.hmm algorithm. The hidden states of the model are represented as ovals in the figure, and arrows correspond to allowed transitions between the states. (Lukashin 1998).
To further improve the prediction of the translation start position as compared to the original GeneMark program, GeneMark.hmm incorporates ribosome binding site (RBS) information when predicting gene locations. For a predicted gene, the RBS is searched in the interval from 19 to 4 base pairs upstream to each alternative start codons located between the position of start codon, suggested by the Viterbi algorithm, and the position of start codon producing the longest (ORF) for the predicted gene.
The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, which is built into the prediction procedure.
The step-wise diagram of sequence data processing and model training is shown below. In the first step, the parameters of the heuristic Markov models are determined according to GeneMark.hmm procedures. Transition and initial probability parameters of heuristic models can be estimated from a sample DNA sequence as short as 400 base pairs. The Markov model of coding and non-coding regions derived at all subsequent steps of the regular cycle are called pseudonative models, since they were derived from real DNA sequences classified in-silico as coding and non-coding regions. The pseudonative models capture the species-specific oligonucleotide frequency patterns more closely than heuristic models. Iterations are repeated until the sequence parse are 99% identical to that of the previous iteration.
Figure 2. Step-by-step diagram of the GeneMarkS procedure. (Besemer 2001).
When compared directly to the Glimmer program using a B. subtilis test set, GeneMarkS yielded a more accurate output of predicted genes and thus may be a superior tool for ab-initio gene prediction.
Prodigal is short for Prokaryotic Dynamic Programming Gene-finding Algorithm, which is a prokaryotic gene recognition and translation initiation site identification software.Since other gene prediction methods works with accuracy dropping considerably in high GC-content genomes, Prodigal is introduced. It utilize dynamic programming with likelihood-function to get coding scores, therefore to predict genes based on these scores.
Fiure3:Illustration of the dynamic programming connections in Prodigal
The red arrows represent gene connections, and the black arrows represent intergenic connections.Each end of a gene is a node, and connections between these nodes represent either genes or the space between genes. The more complicated connections indicate overlapping genes. This algorithm is quite simple, following the rules of KISS-keep it simple and stupid and gives three modes, normal, anonymous and training mode, of gene predictions according to your commands.
FGenesB is an ab initio, Markov chain-based algorithm. FgenesB integrates model-based gene prediction with homology-based annotation, accompanied by operon, promoter and terminator prediction in bacterial sequences. FgenesB also finds tRNA and rRNA genes.
1. FgenesB pipeline main steps
1) Sequence-based gene prediction
2) Homology-based annotation
- Predicts 5s/8s, 16s/18s, and 23s/28s ribosomal RNA in full genome sequences
- Usage : rnammer [-S kingdom] [-m molecules] [-xml xml-file] [-gff gff-file] [-d] [-p] [-h hmmreport] [-f fasta-file] [sequence]
- RNAmmer consists of two components: A core Perl program, 'core-rnammer', and a wrapper, 'rnammer'.
- The wrapper sets up the search by writing one or more temporary configuration(s).
This is an example of RNAmmer gff file output.
The GFF file
The GFF file contains the following information
The name of the fasta entry in which the rRNA feature was predicted. The input can contain multiple entries.
The version of the RNAmmer prediction tool.
Start position of the rRNA feature
End position of the rRNA feature
The HMM alignment score of the calibrated full length model within the window defined by the spotter model
Direction (strand) of the rRNA feature
N/A (always '.')
rRNA subunit type (5/8s,16/18s,23/28s rRNA) E-value and score of the HMM alignment of the calibrated full length model within the window defined by the spotter model.
1) tRNAscan 1.3
- Identifies ∼97.5% of true tRNA genes and gives 0.37 false positives per million base pairs (Mbp).
- The algorithm uses a hierarchical, rule-based system in which each potential tRNA must exceed empirically determined similarity thresholds for two intragenic promoters, plus have the ability to form base pairings present in tRNA stem–loop structures.
2) Pavesi Algorithm
- The Pavesi algorithm identifies tRNAs not detected by tRNAscan 1.3, and vice versa.
- The combined sensitivities of these two programs exceed 99%; however, the combined false positive rate is about five times that of tRNAscan alone.
3) Covariance models
- Covariance models are able to capture both primary consensus and secondary structure information through the use of stochastic context-free grammars (SCFGs;
- RNA covariance models have the advantages of high sensitivity and high specificity.
- It is prohibitively CPU intensive.
- A tRNA covariance model identifies >99.98% of true tRNAs.
tRNAscan-SE combines three tRNA search methods to attain the specificity of covariance model analysis with the speed and sensitivities of optimized versions of tRNAscan 1.3 and the Pavesi search algorithm. tRNAscan-SE detects 99–100% of true tRNAs, giving fewer than one false positive per fifteen billion nucleotides of random sequence, at ∼1000–3000 times the speed of searching with tRNA covariance models.
Rfam 12.0 and Infernal 1.1
The Rfam builds library of covariance models.
In conjunction with Infernal Software gives the Non-coding RNA’s by Homology search.
Gives the output in different formats(Tabular , standard cmscan output).
cmscan -Z 9.502884 --cut_ga --rfam --nohmmonly --tblout results/SP0021.tblout --fmt 2 Rfam.cm Final_Assemblies/SP0021_N.fasta > results/SP0021.cmscan
Given below is the output in a Tabular form.
Where to start
Python script: startstop.py
input: two BED files output: number of matched start, stop, both, at least one match
So we use the result of Prodigal as the template and final results will based on Prodigal due to its correctness, we can trust its start and stop position more than any other standards.
What to merge
Compare the differences between each two tools using : bedtools intersect -a **.bed -b **.bed -f ** -r -u
So we just leave out Glimmer and using Prodial as template and try to merge it with both FGenesB and GeneMark to see the quality of the results merged or not.
How to merge
So our final results will be based on Prodigal, using 0.8 as the overlap threshold to merge Prodigal with FGenesB and GeneMark using : [bedtools intersect -a *.bed -b *.bed -f 0.8 -r -u]
Then using Rfam, tRNAscan and RNAmmer, same parameter as above to delete the overlapped start and stop sites from the merged results.
Finally, reflect the final merged BED file to the original Prodigal output GFF format file to generate the GenePrediction results.
All results available in /data/home/xzhou328/GenePrediction_Final [biogenome2017b.biology.gatech.edu]
Script dealing with original input fasta files using Prodigal, FGenesB and GeneMark.
input: a list of all input fasta files' names output: a brunch of BED files generated from above three tools
Prodigal: command used
prodigal –i input.fasta –o output –f gff(output format)
> Besemer, J. "GeneMarkS: A Self-training Method for Prediction of Gene Starts in Microbial Genomes. Implications for Finding Sequence Motifs in Regulatory Regions." Nucleic Acids Research 29.12 (2001): 2607-618. Web.
> Lukashin, Alexander V., and Mark Borodovsky. "GeneMark.hmm: New Solutions for Gene Finding." GeneMark.hmm: New Solutions for Gene Finding | Nucleic Acids Research | Oxford Academic. Oxford University Press, 01 Feb. 1998. Web. 17 Feb. 2017.
> Allen JE, Pertea M, Salzberg SL. Computational Gene Prediction Using Multiple Sources of Evidence. Genome Research. 2004;14(1):142-148. doi:10.1101/gr.1562804.
> Sallet, Erika, Jérôme Gouzy, and Thomas Schiex. "EuGene-PP: a next generation automated annotation pipeline for prokaryotic genomes." Bioinformatics (2014): btu366.
> Sallet, Erika, et al. "Next-generation annotation of prokaryotic genomes with EuGene-P: application to Sinorhizobium meliloti 2011." DNA research 20.4 (2013): 339-354.
> Hyatt, D., Chen, G., LoCascio, P., Land, M., Larimer, F. and Hauser, L. "Prodigal: prokaryotic gene recognition and translation initiation site identification." 1st ed. BioMed Central Ltd (2010).
> Lowe, T. M., & Eddy, S. R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research, 25(5), 955–964.
> Panwar B, Arora A, Raghava GP. Prediction and classification of ncRNAs using structural information. BMC Genomics. 2014 Feb 13;15:127. doi: 10.1186/1471-2164-15-127.
> Salzber S., et al. "Identifying bacterial genes and endosymbiont DNA with Glimmer." Bioinformatics 23:6, 673-679. 2007