Faction I Functional Annotation Group

From Compgenomics2017

Jump to: navigation, search




Function Annotation

Function Annotation is the assignment of biological function to a gene or set of genes. This biological information can be protein interactions, regulation of gene expression, biological function or biochemical function.


The objective is to annotate the 24 Salmonella enterica serovar Heildelberg from the outbreak of 2013, using the genes coordinates given by the gene prediction group.


There are different types of tools used to annotate genomes : Homology based tools (Extrinsic) or Ab-Initio (Intrinsic). The homology based tools predict the function of the genes using a database, the sequences are compare against different databases depending on the function that is annotated. The Ab-Initio tools base the predictions on inherent sequence features. All the Ab-initio tools used in this section used the Hidden Markov Model.





SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.

  • Usage
Command: signalp -t gram- -f short $infile > $outfile 
  • Output
 name | Cmax | pos | Ymax | pos | Smax | pos | Smean | D | ? | Dmaxcut | Networks-used 

"Y" or "N" under column "?" corresponds to whether the sequence contains signal peptide ending at position "pos" in 4th column.


LipoP produces predictions of lipoproteins and discriminates between lipoprotein signal peptides, other signal peptides and n-terminal membrane helices in Gram-negative bacteria.

  • Usage
Command: LipoP -short $infile > $outfile 


Phobius, a combined transmembrane protein topology and signal peptide predictor. The predictor is based on a hidden Markov model (HMM) that models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of interconnected states. Compared to TMHMM and SignalP, errors coming from cross-prediction between transmembrane segments and signal peptides were reduced substantially by Phobius.

  • Usage
Command: phobius -short $infile > $outfile 


TMHMM predicts the presence of transmembrane helices in amino acid sequences through the use of a Hidden Markov Model. This tool was developed at the Centers for Biological Sequence Analysis at the Technical University of Denmark.

Biological Background

The identification of potential transmembrane helices is important as this information can help to determine if a given sequence encodes a transmembrane protein. Transmembrane proteins have a high biological significance as they often play a crucial role in maintaining cell homeostasis and mediating a cell's interaction with its environment.

Technical Background

The hidden Markov model upon which this tool is based on works by modeling the structural "grammar" of the experimentally verified transmembrane helices present in its training data set. The high sensitivity and specificity values reported for its original test set are known to be lowered when the the program runs into signal peptides.

TMHMM can be run via the online webserver at http://www.cbs.dtu.dk/services/TMHMM/ or by downloading and installing the executable available for request at that site.

In this project we ran TMHMM as part of InterProScan, but can be executed stand alone as well.

An example of one of TMHMM's optional visual outputs indicating the probability of each classification type at each position of the inputted amino acid sequence


InterProScan is a Java based tool that leverages JMS technology to interact with wrappers for various bioinformatics programs that are used to detect protein features.

In this project we ran InterProScan with the following databases and tools (List and descriptions generated from InterProScan output):

  • TIGRFAM: Used to Identify protein families based via Hidden Markov Models
  • SignalP (GRAM_NEGATIVE): Used to predict the location of signal peptide cleavage sites in Gram Negative Prokaryotic Amino Acid Sequences
  • Super Family: A database containing structural and functional annotation for all proteins and genomes
  • Panther (Protein Analysis Through Evolutionary Relationships):
  • Gene3D (4.1.0) : Structural assignment for whole genes and genomes using the CATH domain structure database
  • Hamap (201701.18) : High-quality Automated and Manual Annotation of Microbial Proteomes
  • Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins
  • ProSiteProfiles (20.132) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
  • SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs
  • CDD (3.14) : Prediction of CDD domains in Proteins
  • PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterize a protein family
  • PIRSF (3.01) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
  • ProSitePatterns (20.132) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
  • Pfam (30.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)
  • SignalP_EUK (4.1) : SignalP (organism type eukaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes.
  • ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.
  • MobiDBLite (1.0) : Prediction of disordered domains Regions in Proteins
  • TMHMM: Predicts the location of transmembrane helices in amino acid sequences

We ran InterProScan with the following command

 interproscan.sh -iprlookup -goterms -pa -f GFF3 -d $targetDir -i $file;


CRISPR are segments of prokaryotic DNA containing short, repetitive sequences. CRISPR stands for Clustered Regularly Interspersed Short Palindromic Sequences. They are found in nearly 40% of all bacterial species. They are important in biological functions like host cell defence mechanism, DNA rearrangement, replication and regulation. They act as tools for evolutionary studies and strain typing. Piler-cr is a tool specifically designed for identification and classification of CRISPR repeats. It performs local alignment and constructs piles of CRISPR repeats. Identification depends on crispr array recognition criteria.

Commandline: ./pilercr -in <input_file.fasta> -out <output_file> 

Pilercr was used to predict CRISPR arrays in reads. The results were validated by performing blast against CRISPRdb.

Results: For OB0001: 3 putative CRISPR array predicted by pilercr , 2 Confirmed CRISPR array by performing blast against CRISPRdb No CRISPR array was predicted in plasmid reads.


DOOR2 is an operon database covering 2072 bacteria genomes. The algorithm is a data-mining classifier trained on data from E. coli and B. subtilis. Features include intergenic distance, neighborhood conservation, phylogenetic distance, information from short DNA motifs, similarity score between GO terms of gene pairs, and length ratio between a pair of genes. DOOR2 does not offer a command-line tool so analysis was implemented through BLAST.

DOOR2 workflow

  • Blast Usage:
Command: makeblastdb -in [reference] -dbtype prot

Command: blastp -db [reference] -query $infile -evalue 1e-10 -perc_identity 80 -outfmt "6 stitle qseqid sseqid qcovs pident evalue" > $outfile 


The Virulence factor database was created in 2004 and is the most complete virulence factor database. It provides a depth coverage of the major virulence factors from the best characterize bacterial pathogens, with emphasis in structural and functional biology.

This database contains information about virulence factor from more than thirty bacterial pathogens, virulence associated genes , protein structural features and functions.

The virulence factor database is divided in two different kind of datasets:

- Dataset A: Contains genes associated with experimentally verified virulence factors.

- Dataset B: Contains genes related to known predicted virulence factors

For this project the Dataset A with the verified virulence factor was used. Since this tools does not have command line option, a blast was performed to annotate the Salmonella serovar Heidelberg isolates:

 `makeblastdb -in /data/home/cmt6/tools/vfdb-master/data/VFDB_setA_nt.fas -dbtype nucl`
 `blastn -query queryname -db /data/home/cmt6/tools/vfdb-master/data/VFDB_setA_nt.fas -outfmt "6 qseqid qstart qend qlen length qcovs pident evalue stitle" -max_hsps 2 -max_target_seqs 2`

After the blast, the quality of the hits was assess by their percentage of identical matches and their coverage (both parameters needed to be greater or equal to 90%). After the hits were filtered, the following results were obtained:

Function Annotation.png


Computational Genomics.png

We used the Comprehensive Antibiotic Resistance Database to detect the presence of antibiotic resistance genes in our samples. The CARD contains curated entries associated with a term(s) in a defined antibiotic resistance ontology. We used the Resistance Gene identifier (RGI) software developed by the CARD in our pipeline. The RGI software employs two models named 'Protein homolog model' and 'Protein variant model' and three algorithms ('Perfect','Strict' and 'Loose') with pre-defined cutoff parameters to identify antibiotic resistance genes from the input. For our analysis, we used both models, but considered hits only from the Perfect and Strict algorithms.

We used Amino acid sequences as our input. Sequences found by the RGI software that contained > 90% identity with the target hit were reported. The output results in JSON format were parsed with a custom python script.


  • Lihong Chen, Dandan Zheng, Bo Liu, Jian Yang, Qi Jin; VFDB 2016: hierarchical and refined dataset for big data analysis—10 years on. Nucleic Acids Res 2016; 44 (D1): D694-D697. doi: 10.1093/nar/gkv1239
  • Chen, Lihong et al. “VFDB: A Reference Database for Bacterial Virulence Factors.” Nucleic Acids Research 33.Database Issue (2005): D325–D328. PMC. Web. 7 Mar. 2017
  • Jian Yang, Lihong Chen, Lilian Sun, Jun Yu, Qi Jin; VFDB 2008 release: an enhanced web-based resource for comparative pathogenomics. Nucleic Acids Res 2008; 36 (suppl_1): D539-D542. doi: 10.1093/nar/gkm951
  • Chen, Lihong et al. “VFDB 2012 Update: Toward the Genetic Diversity and Molecular Evolution of Bacterial Virulence Factors.” Nucleic Acids Research 40.Database issue (2012): D641–D645. PMC. Web. 7 Mar. 2017.
  • Juncker, Agnieszka S. et al. “Prediction of Lipoprotein Signal Peptides in Gram-Negative Bacteria.” Protein Science : A Publication of the Protein Society 12.8 (2003): 1652–1662. Print.
  • Charles Bland, Teresa L Ramsey, Fareedah Sabree. “CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats” BMC Bioinformatics. 2007; 8: 209
  • Robert C Edgar “PILER-CR: Fast and accurate identification of CRISPR repeats” BMC Bioinformatics20078:18 DOI: 10.1186/1471-2105-8-18
  • Nikki Shariat et al “CRISPR-MVLST subtyping of Salmonella enterica subsp. entericaserovars Typhimurium and Heidelberg and application in identifying outbreak isolates” BMC Microbiology201313:254DOI: 10.1186/1471-2180-13-254
  • Xie, C. et al. KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Research 39, W316–W322 (2011).
  • Wu, J., Mao, X., Cai, T., Luo, J., Wei, L. KOBAS server: a web-based platform for automated annotation and pathway identification. Nucleic Acids Res 34, W720–W724 (2006).
  • Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.
  • Caspi R., Billington R., Ferrer L., Foerster H., Fulcher C.A., Keseler I.M., Kothari A., Krummenacker M., Latendresse M., Mueller L.A., Ong Q., Paley S., Subhraveti P., Weaver D.S., Karp P.D. The MetaCyc
  • Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Research 44(1):D471-80.(2015)
  • Lukas Käll, Anders Krogh and Erik L. L. Sonnhammer. A Combined Transmembrane Topology and Signal Peptide Prediction Method. Journal of Molecular Biology, 338(5):1027-1036, May 2004.
  • Juncker, Agnieszka S. et al. “Prediction of Lipoprotein Signal Peptides in Gram-Negative Bacteria.” Protein Science : A Publication of the Protein Society 12.8 (2003): 1652–1662. Print.
  • Reynolds, Sheila M. et al. “Transmembrane Topology and Signal Peptide Prediction Using Dynamic Bayesian Networks.” PLOS Computational Biology 4.11 (2008): e1000213. PLoS Journals. Web.
  • Remmert, Michael et al. “HHblits: Lightning-Fast Iterative Protein Sequence Searching by HMM-HMM Alignment.” Nature Methods 9.2 (2012): 173–175. www.nature.com. Web.
  • Krogh, A. et al. “Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes.” Journal of Molecular Biology 305.3 (2001): 567–580. PubMed. Web.
  • “Just_Annotate_My_proteins (JAMp).” N.p., n.d. Web. 8 Mar. 2017.
  • Remmert, Michael et al. “HHblits: Lightning-Fast Iterative Protein Sequence Searching by HMM-HMM Alignment.” Nature Methods 9.2 (2012): 173–175. www.nature.com. Web.
  • Finn, Robert et al. “InterPro in 2017 - beyond protein families and domain annotations.” Nucleic Acids Res. (2017): D190-D199
  • Jones, Philip et al. “InterProScan 5: genome-scale protein function classification.” Bioinformatics (2014): 1236-1240
  • A. Conesa, S. Götz, J. M. Garcia-Gomez, J. Terol, M. Talon and M. Robles. "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research", Bioinformatics, Vol. 21, September, 2005, pp. 3674-3676.
Personal tools