Faction II Functional Annotation Group

From Compgenomics2017

Jump to: navigation, search

Members: Khushbu Patel, Karan Kapuria, Angela Mo, Harrison Kim, David Lu, Christian Colon, Nolan English, Bowen Yang, Cong Gao

Contents

Introduction

Functional Annotation

Functional Annotation can be defined as the part of genome analysis that is customarily performed before a genome sequence is deposited in GenBank and described in a published paper.

Objectives

  1. Annotation of Salmonella Heidelberg genomes using the information obtained by the genome prediction group.
  2. Merges results from all tools and produces a single GFF file for each sporadic isolate.

Background

Pipeline

Screen Shot 2017-04-10 at 1.14.01 AM.png

Tools

LipoP 1.0

LipoP 1.0 predicts lipoproteins and discriminates between lipoprotein signal peptides, other signal peptides and n-terminal membrane helices in Gram negative bacteria.This hidden Markov model (HMM) based tool is able to predict 96.8% of the lipoproteins correctly with only 0.3% false positives in a set of SPaseI-cleaved, cytoplasmic, and transmembrane proteins.

Command Used: cat <input.fasta> | lipop_decode ~/LipoP1.0a/LipoP1.0.mod -SignalCutoff -3 > <output.gff>

Command Used: LipoPformat <output.gff> > <output.txt>

TMHMM

TMHMM is a membrane protein topology prediction method based on a hidden Markov model. It predicts transmembrane helices and discriminate between soluble and membrane proteins with high degree of accuracy. Users can submit as many as 4000 protein sequences in FASTA format each time. It recognizes the 20 amino acids and B, Z, and X, which are all treated equally as unknown. Accuracy however drops with presence of signal peptides due to hydrophobic region, though not as inaccurate for Gram-negative bacteria.

Command used: tmhmm --short /path_to/<input.fasta> > <output.txt>


SignalP

SignalP detects signal peptides using neural networks trained on either a transmembrane containing dataset or non-transmembrane containing dataset. It can be used on eukaryotic, gram-, and gram+ organisms.

Command used: signalp -t <organism type> -f <format> [input.faa] > output.file

InterPro-Scan

Interporscan is a software that integrates the predictive information about protein function from different protein database and scans the input protein (nucleotide) sequence to locate the predictive signature. The following is the available analysis:

TIGRFAM: Protein families based on HHMs

SFLD: Protein families based on HHMs.

ProDom: Protein domain families generated from the UniProt Knowledge Database.

Hamap: High-quality Automated and Manual Annotation of Microbial Proteomes

SMART: SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs Bold text CDD: Prediction of CDD domains in Proteins

ProSiteProfiles: PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them

ProSitePatterns: PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them

SUPERFAMILY: SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.

PRINTS: A fingerprint is a group of conserved motifs used to characterise a protein family

PANTHER: The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.

Gene3D: Structural assignment for whole genes and genomes using the CATH domain structure database

PIRSF: The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.

Pfam: A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)

Coils: Prediction of Coiled Coil Regions in Proteins

MobiDBLite: Prediction of disordered domains Regions in Proteins

 interproscan.sh -goterms -pa -f GFF3 -b <result directory> -i <input faa sequence> -T <temp directory>
 

PilerCR

Designed to identify the characteristic signature of CRISPR repeats by finding a chain of local alignments where the repeats and spacers are within the expected ranges of length and contain sequence conservation.

Usage:./pilercr -in <fasta file> -out <output file>

Antibiotic Resistance Database

The Antibiotic Resistance Database(ARDB) is a comprehensive database of antibiotic resistance factors. To annotate, BLAST was used to first create a useable database for the actual BLAST run.

Command Used: makeblastdb -in [database of choice] -dbtype prot Command Used: blastp -db [database used in the previous command] -query [sample faa file] -outfmt "6 qseqid sseqid qstart qend value pident qcovs stitle" -out <output.txt>

We obtained the putative genes from the holistic BLAST output file by filtering for hits with e-values <= 1e-10, percent identity >= 90, and query coverage >= 90. For contig regions where multiple hits were found, the virulence factor with the lowest e-value, highest percent identity, and highest query coverage was assigned the annotation.

Virulence Factor Database

Virulence factors were obtained from: http://www.mgc.ac.cn/VFs/

The Virulence Factor Database(VFDB) is a comprehensive database curating information on virulence factors since 2004, and has been updated regularly ever since. The database contains information such as structure features of the virulence factors, functions and mechanisms used by the pathogens for circumventing host defense mechanisms and causing pathogenicity. The database downloaded was the full dataset which covers all genes related to known and predicted virulence factors as opposed to the core database which only contains the genes associated with experimentally verified virulence factors.

To annotate, BLAST was used to first create a useable database for the actual BLAST run.

Command Used: makeblastdb -in [database of choice] -dbtype prot

Command Used: blastp -db [database used in the previous command] -query [sample faa file] -outfmt "6 qseqid sseqid qstart qend value pident qcovs stitle" -out <output.txt>

We obtained the putative genes from the holistic BLAST output file by filtering for hits with e-values <= 1e-10, percent identity >= 90, and query coverage >= 90. For contig regions where multiple hits were found, the virulence factor with the lowest e-value, highest percent identity, and highest query coverage was assigned the annotation.

DOOR2

Operons were obtained from the DOOR2 database (http://csbl.bmb.uga.edu/DOOR/displayspecies.php). Salmonella Heidelberg was used as the search query. Plasmid NC_017624(P) and chromosome NC_017623(C) from Salmonella enterica subsp. enterica serovar Heidelbarg str. B182 were downloaded with 4568 genes and 863 operons noted. Chromosome NC_011083(C), plasmids NC_011081(P) and NC_011082(P) from Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 were also downloaded with 4780 genes and 929 operons noted. Operon tables were downloaded and merged into one .opr file. After conversions to fasta files were made, the following commands were used for BLAST:

Command Used: makeblastdb -in [database of choice] -dbtype prot

Command Used: blastp -db [database used in the previous command] -query [sample faa file] -outfmt "6 qseqid sseqid qstart qend value pident qcovs stitle" -out <output.txt>

We obtained the putative genes from the holistic BLAST output file by filtering for hits with e-values <= 1e-10, percent identity >= 90, and query coverage >= 90.

Results

Overall Prediction Coverage

Screen Shot 2017-04-11 at 9.38.28 PM.png

TMHMM and Signal Peptide Predictions

Screen Shot 2017-04-10 at 1.15.51 AM.png

Virulence Factors

Screen Shot 2017-04-11 at 1.04.43 PM.png


Note: The number of genes listed here were found on the chromosome. Samples 9, 13, 14, 17, and 24 each had one virulence factor gene found on a plasmid, not displayed above.


Antibiotic Resistance Genes

ARDB.png

CRISPR

Screen Shot 2017-04-10 at 1.18.49 AM.png

Pathway Annotations

Screen Shot 2017-04-11 at 9.40.28 PM.png

References

> E. L.L. Sonnhammer, G. von Heijne, and A. Krogh. A hidden Markov model for predicting transmembrane helices in protein sequences. In J. Glasgow, T. Littlejohn, F. Major, R. Lathrop, D. Sankoff, and C. Sensen, editors, Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology, pages 175-182, Menlo Park, CA, 1998. AAAI Press.

> A. Sierakowska, H. Willenbrock, G. von Heijne, H. Nielsen, S. brunak and A. Krogh (2003) Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Sci., 12(8):1652-1662, 2003.

Personal tools