Presentation is loading. Please wait.

Presentation is loading. Please wait.

NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data May 30 th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Similar presentations


Presentation on theme: "NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data May 30 th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:"— Presentation transcript:

1 NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data May 30 th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment: Several slides courtesy of Professor Fiona Brinkman, MBB

2 Today’s Agenda  A brief overview of the bioinformatics for  SNP detection software  Proteins  Systems biology  Metagenomics (some resources; very brief…)  Group feedback: bioinformatics needs at SFU?

3 NGS-based SNP Analysis Programs From: Nielsen et al. 2011. Nature Reviews Genetics 12:443-451

4 BIOINFORMATICS OF PROTEINS NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data

5 5 From DNA to Protein to Systems ATGGAATTC…

6 Amino Acid Properties – Venn Diagram

7 Polypeptides

8 Ramachandran Plot

9 Secondary Structure (SS) Prediction Note major assumptions in all  Entire information for forming ss is contained in the primary sequence  Side groups of residues will determine structure  Pattern recognition  Looks for patterns in common ss’s like amphipathic alpha-helices (e.g. pattern of polar and non-polar residues)  Homology  Predict ss of the central residue of a given segment from homologous segments (neighbors)  Based on alignments of homologous residues from a protein family  Assumption: homologous proteins = similar structure  Extension: Use BLOSUM to detect similarity, or, better, use Position Specific Scoring Matrix (PSSM)

10 SS Prediction Programs PredictProtein-PHD (72%) –http://www.predictprotein.org/ PREDATOR (75%) –http://www-db.embl heidelberg.de/jss/servlet/ de.embl.bk.wwwTools.GroupLeftEMBL/argos/ predator/predator_info.html PSIpred (77%) –http://bioinf.cs.ucl.ac.uk/psipred/ (PSSM generated by PSI-BLAST, better sequence database, won CASP competition for many years) Jpred (81%) –http://www.compbio.dundee.ac.uk/jpred/

11 Tertiary Structure Lactate Dehydrogenase: Mixed  /  Immunoglobulin Fold:  Hemoglobin B Chain: 

12 Tertiary Structure: Protein Folds Holm, L. and Sander, C. (1996) Mapping the protein universe. Science, 273, 595-603.

13 Protein Folds  Folds: definition difficult and different criteria used for different classification systems –Normally formed around a separate hydrophobic core  Current protein fold taxonomy –Very roughly … –Approx. 1000-2000 different estimated folds, depending on method of analysis – of which about half are estimated to be known (500-1000) –Average domain size approx. 150 aa (50 – 250 aa approx std dev)

14 Protein Fold Major Classes All alpha proteins (all a) All beta proteins (all b) Alpha/beta proteins (a/b) - Parallel strands connected by helices (bab motifs) Alpha plus beta proteins (a+b) - More irregular a and b combinations “Other” - Often subclassified now

15 Protein Fold Classification Curated/Semi Manual Classification –SCOP (Structural Classification Of Proteins) http://scop.mrc-lmb.cam.ac.uk/scop/ –CATH (Class, Architecture, Topology, Homologous superfamily) http://www.cathdb.info/

16 SCOP classification  Family: clear evolutionarily relationship – Residue identities >= 30% – OR known similar functions and structures (example: globins form family though some only 15% identical)  Superfamily: Probable common evolutionary origin – Low sequence identities, but structural and functional features suggest common evolutionary origin. (example: actin, ATPase domain of heat shock proteins, and hexakinase form a superfamily).  Fold: major structural similarity – Same major ss in same arrangement with the same topological connections – May occur by convergent evolution

17 17 SCOP example

18 18 CATH example

19 Protein Fold Classification Automated Classification –DALI http://ekhidna.biocenter.helsinki.fi/dali –VAST (Vector Alignment Search Tool) http://www.ncbi.nlm.nih.gov/Structure/ VAST/vast.shtml

20 Domain Classification # (DC_l_m_n_p) l: fold space attractor region m: globular folding topology/fold type (clusters of structural neighbours in fold space with average pairwise Z-scores, by Dali, above 2) n: functional family (PSI-Blast, clusters of identically conserved functional residues, E.C. numbers, Swissprot keywords) p: sequence family (>25% identities) DALI/FSSP – Automated classification Exhaustive all-against-all 3D structure comparison of protein structures currently in the PDB

21 http://www.ncbi.nlm.nih.gov/Structure/VAST/vasthelp.html All against all BLAST comparison of NCBI’s MMDB (database of known protein structure at NCBI, derived from the PDB) Clustered into groups by a neighbor joining procedure, using BLAST p-value cutoffs of C or less (where C=10e-7, 10e-40 or 10e-80, to reflect three different levels of redundancy). A fourth level of classification is based on sequence identity VAST – Automated classification

22 22 Motif and Domain Searching InterPro – an integration of tools (PROSITE, PFAM, PRINTS, PRODOM) –http://www.ebi.ac.uk/interpro/ Expasy Tools has more… –PATTINPROT, to search for patterns in proteins yourself, etc… But first… Check if the analysis you want to do has already been done! i.e. www.ebi.ac.uk/proteome/ db.psort.org

23 Phylofacts PhyloFacts includes hidden Markov models for classification of user- submitted protein sequences to protein families across the Tree of Life. http://phylogenomics.berkeley.edu/phylofacts/

24 Subcellular Localization Prediction – Example of the benefit of integrating results with a Baysian approach

25 Localization Prediction - methods  Several programs analyze single features: TargetP  Initially one program analyzed multiple features: PSORT I (eukaryotes and prokaryotes) Developed in 1990

26 PSORT I prediction method: Rule based Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991)

27 27 Compositional Analysis  Molecular Weight  Amino Acid Frequency  Isoelectric Point  UV Absorptivity  Solubility, Size, Shape

28 SYSTEMS BIOLOGY NGS Bioinformatics Workshop 2.1 Meta-Analysis of Genomic Data

29 Systems Biology What is systems biology? ①Considers all (or many) of the proteins and genes in the system ②Links proteins and genes using interactions and functions ③Uses computational models to study system ④Provides insights into mechanisms, system dynamics, global properties

30 Molecular Interaction (MI) Network  Nodes = Gene / Protein  Edge = Interaction  Possible interactions: phosphorylation physical binding transcriptional regulation others?

31 Cytoscape http://www.cytoscape.org/ Cytoscape supports many use cases in molecular and systems biology, genomics, and proteomics:  Load molecular and genetic interaction data sets in many formats  Project and integrate global datasets and functional annotations  Establish powerful visual mappings across these data  Perform advanced analysis and modeling using Cytoscape plugins  Visualize and analyze human- curated pathway datasets such as Reactome or KEGG.

32 Cytoscape Attributes for highlighted nodes / edges Change visible attributes Network navigation Visible networks Search for nodes Control tabs: Network, VizMapper, plugin tabs

33 Data Files: 1. Network (Simple Interaction Format) 2. Node attributes (tab-delimited) 3. Gene expression (tab-delimited) Cytoscape – Loading Data

34 1. Network (Simple Interaction Format) Format: Format: gene1 interaction_type gene2 E.g.: E.g.: Cytoscape – Loading Data C1QBppC1R C1RppC2 C2ppC4 …

35 2. Gene Attribute (tab-delimited table) Maps data values to nodes Cytoscape – Loading Data Load File Check off “Show Text File Import Options” Check off “Transfer first line as attribute names..” Preview

36 3. Gene expression (tab-delimited table) Format: Format: gene1 exp_cond1 exp_cond2 … sig_cond1 sig_cond2 … Expression value: fold-change or intensity from microarray Expression value: fold-change or intensity from microarray Significance value: P-value indicating how likely the expression value is different between conditions. Significance value: P-value indicating how likely the expression value is different between conditions. Cytoscape – Loading Data

37 Cytoscape – Network Style Can change color by double-clicking on arrows Select “Continuous Mapping” as mapping type Select expression fold-change values (CMexp) Double-click “Node color” In “Vizmapper” tab…

38 1. Differentially-expressed subnetworks jActiveModules 2. Functional enrichment BiNGO Systems Biology Analyses

39  Search for sub-networks that contain a significant number differentially-expressed genes (nodes)  All genes in sub-network interact…  SO these highly differentially-expressed sub-networks may represent a critical pathway or complex involved in a condition of interest Differentially-Expressed Subnetworks

40 jActive algorithm:  Searches for sub-networks that contain a significant number differentially-expressed genes (or nodes)  Heuristic – won’t always find the optimum result  Z-score signifies how likely to find a subnetwork with a similar number of DE genes. Differentially-Expressed Subnetworks

41 Search from highlighted nodes Select expression significance (p-values) jActive - Inputs

42 Highlight result and click “Create Network” Subnetworks listed here Subnetworks listed here jActive - Results

43 Functional Enrichment: Also called over-representation analysis  Searches for common or related functions in a gene set  Is there a common annotation (e.g. pathway, GO term) for a set of genes that is more frequent than you would expect by chance? Functional Enrichment

44 Gene Ontology Controlled vocabulary describing functions, processes and cell components Consistency between organisms and gene products GO terms linked by relationships (is-a, part-of) and have hierarchy (parent – child) is-a part-of [other protein complexes] [other organelles] protein complex organelle mitochondrion fatty acid beta-oxidation multienzyme complex

45 BiNGO:  Looks for GO terms that are over-represented in a set of genes.  Displays the results in two ways A table with p-values A graph showing relationships between terms  Uses the hypergeometric test to statistically test for over- representation of each GO term.  Performs multiple hypothesis correction (since we are testing multiple GO terms for over-representation). Functional Enrichment

46 BiNGO - Inputs Click Start BiNGO Select “Custom” and then load go.annot file Lower significance level Fill in Name

47 BiNGO - Results

48 General GO Terms Specific GO Terms Significance

49 EGAN: Exploratory Gene Association Networks http://akt.ucsf.edu/EGAN/

50 METAGENOMICS NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data

51 What is Metagenomics?  The culture-independent isolation and characterization of DNA from uncultured microorganism communities  Nice reading list on the topic: http://www.cbcb.umd.edu/confcour/CMSC828G- materials/reading-list.html  See also: Torsten Thomas Jack Gilbert and Folker Meyer. 2012. Metagenomics - a guide from sampling to data analysis. Microb. Inform. Exp. doi:10.1186/2042-5783-2-3 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3351745/  I will just mention a few relevant bioinformatics tools here (no specific endorsements implied).

52 MG-RAST server http://metagenomics.nmpdr.org/ Meyer, F. et al. 2008. The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 9:386 doi:10.1186/1471-2105-9-386

53 MEGAN - MEtaGenome ANalyzer http://ab.inf.uni-tuebingen.de/software/megan/ Huson DH et al. 2007. MEGAN analysis of metagenomic data. Genome Res. 17: 377-386


Download ppt "NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data May 30 th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:"

Similar presentations


Ads by Google