Presentation on theme: "Analysis of Membrane Proteins in Metagenomics: Networks of correlated environmental features and protein families Prianka Patel, Thesis Defense Yale University."— Presentation transcript:
Analysis of Membrane Proteins in Metagenomics: Networks of correlated environmental features and protein families Prianka Patel, Thesis Defense Yale University Molecular Biophysics and Biochemistry 2.17.10
2 Projects Bowie, James Nature, 2005 Analysis of Membrane Protein Structures Metagenomics of Ocean Microbes: Co-variation with Environment Sequence Separation # Coevolving pairs Photosynthesis
3 Traditional GenomicsMetagenomics Assemble and annotate Extract DNA and sequence Select organism and culture Estimated that less than 1% of microbes can be cultured Contig 1 Contig 2... atgctcgatctcg atcgatctcgctg atgccgatctaa Lose information about which gene belongs to which microbe Collect sample from environment Assemble and annotate Extract DNA and sequence atgctcgatctcg atcgatctcgctg atgccgatctaa What is Metagenomics?
4 Comparative Metagenomics Foerstner et al., EMBO Rep, 2005 An amino acid change in Proteorhodopsin proteins is linked to abundant wavelengths in the sample of origin GC content is shaped by environment Very different environments: whale bone associated, ocean, acid mine, soil Sargasso Sea 2 Sargasso Sea 4 Sargasso Sea 3 Whale 1 (bone Whale 2 (bone) Whale 1 (microbial mat) Acid mine Drainage Minnesota farm soil = Average
5 Comparative Metagenomics Dinsdale et. al., Nature 2008 There are microbial pathways that discriminate between categorically different environments variantinvariant Gianoulis et al., PNAS 2009 There are microbial pathways that discriminate between similar environments Photosynthesis
6 Motivation Variation in membrane proteins across different environments may give insight into microbial adaptations that allow them to survive in a specific habitats. Membrane proteins interact with the environment, transporting available nutrients, sensing environmental signals, and responding to changes Engelman et al., Nature, 2005
7 Sorcerer II Global Ocean Survey Sorcerer II journey August 2003- January 2006 Sample approximately every 200 miles Rusch, et al., PLOS Biology 2007
8 Sorcerer II Global Ocean Survey Metagenomic Sequence 0.1–0.8 μm size fraction (bacteria) 6.3 billion base pairs (7.7 million reads) Reads were assembled and genes annotated Metadata GPS coordinates, Sample Depth, Water Depth, Salinity, Temperature, Chlorophyll Content The majority of samples are from open ocean, with a few estuaries and lakes Each site has its own metadata Assembly was done over all locations, but can be mapped back to a particular site Rusch, et al., PLOS Biology 2007
9 Extracting environmental data using GPS Coordinates Sample Depth:1 meter Water Depth:32 meters Chlorophyll:4.0 ug/kg Salinity:31 psu Temperature: 11 C Location:41°5'28"N, 71°36'8"W * World Ocean Atlas * National Center for Ecological Analysis and Synthesis GPS coordinates allow us to extract information from other sources: GOS Sample Depth:1 meter Water Depth:32 meters Chlorophyll:4.0 ug/kg Salinity:31 psu Temperature: 11 C Location:41°5'28"N, 71°36'8"W
10 World Ocean Atlas 2005 NOAA (National Oceanic and Atmospheric Administration) and NODC (National Oceanographic Data Center) Nutrient Features Extracted: Phosphate Silicate Nitrate Apparent Oxygen Utilization Dissolved Oxygen * Cumulative annual data at the ocean surface * Resolution is 1 degree latitude/longitude... no simple geometric shape matches the Earth Annual Phosphate [umol/l] at the surface
11 National Center for Ecological Analysis and Synthesis (NCEAS) Anthropogenic Features Extracted: Ultraviolet radiation Shipping Pollution Climate Change Ocean Acidification Halperin et. al.(2008), Science * Resolution is 1 km square * Value of a activity at a particular location is determined by the type of ecosystem present: Impact = ∑ Features * Ecosystem * impact weight Shipping Climate Change
12 Predicting membrane proteins in GOS data - TMHMM (Transmembrane Hidden Markov Model): finds hydrophobic stretches of amino acids - COG (Clusters of Orthologous Groups): orthologous groups of protein families Metagenomic ReadsProtein Clusters GOS Mapping TMHMM Filtering Membrane Protein Clusters COG Family 1 Family 2 * 151 Families
13 Predicting membrane proteins in GOS data 22% of unique proteins in membrane protein clusters map to COG
14 What is the Relationship? Correlation of Sites based on environmental features or protein families Discriminative Partition Matching Canonical Correlation Analysis/Protein Features and Environmental Features Network Environmental Features Membrane Protein Families ?
15 How Similar are the Sites to each other? 1 0
16 Species Distribution The 16S rRNA gene is a component of the small prokaryotic ribosomal subunit Bacteria with 16S rRNA gene sequences more similar than 97% are considered the same ‘species’ 10,025 16S genes found and classified 20% level, “phylum” Biers et al. App. Env. Microbiology, 2009
17 Method: For each site, we correlated the EF profile distances and its MPF frequency profile distances and 16S profile distances This suggests that the observed membrane protein variation is more a function of the measured environmental features, than phylogenetic diversity.
18 Discriminative Partition Matching Which membrane protein families are discriminating between these clusters? We can partition the membrane protein family matrix by these site groupings, and then look for significantly different distributions of proteins families between the clusters. Sites cluster into three distinct groups: Groups are geographically separated:
19 First, we performed PCA on the membrane protein families matrix, and grouped the first component scores by the environmental clustering This revealed that the Mid-Atlantic and Pacific were more similar to each other in terms of membrane protein content, and these sites were grouped Discriminate Partition Matching Which families are discriminating between these two site-sets? (T-test)
20 DPM results 30 families showed significant differences (p-value<0.01) between the site sets Most were enriched in the North Atlantic (28/30) Higher pollution, chlorophyll, and possibly higher nutrients and cell abundance in the North Atlantic microbes’ need to expel antimicrobials, by-products of metabolism, or environmental toxins Buffer against shifts in ocean solute concentrations again alluding to the increased pollutants, and possibly nutrient fluxes from land and rivers Chlorophyll content Stabilization of DNA and RNA Exchanges ATP for ADP in mitochondria and obligate intracellular parasites, may be nucleotide/H+ transporters
21 Simultaneous Correlations of Environmental Features and Membrane Proteins Canonical Correlation Analysis We have addressed this questions by: 1. Comparing site similarity based on these two sets of features 2. Finding particular discriminating families between environmental groupings But we don’t know what particular features are associated with each other, and we know that they are all likely interdependent: Canonical Correlation Analysis Environmental FeaturesMembrane Protein Families ? Salinity Pollution Temp Family 1 Family 2 Family 5
22 Canonical Correlation Analysis - CCA allows us to take advantage of the continuity of the features and observe which features are invariant or variant, and the type (positive, negative) of relationship between them. -We correlate all the variables, protein families and environmental features simultaneously. - We have two sets of variables, X 1... X 15 (environmental features) and Y 1... Y 151 (membrane protein families) Environmental FeaturesMembrane Protein Families We are looking for two vectors, a and b (a set of weights for X and Y), such that the correlation between X, Y is maximized:
23 CCA results We are defining a change of basis of the cross co-variance matrix We want the correlations between the projections of the variables, X and Y, onto the basis vectors to be mutually maximized. Eigenvalues squared canonical correlations Eigenvectors normalized canonical correlation basis vectors EnvironmentFamily Correlation =.3 Correlation= 1 This plot shows the correlations in the first and second dimensions Correlation Circle: The closer the point is to the outer circle, the higher the correlation Variables projected in the same direction are correlated
24 CCA results Water depth Acidity App. O2 util. Salinity Pollution Climate changeShipping Phospahte Nitrate Silicate Temperature Chlorophyll Dissolved O2 UV Sample Depth variant invariant Dimension 1 Dimension 2 107 variant membrane protein families 44 invariant membrane protein families Difficult to see the strength and directionality of a relationship Weights of the features are difficult to visualize and compare There is no means of quantifying the variation between sets of features
25 Protein Families and Environmental Features Network (PEN) Distance: Dot product between 1st and 2nd Dimension of CCA
26 Protein Families and Environmental Features Network (PEN) “Bi-modules”: groups of environmental features and membrane proteins families that are associated UV, dissolved oxygen, apparent oxygen utilization, sample depth, and water depth are not in the network COG0598, Magnesium Transporter COG1176, Polyamine Transporter
27 Bi-module 1: Phosphate/Phosphate Transporters Low Phosphate, high affinity phosphate transporters which are induced during phosphate limitation High Phosphate, low affinity inorganic phosphate ion transporter which are constitutively expressed
28 Microbes modulate content in response to phosphate Martiny et al. Env Microbiology, 2009 Van Mooy et al. Nature, 2009 Phosphate Concentration related to phosphate acquisition genes in Prochlorococcus Microbes modulate phospholipid content in response to phosphate concentrations
29 Bi-module 2: Iron Transporters/Pollution/Shipping Negative relationship between areas of high ocean-based pollution and shipping and transporters involved in the uptake of iron Pollution and Shipping may be a proxy for iron concentrations
30 Bi-module 2: Iron Transporters/Pollution/Shipping Rigwell A. J. (2002) Phil. Trans. R. Soc. Lond. Iron is usually limiting in oceans: High Nitrate-Nutrient/Low Chlorophyll regions Delivery of iron to is usually by: - terrestrial input - fluvial (rivers) input - upwelling from the ocean floor - aeolian dust from land
31 Bi-module 2: Iron Transporters/Pollution/Shipping Pollution and Dust N/C and Iron Transporters -Negative correlation between COG4558 and COG0609 and dust/pollution values (p-value <0.01) - Searching the BRENDA database for enzymes using iron as a cofactor reveal that an increase in these two COGs negatively correlated to the amount of enzymes present that required iron.
32 Conclusions New method (PEN) to visualize complex relationships in metagenomic data using explicit environmental variables We show both known and intuitive relationships between features and genomic content CCA also reveals the invariant fraction of environmental features and protein families (highlights important cellular processes): Chloride Channel, Type II secretion Proteins (virulence) Many variant ABC-type transporters(34/41): suggests streamlining for optimization and energy conservation
33 Much of Membrane Protein Space Remains Uncharacterized 15% of predicted membrane proteins had NO homology to Genbank (e-value<1e-10) We used short motifs (PROSITE) to characterize a small fraction of these including ABC Transporters, GPCRs, Lipocalins, beta- lactamases 16% (29,384) were annotated
34 Intraribotype diversity and the definition of a ‘species’ Eugene V Koonin Nat Biotechnology, 2007 16S analysis of GOS data reveals that most sequences fall into 5 ribotypes However, there were very few identical sequences, suggesting that no two cells have identical genome sequences This suggests that ocean microbes are rather adaptive to their environments We observe diversity in membrane protein content and abundance, and show that it is a reflection of different environmental conditions more than phylogenetic diversity (16S) These are mostly oligotrophic (nutrient poor) waters and environmental conditions have likely been fairly constant over many years, genomes are “streamlining”
35 Conclusions Integration of Environmental Features using GPS coordinates Environmental clusters show differences in membrane protein content which reflect environmental conditions (pollution/efflux proteins) Microbes from ocean surface samples show diversity in membrane protein content Diversity in membrane proteins was shown to be a reflection of different environmental conditions more than phylogenetic diversity Developed (PEN) and adapted techniques to connect features of environment to specific protein families Genotypic variation within similar natural populations occurs in response to environmental conditions Integration of geospatial data can highlight unexpected trends as anthropogenic factors seem to be reflected in microbial function
36 Advisors: Donald Engelman and Mark Gerstein Acknowledgements Committee Members: Jim Bowie (UCLA) Annette Molinaro Lynne Regan Mike Snyder Administrative Staff: Mary Backer Ann Nicotra Nessie Stewart Collaborators Gerstein Lab: Tara Gianoulis Kevin Yip Rob Bjornson Nicolas Carriero Philip Kim Jan Korbel Sam Flores Engelman Lab: Damien Thevenin Julia Rogers Past and Present members of Engelman and Gerstein Labs Yale University Biomedical High Performance Computing Facility NIH grant RR19895 which funded the instrumentation Yale Map Collection: Stacey Maples
Your consent to our cookies if you continue to use this website.