Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carlo Colantuoni & Rafael Irizarry April 19, 2006 Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor.

Similar presentations


Presentation on theme: "Carlo Colantuoni & Rafael Irizarry April 19, 2006 Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor."— Presentation transcript:

1 Carlo Colantuoni & Rafael Irizarry April 19, 2006 Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

2 Biological Setup Every cell in the human body contains the entire human genome: 3.3 Gb or ~30K genes. The investigation of gene expression is meaningful because different cells, in different environments, doing different jobs express different genes. Tasks necessary for gene expression analysis: Define what a gene is. Identify genes in a sea of genomic DNA where <3% of DNA is contained in genes. Design and implement probes that will effectively assay expression of ALL (most? many?) genes simultaneously. Cross-reference these probes.

3 Cellular Biology, Gene Expression, and Microarray Analysis DNA RNA Protein

4 AAAAA Gene: Protein coding unit of genomic DNA with an mRNA intermediate. STARTSTOP protein coding 5’ UTR 3’ UTR mRNA Genomic DNA 3.3 Gb DNA Probe ~30K genes Sequence is a Necessity

5 From Genomic DNA to mRNA Transcripts EXONSINTRONS RNA editing & SNPs Alternative splicing Alternative start & stop sites in same RNA molecule ~30K >30K Transcript coverage Homology to other transcripts Hybridization dynamics 3’ bias Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.

6 Sequence Quality! Redundancy! Completeness? Unsurpassed as source of expressed sequence Chaos?!?

7 From Genomic DNA to mRNA Transcripts ~30K >30K >>30K

8 Transcript-Based Gene-Centered Information

9 Possible mis-referencing: Genomic GenBank Acc.#’s Referenced ID has more NT’s than probe Old DB builds DB or table errors – copying and pasting 30K rows in excel … Using RefSeq’s can help. Design of Gene Expression Probes Content: UniGene, Incyte, Celera Expressed vs. Genomic Source: cDNA libraries, clone collections, oligos Cross-referencing of array probes (across platforms): Sequence <> GenBank <> UniGene <> HomoloGene

10 From Genomic DNA to mRNA Transcripts

11

12

13

14

15

16 Functional Annotation of Lists of Genes KEGG PFAM SWISS-PROT GO DRAGON DAVID BioConductor

17 Analysis of Functional Gene Groups

18

19

20

21

22

23

24 One of the largest challenges in analyzing genomic data is associating the experimental data with the available metadata, e.g. sequence, gene annotation, chromosomal maps, literature. The annotate and AnnBuilder packages provides some tools for carrying this out. Using AnnBuilder. It is possible to build associations with specific gene lists, eg. hgu95a package for Affymetrix HGU95A GeneChips. The annotate package maps to GenBank accession number, LocusLink LocusID, gene symbol, gene name, UniGene cluster, chromosome, cytoband, physical distance (bp), orientation, Gene Ontology Consortium (GO), PubMed PMID.

25 Analysis of Functional Gene Groups

26 Functional Gene/Protein Networks DIP BIND MINT HPRD PubGene Predicted Protein Interactions

27 Analysis of Gene Networks

28

29 9606 is the Taxonomy ID for Homo Sapiens

30

31

32

33 Predicted Human Protein Interactions

34 Used high-throughput protein interaction experiments from fly, worm, and yeast to predict human protein interactions. Human protein interaction is predicted if both proteins in an interaction pair from other organism have high sequence homology to human proteins. >70K Hs interactions predicted >6K Hs genes

35 Analysis of Gene Networks

36 Carlo Colantuoni Clinical Brain Disorders Branch, NIMH, NIH Dept. Biostatistics, JHSPH Thanks to … Rafael Irizarry Scott Zeger Jonathan Pevsner

37

38 FTP: ftp://ftp.ncbi.nlm.nih.gov/ ftp://ftp.ncbi.nlm.nih.gov/repository/UniGene ftp://ftp.ncbi.nih.gov/pub/HomoloGene/ NCBI Web Links

39 NUCLEOTIDE: PATHWAYS and NETWORKS: ftp://ftp.genome.ad.jp/pub/kegg/ (http://www.genome.ad.jp/anonftp/) (also.com) PROTEIN: ftp://us.expasy.org/ ftp://us.expasy.org/databases/prosite/ More Web Links

40 SAVAGE: Detection of More Subtle Functionally Related Groups of Gene Expression Changes

41 EXP#1 Swiss-Prot 30K PFAMKEGG ~3K 10K ~40K annotations DRAGONSAVAGE Differential Expression of Functional Gene Groups within One Experiment

42 EXP#4EXP#3EXP#2EXP#1 BioDB Differential Expression of a Single Functional Gene Group Across Multiple Experiments DRAGON SAVAGE

43 Similar Differential Expression Patterns Across Multiple Experiments p value 0.0 <0.1 ALL CN The distribution of gene expression values for each gene group in each sample is plotted as a single point in low dimensional space. This is achieved using Principal Components Analysis along with Non-Metric Multi-Dimensional Scaling. 1 1 EXP#1 2 2 EXP# X CN X

44 PING: Detection of Differential Expression in Functional Networks of Proteins

45 Interaction Networks in Gene Expression Data

46 Large Protein Interaction Network Network Regulated in Sample #1

47 Network Regulated in Sample #2 Large Protein Interaction Network

48 Network Regulated in Sample #1 Network Regulated in Sample #2 Network Regulated in Sample #3 Large Protein Interaction Network

49 Network of Interest Network Regulated in Sample #1 Network Regulated in Sample #2 Network Regulated in Sample #3 Large Protein Interaction Network PING

50

51 Genomic DNA Content 1.Interspersed repeats (~1/2 Hs. genome) 2.(Processed) pseudogenes 3.Simple sequence repeats 4.Segmental duplications (~5% Hs. genome) 5.Blocks of tandem repeats (can be very large) 6.Genes: Promoters - Exons – Introns <3% defining what a gene is - protein coding unit of genomic DNA with an mRNA intermediate identifying genes within genomic DNA protein-coding genes (mRNA) functional RNA genes - tRNA, rRNA, snoRNA, snRNA, miRNA prokaryotes eukaryotes

52 AAAAA Gene: Protein coding unit of genomic DNA with an mRNA intermediate. STARTSTOP protein coding 5’ UTR 3’ UTR mRNA Genomic DNA 3.3 Gb Protein

53 AAAAA Gene: Protein coding unit of genomic DNA with an mRNA intermediate. STARTSTOP protein coding 5’ UTR 3’ UTR mRNA Genomic DNA 3.3 Gb Protein ~30K genes Sequence is a Necessity

54 How is a gene defined in “wet” biology and in silico? Seq. from mRNA sample Seq. on array Array probe design: Source – cDNA libraries, oligos, clone collections Content – UniGene, Celera, Incyte Transcript coverage Homology to other transcripts Hybridization dynamics – hyper-multiplex hyb rxn Empirical validation 3’ bias Alt. splicing - known and not Alt. start / stop site in same RNA molecule Less important: RNA editing, SNPs Cross-referencing of array probes: GenBank <> UniGene <> HomoloGene Possible mis-referencing: Genomic GenBank Acc.#’s Referenced ID has more NT’s than probe Old DB builds DB or table errors

55 Finding genes in eukaryotic DNA ORF identification – Three Letter Genetic Code (codons) 4*4*4. It is possible to translate any stretch of genomic DNA into protein, but that doesn’t mean we have identified a protein coding gene! There are several kinds of exons: -- non-coding -- initial coding exons -- internal exons -- terminal exons -- some single-exon genes are intronless

56 What We Are Going To Cover Cells, Genes, Transcripts –> Genomics Experiments Sequence Knowledge Behind Genomics Experiments Annotation of Genes in Genomics Experiments


Download ppt "Carlo Colantuoni & Rafael Irizarry April 19, 2006 Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor."

Similar presentations


Ads by Google