Human Genome 90,000 Human proteins, initially assumed near that number of genes (initial estimates 153,000) The 1000 cell roundworm Caenorhabditis elegans has 19,500 genes, corn has 40,000 genes Current estimates are 25,000 or fewer genes Alternative splicing allows different tissue types to perform different function with same gene assortment
Implications 75% of human genes are subject to alternative editing faulty gene splicing leads to cancer and congenital diseases. gene therapy can use splicing
Application We talked before about apoptotis when the cell determines it cant be repaired Bcl-x is a regulator of apoptotis, is alternatively spliced to produce either Bcl-x(L) that suppresses apoptosis, or Bcl-x(S) that promotes it.
Spliceosome Five snRNA molecules U1, U2, U3, U4, U5, U6 combine with as many as 150 proteins to form the spliceosome It recognizes sites where introns begin and end –Cuts introns out of pre-mRNA –joins exons
Spliceosome The 5’ splice site is at the beginning of the intron, the 3’ site is at the end The average human protein coding gene is 28000 nucleotides long with 8.8 exons separated by 7.8 introns exons are 120 nucleotides long while introns are 100-100,000 nucleotides long
Splicing errors familial dysautonomia results from a single- nucleotide mutation that causes a gene to be alternatively spliced in nervous system tissue The decrease in the IKBKAP protein leads to abnormal nervous system development (half die before 30) > 15% of gene mutations that cause genetic diseases and cancers are caused by splicing errors.
Why splicing Each gene generates 3 alternatively spliced mRNAs Why so much intron (1-2% of genome is exons)? Mouse and human differences are almost all splicing Half of the human genome is made up of transposable elements, Alus being the most abundant (1.4 million copies) –They continue to multiply and insert themselves into the genome at the rate of one insertion per 100 human births mutations in the Alu can create a 5’ or 3’ site in an intron causing it to be an exon This mutation doesn’t impact existing exons It only has effect when it is alternatively spliced in
Ideal Microarray Readings Exon 1Exon 2Exon 4Exon 5 Exon 1Exon 3Exon 5 Isoform 1: Isoform 2: Probe types Constitutive Exon Junction Unique (“Cassette”) a a b c d e Probe Expression abcde
Motivation Why alternatively splice? How does it affect the resulting proteins? Look at domains: –High level summary of protein –~80% of eukaryotic proteins are multi- domain –Domains are big relative to an exon
Some Previous Work Signatures of domain shuffling in the human genome. Kaessmann, 2002. Intron phase symmetry around domain boundaries The Effects of Alternative Splicing On Transmembrane Proteins in the Mouse Genome. Cline, 2004. Half of TM proteins studied affected by alt- splicing.
Method Predict Alternative Splicing Predict Protein Domains Look for effects of Alt-Splicing on predicted domains –“Swapping” –“Knockout” –“Clipping”
Microarray Design Genes based on mRNA and EST data in mouse Mapped to Feb. 2002 mouse genome freeze ~500,000 probes (~66,000 sets) ~100,000 transcripts ~13,000 gene models
Technical work Genome Space transcripts probes Provided data Overlap gene models E@NM_021320 cc-chr10-000017.82.0 G6836022@J911445 cc-chr10-000017.91.1 G6807921@J911524_RC cc-chr10-000018.4.0 Probe to transcript mapping Generated Data
Predicting Alternative Splicing Using mouse alt-splicing microarrays Data from Manny Ares –8 tissues –3 replicates of each tissue
Predicting Alternative Splicing General Approach: Clustering, then Anti-Clustering 107 Clusters Detail View
Gene Expression Measurement mRNA expression represents dynamic aspects of cell mRNA expression can be measured with latest technology mRNA is isolated and labeled with fluorescent protein mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser
Gene Expression Microarrays The main types of gene expression microarrays: Short oligonucleotide arrays (Affymetrix); cDNA or spotted arrays (Brown/Botstein). Long oligonucleotide arrays (Agilent Inkjet); Fiber-optic arrays...
Affymetrix Microarrays 50um 1.28cm ~10 7 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Raw gene expression is intensity difference: PM - MM Raw image
Microarray Potential Applications Biological discovery –new and better molecular diagnostics –new molecular targets for therapy –finding and refining biological pathways Recent examples –molecular diagnosis of leukemia, breast cancer,... –appropriate treatment for genetic signature –potential new drug targets
Microarray Data Analysis Types Gene Selection –find genes for therapeutic targets –avoid false positives (FDA approval ?) Classification (Supervised) –identify disease –predict outcome / select best treatment Clustering (Unsupervised) –find new biological classes / refine existing ones –exploration …
Microarray Data Mining Challenges too few records (samples), usually < 100 too many columns (genes), usually > 1,000 Too many columns likely to lead to False positives for exploration, a large set of all relevant genes is desired for diagnostics or identification of therapeutic targets, the smallest set of genes is needed model needs to be explainable to biologists
Microarray Data Classification Prediction: ALL or AML Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707 Data Mining model New sample Microarray chipsImages scanned by laser Datasets