2: Large-Scale High throughput technologies: Sequencing Gene expression profiling Chip-CHIP and tiling arrays Whole genome yeast two hybrid scan Genomic knockout of all single genes SNP/CGH Methylation profiling … Proteome profiling
Genomic Sequencing – shotgun sequencing Sequencing is usually ~700 bp in a single run. How can we sequence a genome? 2: Large-Scale
Genomic Sequencing – Walking. 1.Design a primer 2.Sequence. 3.Design a new primer 4.Sequence 5.… One has to design new primers every time. To do so, one has to wait for the sequencing results
2: Large-Scale GAGGAGACGAACACCCGTATACAGTCGACG ACCCCGAGGAGACGAACACCCGTATACAGTCGACGTTTATATATA GTATACAGTCGACGTTTATATATA ACCCCGAGGAGACGA Genomic Sequencing – shotgun sequencing 1. Break DNA to small pieces 2. Sequence each piece 3. Assemble
2: Large-Scale After the DNA is isolated (from the tissue/cell/virus), it is fragmented either by restriction enzymes or by mechanical force. ACGTAACGTATACCCGAC TATATGCATTGCATATG “Frayed ends” 1. Break DNA to small pieces
2: Large-Scale <-ATACGTAACGTATACCCGAC TATATGCATTGCATATGGG-> 3’ 5’ 3’ To blunt-end (“fix”) frayed ends, one needs a DNA polymerase. In the example above, just adding a polymerase will make the edges blunt. Polymerases always make the chain grow from the 5’ towards the 3’ (5’->3’)
2: Large-Scale ACGTAACGTATACCCGAC ATTGCATATGGGCTGAACAT 3’ 5’ 3’ <-ATACGTAACGTATACCCGAC TATATGCATTGCATATGGG-> 3’ 5’ 3’ Polymerases always make the chain grow from the 5’ towards the 3’ (5’->3’) But what about this case?
2: Large-Scale E. coli DNA polymerase has 3 domains: One does the replication One digests DNA 3’->5’ (exonuclease). One digests DNA 5’->3’ (exonuclease). Klenow fragment = engineered polymerase without the 5’->3’ exonuclease activity.
2: Large-Scale ACGTAACGTATACCCGAC ATTGCATATGGGCTGAACAT 3’ 5’ 3’ <-ATACGTAACGTATACCCGAC TATATGCATTGCATATGGG-> 3’ 5’ 3’ Polymerases always make the chain grow from the 5’ towards the 3’ (5’->3’) But what about this case? Klenow has 3’->5’ exonuclease activity
2: Large-Scale GAGGAGACGAACACCCGTATACAGTCGACG GTATACAGTCGACGTTTATATATA ACCCCGAGGAGACGA The pieces are inserted into a vector – e.g., a plasmid. Sequencing is done from both sides 2. Sequence each piece: One can use the same primers for all the sequencing. Parallelism of sequencing.
2: Large-Scale GAGGAGACGAACACCCGTATACAGTCGACG ACCCCGAGGAGACGA ? GTATACAGTCGACGTTTATATATA GTATACAGTCGACGTTTATATATA ACCCCGAGGAGACGA Shotgun sequencing – why isn’t it a trivial task? 1. By chance, some parts are not sequenced even once!!!
2: Large-Scale Shotgun sequencing – Definition of coverage. X5 coverage: each base in the final sequence was present, on average, in 5 reads Although the human genome was sequenced at a X12 coverage, still 1% of the genome is either not assembled or not reliable.
Shotgun sequencing – why isn’t it a trivial task? 2.Some pieces do not align because of sequencing errors 2: Large-Scale GAGGTGAGGAACACCCGTATACAGTCGACG ACCCCGAGG?GA?GAACACCCGTATACAGTCGACGTTTATATATA ACCCCGAGGAGACGA
Shotgun sequencing – why not a trivial task? 3. Repetitive sequences –satellites DNA. 2: Large-Scale GGGGGGGGGGGGGGGGGGGGGGGGGGGG ACCCCGGGGGGGGGGGGG????GGGGGGGGGGGGGA GGGGGGGGGGGGGGGGGGGGGGA ACCCCGGGGG
2: Large-Scale Shotgun sequencing – why isn’t it a trivial task? 4. Repetitive sequences (duplicated regions). In the genome we have duplicated regions which have almost identical sequence.
2: Large-Scale Shotgun sequencing – why isn’t it a trivial task? 5. Some fragments are not sequenced because once inserted to a bacterium, they are toxic.
2: Large-Scale A section of the genome that could be reliably assembled. A contig
2: Large-Scale A contig Lander- Waterman estimation of number of contigs w.r.t. genome coverage
2: Large-Scale At 8X-10X coverage, ~5 contigs are expected -> some of the genome is expected to be un-sequenced.
2: Large-Scale Vector (e.g., e. coli) Cloned fragment of the genome (e.g., 10 KB) When sequencing a large genome, often the inserts are very large (10KB). In such case, it is impossible to sequence the entire insert, and only the edges are sequenced.
Short fragments from both ends are sequenced 2: Large-Scale Mate pairsA read
The size of the insert is also recorded. 2: Large-Scale Mate pairsA read 10 KB
2: Large-Scale Information from mate pairs is used to build a scaffold of the genome A contig
2: Large-Scale The human genome is the chimp genome with 99% accuracy. Comparative assembly If one sequences the chimp genome – the information from the human genome can aid in the assembly.
2: Large-Scale If one offers you to sequence your genome at 99.9% accuracy – don’t take it even for 5$.
2: Large-Scale Often, phages are used as cloning vectors in standard cloning experiments. For genomic sequencing, Bacterial Artificial Chromosomes (BACs) are often used. These are based on the F plasmid – a large plasmid that is stably replicating in E. coli. Over 300kb can be inserted in the plasmid.
2: Large-Scale The idea is to first divide a big genome to overlapping regions, put each in a BAC, and then use shotgun method to sequence each BAC. BAC BAC-by- BAC Assemble of the Genome Into BAC Shotgun Sequencing the edges Assemble each BAC
2: Large-Scale Pyrosequencing: sequencing at the speed of light
2: Large-Scale Pyrosequencing: a relatively new technique (invented 1986) in which the sequence of a DNA is discovered by synthesizing its complementary strand (the "sequencing by synthesis" principle).
2: Large-Scale Pyrosequencing: Gel free Nucleotides are label free Parallelism
2: Large-Scale GTP + DNA(n) -> DNA(n+1) + PPi Enzyme = polymerase PPi -> ATP Enzyme = Sulfurylase ATP -> light Enzyme = luciferase ATP -> AMP + 2PPi Enzyme = Apyrase
2: Large-Scale Pyrosequencing ACGTAACGTATACCCG TGCATT? Only if one adds G – there will be light! ACGTAACGTATACCCG TGCATT? 1.Add ATP -> no light 2.Add CTP -> no light 3.Add GTP -> light 4.Add TTP -> no light 5.Add ATP -> no light 6.Add CTP -> light 7.Add GTP -> no light 8.Add TTP -> no light 9.Add ATP -> light GCASequence = GCA
2: Large-Scale Pyrosequencing Each DNA fragment was amplified and attached to a bead seperately (one bead to each fragment). Each bead was added to a fibre-optic well.
2: Large-Scale Pyrosequencing A computer can read the light pattern from billions of wells simultaneously. (Sequencing of a bacterial genome in 7h).
2: Large-Scale Bioinformatics and medicine Your chip analysis suggests stress
2: Large-Scale Bioinformatics and medicine 1.Today, medicine is based on episodic treatment. 2.First step that is currently taken place is the use of digital imaging and their analysis (e.g., optic fibers). 3.Next step: “Digital health” – medical data for a person will be shared by all doctors – no matter where you are.
2: Large-Scale Bioinformatics and medicine 4. Clinical genomics: fast and accurate identification of pathogens 5. Clinical genomics: sequence (part) of the genome to gain insights into which drugs are efficient. 6. Predisposition analysis for diseases. 7. Towards “lifetime treatment”… 8. To relay less on the intuition of the doctor – more on quantitative parameters and statistical analysis.
2: Large-Scale Difference between humans: SNP – single nucleotide polymorphism CGH – copy number variation Chromatin Epigenetics We want to link these differences to diseases. Bioinformatics and medicine
2: Large-Scale Some more important buzz words Genomics Proteomics Metabolomics System biology In-silico (in vitro, in vivo) Protein Engineering Synthetic biology Post genomic era
2: Large-Scale Some important NUMBERS Human DNA = ~2 meters 300 x 10 9 cells 3.2 x 10 9 nucleotides
2: Large-Scale Chip arrays and gene expression data
2: Large-Scale With the chip array technology, one can measure the expression of 10,000 (~all) genes at once. Can answer questions such as: 1.Which genes are expressed in a muscle cell? 2.Which genes are expressed during the first weak of pregnancy in the mother? In the new baby? 3.Which genes are expressed in cancer?
2: Large-Scale 4. If one mutates a TF: which genes are not expressed following this change? 5. Which genes are not expressed in the brain of a retarded baby? 6. Which genes are expressed when one is asleep versus when the same person is awake?
2: Large-Scale DNA chip: in each cell there’s a specific DNA molecule. Upon hybridization with an mRNA molecule (or a cDNA one) – the intensity of the hybridization can be quantified by light.
2: Large-Scale Affymetrix: The base is a “wafer” מצע גבישי מוליך למחצה דק A light-sensitive chemical compound that prevents coupling between the wafer and the first nucleotide of the DNA probe being created.
2: Large-Scale The blue “cap” is light sensitive. A mask is added to some of the cells. When the cells are illuminated, only where there is light – a reaction with the nucleotide can happen. Affymetrix
2: Large-Scale The nucleotide that is added is also chemically linked with a new “cap” (light sensitive). Affymetrix
2: Large-Scale The entire process is called photolithography Affymetrix
2: Large-Scale Affymetrix: each probe is 25 bp – a part of an exon. The reader The chip itself In one cm 2 > 10 6 different oligos. Affymetrix
2: Large-Scale Affymetrix: each probe is 25 nucleotides. Above this, a technological problem exists: the synthesis becomes inaccurate. With such short probes, each mRNA can hybridize to more than one probe. The solution, each gene is “covered” by several probes. Affymetrix
2: Large-Scale Affymetrix: one can buy ready-made chips (human genome, mouse genome), or can design (“print”) his own chip (more expensive). Affymetrix
2: Large-Scale Detection: mRNA is isolated from the tissue (cells, viruses). cDNA is synthesized. The cDNA is fluorescently labeled. Sometimes, the cDNA is amplified using PCR. The intensity in each cell (probe) is measured by “the reader”. Affymetrix
2: Large-Scale Agilent Developed DNA printers – in each spot pico-liters of nucleotides are added. They can make probes up to 60 mers (Agilent is derived from Hewlett-Packard). Agilent Standard phosphoramidite chemistry
2: Large-Scale Agilent Hybridization to Agilent probes is more accurate. If there is hybridization to a probe, the gene it represents is probably expressed. Agilent
2: Large-Scale But, it is impossible to know how many probes are in each cell. So absolute fluorescent intensities are meaningless. Agilent
2: Large-Scale Solution, in the same experiment, hybridize samples with two conditions: healthy mRNA (in Red) versus tumor cells (green). The Agilent reader will give the ratio of the two colors. Agilent
2: Large-Scale In this approach, long cDNA sequences (>300bp) are produced in a cell (a clone) and are linked to each chip cell. Producing long cDNA rather than synthesizing them a nucleotide at a time is cheaper! As in the case of Agilent, it is impossible to control the number of probes in each cell. Stanford cDNA chips
2: Large-Scale Output Brain tumor females Brain tumor males w.t Gene 1 Gene 2 Gene 3 Gene 60000 Each cell is either an absolute number or a relative one, depending on the technology used.
2: Large-Scale Repeats Brain tumor female1 Brain tumor male2 Brain tumor male1 w.t Gene 1 Gene 2 Gene 3 Gene 60000 The repeat can either be the same sample – a different chip or a “real” biological repeat – a different sample.
2: Large-Scale Expression profile bt4bt3bt2bt1wt4wt3wt2wt1 231716154534g1 97366457g2 603026255232g3 Genes 1 and 3 show the same trend (go both high under the same conditions). That is: they have the same expression profile.
2: Large-Scale Clustering bt4bt3bt2bt1wt4wt3wt 2 wt 1 231716154534g1 97366457g2 603026255232g3 In general, we want to find all the genes which share the same expression profile -> suggestive of a functional linkage. This is done by clustering genes with the same profile
2: Large-Scale Clustering bt4bt3bt2bt1wt4wt3wt 2 wt 1 2302204534g1 90806457g2 1661605232g3 Clustering of the conditions can suggest two types of brain tumor (bt) Bi-clustering: both on the conditions and the genes.
2: Large-Scale Applications Think of increasing the glucose concentration of E.coli and making a chip array in various concentrations. One can potentially discover all genes in the glucose pathway. Knocking out a gene -> discover all genes that interact with it.
2: Large-Scale Applications Analyzing expression of genes can help reveal the gene network of a given organism.
2: Large-Scale Clinical Tal 11g1 4g2 0g3 Do I have a brain tumor? bt4bt3bt2bt1wt4wt3wt 2 wt 1 2302204534g1 90806457g2 1661605232g3
2: Large-Scale Sequence by hybridization It was thought that the following procedure could work for sequencing a genome: 1.Make a chip containing all x mers (e.g., x = 25). 2.Hybridize a genome to the chip. 3.By analyzing all the hybridizations with their overlaps – assemble the genome. Problem: it doesn’t work.