Presentation is loading. Please wait.

Presentation is loading. Please wait.

CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo

Similar presentations


Presentation on theme: "CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo"— Presentation transcript:

1 CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)
Filling in: Ioannis Pandis, PhD

2 Sequencing and Genomics
DNA Sequencing Sequencing Analysis Gene Expression Gene Expression Analysis Functional Genomics

3 DNA Structure Double Helix (Crick & Watson) Nitrogenous Base Pairs
2 coiled matching strands Backbone of sugar phosphate pairs Nitrogenous Base Pairs Roughly 20 atoms in a base Adenine  Thymine [A,T] Cytosine  Guanine [C,G] Weak bonds (can be broken) Form long chains called polymers Read the sequence on 1 strand GATTCATCATGGATCATACTAAC Mentioned

4 Differences in DNA DNA differentiates: tiny 2% We share DNA with
Species/race/gender Individuals We share DNA with Primates,mammals Fish, plants, bacteria Genotype DNA of an individual Genetic constitution Phenotype Characteristics of the resulting organism Nature and nurture tiny 2% Share Material Roughly 4%

5 Genes Chunks of DNA sequence Large percentage of human genome
Between 600 and 1200 bases long 22,000 human genes, 100,000 genes in tulips Large percentage of human genome termed “junk”: does not code for proteins “Simpler” organisms such as bacteria Are much more “evolved” (have hardly any junk) Viruses have overlapping genes (zipped/compressed) Often the active part of a gene is split into exons Separated by introns

6 Transcription Take one strand of DNA
Write out the counterparts to each base G becomes C (and vice versa) A becomes T (and vice versa) Change Thymine [T] to Uracil [U] You have transcribed DNA into messenger RNA Example: Start: GGATGCCAATG Intermediate: CCTACGGTTAC Transcribed: CCUACGGUUAC mentioned

7 The Synthesis of Proteins
Instructions for generating Amino Acid sequences (i) DNA double helix is unzipped (ii) One strand is transcribed to messenger RNA (iii) RNA acts as a template ribosomes translate the RNA into the sequence of amino acids Amino acid sequences fold into a 3d molecule Gene expression Every cell has every gene in it (has all chromosomes) Which ones produce proteins (are expressed) & when? mentioned

8 Genetic Code How the translation occurs Think of this as a function:
Input: triples of three base letters (Codons) Output: amino acid Example: ACC becomes threonine (T) Gene sequences end with: TAA, TAG or TGA Mentioned

9 Genetic Code 3 nt / aa = 43 = 64aa??? A=Ala=Alanine C=Cys=Cysteine
D=Asp=Aspartic acid E=Glu=Glutamic acid F=Phe=Phenylalanine G=Gly=Glycine H=His=Histidine I=Ile=Isoleucine K=Lys=Lysine L=Leu=Leucine M=Met=Methionine N=Asn=Asparagine P=Pro=Proline Q=Gln=Glutamine R=Arg=Arginine S=Ser=Serine T=Thr=Threonine V=Val=Valine W=Trp=Tryptophan Y=Tyr=Tyrosine Mentioned

10 Example Synthesis TCGGTGAATCTGTTTGAT AGCCACUUAGACAAACUA SHLDKL
Transcribed to: AGCCACUUAGACAAACUA Translated to: SHLDKL Mentioned

11 Proteins DNA codes for Amino acids strings Residue sequences
strings of amino acids Amino acids strings Fold up into complex 3d molecule 3d structures:conformations Between 200 & 400 “residues” Folds are proteins Residue sequences Always fold to same conformation Proteins play a part In almost every biological process Mentioned

12 Evolution of Genes: Inheritance
Evolution of species Caused by reproduction and survival of the fittest But actually, it is the genotype which evolves Organism has to live with it (or die before reproduction) Three mechanisms: inheritance, mutation and crossover Inheritance: properties from parents Embryo has cells with 23 pairs of chromosomes Each pair: 1 chromosome from father, 1 from mother Most important factor in offspring’s genetic makeup

13 Evolution of Genes: Mutation
Genes alter (slightly) during reproduction Caused by errors, from radiation, from toxicity 3 possibilities: deletion, insertion, substitution Substitution: ACGTTGACTC  ACGATGACTT Deletion: ACGTTGACTC  ACGTGACTC Insertion: ACGTTGACTC  AGCGTTGACTC Frameshift: ACGTTGACTC  AGCGTTGACTC Mutations are categorised into: Neutral or Deleterious A single change has a massive effect on translation Causes a different protein conformation

14 Evolution of Genes: Crossover (Recombination)
DNA sections are swapped From male and female genetic input to offspring DNA

15 Sequencing for Medical Study
Genotype Phenotype Hypothesis Test Hypothesis By Genetic Manipulation How do we answer this question? Typically we start with a phenotype that we are interested in. And we find individuals that have that phenotype and individuals that do not have that phenotype. Then we find a genetic change that is predominantly in the individuals with the phenotype but not in those without the phenotype. This leads us to a hypothesis that genotype X is responsible for phenotype Y. To test this we must now take an organism, make genotype X, and see if we observe phenotype Y. If true, we can hypothesize that changes in related genes may also cause the phenotype of interest, so we repeat the cycle.

16 Typical Cycle of the Study
Mutation in APC Gene Genotype Phenotype Hypothesis: Two groups: 1.Develop Colorectal cancer At Young Age 2. Do not APC is a Tumor Supressor Gene Test Hypothesis By Genetic Manipulation A quick example: Bert Vogelstein and colleagues were interested in a phenotype. They observed certain families developed colorectal cancer at at a very young age. They found a mutation in a gene called APC that was overrepresented in affected individuals. They hypothesized that APC suppresses tumors, and therefore by deleting APC, one promotes tumors. They tested this by deleting the gene in mice. Their control is genetically identical except at the APC locus. Only the APC- mouse gets cancer. Delete APC in Mouse Control: Isogenic APC+

17 Technologies Required
In 2005 $9 million/genome Not feasible ?Sequencing? Genotype Observation Reading/Thinking Phenotype Hypothesis Test Hypothesis By Genetic Manipulation As someone who is interested in technology development, I am interested in understanding how we can speed up this cycle. So this slide shows what geneticists actually do at each point in the cycle. Gene Deletion/Replacement

18 The thing is changing rapidly: Bp/$$ increases exponentially with time
In 1980, the sequencing cost per finished bp ≈ $1.00 In 2003, the sequencing cost per finished bp ≈ $0.01 >>> a 100-fold reduction in years Figure 1 | Exponential growth in computing and sequencing. The dark-blue plot indicates the Kurzweil/Moore’s Law: it describes the doubling of computer instructions per second per US dollar (IPS/US $) that has been occurring approximately every 18 months since The magenta plot indicates an exponential growth in the number of base pairs of accurate DNA sequence per unit cost (bp/US $) as a function of time. To some extent, the doubling time for DNA mimics the IPS/US $ curve because it is dependent on it. An even steeper segment occurs in the orange curve; this depicts the number of web sites (doubling time of four months) and shows how quickly a technology can explode when a protocol that can be shared spreads through an existing infrastructure. The turquoise plot is an ‘Open Source’ case study of ‘FLUORESCENT IN SITU SEQUENCING’ with polonies (see main text for details of this DNAsequencing technology) in bp/min on simple test templates (doubling time of one month). Adapted from Shendure et al 2004

19 History of DNA Sequencing
Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) Avery: Proposes DNA as ‘Genetic Material’ Watson & Crick: Double Helix Structure of DNA Holley: Sequences Yeast tRNAAla 1870 1953 1940 1965 1970 1977 1980 1990 2002 Miescher: Discovers DNA Wu: Sequences  Cohesive End DNA Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation Messing: M13 Cloning Hood et al.: Partial Automation Cycle Sequencing Improved Sequencing Enzymes Improved Fluorescent Detection Schemes 1986 Next Generation Sequencing Improved enzymes and chemistry Improved image processing 1928??? Efficiency (bp/person/year) 1 15 150 1,500 15,000 25,000 50,000 200,000 50,000,000 100,000,000,000 2008

20 History of DNA Sequencing
Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) Griffith's experiment, reported in 1928 by Frederick Griffith

21 History of DNA Sequencing
Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) Avery: Proposes DNA as ‘Genetic Material’ Watson & Crick: Double Helix Structure of DNA Holley: Sequences Yeast tRNAAla 1870 1953 1940 1965 1970 1977 1980 1990 2002 Miescher: Discovers DNA Wu: Sequences  Cohesive End DNA Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation Messing: M13 Cloning Hood et al.: Partial Automation Cycle Sequencing Improved Sequencing Enzymes Improved Fluorescent Detection Schemes 1986 Next Generation Sequencing Improved enzymes and chemistry Improved image processing Efficiency (bp/person/year) 1 15 150 1,500 15,000 25,000 50,000 200,000 50,000,000 100,000,000,000 2008

22 Sanger Sequencing (Chain-termination Methods)
DNA is fragmented Cloned to a plasmid vector Cyclic sequencing reaction Separation by electrophoresis Readout with fluorescent tags

23 Basics of the “old” technology
Clone the fragmented DNA. Generate a ladder of labeled (colored) molecules that are different by 1 nucleotide. Separate mixture on some matrix. Detect fluorochrome by laser. Interpret peaks as string of DNA. Strings are 500 to 1,000 letters long Assemble all strings into a genome The Process Is Sequential

24 That is what old technology take
3 ∙ 109 bp 1x coverage 10x coverage × 3 ∙ 109 bp = 40 years 2 ∙ 106 bp/day 10x coverage × 3 ∙ 109 bp × $0.001/bp = $30 million

25 New Generation Sequencing

26 Basics of the “new” technology
Get DNA and fragment it Attach all fragments to glass slides. Perform amplification by some form of PCR Sequencing ALL these fragments in PARALLLE using chain termination or other methods such as pyro-sequencing Extend and amplify signal with some color scheme. Detect fluorochrome by microscopy. Interpret series of spots as short strings of DNA. Strings are letters long Multiple images are interpreted as 0.4 to 1.2 GB/run (1,200,000,000 letters/day). Map or align strings to one or many genome. Making Millions Short Sequence Reads in Parallel

27 Technology Overview: Solexa/Illumina Sequencing
How does polony sequencing work? Here are the steps in our protocol. Step 1: Make a Library of linear DNA molecules This library can be made by random priming or ligation Every molecule in this library has a Universal Primer on either end In the middle is the Variable region that differs from molecule to molecule. This is the DNA that we want to sequence. Made without cloning into E.coli so very high complexity. Very Dilute amounts of this library and pour an acryalmide gel on a glass microscope slide. Include PCR reagents Cycle in PCR machine adapted for slides. Because solution so dilute each template mol is relatively far from other mole. PCR proceeds - acrylamide restricts diffusion of amplification products so that tney remain localized to the template molecule. A polymerase colony forms (POLONY) 5,000,000

28 Immobilize DNA to Surface
The goal is to amplify millions of “polonies” on a single slide. But then we want to perform enzymatic manipulations on this DNA To facilitate this, one of the primers used in the PCR has a 5’ acrylic group that allows copolymerization into the gel. Source:

29 Technology Overview: Solexa Sequencing
Now we’ve amplified polonies, How are we going to sequence them. Here’s how we do this.

30 Sequence Colonies The first thing I want to ask is can we amplify single molecules at high densities in acrylamide? Alexander Chetverin had previously demonstrated growing RNA colonies in agarose using Q beta replicase system so we were optimistic that DNA colonies could be grown. Add various amount of templatre (1 kind 236 base pairs), amplify, stain with DNA DYE SYBR GREEN - see fluorescent spheres? Number of spheres linear with amount of template added Pick sphere and run on gel – right size. Conclude that these spheres were DNA colonies or polonies. Now the size of the polonies shown here is much larger than we would like. We could fit 2100 of these polonies on a single slide and we want millions.

31 Sequence Colonies But, by increasing the length of the DNA template and increasing the acrylamide concentration, we can get polonies as small as 5 microns. This would enable 15,000,000 distinguishable polonies assuming Poisson statistics.

32 Call Sequence We established that we could amplify polonies
But can we sequence them? Rather than sequencing them directly, I decided to try to work out the protocol using oligonucleotide templates attached to acrylamide to work out issues with attachment chemistry, as well as the nucleotide and polymerase chemistries. First question, can we get a specific single base extension with our

33 From Debbie Nickerson, Department of Genome Sciences, University of Washington,

34 Sequence Alighment Meyerson et al, 2011

35 So, how fast is cost going down?
2006: $10 million 2008: $100,000 2009: $10,000 2010: $5,000 2012: $1,000 ??? $100

36 Informatics Informatics challenge : ample applications
All the genomics research can be uniformly done through sequencing (with the help of proper assay design) Bioinformatics turns the sequencer into universal genomics interpreter Not a challenge, rather a big opportunity!!! For Edison, phonograph was not primarily designed for playing music but …….

37 One Stone, Many Birds: NGS May Enable a Uniform Bioinformatics
Mapped Position : Structure/functionality (Mapping) Read Numbers: Quantified Abundance (Counting) BP Variant: SNP & Mutation Pattern (Detecting)

38 Match These Sequences How do we match this sequence: gattcagacctagct
With this sequence: gtcagatcct

39 Possible Answers 1. gattcagacctagct (no indels) gtcagatcct
2. gattcaga-cctagct (with indels) g-t-cagatcct 3. gattcagacctagc-t (no overhang) 4. gattcagacctagct (with overhang)

40 Sequence Matching Algorithms #1
Without indels Hamming distance Scoring schemes Certain changes in sequence more likely Due to chemical properties of the residues BLAST algorithm Idea: match local regions and expand Seven part process In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In another way, it measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.

41 Sequence Matching Algorithms #2
With indels Drawing of Dotplots Dynamic Programming (getting from A to B) Quickest route to Z + Quickest route from Z VPFLLMMVLG V P F M L G A C D G Z E B F

42 Searching Databases We have ways to score how well 2 seqs match
Now want to use this in databases Given a known gene sequence Which genes in the database are closely related Have to worry about: Repeated subsequences biasing matches Accuracy and significance of matches Sensitivity and specificity (false + and false -)

43 Functional Genomics—Transcriptomics
Transcriptome – the complete set of coding and non-coding RNA molecules in a cell at a particular time: Varies between cell types Transcriptomics – the study of the transcripts in a cell, cell type, organism, etc.

44 Methods for Transcriptomics
Microarray-based: High-throughput gene expression profiling Hybridization of labeled cDNAs to an array of complementary DNA probes Measurement of expression levels based on hybridization intensity Sequence-based: Full-length cDNA (FLcDNA) sequencing: complete sequencing of cDNA clone Expressed sequence tag (EST) sequencing: Single-pass sequencing of cDNA clone Serial Analysis of Gene Expression (SAGE): Short sequence tags at 3’ end of transcript Tags concatenated and sequenced NGS enables whole transcriptome sequencing : Sequence Census Method

45 Machine Learning Machine learning (inductive reasoning)
Automatic proposing of hypotheses based on data Has many applications in bioinformatics, such as microarray analysis Example: predictive toxicology Given: set of toxic drugs and a set of non-toxic drugs Given: background information (chemistry, etc.) Produces: hypothesis why drugs are toxic/toxis mechanism Overview of machine learning Aims, techniques, methodologies, representations Artificial neural networks Support vector machine et.al

46 Machine Learning Larrañaga et al. 2005

47 The End! Questions?


Download ppt "CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo"

Similar presentations


Ads by Google