Presentation is loading. Please wait.

Presentation is loading. Please wait.

Schedule 10-feb, 9.00-12.00 (A1.14)  Chapter 5 Genetic Variation, SNPs, Haplotype, Genetic Distance  No computer practicum 11-feb, 9.00-11.30 (C1.112)

Similar presentations

Presentation on theme: "Schedule 10-feb, 9.00-12.00 (A1.14)  Chapter 5 Genetic Variation, SNPs, Haplotype, Genetic Distance  No computer practicum 11-feb, 9.00-11.30 (C1.112)"— Presentation transcript:

1 Schedule 10-feb, (A1.14)  Chapter 5 Genetic Variation, SNPs, Haplotype, Genetic Distance  No computer practicum 11-feb, (C1.112)  Score matrices  Computer exercises: (B1.24H&L) 15-feb, (A1.10)  Chapter 6 Natural Selection / Ka/Ks / HIV  No computer practicum 16-feb, (A1.06)  Chapter 7 Phylogenetics / SARS  Computer Exercises : (B1.24H&L)

2 Mini projects: literature and computation study Aim To further demonstrate how bioinformatics is used in scientific research we defined a complementary track during which you will perform (a) a literature study for a selected topic (b) computations related to the topic investigated. Literature report: about 3000 words / English Computational exercise: programming language of choice (e.g., Perl, Matlab) Example topics for literature study We selected several topics that you may choose for your literature study and computations. However, you are free to propose another topic but this has to be approved first (please contact Antoine van Kampen). Before you start: Contact Antoine van Kampen, I Deadline: April 1, 2010


4 Chapter 5 Are Neanderthals among us? Variation within and between species Prof. dr. Antoine van Kampen Biosystems Data Analysis Swammerdam Institute for Life Sciences Bioinformatics Laboratory Academic Medical Centre

5 Neanderthal Man Skeleton discovered in 1856 in Germany (Neander Thal) Skeleton dated to about 44 thousand years ago Unusual features of skeleton Belongs to ancient species of hominid: any member of the biological family Hominidae (the "great apes") Biologically different from modern humans First reconstruction of Neanderthal Man

6 Are Neanderthals our ancestors? Are modern Europeans the offspring of Neanderthals?  Has recently been settled by genetic analysis

7 Human evolution and spread

8 Consider dynamic nature of DNA Determine how DNA sequences change over time Use this information to infer the history and function of different parts of the genome Variation data is also used in  Medical research  Forensics  Genome annotation

9 Variation in DNA sequences 1 Questions about human origins can be answered  Exploit the fact that every genome has a slightly different genome sequence  Differences within and between species  Siblings with same parents have differences Variation in DNA accumulates via  Mutations (mistakes made by the cellular machinery that are then encoded in the genome)  Cell’s proof-reading machinery is very good  Estimate: one mistake for every 2 million – 1 billion bases  Introduced by recombination (when organism is diploid)

10 Variation in DNA sequences 2 Mutation rates  differ between organisms  differs between mitochondrial and nuclear genome  In most animals the rate in mitochondrial genome is higher than in the nuclear genome New mutations at any one nucleotide position are rare Most genetic differences between individuals are inherited mutations and not newly arisen variants  exploit this fact to study the history of species  shared mutations are indicative of shared ancestry

11 Germline mutations Mutations occur at every cell duplication  genome is replicated each time Creating a fully grown human:  trillions of cells each of which dies of and is replaced multiple times during a life time  This is one reason that cancer is an illness of the elderly Mutations in skin cells or heart-muscle cells are not passed on to our offspring Only mutations that occur in the germline cells have a chance of spreading through the population  There are exceptions such as plants

12 Geneticists classify the animal cells into two types Germ-line cells Cells that give rise to gametes such as eggs and sperm Somatic cells All other cells Germ-line mutations are those that occur directly in a sperm or egg cell, or in one of their precursor cells Somatic mutations are those that occur directly in a body cell, or in one of its precursor cells Mutations Can Occur in Germ-Line or Somatic Cells

13 Therefore, the mutation can be passed on to future generations The size of the patch will depend on the timing of the mutation The earlier the mutation, the larger the patch Therefore, the mutation cannot be passed on to future generations

14 Mutations Neutral  no effect (this might still be a non-synonymous mutation!!) Deleterious  disrupts some biological function Advantageous  improves some biological function If the mutation is not passed on to a child then it is lost Polymorphism: Any difference among individuals at a specific position in the genome (regardless of frequency) Point mutations: change of one base when these mutations are polymorphic within a species then we call them Single Nucleotide Polymorphisms (SNP) Various versions of the DNA sequence are called alleles: we might find a SNP with an A allele and a T allele at a certain position



17 Hemoglobin beta gene (HBB)

18 Human variation SNPs SNPs account for a large part of genetic variation Humans: on average 1 SNP / 1500 bp  Any two sequences will differ at 0.067% of positions Short tandem repeats (STRs or microsatellites) Repeats of short DNA words (e.g., CACACACACA) Due to slippage during replication Mutation rate at microsatellites is much higher than for SNPs Rare types of variation Indels: insertions and deletions Rearrangments: inversions, duplications, transpositions (copy- paste or cut-paste of genome sequences)

19 Applications of microsatelites Forensics.  In forensic identification cases, the goal is typically to link a suspect with a sample of blood, semen or hair taken from a crime. Alternatively, the goal may be to link a sample found on a suspect's clothing with a victim.  Relatedness testing in criminal work may involve investigating paternity in order to establish rape or incest. Because the lengths of microsatellites may vary from one person to the next, scientists have begun to use them for above applications; a procedure known as DNA profiling or "fingerprinting“.

20 Applications of microsatelites Diagnosis and Identification of Human Diseases Because microsatellites change in length early in the development of some cancers, they are useful markers for early cancer detection. Because they are polymorphic they are useful in linkage studies which attempt to locate genes responsible for various genetic disorders.

21 Transitions and transversions Not all point mutations are equally likely  Even if mutation has no effect  Due to molecular structure of nucleotides 4 transitions, 8 transversions Transitions are more common than transversions

22 Genetic code is more robust to transitions When only two synonymous codons code for same AA they always differ by only one transition Transitions within coding sequence are on average less harmful than transversions

23 DNA and amino acid substitutions Not all nucleotide mutations occur with same frequency  Due to chemical structure  Due to sometimes deleterious consequence of change in DNA Not all changes between amino acids are seen with same frequency  Sometimes because two AA are multiple nucleotide-mutation steps away (Ala  Cys requires 3 mutations GCA  TGT )  Some AA more interchangeable due to biochemical characteristics such as size, polarity and hydrophobicity

24 Analysis of DNA-sequence variation Human DNA sequence is 99.9% identical between individuals → varying nucleotides Polymorphism: normal variation between individuals Genetic variation  May cause or predispose to inheritable diseases  Determines e.g. individual drug response  Used as markers to identify disease genes Genotyping

25 Genetic marker Polymorphisms that are highly variable between individuals: Microsatellites and single nucleotide polymorphisms (SNPs) Marker may be inherited together with the disease predisposing gene because of linkage disequilibrium (LD) Important terms Allele Alternative form of a gene or DNA sequence at a specific chromosomal location (locus) at each locus an individual possesses two alleles, one inherited from each parent Genotype genetic constitution of an individual, combination of alleles

26 Linkage disequilibrium, LD Alleles are in LD, if they are inherited together more often than could be expected based on allele frequencies Two loci are inherited together, because recombination during meiosis (formation of gametes) separates them only seldom

27 Haplotype A haplotype is series of genetic variants (e.g., SNP) on one chromosome that are inherited from one parent In subsequent generations the chromosomal haplotype is broken up by crossing over events in meiosis In practice, “haplotype” refers to closely linked genetic loci. SNPs that are located in close proximity tend to travel together  known as linkage disequilibrium (LD)  In general, loci that are located more closely together on a chromosome will be in stronger LD  Correlation between LD and physical distance separating two loci is modest  Some loci that are separated 20 bp will not be in LD, while other loci separated by bp will be in tight LD.

28 Haplotype Multiple loci in the same chromosome that are inherited together Usually a string of SNPs that are linked alleles locus haplotypes (combi of three alleles on one chromosome

29 Haplotype construction No good experimental methods available to identify haplotypes → Computational methods to create haplotypes from genotype data

30 ...Haplotype construction Family-based haplotype construction  Linkage analysis softwares: Simwalk, Merlin, Genehunter, Allegro... Population-based haplotype construction  Not as reliable as family-based  EM-algorithm (expectation maximization algorithm), described in  SnpHap  PHASE

31 Haplotype blocks Low recombination rate in the region Strong LD Low haplotype diversity Small number of SNPs in the block are enough to identify common haplotypes; tag SNPs

32 Formation of haplotype blocks Recombination events that shuffle the components of a haplotype do not occur at random Some locations in the genome have much higher recombination rates  Recombination hotspots The occurrence of recombination hotspots has contributed to the limited haplotype diversity of the genome  There are fewer observed haplotypes than would be expected by chance Size of haplotype blocks vary from about 9kb to over 100kb. What is the size of a gene?

33 Average gene size: 10-15kb

34 recombination x chromosomes Formation of haplotype blocks meiosis

35 Few generationsHundreds of generations

36 Average block size African populations: 11 kb Non-african populations: 22 kb 60%-80% of the genome is in the blocks of > 10 kb kb

37 Block frequencies Typically, only 3-5 common haplotypes account for >90% of the observed haplotypes

38 Information content is higher Gene function may depend on more than one SNP Smaller number of required markers  The amount of wrong positive association is reduced Replacing of missing genotypes by computational methods Elimination of genotyping errors Challenges:  Haplotypes are difficult to define directly in the lab; computational methods  Defining of block boarders is ambiguous; several different algorithms Benefits of haplotypes instead of individual SNPs

39 Haplotype example Example: β 2 -adrenergic receptor gene (ADRB2) Consider 8 SNPs; each two alleles One would expect 2 8 =256 haplotypes Observed number of haplotypes is much smaller and only 3 haplotypes are estimated to occur with large frequency

40 Genotypes and haplotypes Diploid organism. 2 bi-allelic loci on the same chromosome (e.g., SNPs). First locus alleles A and T: 3 genotypes AA, AT, and TT. Second locus alleles G and C: 3 genotypes GG, GC, and CC. Individual: 9 possible configurations for the genotypes at these two loci. Punnett square (next slide) shows the possible genotypes that an individual may carry and the corresponding haplotypes.

41 Genotypes and haplotypes homologous chromosomes A GG A haplotype heterozygous Number of haplotypes grows exponentially with number of polymorphisms  genotype genotype Question: what is the haplotype given the genotype?

42 Nuclear DNA: Genotypes and haplotypes homologous chromosomes A GG A haplotype A CG A A CC A

43 Nuclear DNA: Genotypes and haplotypes homologous chromosomes A GG A haplotype A CG A A CC A T GG T T CG T T CC T

44 Nuclear DNA: Genotypes and haplotypes homologous chromosomes A GG A haplotype A CG A A CC A T GG T T CG T T CC T

45 Nuclear DNA: Genotypes and haplotypes homologous chromosomes A GG A haplotype A CG TA GC T ambigous

46 Mitochondrial DNA (mtDNA): a model for the analysis of variation mtDNA is ideal for studying human evolution  because of high mutation rate  Other technical advantages (e.g., easier to isolate) Mitochondria contain high number of mutagenic oxygen molecules  lead to high mutation rate Small circular genome  bases long in humans  37 protein coding genes  RNA genes  Slightly different genetic code than the nuclear genome

47 Mitochondrial DNA (mtDNA) D-loop -non-coding sequence -origin of replication -promoter -hypervariable regions (HVR-I, HVR-II) L= bp HVR: high variability among humans. Ideal for studying the relationships among individuals

48 Advantage of mtDNA mtDNA inherited only from mother Every individual will have only one version of mtDNA  We automatically know the haplotype

49 mtDNA: Genotypes and haplotypes A G haplotype AT GAGTG CACTC mtDNA A G T C T G mtDNA is only passed down through the mother. Since we only have one version, we automatically known the haplotype if we know the genotype.

50 Variation between species Genetic differences between species are responsible for  behavior  morphology  physiology Variation between species  Tell us about relationships between species  close related species have on average more similar DNA  Tells us how evolution proceeded over millions of years Key to understanding differences between species:  number of nucleotide substitutions that separate two DNA sequences

51 Substitution rate Substitution rate between homologous sequences from different species, tells us about  Time since divergence of species  Biological function of genomic sequences  Relationships among species Mutation  Originates in a single individual  may be lost if individual leaves no offspring  may become fixed throughout the species  every individual in the species will have the new allele at the specific nucleotide position. Substitution rate: rate at which species accumulate such fixed differences

52 Substitution vs. polymorphism What happens after a mutation changes a nucleotide in a locus Polymorphism: mutant allele is one of several present in population Substitution: the mutant allele fixes in the population. (New mutations at other nucleotides may occur later.)

53 Substitution schematic Individual: Time 0: aaat aaat aaat aaat aaat aaat aaat Time 10: aaat aaat aaat aaat acat aaat aaat Time 20: aaat aaat acat aaat acat acat acat Time 30: acat acat acat acat acat acat acat Time 40: acat acat actt acat acat acat acat (1) times 10-29: polymorphism (2) time 30: mutation fixed -> substitution (3) time 40: new mutation: polymorphism (1) (2) (3)

54 Substitution rates for neutral mutations Most neutral mutations are lost Only 1 out of 2N fix Most that are lost go quickly (< 20 generations for population sizes from ) Substitution rate is expressed as number of substitutions per site per million years

55 Neutral theory (neutral mutations) Aa Any new mutation: initially present at a frequency of 1/(2N) = chance of becoming fixed in population Number of mutations 2Nµ  substitution rate = ρ = 2Nµ(1/2N) = µ 2 copies of each gene Population size N=10 2*N=20 alleles Aa ABAa mutation (rate=µ)

56 Neutral theory ρ = 2Nµ(1/2N) = µ (e.g., 2 mutations per site per Myear) Substitution rate of new mutations is independent of the population size and is equal to the neutral mutation rate. Larger populations: -create more mutations -smaller chance of becoming fixed Smaller populations: -create less mutations -larger chance of becoming fixed

57 Genetic drift Change in gene frequency due to chance fluctuations in a finite population

58 Number of substitutions per site (K) Genetic distance (K) K = number of substitutions per site  number of substitutions / length of sequence (this controls for the length of sequence) If divergence time (T) is known then Substitution rate ρ = K / (2T) (we divide by 2T because both lineages that come from common ancestor can accumulate mutations independently) Substitution rate is expressed as number of substitutions per site per million years

59 Estimating genetic distance Genetic distance (K) between two homologous sequences  number of substitutions since they diverged from common ancestor Problem Simple count will underestimate the true number of differences when multiple substitutions have occurred at the same site.

60 Multiple substitutions G A C C T T C A A T C A C G G G A C T T T C C T T C A A T C A C G G G A C T T T C C T T C A A T C A C C G G A C T T T C C T T C A A T C T C C G G A C T C A C C T T C A A T C T C C G G A C T Observed: 3 Actual: 6 Observed: compare first and last sequence Intermediate substitutions are not observed

61 Multiple substitutions Need probabilistic model to correct for multiple changes True genetic distance: K Observed differences: d Due to back-mutations K ≥ d

62 Saturation At the extreme: on average one substitution per site across the sequence during evolution (saturation) Two random sequences of equal length will match for approximately ¼ of their sites In saturation therefore the proportional genetic distance is ¼ Process of substitution is random  Genetic distance (K) and observed differences (d) are random variables  Various ways to estimate K from d Sequence evolution: Markov process.  A sequence at time t depends only on the sequence at time t-1

63 The Jukes-Cantor model Correction for multiple substitutions Assumes that all substitutions are equally likely (e.g. transitions and transversions) Substitution probability per site per second is α Substitution means there are 3 possible replacements  (e.g. C → {A,G,T}) Non-substitution means there is 1 possibility  (e.g. C → C)

64 Transition matrix Therefore, the one-step Markov process has the following transition matrix: M JC = ACGT A1-αα/3α/3α/3 Cα/31-αα/3α/3 Gα/3α/31-αα/3 Tα/3α/3α/31-α This leads to Jukes-Cantor formula

65 For small d using ln(1+x) ≈ x : K ≈ d So: actual distance ≈ observed distance For saturation: d ↑ ¾ : K →∞ So: if observed distance corresponds to random sequence-distance then the actual distance becomes indeterminate Jukes-Cantor formula

66 The Kimura two-parameter model Assumes that transitions are more likely than transversions Use two probabilities  Transitions: α  Transversions: β agct a 1-  -2  g   c   t  K = -0.5 ln(1-2P-Q) – 0.25 ln(1-2Q) P: fraction of transitions Q: fraction of transversions d=P+Q

67 Case study: are Neanderthals still among us? Are homo sapiens related to Neanderthals From GenBank Take 206 mtDNA (modern humans) Take 2 Neanderthal mtDNAs Extract comparable parts from the hypervariable regions for the modern humans (only parts of the HVR were available for Neanderthals)  208 sequences of 800bp  compute genetic distance corrected by Jukes-Cantor formula

68 Results Average distance between any two homo sapiens: (out of 1000 bases, 25 will be different on average) Average distance between homo sapiens and Neanderthal: (140 out of 1000 bases): much higher Make matrix of all pair-wise distances between sequences and use multi-dimensional scaling for visualization. Alternative: use phylogenetic tree

69 Results: multi-dimensional scaling Neanderthals: not a sub-population of human (different species) distances between points reflect genetic distance

70 Results: phylogenetic distance Neanderthals: more closely related to human homo sapiens Apes

Download ppt "Schedule 10-feb, 9.00-12.00 (A1.14)  Chapter 5 Genetic Variation, SNPs, Haplotype, Genetic Distance  No computer practicum 11-feb, 9.00-11.30 (C1.112)"

Similar presentations

Ads by Google