Download presentation
1
Schedule 10-feb, 9.00-12.00 (A1.14) 11-feb, 9.00-11.30 (C1.112)
Chapter 5 Genetic Variation, SNPs, Haplotype, Genetic Distance No computer practicum 11-feb, (C1.112) Score matrices Computer exercises: (B1.24H&L) 15-feb, (A1.10) Chapter 6 Natural Selection / Ka/Ks / HIV 16-feb, (A1.06) Chapter 7 Phylogenetics / SARS Computer Exercises : (B1.24H&L)
2
Mini projects: literature and computation study
Aim To further demonstrate how bioinformatics is used in scientific research we defined a complementary track during which you will perform (a) a literature study for a selected topic (b) computations related to the topic investigated. Literature report: about 3000 words / English Computational exercise: programming language of choice (e.g., Perl, Matlab) Example topics for literature study We selected several topics that you may choose for your literature study and computations. However, you are free to propose another topic but this has to be approved first (please contact Antoine van Kampen). Before you start: Contact Antoine van Kampen, I Deadline: April 1, 2010
3
Course material EDUCATION BIOMEDICAL SCIENCES
4
Chapter 5 Are Neanderthals among us
Chapter 5 Are Neanderthals among us? Variation within and between species Prof. dr. Antoine van Kampen Biosystems Data Analysis Swammerdam Institute for Life Sciences Bioinformatics Laboratory Academic Medical Centre
5
Neanderthal Man Skeleton discovered in 1856 in Germany (Neander Thal)
Skeleton dated to about 44 thousand years ago Unusual features of skeleton Belongs to ancient species of hominid: any member of the biological family Hominidae (the "great apes") Biologically different from modern humans First reconstruction of Neanderthal Man
6
Are Neanderthals our ancestors?
Are modern Europeans the offspring of Neanderthals? Has recently been settled by genetic analysis
7
Human evolution and spread
8
Consider dynamic nature of DNA
Determine how DNA sequences change over time Use this information to infer the history and function of different parts of the genome Variation data is also used in Medical research Forensics Genome annotation
9
Variation in DNA sequences 1
Questions about human origins can be answered Exploit the fact that every genome has a slightly different genome sequence Differences within and between species Siblings with same parents have differences Variation in DNA accumulates via Mutations (mistakes made by the cellular machinery that are then encoded in the genome) Cell’s proof-reading machinery is very good Estimate: one mistake for every 2 million – 1 billion bases Introduced by recombination (when organism is diploid)
10
Variation in DNA sequences 2
Mutation rates differ between organisms differs between mitochondrial and nuclear genome In most animals the rate in mitochondrial genome is higher than in the nuclear genome New mutations at any one nucleotide position are rare Most genetic differences between individuals are inherited mutations and not newly arisen variants exploit this fact to study the history of species shared mutations are indicative of shared ancestry
11
Germline mutations Mutations occur at every cell duplication
genome is replicated each time Creating a fully grown human: trillions of cells each of which dies of and is replaced multiple times during a life time This is one reason that cancer is an illness of the elderly Mutations in skin cells or heart-muscle cells are not passed on to our offspring Only mutations that occur in the germline cells have a chance of spreading through the population There are exceptions such as plants
12
Mutations Can Occur in Germ-Line or Somatic Cells
Geneticists classify the animal cells into two types Germ-line cells Cells that give rise to gametes such as eggs and sperm Somatic cells All other cells Germ-line mutations are those that occur directly in a sperm or egg cell, or in one of their precursor cells Somatic mutations are those that occur directly in a body cell, or in one of its precursor cells
13
The size of the patch will depend on the timing of the mutation
The earlier the mutation, the larger the patch Therefore, the mutation can be passed on to future generations Therefore, the mutation cannot be passed on to future generations
14
Mutations Neutral no effect (this might still be a non-synonymous mutation!!) Deleterious disrupts some biological function Advantageous improves some biological function If the mutation is not passed on to a child then it is lost Polymorphism: Any difference among individuals at a specific position in the genome (regardless of frequency) Point mutations: change of one base when these mutations are polymorphic within a species then we call them Single Nucleotide Polymorphisms (SNP) Various versions of the DNA sequence are called alleles: we might find a SNP with an A allele and a T allele at a certain position
15
Example: sequence polymorphism
GTCCTTCATAATCATCACGGGACT GACCTTCATAACCATCACGGGACT AACCTTCATAACCATCTCCGGACC 3 sequences 6 polymorphisms
17
Hemoglobin beta gene (HBB)
18
Human variation SNPs SNPs account for a large part of genetic variation Humans: on average 1 SNP / 1500 bp Any two sequences will differ at 0.067% of positions Short tandem repeats (STRs or microsatellites) Repeats of short DNA words (e.g., CACACACACA) Due to slippage during replication Mutation rate at microsatellites is much higher than for SNPs Rare types of variation Indels: insertions and deletions Rearrangments: inversions, duplications, transpositions (copy-paste or cut-paste of genome sequences)
19
Applications of microsatelites
Forensics. In forensic identification cases, the goal is typically to link a suspect with a sample of blood, semen or hair taken from a crime. Alternatively, the goal may be to link a sample found on a suspect's clothing with a victim. Relatedness testing in criminal work may involve investigating paternity in order to establish rape or incest. Because the lengths of microsatellites may vary from one person to the next, scientists have begun to use them for above applications; a procedure known as DNA profiling or "fingerprinting“.
20
Applications of microsatelites
Diagnosis and Identification of Human Diseases Because microsatellites change in length early in the development of some cancers, they are useful markers for early cancer detection. Because they are polymorphic they are useful in linkage studies which attempt to locate genes responsible for various genetic disorders.
21
Transitions and transversions
Not all point mutations are equally likely Even if mutation has no effect Due to molecular structure of nucleotides 4 transitions, 8 transversions Transitions are more common than transversions
22
Genetic code is more robust to transitions
When only two synonymous codons code for same AA they always differ by only one transition Transitions within coding sequence are on average less harmful than transversions
23
DNA and amino acid substitutions
Not all nucleotide mutations occur with same frequency Due to chemical structure Due to sometimes deleterious consequence of change in DNA Not all changes between amino acids are seen with same frequency Sometimes because two AA are multiple nucleotide-mutation steps away (Ala Cys requires 3 mutations GCA TGT ) Some AA more interchangeable due to biochemical characteristics such as size, polarity and hydrophobicity
24
Genotyping Analysis of DNA-sequence variation
Human DNA sequence is 99.9% identical between individuals → varying nucleotides Polymorphism: normal variation between individuals Genetic variation May cause or predispose to inheritable diseases Determines e.g. individual drug response Used as markers to identify disease genes
25
Important terms Allele Genotype Genetic marker
Alternative form of a gene or DNA sequence at a specific chromosomal location (locus) at each locus an individual possesses two alleles, one inherited from each parent Genotype genetic constitution of an individual, combination of alleles Genetic marker Polymorphisms that are highly variable between individuals: Microsatellites and single nucleotide polymorphisms (SNPs) Marker may be inherited together with the disease predisposing gene because of linkage disequilibrium (LD)
26
Linkage disequilibrium, LD
Alleles are in LD, if they are inherited together more often than could be expected based on allele frequencies Two loci are inherited together, because recombination during meiosis (formation of gametes) separates them only seldom
27
Haplotype A haplotype is series of genetic variants (e.g., SNP) on one chromosome that are inherited from one parent In subsequent generations the chromosomal haplotype is broken up by crossing over events in meiosis In practice, “haplotype” refers to closely linked genetic loci. SNPs that are located in close proximity tend to travel together known as linkage disequilibrium (LD) In general, loci that are located more closely together on a chromosome will be in stronger LD Correlation between LD and physical distance separating two loci is modest Some loci that are separated 20 bp will not be in LD, while other loci separated by bp will be in tight LD.
28
Haplotype Multiple loci in the same chromosome that are inherited together Usually a string of SNPs that are linked locus alleles haplotypes (combi of three alleles on one chromosome
29
Haplotype construction
No good experimental methods available to identify haplotypes → Computational methods to create haplotypes from genotype data
30
...Haplotype construction
Family-based haplotype construction Linkage analysis softwares: Simwalk, Merlin, Genehunter, Allegro... Population-based haplotype construction Not as reliable as family-based EM-algorithm (expectation maximization algorithm), described in SnpHap PHASE
31
Haplotype blocks Low recombination rate in the region Strong LD
Low haplotype diversity Small number of SNPs in the block are enough to identify common haplotypes; tag SNPs
32
Formation of haplotype blocks
Recombination events that shuffle the components of a haplotype do not occur at random Some locations in the genome have much higher recombination rates Recombination hotspots The occurrence of recombination hotspots has contributed to the limited haplotype diversity of the genome There are fewer observed haplotypes than would be expected by chance Size of haplotype blocks vary from about 9kb to over 100kb. What is the size of a gene?
33
Average gene size: 10-15kb
34
Formation of haplotype blocks
x meiosis 1 2 2 1 1 2 recombination chromosomes
35
Hundreds of generations
Few generations Hundreds of generations 2 1 2 3 1
36
1-150 kb Average block size African populations: 11 kb
Non-african populations: 22 kb 60%-80% of the genome is in the blocks of > 10 kb
37
Block frequencies Typically, only 3-5 common haplotypes account for >90% of the observed haplotypes
38
Benefits of haplotypes instead of individual SNPs
Information content is higher Gene function may depend on more than one SNP Smaller number of required markers The amount of wrong positive association is reduced Replacing of missing genotypes by computational methods Elimination of genotyping errors Challenges: Haplotypes are difficult to define directly in the lab; computational methods Defining of block boarders is ambiguous; several different algorithms
39
Haplotype example Example: β2-adrenergic receptor gene (ADRB2)
Consider 8 SNPs; each two alleles One would expect 28=256 haplotypes Observed number of haplotypes is much smaller and only 3 haplotypes are estimated to occur with large frequency
40
Genotypes and haplotypes
Diploid organism. 2 bi-allelic loci on the same chromosome (e.g., SNPs). First locus alleles A and T: 3 genotypes AA, AT, and TT. Second locus alleles G and C: 3 genotypes GG, GC, and CC. Individual: 9 possible configurations for the genotypes at these two loci. Punnett square (next slide) shows the possible genotypes that an individual may carry and the corresponding haplotypes.
41
Genotypes and haplotypes
Number of haplotypes grows exponentially with number of polymorphisms homologous chromosomes heterozygous G G haplotype A A genotype heterozygous genotype Question: what is the haplotype given the genotype?
42
Nuclear DNA: Genotypes and haplotypes
homologous chromosomes G G haplotype A A G C A A Homologous chromosomes are chromosomes in a biological cell that pair (synapse) during meiosis Humans have 22 pairs of homologous non-sex chromosomes (called autosomes). Homologous chromosomes are two pairs of sister chromatids that have gone through the process of crossing over and meiosis An organisms genotype may not uniquely define its haplotype C C A A
43
Nuclear DNA: Genotypes and haplotypes
homologous chromosomes G G G G haplotype T T A A G C G C A A T T Homologous chromosomes are chromosomes in a biological cell that pair (synapse) during meiosis Humans have 22 pairs of homologous non-sex chromosomes (called autosomes). Homologous chromosomes are two pairs of sister chromatids that have gone through the process of crossing over and meiosis An organisms genotype may not uniquely define its haplotype C C C C A A T T
44
Nuclear DNA: Genotypes and haplotypes
homologous chromosomes G G G G haplotype T T A A G C G C A A T T Homologous chromosomes are chromosomes in a biological cell that pair (synapse) during meiosis Humans have 22 pairs of homologous non-sex chromosomes (called autosomes). Homologous chromosomes are two pairs of sister chromatids that have gone through the process of crossing over and meiosis An organisms genotype may not uniquely define its haplotype C C C C A A T T
45
Nuclear DNA: Genotypes and haplotypes
homologous chromosomes G G haplotype A A Homologous chromosomes are chromosomes in a biological cell that pair (synapse) during meiosis Humans have 22 pairs of homologous non-sex chromosomes (called autosomes). Homologous chromosomes are two pairs of sister chromatids that have gone through the process of crossing over and meiosis An organisms genotype may not uniquely define its haplotype G C C G A T A T ambigous
46
Mitochondrial DNA (mtDNA): a model for the analysis of variation
mtDNA is ideal for studying human evolution because of high mutation rate Other technical advantages (e.g., easier to isolate) Mitochondria contain high number of mutagenic oxygen molecules lead to high mutation rate Small circular genome bases long in humans 37 protein coding genes RNA genes Slightly different genetic code than the nuclear genome
47
Mitochondrial DNA (mtDNA)
D-loop -non-coding sequence -origin of replication -promoter -hypervariable regions (HVR-I, HVR-II) L= bp HVR: high variability among humans. Ideal for studying the relationships among individuals
48
Advantage of mtDNA mtDNA inherited only from mother
Every individual will have only one version of mtDNA We automatically know the haplotype
49
mtDNA: Genotypes and haplotypes
TG C AC TC A T G C mtDNA is only passed down through the mother. Since we only have one version, we automatically known the haplotype if we know the genotype.
50
Variation between species
Genetic differences between species are responsible for behavior morphology physiology Variation between species Tell us about relationships between species close related species have on average more similar DNA Tells us how evolution proceeded over millions of years Key to understanding differences between species: number of nucleotide substitutions that separate two DNA sequences
51
Substitution rate Substitution rate between homologous sequences from different species, tells us about Time since divergence of species Biological function of genomic sequences Relationships among species Mutation Originates in a single individual may be lost if individual leaves no offspring may become fixed throughout the species every individual in the species will have the new allele at the specific nucleotide position. Substitution rate: rate at which species accumulate such fixed differences
52
Substitution vs. polymorphism
What happens after a mutation changes a nucleotide in a locus Polymorphism: mutant allele is one of several present in population Substitution: the mutant allele fixes in the population. (New mutations at other nucleotides may occur later.)
53
Substitution schematic
Individual: Time 0: aaat aaat aaat aaat aaat aaat aaat Time 10: aaat aaat aaat aaat acat aaat aaat Time 20: aaat aaat acat aaat acat acat acat Time 30: acat acat acat acat acat acat acat Time 40: acat acat actt acat acat acat acat (1) (2) (3) (1) times 10-29: polymorphism (2) time 30: mutation fixed -> substitution (3) time 40: new mutation: polymorphism
54
Substitution rates for neutral mutations
Most neutral mutations are lost Only 1 out of 2N fix Most that are lost go quickly (< 20 generations for population sizes from ) Substitution rate is expressed as number of substitutions per site per million years
55
Neutral theory (neutral mutations)
Population size N=10 2*N=20 alleles Neutral theory (neutral mutations) Aa Aa Aa Aa Aa Aa Aa Aa Aa Aa mutation (rate=µ) Aa Aa Aa Aa Aa AB Aa Aa Aa Aa Any new mutation: initially present at a frequency of 1/(2N) = chance of becoming fixed in population Number of mutations 2Nµ substitution rate = ρ = 2Nµ(1/2N) = µ 2 copies of each gene
56
Neutral theory ρ = 2Nµ(1/2N) = µ (e.g., 2 mutations per site per Myear) Substitution rate of new mutations is independent of the population size and is equal to the neutral mutation rate. Larger populations: -create more mutations -smaller chance of becoming fixed Smaller populations: -create less mutations -larger chance of becoming fixed
57
Genetic drift Change in gene frequency due to chance fluctuations in a finite population The 'population' on the left is genetically variable; it contains three alleles (blue, white and yellow). By chance alone, the blue alleles are overrepresented in the new population and the yellow allele has been lost completely. There are two results of genetic drift: 1. The surviving population differs genetically from the original population. 2. Genetic variation has been lost
58
Number of substitutions per site (K)
Genetic distance (K) K = number of substitutions per site number of substitutions / length of sequence (this controls for the length of sequence) If divergence time (T) is known then Substitution rate ρ = K / (2T) (we divide by 2T because both lineages that come from common ancestor can accumulate mutations independently) Substitution rate is expressed as number of substitutions per site per million years
59
Estimating genetic distance
Genetic distance (K) between two homologous sequences number of substitutions since they diverged from common ancestor Problem Simple count will underestimate the true number of differences when multiple substitutions have occurred at the same site.
60
Multiple substitutions
G A C C T T C A A T C A C G G G A C T T T C C T T C A A T C A C G G G A C T T T C C T T C A A T C A C C G G A C T T T C C T T C A A T C T C C G G A C T C A C C T T C A A T C T C C G G A C T Observed: 3 Actual: 6 Observed: compare first and last sequence Intermediate substitutions are not observed
61
Multiple substitutions
True genetic distance: K Observed differences: d Due to back-mutations K ≥ d Need probabilistic model to correct for multiple changes
62
Saturation At the extreme: on average one substitution per site across the sequence during evolution (saturation) Two random sequences of equal length will match for approximately ¼ of their sites In saturation therefore the proportional genetic distance is ¼ Process of substitution is random Genetic distance (K) and observed differences (d) are random variables Various ways to estimate K from d Sequence evolution: Markov process. A sequence at time t depends only on the sequence at time t-1
63
The Jukes-Cantor model
Correction for multiple substitutions Assumes that all substitutions are equally likely (e.g. transitions and transversions) Substitution probability per site per second is α Substitution means there are 3 possible replacements (e.g. C → {A,G,T}) Non-substitution means there is 1 possibility (e.g. C → C)
64
Transition matrix MJC =
Therefore, the one-step Markov process has the following transition matrix: MJC = A C G T A 1-α α/3 α/3 α/3 C α/3 1-α α/3 α/3 G α/3 α/3 1-α α/3 T α/3 α/3 α/3 1-α This leads to Jukes-Cantor formula
65
Jukes-Cantor formula For small d using ln(1+x) ≈ x : K ≈ d
So: actual distance ≈ observed distance For saturation: d ↑ ¾ : K →∞ So: if observed distance corresponds to random sequence-distance then the actual distance becomes indeterminate
66
The Kimura two-parameter model
Assumes that transitions are more likely than transversions Use two probabilities Transitions: α Transversions: β a g c t 1--2 K = -0.5 ln(1-2P-Q) – 0.25 ln(1-2Q) P: fraction of transitions Q: fraction of transversions d=P+Q
67
Case study: are Neanderthals still among us?
Are homo sapiens related to Neanderthals From GenBank Take 206 mtDNA (modern humans) Take 2 Neanderthal mtDNAs Extract comparable parts from the hypervariable regions for the modern humans (only parts of the HVR were available for Neanderthals) 208 sequences of 800bp compute genetic distance corrected by Jukes-Cantor formula
68
Results Average distance between any two homo sapiens: (out of 1000 bases, 25 will be different on average) Average distance between homo sapiens and Neanderthal: (140 out of 1000 bases): much higher Make matrix of all pair-wise distances between sequences and use multi-dimensional scaling for visualization. Alternative: use phylogenetic tree
69
Results: multi-dimensional scaling
Neanderthals: not a sub-population of human (different species) distances between points reflect genetic distance
70
Results: phylogenetic distance
Neanderthals: more closely related to human homo sapiens Apes
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.