Analyzing human population genetic history through the study of genetic variation Mark Mata Mentor: Eleazar Eskin UCLA Zar Lab SoCalBSI 2009
Background To study human population genetic history is to study parts of human evolution Human evolution is one of the fundamental questions in science We ask ourselves many questions like: Where do we come from? Why are we all different? How are we all different?
Background The ZarLab does studies with the most recent events in human evolution: Now that we have modern humans, what variations have occurred in our genes since our ancient African ancestors To answer this question our group is looking at human variation to produce a genetic history of these changes
Why do we care? Many diseases are caused by variations that have occurred in our genetic history Better understanding of our genetic history and human variation may eventually lead to better treatment plans Personalized medicine: “The right drug, in the right dose, to the right person, at the right time.” PerkinElmer website:
Human Variation Modern humans share 99.9% of our DNA 0.1% account for variations between humans Of this, 80% of the variation are the result of SNPs SNP (single-nucleotide polymorphism) – position in the genome where there are two different bases present in the population. The base at a SNP on a chromosome is referred to as the “allele” A haplotype is the sequence of alleles on a genome The other 20% are from deletions or insertions on the genome PerkinElmer website:
Human Variation We are studying the 80% of the variations that come in the form of SNPs These SNPs are compiled into a list of SNPs which are called haplotypes Deletions and insertions are “ignored” because of the limitations of microarrays from which the data is generated
International HapMap Project Study done by the International HapMap Consortium “…create a public, genome-wide database of common human sequence variation…” Identified SNPs and compiled the SNP alleles into a database of haplotypes for four different populations (Phase 1) Population used were a group of 60 Mormons in Utah Have been widely studied in the past Western and Northern European descent Have very detailed records Used their chromosome 19 “A haplotype map of the human genome” by: The International HapMap Consortium. Nature. Published 27 October 2005
My Project Goals Reconstruct human genetic history This is a very difficult problem Sub-problem: Identify recent genetic events Make the assumption that these new genetic events are rare or very few in number Easier to classify and identify relationships when compared to older more common haplotypes These new events are important because they identify shared recent ancestry Disease causing variations could be from recent events
Identifying Recent Genetic Events 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombinations
Workflow Individual’s Frequency of Identify HaplotypesVariationEvents TTTTTTTTTTTTTTTAAAAAAAAA A AAAAAAAAAAAAAAA TTTTTTTTTTTTTTTCommonAAAAAAAAA T * AAAAAAAAAAAAAAAAAAAAAAAAA – 49% TTTTTTTTTTTTTTTTTTTTTTTTT – 48% AA AAAAAAAA AAAAAAAAAAAAAAA TTTTTTTTTTTTTTTRareAA|TTTTTTTT AAAAAAAAATTTTTTAAAAAAAAAT – 1% AATTTTTTTTTTTTTAATTTTTTTT – 1% TTTTTTTTTT TTTTTTATTTTTTTTTTTTTTATTT – 1% AAAAAAAAAAAAAAATTTTTT T TTT AAAAAAAAAAAAAAA TTTTTT A *TTT 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Choosing a region size Need to pick a region size that will be large enough to pick up a lot of different variations but small enough to see what caused the variations Through numerous tests, selecting a region of 20 nucleotides and using progressively smaller regions, it was determined that a region size of 10 nucleotides was the best 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombinations
1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombinations Region Size 20
Region Size 10 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Frequency of Variation Individual’s RegionHow Many Haplotype TTTTTTTTTTTTTTTTTTTTTTTTT AAAAAAAAAAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTTTTTTTTTT AAAAAAAAAAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTTTTTTTTTT AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAA - 59 TTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTT - 58 AAAAAAAAATTTTTTAAAAAAAAAT AAAAAAAAAT - 1 AATTTTTTTTTTTTTAATTTTTTTT AATTTTTTTT - 1 TTTTTTATTTTTTTTTTTTTTATTT TTTTTTATTT - 1 AAAAAAAAAAAAAAAAAAAAAAAAA 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Frequency of Variation Individual’s How ManyFrequency of HaplotypeVariation TTTTTTTTTT|TTTTT AAAAAAAAAA|AAAAA TTTTTTTTTT|TTTTT AAAAAAAAAA|AAAAA TTTTTTTTTT|TTTTT AAAAAAAAAA|AAAAAAAAAAAAAAA – 59/120~49% TTTTTTTTTT|TTTTTTTTTTTTTTT – 58/120~48% AAAAAAAAAT|TTTTTAAAAAAAAAT – 1/120~1% AATTTTTTTT|TTTTTAATTTTTTTT – 1/120~1% TTTTTTATTT|TTTTTTTTTTTATTT – 1/120~1% AAAAAAAAAA|AAAAA 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Grouping Variations Classified as either common or rare haplotypes Make the assumption that new genetic events are rare or very few in number A cut off rate of 5% frequency or higher was used to separate common subsequences from rare subsequences 5% was a number that came from the International HapMap Consortium study “A haplotype map of the human genome” by: The International HapMap Consortium. Nature. Published 27 October Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Grouping Variations Individual’s Frequency ofGroup GenesVariation TTTTTTTTTT|TTTTT AAAAAAAAAA|AAAAA TTTTTTTTTT|TTTTT AAAAAAAAAA|AAAAACommon: TTTTTTTTTT|TTTTTAAAAAAAAAA AAAAAAAAAA|AAAAAAAAAAAAAAA – 49%TTTTTTTTTT TTTTTTTTTT|TTTTTTTTTTTTTTT – 48% AAAAAAAAAT|TTTTTAAAAAAAAAT – 1%Rare: AATTTTTTTT|TTTTTAATTTTTTTT – 1%AAAAAAAAAT TTTTTTATTT|TTTTTTTTTTTATTT – 1%AATTTTTTTT AAAAAAAAAA|AAAAATTTTTTATTT AAAAAAAAAA|AAAAA 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Recent Events Make comparisons to identify two forms of variation: Point mutations Recombination events Common:Rare: AAAAAAAAAAAAAAAAAAAT TTTTTTTTTTAATTTTTTTT TTTTTTATTT 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Point Mutations Individual’s Frequency of Identify HaplotypesVariationEvents TTTTTTTTTTTTTTTAAAAAAAAA A AAAAAAAAAAAAAAA TTTTTTTTTTTTTTTAAAAAAAAA T * AAAAAAAAAAAAAAA TTTTTTTTTTTTTTT AA AAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA – 49% TTTTTTTTTTTTTTTTTTTTTTTTT – 48%AA|TTTTTTTT AAAAAAAAATTTTTTAAAAAAAAAT – 1% AATTTTTTTTTTTTTAATTTTTTTT – 1% TTTTTTTTTT TTTTTTATTTTTTTTTTTTTTATTT – 1% AAAAAAAAAAAAAAATTTTTT T TTT AAAAAAAAAAAAAAA TTTTTT A *TTT 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Point Mutations Individual’s Frequency of Identify HaplotypesVariationEvents TTTTTTTTTTTTTTT AAAAAAAAAAAAAAA TTTTTTTTTTTTTTT AAAAAAAAAAAAAAA TTTTTTTTTTTTTTT AAAAAAAAAAAAAAAAAAAAAAAAA – 49% TTTTTTTTTTTTTTTTTTTTTTTTT – 48% AAAAAAAAATTTTTTAAAAAAAAAT – 1% AATTTTTTTTTTTTTAATTTTTTTT – 1% TTTTTTATTTTTTTTTTTTTTATTT – 1% AAAAAAAAAAAAAAATTTTTT T TTT AAAAAAAAAAAAAAA TTTTTT A *TTT 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Recent Events Point mutations Are found by comparing a common haplotype and with a rare haplotype A difference of one shows that a rare haplotype is a point mutation of a common haplotype Marked by a “*” next to the point mutation Common: TTTTTTTTTT TTTTTTA*TTT Rare:TTTTTTATTT 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Recombination Individual’s Frequency of Identify HaplotypesVariationEvents TTTTTTTTTTTTTTTAAAAAAAAA A AAAAAAAAAAAAAAA TTTTTTTTTTTTTTTAAAAAAAAA T * AAAAAAAAAAAAAAA TTTTTTTTTTTTTTT AA AAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA – 49% TTTTTTTTTTTTTTTTTTTTTTTTT – 48%AA|TTTTTTTT AAAAAAAAATTTTTTAAAAAAAAAT – 1% AATTTTTTTTTTTTTAATTTTTTTT – 1% TTTTTTTTTT TTTTTTATTTTTTTTTTTTTTATTT – 1% AAAAAAAAAAAAAAATTTTTT T TTT AAAAAAAAAAAAAAA TTTTTT A *TTT 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Recombination Individual’s Frequency of Identify HaplotypesVariationEvents TTTTTTTTTTTTTTT AAAAAAAAAAAAAAA TTTTTTTTTTTTTTT AAAAAAAAAAAAAAA TTTTTTTTTTTTTTT AA AAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA – 49% TTTTTTTTTTTTTTTTTTTTTTTTT – 48%AA|TTTTTTTT AAAAAAAAATTTTTTAAAAAAAAAT – 1% AATTTTTTTTTTTTTAATTTTTTTT – 1% TTTTTTTTTT TTTTTTATTTTTTTTTTTTTTATTT – 1% AAAAAAAAAAAAAAA 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Recent Events Recombination Combine portions of two common haplotypes and see if they form a rare haplotype Common:Possible Recombinations: AAAAAAAAAAAA|TTTTTTTT TTTTTTTTTTAAA|TTTTTTT AAAA|TTTTTT AAAAA|TTTTT AAAAAA|TTTT AAAAAAA|TTT AAAAAAAA|TT 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Rare Mutations Marked by a “|” at the border between one haplotype and another haplotype Possible Recombinations:Actual Recombinations: AA|TTTTTTTTAA|TTTTTTTT AAA|TTTTTTT AAAA|TTTTTT AAAAA|TTTTT AAAAAA|TTTT AAAAAAA|TTT AAAAAAAA|TT 1.Select a region in a haplotype and find the frequency of variation 2.Group variations into common and rare 3.Find recent point mutations 4.Find recent recombination events
Sample input and output chr-haplotypes.txt: new_chr-haplotypes.txt:Indv1 TTTTTTTTTTTTTTTT T T T T T T T T TIndv1 AAAAAAAAATTTTTTA A A A A A A A A T*Indv2 AATTTTTTTTTTTTTA A|T T T T T T T TIndv2 TTTTTTATTTTTTTTT T T T T T A*T T T
Visualization Tool
Expanding to the Whole Chromosome Now that we have a way to look for variations in regions of a chromosome, we can expand the technique to look for variations in a whole chromosome We used a technique of overlapping windows AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA |AAAAAAAAAA|
Overlapping Windows Individual’s Frequency of Identify HaplotypesVariationEvents TTTTTTTTTTTTTTTAAAAAAAAA A AAAAAAAAAAAAAAA TTTTTTTTTTTTTTTAAAAAAAAA T * AAAAAAAAAAAAAAA TTTTTTTTTTTTTTT AA AAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA – 49% TTTTTTTTTTTTTTTTTTTTTTTTT – 48%AA|TTTTTTTT AAAAAAAAATTTTTTAAAAAAAAAT – 1% AATTTTTTTTTTTTTAATTTTTTTT – 1% TTTTTTTTTT TTTTTTATTTTTTTTTTTTTTATTT – 1% AAAAAAAAAAAAAAATTTTTT T TTT AAAAAAAAAAAAAAA TTTTTT A *TTT
Overlapping Windows Individual’s Frequency of Identify HaplotypesVariationEvents TTTTTTTTTTTTTTTAAAAAAAAA A AAAAAAAAAAAAAAA TTTTTTTTTTTTTTTAAAAAAAAA T * AAAAAAAAAAAAAAA TTTTTTTTTTTTTTT AAAAAAAAAAAAAAAAAAAAAAAAA – 49% TTTTTTTTTTTTTTTTTTTTTTTTT – 48% AAAAAAAAATTTTTTAAAAAAAAAT – 1% AATTTTTTTTTTTTTAATTTTTTTT – 1% TTTTTTATTTTTTTTTTTTTTATTT – 1% AAAAAAAAAAAAAAA
Overlapping Recombination events that looked like point mutations Common:AAAAAAAAAAAAAAA TTTTTTTTTTTTTTT Rare:AAAAAAAAATTTTTT First 10Slide over 5 and next 10 Common:AAAAAAAAA A Common: AAAA AAAAAA TTTT TTTTTT Rare:AAAAAAAAA T *Rare: AAAA | TTTTTT AAAAAAAAA|T*TTTTT AAAAAAAAA|TTTTTT
Applying to a Population’s Chromosome Now that we have a technique to look for new variations in a whole chromosome We can apply it to a population and identify regions where recent genetic events took place
Identified Recent Genetic Events In chromosome 19: Unique point mutations= Unique recombination events = 4065 Total unique events = Total point mutations = Total recombination events= Total number of events= Average point mutations per individual = 383 Average recombination events per individual= 94 Average events per individual = 478
Point Mutations Number of Events SNP Position in the Haplotype
Recombination Events Haplotype Number of Events SNP Position in the Haplotype
Point Mutations and Recombination Events Number of Events Haplotype SNP Position in the Haplotype
Conclusion We have developed an algorithm for identifying recent genetic events in an individual There were more point mutations identified than there were recombination events Certain regions in the genome where there were many recent genetic events and there are regions with fewrecent genetic events
Future Work Run the algorithm over the whole genome Extend the algorithm to multiple populations Identify recent events that are unique to a population vs. ones that are shared Identify genetic relations between common haplotypes Create a chronological order of recent events in an individual Adapt the algorithm for high-throughput sequencing data
UCLA ZarLab Dr. Eleazar Eskin All the lab people SoCalBSI Dr. Jamil Momand Dr. Sandra Sharp Dr. Nancy Warter-Perez Dr. Wendie Johnston Dr. Beverly Krilowicz Dr. Silvia Heubach Dr. Jennifer Faust Ronnie ChengFunded By: SoCalBSI 2009 Interns
The other ancestors are determined through SNP differences of 2 or more Determining ancestors
My Project Red line Point Mutation Blue line Ancestor to common relationship Black dashed line Haplotype resulted from cross over mutation
Graph Graph is generated by a program called Graphviz which is a graphical visualization program
Graph