Presentation on theme: "The Human Genome Project Lecture 4 Strachan and Read Chapter 8."— Presentation transcript:
The Human Genome Project Lecture 4 Strachan and Read Chapter 8
The HGPs primary aims The main aims of the Human Genome Project (HGP) were to: –Construct maps of the genome (genetic and physical) –Identify all the genes (now known to be about 30,000) –Determine the entire DNA sequence (3,000,000,000 bp)
Other aims of HGP As well as the genome sequence, the aims were: Technology development Model organism genome projects (E. coli, yeast, mouse, fruit fly, C. elegans) Ethical, legal and societal implications (ELSI)
The linkage map The map was built by linkage studies in 60 large families with grandparents and large numbers of children, collected by the University of Utah and the Centre d'Étude du Polymorphisme Humain (CEPH), Paris Families were typed with over 5000 polymorphic DNA sequences: 60% were microsatellite repeats (mostly dinucleotide (CA) repeats, also some tri- and tetra-nucleotides). Only about 400 of them were actual genes Construction of the genetic map: –Obtain genotypes of all markers on all family members (PCR and gel electrophoresis, using robots and automated gel apparatus –Calculation of recombination fractions between markers –Observe crossovers between closely linked markers, use this information to confirm order of markers Construction of the linkage map is a very big problem; sophisticated software was used to work out the "best fit" map of all the markers, with advanced statistical methods and algorithms
STSs and ESTs Sequence tagged sites (STSs) are specific loci in the genome, for which enough DNA sequence is available to make PCR primers to amplify the locus (usually as a fragment of a few 100bp). These include microsatellites (e.g. CA repeats) that can be used for linkage studies. The information required to use an STS is just the sequences of the PCR primers; therefore it is very easy to make databases of STSs that can be used by anyone. No actual bits of DNA need change hands. This is crucial in allowing genome projects to proceed as international collaborations, with many laboratories participating in a co-ordinated way. ESTs act as specific tags for each human gene, since they are derived by sequencing cDNA clones which came from mRNA and therefore represent the actual transcribed sequences (as opposed to STSs, which can be derived from anywhere in the genome and are mostly non-coding). They allow rapid access to the actual genes, ignoring introns and junk DNA
ESTs can be 3' or 5' depending on which end of the cDNA was sequenced. Because of the methods used to make cDNA libraries, parts of the 5' end of the gene are often lost during cloning whereas the 3' end is more reliable. Therefore, the same gene may give different 5' ESTs and it will difficult to deduce whether they have come from the same gene. This shown on the diagram by the white boxes representing cDNA clones being different lengths. Another complication is due to alternative splicing. On the left is shown the genomic structure of a gene, with the exons as boxes - the red one is subject to alternative splicing.
X-ray hybrid mapping X-ray hybrids are made by irradiating a human cell line with 3000 rad of X-rays, fusion to hamster cells, and isolation of hybrid cell lines in culture A panel of 100-200 hybrids with 5-10 different fragments of human DNA in each gives about 1000 fragments in total, i.e. the human genome has been divided into 1000 bits. The closer together 2 markers are in the genome, the more likely it is that they will be present in the same hybrids (since they are less likely to be separated by an X-ray induced break). By doing a PCR assay for each marker on all the hybrids, a map can be made. The units are called cR (centiray, where 1cR is a 1% chance that the markers will be separated by X-ray breakage).
For each pair of markers in turn the "co-retention frequency" is the number of hybrids in which both markers are present, divided by the number of hybrids in which one or other (or both) markers are present. On the figure, there are 5 hybrids containing both markers B and C, and 6 containing B and/or C. Therefore the co-retention frequency is 5/6 or 0.83. Likewise it is 6/7 for markers E and F, and 2/10 for markers C and E. This shows that B and C are close together, E and F are close together, but C and E are further apart. The analysis is extended to all the markers and their order is worked out by considering all the co-retention frequencies.
Clone contigs A clone contig is a series of cloned DNA segments that overlap each other, assembled in the correct order along the genome The clones are made using vectors: –cosmids (capacity 45 kb) –BACs or YACs (Bacterial or Yeast Artificial Chromosomes) which can clone 100s of kb of DNA - more suitable for dealing with large stretches of mammalian DNA.
Putting it together The physical map consists of 1000s of cloned genomic DNA fragments, in E coli host cells (BACs, cosmids, 40- 250kb) or yeast (100-1500kb: "Yeast artificial chromosomes" or YACs), X-ray hybrids, and hundreds of thousands or STSs and ESTs. The linkage map contains several thousand STSs. All of these can be linked together to produce an integrated genome map. The presence or absence of each STS or EST in each X-ray hybrid and cloned DNA is simply determined by PCR. Because of the huge numbers involved, automation of the assays is required.
Sequencing There was a great deal of human genome to sequence (3000 Mb, or 3 x 10 9 bp). Due to the limitations of the techniques, each sequencing reaction can only generate up to 700 bp of DNA sequence. So the total sequence must be assembled from millions of short, overlapping bits of sequence. The starting point for this is the contigs of overlapping BAC clones. Each clone in the contig is subcloned into 100s of smaller fragments, using a plasmid vector suitable for preparing templates for the DNA sequencing reactions.