Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in 1995. Various side.
Published byModified over 4 years ago
Presentation on theme: "Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in 1995. Various side."— Presentation transcript:
Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in 1995. Various side projects: genetic diseases, variations between individuals, ethnic variation, comparison to other species. Strategy: –1. physical map relating specific DNA markers to the proper chromosomal position. –2. Overlapping set of cloned DNAs (contigs) –3. sequencing and assembly –4. finding the genes in the sequence –5. annotation of gene function
Physical Maps A genetic map uses recombination, crossing over during meiosis, to determine how frequently two genes (or markers) are inherited together. A physical map determines where a given DNA marker is located on the DNA of the chromosome. Genetic and physical maps are (supposed to be) colinear—all the genes appear in the same order in both maps. But, distances are quite different: there is very little recombination in the centromeres, so large DNA distances are very short recombination distances. Genetic maps using microsatellite (SSR) markers were used to develop physical maps: the appropriate SSR sites were expected to be found on the corresponding cloned DNA.
Sequence Tagged Sites a sequence tagged site (STS) is a short sequence that is unique in the genome. You obtain the sequence information from cloned DNA, and then locate it in the genome. Using PCR it is then possible to determine whether your STS is present in any other clone or cell line. Obtaining STS: sequencing the ends of large cloned DNAs (BACs or YACs, for example). Uniqueness: use the cloned DNA from the STS as a probe on a Southern blot of genomic DNA: if the STS is unique, only 1 band will hybridize. Repetitive DNA is very common in the human genome, and many DNA sequences are not unique. A good source of unique DNA is EST clones: cDNA made from messenger RNA. Size: a DNA sequencing run will usually give 500-600 bp of good, reliable sequence information. On the other hand, consider the size for the genome: 3 x 10 9 bp. Each base is one of 4 choices, so a 16 bp sequence will appear about once in 4.3 x 10 9 bp. In practice, 20 bp is about the minimum size for good PCR amplification, and 24 bp is about the minimum that will give a good BLAST hit.
Somatic Cell Hybrids Human and mouse (or hamster) cultured cells can be fused together using polyethylene glycol. –The resulting fused cell is a heterokaryon: it has 2 nuclei from different species. –If the heterokaryon undergoes mitosis, the nuclei fuse. –Human chromosomes are unstable in a mixed nucleus, and most of them are randomly lost. The mouse chromosomes all stay. –Different cell lines can be established that contain different combinations of human chromosomes –You can identify which human chromosomes remain using chromosome banding techniques. A good way to determine which chromosome a DNA sequence is on. Sometimes also for gene products or phenotypes.
Radiation Hybrids Standard somatic cell fusions contain entire human chromosomes. To locate a gene more closely, you need to use chromosome fragments. Start by irradiating human cells with a controlled dose of X-rays: chromosomes break up. Then, fuse the cells to mouse cells. The human chromosome fragments get integrated into the mouse chromosomes. Create a panel of mouse/human hybrid cell lines. –The current standard panels contain about 100 cell lines. –Each line contains about 32% of the human genome –Average size of human genome fragment = 25 kbp –More radiation = smaller fragments Mapping: the hybrid cell lines contain random human chromosome fragments, but closely linked sites are usually in the same cell line (same basic principle as recombination mapping). –Until you have located some of the markers on the chromosomes, radiation hybrid mapping only gives you information about whether any two sequences are close together on the chromosome.
Contigs A contig is a set of partially overlapping clones, a contiguous set of clones. No gaps between them. Contigs allow you to build up the sequence of the chromosome over much larger regions than any single clone. The first reasonably complete physical map of the human genome involved contigs generated by YACs (yeast artificial chromosomes). Initially, you have a collection of clones with no information about how they are ordered on the chromosome. Contigs are built up by using PCR to identify unique sequences (STS or EST) on each clone, and then looking for overlaps between the clones.
Sequencing Strategy Once a contig map of the genome was obtained, it was necessary to sequence each individual clone. Most of the actual human genome sequencing was done on BAC clones, which are less prone to rearrangement than YAC clones. BACs are about 100-200 kbp long. Large clones are generally sequenced by shotgun sequencing: The large cloned DNA is randomly broken up into a series of small fragments ( less than 1 kb). These fragments are cloned and sequenced. A computer program then assembles them based on overlaps between the sequences of each clone. To ensure that every bit has been covered, you need to sequence random clones until you have covered each spot 5-10 times on average.
Whole Genome Shotgun Sequencing Why bother with creating a large scale physical map: all that YAC and BAC cloning, radiation hybrids, STS comparisons, etc? Why not just fragment the whole genome into 1 kb pieces, sequence them all, and let the computer assemble the whole genome? In practice, the genome is cloned into large fragments first, and then each large fragment is broken up for shotgun sequencing. But, the large fragments are not ordered: no physical map or set of contigs is created. Requires a lot of overlapping coverage Also requires good software. Very successful for prokaryotic genomes (10 Mbp or less). –but the human genome is 300 times larger Big problem: repeat sequence DNA, which is everywhere, and especially near the centromere. To find overlaps between clones, you need unique regions. It remains unclear whether whole genome shotgun sequencing will work if there is no other information available to provide order. It has not been widely adopted for eukaryotic projects (so far).
Gene Detection the best evidence that a given DNA sequence is expressed is to find an EST (cDNA copy of mRNA) that matches it. Large numbers of EST libraries have been constructed and sequenced. –The primary result of this was to determine that many genes have several different intron slicing patterns: sequences are exons in some tissues but introns in others. Homology searches, using BLAST, are a good way to find genes. If a DNA sequence closely matches a sequence from another organism, it has been evolutionarily conserved, and that usually means that it is an expressed gene. Exon prediction: exons need to be open reading frames (no stop codons), and they display patterns of nucleotide usage different from random DNA. Several different programs exist, and they give somewhat varying results. “Hypothetical genes” are genes whose existence has been predicted by computer but which lacks any experimental or cross-species data to confirm it. –a “conserved hypothetical gene” is a sequence that matches other species even though there is no EST or other experimental evidence for its expression
Gene Annotation Computer predictions of gene function are mediocre at best. Humans, especially those who are experts in the field, do a much better job of evaluating evidence and deciding what a given gene’s function is. There is a big problem of too much information not uniformly coded or maintained. The scientific literature contains numerous examples of the same gene or protein with several different names, and getting common definitions of functions is even harder. To counter this, the Gene Ontology Consortium (GO) has created a controlled vocabulary of about 11,000 terms. Every gene product (protein) can be annotated into three general categories: –molecular function: what the protein actually does, such as “kinase activity” –biological process: what cellular process the protein participates in, such as “signal transduction” –cellular component: where the protein is found in the cell, such as “integral to the plasma membrane” Each gene product can have multiple descriptive terms. The terms are hierarchical: more specific terms are contained within less specific terms. But, a given term can have more than one parent and more than one child term.