Presentation on theme: "Ka-Lok Ng Dept. of Bioinformatics Asia University"— Presentation transcript:
1Ka-Lok Ng Dept. of Bioinformatics Asia University Genome ScienceKa-Lok NgDept. of BioinformaticsAsia University
2The Core Aims of Genomics Science (1) An integrated web-based database and research interfaceaccess to the enormous volume of dataweb interfacesRelational databasesGeneric Model Organism Database (GMOD)project to develop reusable components suitable for creating new community databases of biology
3The Core Aims of Genomics Science (2) To assemble physical an genetic mapslocation of genes in a genomephysical distance and relative position defined byrecombination frequenciesthe map is crucial for comparing the genomes of relatedspeciesrelated phenotypic and genetics dataused in animal and plants breedingextend to more species with greater accuracy
4The Core Aims of Genomics Science To generate and order genomic and expressed gene sequencesHigh-volume sequencingBasic technique is developed by Fred Sanger“Shotgun” approach assemble into contigs, scaffolds (a set of contigs), then the whole chromosomesmRNA is unstableCoding parts cDNA clones – cloned from mRNA transcriptsExpressed sequence tags (ESTs)Obtain full length cDNA is not easy because of mRNA structure
5The Core Aims of Genomics Science To generate and order genomic and expressed gene sequencesmRNA cDNA ESTWhole genome reconstructionReverse transcription cDNAEST - partial cDNA sequencessequenced either from 5' or 3‘Alternative splicing not a one-to-onecorrespondence between ESTs and genes
6The Core Aims of Genomics Science Identify and annotate the complete set of genes encoded within a genomeFrom complete sequence of a genome genes identificationAlignment of cDNA, DNA and protein sequences – BLASTGene finding software – ORFs, transcription start and termination sites, exon/intron boundariesThen gene annotation linking sequence to genetic function, expression, locus information, comparative data from homologous proteins
7The Core Aims of Genomics Science (5) To characterize DNA sequence diversitySingle-nucleotide polymorphisms (SNPs)About 90 percent of human genome variation comes in the form of single nucleotide polymorphisms (neither harmful nor beneficial)Theoretically, a SNP could have four possible forms, or alleles (different seq. alternative), since there are four types of bases in DNA. But in reality, most SNPs have only two alleles. For example, if some people have a T at a certain place in their genome while everyone else has a G, that place in the genome is a SNP with a T allele and a G allele.The human genome contains more than 10 million SNPs once in every 100 to 300 bp !Find associations between SNP variation and phenotypic variation，e.g. Sickle-cell anemia 鐮刀狀細胞貧血症SNPmutation
9The Core Aims of Genomics Science (5) To characterize DNA sequence diversityCharacterize the level of haplotype structure due to linkage disequilibrium (LD)haplotype = a set of adjacent polymorphisms found on a single chromosomeLD = groups of closely linked alleles that tend to be inherited together, can be used to map human disease genes very accuratelyKnowledge of LD are utilized to do disease locus mappingIn the human genome, haplotypes tend to be approximately 60,000 bp in size and therefore contain up to 60 SNPs that travel as a group.Haplotype
10The Core Aims of Genomics Science Mendel's Laws enable the outcome of genetic crosses to be predicted.A and B on different chromosome
11The Core Aims of Genomics Science Genes on the same chromosome should display linkage.Genes A and B are on the same chromosome and so should be inherited together. Mendel's Second Law should therefore not apply to the inheritance of A and B, but holds for the inheritance of A and C, or B and C. Mendel did not discover linkage because the seven genes that he studied were each on a different pea chromosome.Partial linkagePartial linkage was discovered in the early 20th century. The cross shown here was carried out by Bateson, Saunders and Punnett in 1905 with sweet peas. The parental cross gives the typical dihybrid result (see Figure on the right ), with all the F1 plants displaying the same phenotype, indicating that the dominant alleles are purple flowers and long pollen grains. The F1 cross gives unexpected results as the progeny (後裔) show neither a 9 : 3 : 3 : 1 ratio (expected for genes on different chromosomes) nor a 3 : 1 ratio (expected if the genes are completely linked). An unusual ratio is typical of partial linkage
12The Core Aims of Genomics Science (5) To characterize DNA sequence diversitythe farther apart two genes are, the more they tend to assort independently (randomly) recombination frequency ↑Higher freq. farther apartVermilion - 朱紅色
13The Core Aims of Genomics Science (6) To compile atlases of gene expressionanalyzing profiles of transcription and protein synthesistraditional method: Northern blots, hybridizationmodern technology – microarrayrelative level of expression (differential expression)patterns of covariation in gene expression clues to unknown gene function (guilt by association)
14The Core Aims of Genomics Science (7) To accumulate functional data, including biochemical and phenotypic properties of genesNear-saturation mutagenesis (screening hundreds of thousands of mutants to identify genes that affect traits as diverse as embryogenesis, immunology, and behavior)high-throughput reverse genetics (methods to systematically and specifically inactivate individual genes).Yeast Genome Deletion ProjectMouseProteomics – detecting protein expression and protein-protein interactionsPharmacogenomicists – study the interactions between small molecules (i.e. potential drugs) and proteinsFunctional genomics – a crucial component is to study various model organismsClone library – collections of DNA fragments that are cloned into a vector
15The Core Aims of Genomics Science With Smith's site-directed mutagenesis the researchers can study in detail how proteins function and how they interact with other biological molecules. Site-directed mutagenesis can be used, for example, to systematically change amino acids in enzymes, in order to better understand the function of these important biocatalysts. The researchers can also analyze how a protein is folded into its biologically active three-dimensional structure. The method can also be used to study the complex cellular regulation of the genes and to increase our understanding of the mechanism behind genetic and infectious diseases, including cancer.GTC Valine GCC AlanineSite-directed mutagenesis
16The Core Aims of Genomics Science (8) To provide the resources for comparison with other genomes.Comparative maps allow genetic data from one species to be used in the other speciesComparative maps local gene order along a chromosome tends to be conserved Synteny (human and mouse genome)Even without synteny, the conservation of gene function is known (say from fly to primate靈長類動物)Gene order conservation (GOC)
17Mapping Genomes – Genetic Maps Genetic map – the relative order of genetic markers in linkage groups in which the distance between markers is expressed as units of recombinationGenetic markers – sequences tags, repeats, restriction enzyme polymorphism (cutting sites)In diploid (具兩套染色體) organisms, genetic maps are assembled from data on the co-segregation (同時分離) of genetic markers either in pedigrees (家譜) or in the progeny (後代) of controlled crosses.Genetic distance unit centriMorgan (cM)In human 1cM = 1% of recombination frequencyHuman, 1cM ~ 1Mbp100 cM 1 crossover occurs per chromosome per generationMarkers on different chromosomes have a chance of co-segregation 50cM (0.5 crossover occurs per generation)pgs2e-fig jpg
18Mapping Genomes – Genetic Maps A pair of different parental chromosomes(green and blue colors).(B)A table showing the frequency of recombinantsbetween each marker. Larger number indicates that the genes are farther apart.(C)The most likely genetic map from the entire data. In this hypothetical example, two linkage groups are inferred, the top one is longer than 50 cM.pgs2e-fig jpgGenetic distance ~ 0.11 11cM0.22 21cM, 0.25 24cM,0.33 33cMFigure 1.1
19Mapping Genomes – Genetic Maps Software of the assembly of genetic mapsMultiple factors lead to high variation in thecorrespondence between physical and genetic distancesThere is variability of recombination rate along achromosome (centromeres and telomeres are lessreconbinogenic than general euchromatin) hot spotsand cold spots of recombination
20Exercise 1.1 (Part 1) Constructing a genetic map Constructing a genetic map - four recessive loci – thickskin, reddish, sour, petite.After identifying two true-breeding trees that are either completely wild-type or mutant for all four loci, the breeder crosses them, and then plants an orchard of F2 (second generation) trees.Q. Based on the following frequencies of mutant classes, determine which loci are likely to be on the same chromosome and which are the most closely linked.pgs2e-exer jpg
21Exercise 1.1 (Part 2) Constructing a genetic map pgs2e-exer jpgAssume independent assortment for each recessive phenotype ¼ 242 petite ( ), 249 reddish, 247 sour and 236 thickskinExpect that unlinked loci would segregate independently ~ 60 trees (that is 1/4*1/4*968) produced each double mutants class
22Exercise 1.1 (Part 2) Constructing a genetic map Mapping Genomes – Genetic MapsExercise 1.1 Constructing a genetic mapfour recessive loci – thickskin, reddish, sour, petiteQ. Determine which loci are likely to be on the same chromosome and which are the most closely linked.Answer: Total number of 968 trees. Assume independent assortment for each recessive phenotype ¼ 242 petite, 249 reddish, 247 sour and 236 thickskinExpect that unlinked loci would segregate independently ~ 60 trees (that is 1/4*1/4*968) produced each double mutants class
24Mapping Genomes – Physical Maps is an assembly of contiguous stretches of chromosomal DNA – contigs – in which the distance between landmark sequences of DNA is expressed in kilobasesthe ultimate physical map is the complete sequenceApplications(1) provide a scaffold upon which polymorphic markers can be placed(2) facilitating finer scale linkage mapping(3) confirm linkages inferred from recombination frequencies(4) resolve ambiguities about the order of closely linked genes(5) enable detailed comparisons of regions of synteny between genomespgs2e-fig jpg
25Mapping Genomes – Physical Maps Two strategies used to assemble contigsAlignment of randomly isolated clones based on shared restriction fragment length profilesYAC – ~1Mbp long fragmentsBAC – ~100kbp long fragmentsPlasmid – ~ kbp long fragmentsAutomatic restriction profiling (Ch. 2) assemble contigs (short for "contiguous sequences").
26Genomic clone library Unlike the case of fX174, no large genome could be completely sequenced withoutan extra round of fragmentation intomanageable sized chunks. In other wordsit had to be transferred into one ormore clone libraries from whichindividual clones were picked to be"subcloned" in M13 for sequencing.The general outline of the procedure is shown at right. You can see that fX174 bypassed the first stage, the construction of a clone library from the target genome.cDNA library – made from RNA that has been reverse transcribed into cDNA and are used for EST sequencing projects.
28Mapping Genomes – Physical Maps (2) Hybridization-based approaches – chromosome walkingChromosome walking is used as a means of finding adjacent genes (positional cloning), or parts of a gene which are missing in the original clone as well as to analyze long stretches of eukaryotic DNA. This task requires finding a set of overlapping fragments of DNA that spans the distance between the marker and the gene.Genomic DNA is shown in blue. Selected clones from a library of cloned genomic DNA fragments are shown in red. The initial probe, probe a, is specific to gene A or exon A and allows identification of clones 1 and 2. A new probe, probe b, is prepared from one end of clone 2 and used to isolate new clones 3 and 4 from the genomic library. Probe c, prepared from clone 4 is used to identify clone 5, etc. The orientation of the clones is determined by restriction mapping of the clones. Clone 6 contains the desired gene B or exon B.
29Mapping Genomes – Cytogenetic Maps Historically – aid in the alignment of physical and genetic mapsCytogenetic maps are the banding patterns observed through a microscope on stained chromosome spreadsTraditional preparation – salivary gland polytene chromosomes 唾液腺多線染色體 (greatly enlarged relative to their usual condition) of insects and Giemsa-banded mammalian metaphase karyotypesChromosomes the genetic material phenotypes or medical conditions correlate with the deletion or rearrangement of chromosome sectionsCytogenetic map are aligned with the physical map through in situ (在原位置) hybridization – a clone fragment is annealed to a single location on the cytogenetic mapNCBI Genomic BiologyKeyword: HOX AND homo[ORGN]Karyotypespgs2e-fig jpg
30Mapping Genomes – Cytogenetic Maps Alignment of cytological, physical, and genetic maps.Cytological map – a representation of a chromosome based on the pattern of staining of bandsPhysical map – the location of transcripts and sites of insertions and deletionsGenetic map – recombination rates vary along a chromosome, typically reduced near the telomere and centromereDistances between genetic, physical and cytological markers are not uniformHow to search for genes on a genome map ? See my lecture notes on Bioinformatics class.
31Comparative GenomicsSynteny – conservation of gene order between chromosome segments of two or more organisms.Homologes – highly conserved loci derived form a common ancestral locusOrthologs – similar genes that arose as result of duplication subsequent to an evolutionary splitParalogs – similar genes that arose as result of duplicationspeciationpgs2e-fig jpgConservation of gene order is an inverse function of the times sincedivergence from the ancestral locus.Note – rates of divergence vary considerably at all taxonomic levels.Japanese pufferfish – 7.5 times smaller than the human genome, showextensive gene order similarity with humans, around 50% - 80% is in the sameorder as is found in the human genome
32Comparative GenomicsChromosome painting – used to define regions of Synteny cover regions (~0.1 of a chromosome arm)Each chromosome of one species is labeled with a set of fluorescent dyes, and hybridized to chromosome spreads of the other genome.Uses the fluorescent in situ hybridization (FISH) technique to detect DNA sequences in metaphase spreads of animal cells. The fluorescently labeled hybrid karyotype is shown in bottom.
33Comparative GenomicsSynteny between cat and human genomes. Ideograms (染色體模式圖) for each ofthe 24 chromosomes shown on the right in each pair are aligned against color-codedrepresentations of corresponding cat chromosomes.CAT – six groups (A – F) of 2 – 4 chromosomes each.Top row – 12 autosomes that are essentially syntenic along, except for somerearrangementsBottom row – 10 autosomes that have at least one major rearrangementThe two sex chromosomes are essentially syntenic between cat and humanpgs2e-fig jpg
34Comparative Genomics Sequence conservation = functional importance High-resolution comparative physical mapping – found ~1Mbp synteny region between human and mouseMay contain hundreds of genes, local inversions and insertions/deletions involving one or a few genesFamilies of genes organized in tandem clustersConsiderable size variation in intergenic “junk” DNA
35Comparative GenomicsIdentifying genes and regulatory regions in seq. genomes is challengingORF are usually good
36Comparative GenomicsIdentifying genes and regulatory regions in sequenced genomes is challengingOpen reading frames (ORFs) are usually good indication of genesHowever, it is difficult to determine which ORFs belong to a geneMany mammalian genes have small exons and large intronsRegulatory sequences even more difficultOne of the major problems facing genomic researchers is to annotate sequenced genomes. A first step in this process is the identification of genes and the regulatory sequences that govern their expression. In general, the presence of an open reading frame (ORF) is a good indicator that a peptide is encoded in the DNA. However, it is frequently not straightforward to identify which ORFs belong to a gene. This problem is particularly acute in mammalian genomes, in which genes tend to have a lot of small exons (an average of 9-10) and large introns. Regulatory sequences such as transcription-factor binding sites are particularly difficult to identify when only a single organism’s genomic sequence is available.
37Comparative Genomics Computer programs analyze genomic sequence GRAIL GeneFinderLook for ORFs, splice sites, poly A addition sites, etc.Predict gene structureFrequently wrongUsually miss exons at beginning or end of geneSometimes predict exon when one doesn’t really existIt is not easy to know whether an ORF is really part of a gene or is there fortuitously. Computer programs like GRAIL and GeneFinder will search DNA sequences for ORFs. In addition, these programs search for regulatory sequences that can help in the identification of genes. In particular, the sequences found around splice junctions are very useful in distinguishing real exons from ORFs that happen to be present in an intron. Nevertheless, most computer programs designed to identify genes by using only a single genomic sequence are wrong at least 10% of the time. They frequently miss exons at the beginning or end of a gene, or they predict an exon where one doesn’t really exist.
38Comparative GenomicsWhen comparing genomes of different species, the genes normally have the same exon–intron structureLook for conserved ORFs in both genomesFrequently permit accurate identification of genesFugu–human comparison found >1,000 genesMouse–human comparison indicates only 25,000 genes in genomeWhen genomes of more than one species are compared, coding regions and exon–intron structures tend to be conserved. Thus, it becomes fairly straightforward to look for ORFs that are conserved in both genomes and found in similar positions relative to surrounding ORFs. Cross-genome comparisons have been shown to be an effective means of accurately annotating genes as well as identifying new genes. For example, in the comparison of the Fugu genome with the human genome, over 1,000 putative genes were discovered that had been missed by the annotation programs. Comparison of the mouse genome with the human genome indicates that there are only about 25,000 genes in both genomes. This finding is consistent with the predictions made from the draft human sequence, but runs contrary to earlier pregenome predictions that set the number of genes as closer to 100,000.
39Example of sequence comparison Comparison of the human and mouse spermidine synthase genes revealed an additional intron in the human gene that is not found in the mouse homologueHumanMouseWhen the human spermidine synthase gene involved in the synthesis of polyamines was compared with the homologous gene in the mouse genome, it was found that there is an additional intron in the human gene that interrupts the fifth exon in the mouse gene. This is an example of a comparison that highlights how gene structure can change as species evolve.5,500 bp
40The Human Genome Project (HGP) ObjectivesGeneration of high-resolution genetic and physical maps that will help in thelocalization of disease-associated genes.The attainment of sequence benchmarks, leading to generation of a completegenome sequence by the year (A draft version was achieved in May 2000,but finished sequence required an error rate of less than 1 in 10,000 bp)Identification of each and every gene in the genome by a combinationbioinformatics identification of open reading frame (ORFs), generation of voluminousEST databases, and collation(對照)of functional data including comparative data fromother animal genome projects.Compilation of exhaustive polymorphism databases, in particular of SNPs, tofacilitate integration of genomic and clinical data, as well as studies of humandiversity and evolution.pgs2e-fig jpg
41The Human Genome Project (HGP) Table 1.1 Initial Goals of the HGPFrom the First 5-Year Plan:Table 1.2 A Blueprint for the Future of the HGP15 Grand Challenges in the Third 5-Year Plan: 2003 – 2005HGP budget – set aside for research on the ethical , legal, and social implication of genetic reserach (the ELSI project)
47The Human Genome Project The architecture of the Human Genome Project in the twenty-first century.Three major themes for future genome research are founded on six pillarsof genome resources.
48ELSI Box 1.1 The Ethical, Legal, and Social Implications of the HGP Funding – The National Human Genome Research Institute (NHGRI) 5% of its annual budget to ELSIFunding three types of activities: regular research grants, education grants, and intramural programs at the NIH campusWeb sites:4 major objectives4 main subject areas
49ELSIGreat concern is the privacy and confidentiality of genetic information.Especially – Iceland (介於格陵蘭與挪威間 and Estonia (愛沙尼亞共和國 government-sponsored databases of medical records have been supplied to medical research companies.Psychological impact and potential for stigmatization (給帶來恥辱,使貼上標籤) inherent in the generation of genetic data racial mistrust and socioeconomic differences in gathering of and access to genetic informationReproductive issuesPotential moral (possible legal) obligations once data has been obtained.Philosophical discussions – human responsibility, human right to “play God” with genetic material, meaning of free will in relation to genetically influenced behaviorsGenetically Modified Organisms (GMOs)1998 – Five new major aims
511.7 (Part 1) Whose genome was sequenced? The content of the Human GenomeCompletion of the first draft of the HGP was announced at press conference in May 2000, but publication of the result was delayed until Feb. of 2001.Need refinment of the seq. assembly, including gap closure, gene annotation, and predictionIt is estimated that the total number of genes is somewhere around 25,000 (~ two times greater than gene contents of the fruit-fly and C.elegans, and five times greater than yeast, see Table 1.3 for more details)Table 1.3 Comparison of Gene Content in some Representative GenomesNo dramatic differences in gene content between humans and other mammals.Sep – the first high-resolution genetic map of the complete genome – 23 linkage groups (one per chromosome) with 1200 markers at an average of 1cM intervalsAround 1995 – physical map – sequence tag sites (STS) at ~60 kbp intervals1998 – 3000 SNPsMiddle of 2004 – 1.8 million mapped SNP, see The SNP Consortium (TSC)Providing polymorphic markers at 2kb intervals and placing 85% of all exons within 5kp of a SNP.2000 – the first draft of the smallest human chromosome, chromosome 21 was publishedpgs2e-fig jpg
53The content of the Human Genome Two questions for the HGPWhose genome was sequenced ?The sequence is derived from a collection of several libraries obtained froma set of anonymous donors. Both the IHGSC and the private firm Celera Genomicsassembled their seq. from multiple libraries of ethnicaly diverse individualsOne particular indiveidual’s DNA contributed 3/4 and 2/3 of the raw seq. respectively.Size of shaded sector ~ amount of seq.contributed by a single individual
54The content of the Human Genome The Celera sample included at least one individuals from each of four ethnic groups, as well as both males and females.Craig Venter admitted that his own DNA contributed substantially to the Celera sequenceTheir own poodle (獅子狗) contributed to the first-draft canine (犬科動物) genome seq.
55The Human Genome Project (2) When can we regard it as finished ?The complete seq. of 99% of human euchromatin has been published to an estimated error rate of ~ 1 event in 100,000 bases.Human polymorphism is an order of magnitude greater than this at least 10 SNPs for each seq. errorExtensive tracts of heterochromatin (there are few or no genes, such as centromeres and telomeres), mostly associated with centromeres that may account for as much as 20% of the total genome, will probably never be sequenced.Since the completion of the first draft HGP focus on characteristing human diversity.International HapMap project – map all of the major haplotypes in the human genome and characterize their distribution among populations, as a step toward identification of human disease susceptibility factors, seepgs2e-fig jpg
56Internet Resources – NCBI and Ensembl NCBI http://www.ncbi.nlm.nih.gov Ensemble– a collaboration between EMBL-EBI and the Sanger Center in the UK.Both sites provide high-resolution physical maps of any segment of the genome.Several genome viewsUCSC Genome BrowserCommercial web sites - Incyte Genomics, Celera, Rosetta Inpharmatics, Informax, and LION BiosciencesFigure 1.8 The National Center for Biotechnology Information (NCBI) Web site.pgs2e-fig jpg
57Internet Resources – NCBI and Ensembl Ex. 1.2 Use the NCBI and Ensemble genome browser to examine a human disease gene. Use OMIM to identify a gene that is implicated in the etiology (病因學) of the disease.Ans.Go to Asthma (氣喘) find one of the interest for example, Interleukin 13 (IL13). This page gives a lot of textual information + link to other sites, including Human Gene Mutation Database (HGDB) or Entrez GeneWhat are the various identifiers of the gene ?*147683(b) Where is the gene located on the chromosome (cytologically and physically) ?The cytological location is 5q31 (chromosome 5, long arm, Click on Gene map locus 5q13 click location 5q13 click NCBI MapViewer position Mb, Gene ID for IL13 is 3596 Gene aliases: ALRH; P600; IL-13; MGC116786; MGC116788; MGC116789(c) What is the RefSeq for the gene ?The RefSeq is NM_002188, an mRNA seq.
58Internet Resources – NCBI and Ensembl (d) How many exons are there in the major transcript, and how long is it?From Entrez Gene Display ‘Gene table’ 4 exons, 1282 bp long and encodes a 146 amino acid protein, or use NCBI MapViewer Consensus CDS (ccds)From RefSeq ID is NM_ link to GeneBank signal peptide (interleukin 13 precursor), 34 aa (seq. 15 – 116),mat_peptide (interleukin 13 precursor) 98 aa(e) What is known about the function of the gene?See NCBI description - This gene encodes an immunoregulatory cytokine produced primarily by activated Th2 cells. This cytokine is involved in several stages of B-cell maturation and differentiation.(f) Do the two annotations agree? Which browser do you prefer, and why?Ensemble select gene type IL-13 Ensembl gene ID ENSGGeneView show that the Exons: 4 Transcript length: 1,282 bps Protein length: 146 residues
59Internet Resources - OMIM Online Mendelian Inheritance in ManA database that provides text summarizing recent genetic research in response to a query about a particular disease, as well as links to MedLine and GenBank and other information.Intended for physicians and human geneticistsdisease types such as muscle, metabolism,cardiovascular, and physiological disorders.OMIM lists in excess of 15,000 known disease-causing Mendelian disorders.GEO BLAST tool – search for all genes in the geneexpression database that have similar seq, and then compare levels of expression of the genes across species and experimental conditions.pgs2e-fig jpgFigure 1.9 The Mendelian Inheritance in Man (OMIM) Web site
61Internet Resources - OMIM OMIM has a defined numbering system – certain positions within that number indicate information aboutthe genetic disorder itself.The first digit – the mode of inheritance of the disorder1 = autosomal (常染色體) dominant2 = autosomal recessive3 = X-linked locus or phenotype4 = Y-linked locus or phenotype5 = mitochondrial6 = autosomal locus or phenotype
62Internet Resources - OMIM The distinct between 1 or 2 and 6 is that entries cataloged before May 1994 were assigned either a 1 or 2, whereas entries after that date were assigned a 6 regardless of whether the mode of inheritance was dominate or recessive.* = the phenotype caused by the gene at this locus is not influenced by genes at other loci; however, the disorder itself may be caused by mutations at multiple loci# = the phenotype is caused by two or more genetic mutations
63Internet Resources - OMIM Example: (MKKS)Displayallelevariantallelic variants – description is given after each allelic variant of theclinical or biochemical outcome of that particular mutationallelic variant for MKKS
64Internet Resources - OMIM The OMIM indicates that the gene SRY encodes a transcription factor that is a member of the high-mobility group-box family of DNA binding proteins. Mutations in this gene give rise to XY females with gonadal dysgenesis(女性生殖腺發育不全症), as well as translation of part of the Y chromosome containing this gene to the X chromosome in XX males.Q 1a. An allelic variant of SRY causing sex reversal with partial ovarian function has been cataloged in OMIM. What was the mutation at the amino acid level and what is observed in XY mice carrying this mutation?Ans. Use “SRY AND human” for the OMIM search then view list of allelic variants. Variant 0020 is the correct entry. Mutation is Gln2Ter; XY mice are fertile females, although fertility is reduced and ovaries fail early.
65Internet Resources - OMIM Q1b. Follow the Gene Map link in the left sidebar to access the MIM gene map, one other gene is found at the same cytogenetic map location. What is the name of this gene, and what methods were used to map the gene to this location?Ans. Click GeneMap in the left sidebar. Correct gene is ZFY. Under the Methods columns, REn and A are listed. Clicking on the Methods hyperlink at the top of the column shows the key to the abbreviations. REn stands for neighbor analysis in restriction fragments; A stands for in situ hybridization.
66Figure 1.10 (Part 1) A gallery of animal genome sequencing projects Animal Genome ProjectsThe International Sequencing Consortium (ISC)A database of animal and plant genome sequencing projectsSome of these organisms are shown in Figure 1.10pgs2e-fig jpgFigure 1.10 (Part 1) A gallery of animal genome sequencing projects
67Figure 1.10 (Part 2) A gallery of animal genome sequencing projects Animal Genome ProjectsAt the National Human Genome Research Institute (NHGRI), the decision to commit the tens of millions of dollars required for any new genome is made by a council of senior genome scientists – a 10 page “white paper”Weigh the expected impact of the sequence on enabling biomedical research and the annotation of sequence functionA draft genome can be produced for most animals within 3-6 monthspgs2e-fig jpgFigure 1.10 (Part 2) A gallery of animal genome sequencing projects
681.10 (Part 3) A gallery of animal genome sequencing projects GenBank Files – Box 1.2There are may ways to present the structure and annotation of a gene or seq.due to alternative splicing and TSS, the small errors occur during cDNA cloningall genomes are full of polymorphismThe same gene may be represented by multiple different seq. or annotations in the genome databaseRefseq – hand curation by expertsExample – human HoxA1,Go toLOCUS: XM_004915, GI:Followed by the reference, ….Features section (CDS, misc_feature, .. etc), links to GeneID, MIM, CDDNext comes the seq. in FASTA format, ‘Display’ in XML or ASN.1 file formatpgs2e-fig jpg
69GenBank Files – Box 1.2 Use Entrez Gene – HOXA1 Two isoforms GenBank formatGraph display – HOXA1
72Figure 1.11 The Mouse Genome Informatics (MGI) Web site Rodent Genome ProjectsMouse Genome Informatics (MGI)Three major advantages of rodent research areExistence of a large number of mutant strains that, combined with whole genome mutagensis lead to genetic analysis of every identified locus in the genomeExistence of a panel of approximately 100 commonly used lab. mouse strainswith well-characterized genealogy – a resource for the study of genetic variation3. The existence of conserved seq. blocks is generally an indicator of functional constraint2002 – draft of the Mouse genome2004 – draft of the rat genomepgs2e-fig jpgFigure The Mouse Genome Informatics (MGI) Web site
73Rodent Genome Projects Functional genomic analysis of rat has been stimulated by three major advances achieved in the 1990sThe technology for targeted (Site-directed) mutagenesis by homologous recombination of the wide-type locus with a disrupted copySaturation random (unbiased) mutagenesis programs - Gathers information about entire “sequence space” – i.e., relationship between aa sequence, 3D protein structure and functionEmergence of ‘phenomic’(表現性狀) analysis, in which mutagenized lines are subject to biochemical, physiological, immunological, morphological, and behavioral tests in parallel large-scale identification of genes required for non-lethal (非致命的) phenotypes
74Figure 1.12 Mouse-human synteny and sequence conservation Rodent Genome ProjectsConservation of gene order and DNA seq. between the human and mouse genomesBlocks of synteny between mouse (chr. 11) and parts of five different human chromosomesEnlarged view of a small region – human 5q31. In this approximately 1 Mb region there is almost perfect correspondence in the order, orientation, and spacing of 23 putative genes, including four interleukins.Enlargement of the alignment of 50kb that includes the genes KIF-3A, IL-4 and IL-13. Blue dots show the distribution of conserved seq. (with 50%-100% identity). Two of the conserved blocks (red bars) fall between genes, whereas most of the others (blue bars) are in the introns and exons of the genes.Use PipMakerpgs2e-fig jpgFigure Mouse-human synteny and sequence conservation
75Exercise 1.3 Compare the structure of a gene in a mouse and a human pgs2e-exer jpg
76Rodent Genome Projects Use NCBIchoose Genome biologymouse chr.11use Maps and optionsadd human gene map
77Rodent Genome Projects Mouse Genome Informatics (MGI)Integrate physical and genetic mapsSearch for ortholog genesOnline comparison of the mouse and human genome
78Rodent Genome Projects Ex. 1.3Use either NCBI or Ensembl browser, explore the structure of the gene used in Fig. 1.2 in a mouse and a human (and other vertebrates)Ans. Ensembl– type in human IL13 (ENSG ) ‘Orthologue Prediction’ view all genes in ‘MultiContigView’ IL13 is on mouse chr.11, human chr. 5, and rat chr.10
80Other Vertebrate Biomedical Models 2004 – chicken (G. gallus) and dog (C. familiaris) genomes are fully sequencedMotivation – biomedicalChickens – model for oncogenesis and virologyDog – model for complex diseases such as asthma, parasite infection, cancerarthritis (關節炎), diabetes, and behavioral disordersApplicationsArtificial selection on breed diversityResearch into avian (鳥類的) evolutionVertebrate development Zebrafishtransparent embryogenesis, ease of culture, existence of dense genetic mapFound ~ thousands of genes are required for proper development of organsa variety of ecologically and commercially fish species, such as sticklebacks刺魚, cichlids慈鯛, salmonids
83Other Vertebrate Biomedical Models Sequencing nonhuman primates, such as rehsus macaque (獮猴), chimpanzee(黑猩猩) – intend to understand the origins of diversity in the immune system as well as mechanisms of pathogen resistanceComparison of human and chimp seq.Many genes seems to have been positively selectedHuamn are differentiated from chimps by small deletions up to 10kb in length, which occur on average every 500kb along chromosome 21pgs2e-fig jpg
84Animal Breeding Projects OMIA (Australia) – genome maps for over a dozen species of agricultural importanceAccess data on inheritance patterns for species other than human and mouseBenefits of breeding programs lie in improvements in yield, infectious disease resistance adaptation to climatic conditions, improved food quality, maximizing the benefits of transgenic technologyThese goals will be met both through enhanced genetic map development and association studies using SNP technologyArkDBs (UK, Roslin Institute in Edinburgh)genomes resources for ~10 speciespgs2e-fig jpg
85Invertebrate Model Organisms Generic Model Organism Database (GMOD)A coordinated effort of the mammalian, invertebrate, and plant genome communities to standardize web tool construction and implementation and to provide open source software for database managementFigure The GMOD project
86Invertebrate Model Organisms A 40 kb region of cytological band 43E of fruit fly, centered on the saxophone gene.Figure Drosophila gene annotation
87Invertebrate Model Organisms FlybaseSearch for the gene symbol : saxclick the ‘gene region map’each gene either has a number beginning with CG or is identified by its standard name (e.g. sax)show gene and mRNAtransposable element insertions (Burdock, one is shown in pink)
88Invertebrate Model Organisms The first multicellular eukaryotes to be sequenced completely is C. elegans at 1998Fruit fly –sequences completed at 2000Decades of genetic analysis have led to the molecular characterization of up to 20% of the complement of genes in these two organismsOver 90% of the true genes seem to have been identifiedAssigned a tentative function based on seq. similarity1/3 ~ 1/4 of the predicted genes remain ‘orphans’ with no known seq. similarity to genes in any other organism without functional data
89Invertebrate Model Organisms Ongoing EST sequencing, gene structure and mutational analysisUnexpected – there may be 50% more genes in C.elegans genome (19,000) than there are in the fly genome (13,500), despite the fact that the fly is much more complex at several levels, including (1) the number of cells, (2) number of cell types, and (3) organization of the nervous systemNematode – a surprising surplus of steroid類固醇-hormone receptorsFruit fly – olfactory嗅覺的 receptor familyThere is no simple relationship between gene number and tissue complexityThe high degree of conservation of all the major regulatory and biochemical pathways, most of all are identifiable not only in both nematode and flies but also in the unicellular eukaryote yeast and in vertebrate genomes
90Invertebrate Model Organisms Functional genomics a major impact of the invertebrate genome projects is the prospect of obtaining mutations in every single gene of the genomesIn fly – by a combination of saturation mutagenesis + a library of overlapping deficiencies (deletion) that remove every segment of each chromosomeIn nematode - saturation mutagenesis + RNAi (a double-strand RNA fed to the worms
91Figure 1.15 Human disease genes in model organisms Invertebrate Model Organisms>60% of a sample of 289 human disease genes have an orthologous genes in the fly<60% in nematode~20% in yeastFig shows the fraction of human disease genes in each of six categories that have orthologs in the fly, nematode and yeast genome, as detected by seq. similarity at three level of significanceConservation of genetic interactions across the animal kingdom uncover genes that are interact with known disease-promoting lociPharmaceutical companies – interested in invertebrate genomics for its potential to identify drugs that affect neural functionExample: fluoxetine resistance in nematodes, alcohol tolerance in filesMolecular interactions between gene products can be conserved allows the functional comparison of genes across speciespgs2e-fig jpgFigure Human disease genes in model organisms
94Box 1.3 Managing and Distributing Genome Data Internet technology is essential for genomic scientistsNCBI, EBI, LIMS (laboratory information management systems)DB – RDB (relational DB) and OODB (object-oriented DB)RDB – very effective for sorting, searching, and distributing data that fits into table formOODB – good at handling complex data structures and are useful for performing analyses on sequence ‘objects’ (data + with functions for operating on the data) a very efficient programming approachDB query language = SQL = structured query languageScripting language (no need to compile) = PERL = good for extracting and processing text files
95Box 1.3 Managing and Distributing Genome Data pgs2e-box jpg
96Plant Genome ProjectsArabidopsis Thaliana – the first plant genome to be sequenced between 1999 and 2000~115 Mb, ~25,000 genes, ~2 times (no. fly genes)Evolved via two rounds of whole genome duplication shuffling隨意混和 of chromosome regions and considerable gene loss>1500 tandem arrays (generally 2 or 3 copies) of repeated genes have been identified, ~11,000 gene familiesSome geneticists regard this number as representative of the minimal complexity required to support multicellularityIt is believed that all plant and animal genomes represent modifications of a ‘toolkit’ of gene families that evolved >109 years ago
97Figure 1.16 Chromosome duplications in the Arabidopsis thaliana genome Plant Genome Projects>30 Segmental duplications7 intra-chromosomal duplication are shown as duplicated blocks of color within three of the five chromosomes; five duplications occur in the first chromosome and the fourth and fifth chromosomes display one duplication pieceAnther two dozen inter-chromosomal segmental duplications. A twist in the band inversion accompanied the duplication eventpgs2e-fig jpgFigure Chromosome duplications in the Arabidopsis thaliana genome
98Plant Genome Projects Plant genomes – plant-specific genes Enzymes required for cell wall biosynthesisTransport proteins that move organic nutrients, inorganic ions, toxic compounds, metabolites, and even proteins and nucleic acids between cellsEnzymes required for photosynthesis, such as Rubisco and electron transport proteinsProducts involved in plant turgor 細胞之正常膨脹, phototrophic趨光性 and gravitrophic趨地性Enzymes and cytochromes involved in the production of second metabolites found in flowering plantsA large number of pathogen resistance R genes, as are mammalian immune system. R genes are dispersed throughout the genome rather than localized in a single complex
99Plant Genome ProjectsPlants share with animals many of the gene families - Intercellular communication, transcriptional regulation, signal transductionA. Thaliana lacks homologs of the Ras G-protein family and tyrosine kinase receptors, Rel, forkhaed, nuclear steroids receptor transcription factorsTAIR – The Arabidopsis Information ResourceUK CropNet
100Grasses and Legumes豆莢 >50 different plant species are under way The most important – major feed crops – the grasses maize, rice, wheat, sorghum高粱, barley大麥, the forage飼料的 legumes soybean, alfalfa紫花苜蓿, forage rye黑麥 grasses, fescues(羊茅,酥油草) several genomes are very large whole genome sequencing is impracticalBoth rice (Oryza sativa) and maize (Zea mays) have relatively small genomesTwo major rice genome cultivars培育品種, japonica rice禾更米 and indica rice秈米MaizeGDBwaxy rice糯米
101Figure 1.17 Rice-Arabidopsis synteny Comparison of genome sequences of rice and arabidopsis extensive complex patterns of synteny20 of 54 genes in a 340 kb long of the rice genome (top) retain the same order in five different 80- to 200-kb regions of the Arabidopsis genome (below).Conserved genes (red and green boxes) are found on both rice and Arabidopsis strands, but are interspersed by a variable number of different genes (yellow boxes) in Arabidopsis. Shaded boxes above the rice chromosome indicate that the conserved genes is in the opposite relative orientation on the Arabidopsis chromosomes.pgs2e-fig jpgriceFigure Rice-Arabidopsis synteny
102Grasses and LegumesEconomically important traits include resistance to a broad range of pathogens; flowering time, seed set, grain morphology, and related yield traits; tolerance to drought, salt, heavy metals and other extreme environmental circumstances; and measures of feed quality such as protein and sugar content.Improved through genetic engineering + specialized plant breeding techniquesGenome projects reveal much information regarding the evolution of domesticated species
103Grasses and Legumes Teosinte墨西哥類蜀黍 versus Maize玉蜀黍 Modern maize is a derivative of the wild progenitor teosinte, which had multiple tillers.Throughout the coding region of tb1, the level of polymorphism is substantially the same in a sample of maize and teosinte. However, in the 5’ UTR, there is a dramatic reduction in the level of polymorphism in maize relative to that seen in teosinte.Figure Teosinte branched 1 and the evolution of maize
104Other Flowering Plants >90 angiosperm genome projects are listed on the US department of Agriculture web siteAfrican, Australian, European, US projectsGenetic maps and search for a common set of plant genesFor some species, large EST seq. projects are also in place enable comparative genomic analysisArabidopsis + grasses + several model organisms shed light on plant evolution
105Figure 1.19 Forest genomics Other Flowering PlantsForest trees – potential for economic impactHigh-density genetic maps of spruce, loblolly and several pines, a few species of EucalyptusTrait – wood quality, growth and flowering parametersDendrome web siteComparative analyses and transcription profiling of genes involved in wood properties including lignins木質素 and enzymes that regulate cell wall biosynthesisCrops plants – potato, tomato, tobacco, beans, cottonAnalyzing the genome diversity affect productivity, yield and quality improvementsNo plant equivalent of the HGP’s ELSI initiative has been established.pgs2e-fig jpgFigure Forest genomics
106Microbial Genome Projects The minimal genome1995 – the 1st complete genome, H. influenzae M. genitalium 3 other bacteria1997 – E. coliSeq. information – genome structures (GC content, transposable elements, recombination), genome content (total number of genes, conserved gene families)Gene annotation for prokaryotes are more straightforward – ORF tend to be uninterrupted and genes tend to be closely spaced; however the assignment of genes to operons is not trivial~3/4 microbial genome can be assigned a function based on their similarity to genes on other organisms or by identifying protein domainsTIGR
107Microbial Gene contents M. genitalium 0.6 Mb, 471 genesH. influenzae 1.8 Mb, 1750 genesE. coli K Mb, 4288 genes average gene length ~ 1.1 kbGene duplication and divergence in large genomes,gene loss in small genomespgs2e-table jpg
108Exercise 1.4 Compare two microbial genomes using the CMR pgs2e-exer jpg
109The minimal genome– the minimum complement of genes that are necessary and sufficient to maintain a living organismTo define genetically ‘What is life’?Two general strategiesBioinformatics strategy – identify which genes are present in each and every sequenced genomeSome functions can be performed by non-orthologous genesConserved orthologs + a small number of alternatives ~256 genes
110The minimal genomeExperimental strategy – systematically knock out the function of individual genes: mutations that cannot be recovered define genes that are likely to be components of the minimal genomeM. genitalium – recovered 120 of the 470 genesB. subtilis (~4100 genes) – 271 genes are indispensable (必要的) under favorable growth conditions, metabolism, cell division and shape, synthesis of cellular envelopeSynthetic lethal (綜合的致命) – the nonviability (無存活能力) in combination of two or more individually viable mutationsInfer that life can be supported by a genome of between 250 and 350 genesBuild a viable organism from scratch by stitching (組在一起) together artificially synthesized genes – build a poliovirus (脊髓灰質炎病毒)
111Figure 1.20A Describing the minimal genome Deeper color presence of a genePale color the genes is absent in that speciesGene a, d, f are present in all species, so are inferred to be necessary for life.pgs2e-fig jpgFigure 1.20A Describing the minimal genome
112Figure 1.20B Describing the minimal genome Mutagenesis experimentsEstablish which genes are essential by systematically knockout each functional genes and seeing whether the organism can survive without it.The overlap of these two approaches may define the minimal genome.pgs2e-fig jpgFigure 1.20B Describing the minimal genome
1131.21 TIGR representation of a typical microbial genome Sequenced Microbial GenomesTIGR – Comprehensive Microbial Resource (CMR)New site39 genomes were generated by TIGR, and the rest by Brazil, Japan … Omniome DBStreptococcus pneumoniae TIGR4The outer and inner circles represent genes encoded on the two strands of the chromosomesGenes from HMM – blueBLAST – yellow, Omniome – pinkClick ‘align genome’ – MUMMERClick ‘Analyses’ – for more tools, such as COG/TIGRFAM/PFAMpgs2e-fig jpg
117Environmental Sequencing Sequencing DNA extracted form an environment such as ocean, soil, or intestinal flora (腸道微生物)The main reason is that the vast majority of bacteria cannot be cultured in vitro our knowledge of microflora is both limited by and biased by samplingPilot projects – identify novel genes has the potential to change oceanographers’ understanding of the mechanisms of photosynthesis and global carbon and nitrogen cyclingProteorhodopsin genes – suggesting that light harvesting need not be coupled to chlorophyll in cyanobacteriaC. Venter – identified >1M new genes !!, almost 150 new types of bacteriaFecal material – human gut contains > 500 different species of bacteria, < 30% can be cultured outside the body
119Parasite GenomicsWorld Health Organization (WHO)10 tropical diseases that affect billions of people worldwideEradicating (根除) the pathogenic agentsCrop damage caused by parasitic plant nematodes costs billions of dollars
120Parasite GenomicsAimsIdentify species-specific genesUnderstanding the developmental geneticsPolymorphism surveys that address the population biology of the parasitesMapping the genomics of the mosquito