Making sense of genomic seqs Look for repeats HMM analysis Compare genomes to each other Compare to other kind of experimental data –Which kinds of data can you think of?
Other kinds of data 1.mRNA (EST) 2.RNA sequences & structures 3.Protein sequences 4.Protein structures 5.SNPs, polymorphisms 6.Gene expression (microarray) 7.Protein expression (2D protein gels) 8.Protein interaction 9.Metabolic pathways 10.Regulatory pathways
Transcript databases RefSeq contains full length sequences of mRNAs, carefully reviewed –Currently human sequences dbEST contains 5’ and 3’ reads of random cDNAs –Currently 3.7 mio. human seqs
ESTs UniGene: Merge (cluster) any two ESTs when >100 bp are identical 3.7 mio -> clusters
ESTs UniGene: total # clusters Cluster size Number of clusters
Some statistics Copies# different mRNA # of mRNAs per per cell species abundance level _________________________________________ _____________________________ Total
More statistics Abundance level Size of EST database
Solutions Sequence ESTs from many cell types, rare transcripts might be abundant in other tissues. Use subtraction / normalization procedures prior to sequencing
Transcripts: what can we learn? Comparing genome sequences to transcripts allows: –Confirmation of gene predictions –Experimental identification of Exons/Introns, 5’ UTRs, 3’ UTRs –Alternative splicing Asses the relative abundance of transcripts.
Protein databases SwissProt: Carfully curated / annotated database of experimentally determined protein sequences: entries PIR: Protein Identification Ressource: entries. Translated nucleotide databases: nr, trEMBL, RefSeqP m. m.
Gene ontology Gene Ontology controlled vocabulary that can be applied to all organisms The three organizing principles of GO are molecular function, biological process and cellular component
Combining EST frequencies with GO EST frequncies from blood-fed vs. non-blood-fed mosquitos grouped by function
Whats new? Mass spectrometry was invented turn of century (Thomson) Noble price to Aston 1930s MALDI-TOF (Henzel et al, 1993) Nano-electro-spray (Wilm, Mann 1996s) coupled to tandem mass spectrometer Noble price 2002 to John B. Fenn and Koichi Tanaka
Nanoelectrospray ionization – tandem mass spectrometry (MS-MS)
Positive ESI-MS m/z spectrum of the protein hen egg white lysozyme. The sample was analysed in a solution of 1:1 (v/v) acetonitrile : 0.1% aqueous formic acid and the m/z spectrum shows a Gaussian-type distribution of multiply charged ions ranging from m/z to Each peak represents the intact protein molecule carrying a different number of charges (protons). The peak width is greater than that of the singly charged ions seen in the leucine enkephalin spectrum, as the isotopes associated with these multiply charged ions are not clearly resolved as they were in the case of the singly charged ions. The individual peaks in the multiply charged series become closer together at lower m/z values and, because the molecular weight is the same for all of the peaks, those with more charges appear at lower m/z values than do those with fewer charges (M. Mann, C. K. Meng, J. B. Fenn, Anal. Chem., 1989, 61, 1702). The m/z values can be expressed as follows:m/z = (MW + nH+) n where m/z = the mass-to-charge ratio marked on the abscissa of the spectrum; MW = the molecular weight of the sample n = the integer number of charges on the ions H = the mass of a proton = Da. If the number of charges on an ion is known, then it is simply a matter of reading the m/z value from the spectrum and solving the above equation to determine the molecular weight of the sample. Usually the number of charges is not known, but can be calculated if the assumption is made that any two adjacent members in the series of multiply charged ions differ by one charge. For example, if the ions appearing at m/z in the lysozyme spectrum have “n” charges, then the ions at m/z will have “n+1” charges, and the above equation can be written again for these two ions: = (MW + nH+)and = (MW + (n+1)H+) n(n+1) These simultaneous equations can be rearranged to exclude the MW term: n(1431.6) –nH+=(n+1) – (n+1)H+ and so:n(1431.6)=n(1301.4) – H+ therefore: n( )= – H+ and:n=( H+) ( – ) hence the number of charges on the ions at m/z = = Putting the value of n back into the equation: = (MW + nH+) n gives x 10=MW + (10 x 1.008) and soMW=14,316 – thereforeMW=14,305.9 Da The observed molecular weight is in good agreement with the theoretical molecular weight of hen egg lysozyme (based on average atomic masses) of Da. This may seem long-winded but fortunately the molecular weight of the sample can be calculated automatically, or at least semi- automatically, by the processing software associated with the mass spectrometer. This is of great help for multi-component mixture analysis where the m/z spectrum may well contain several overlapping series of multiply charged ions, with each component exhibiting completely different charge states. Using electrospray or nanospray ionisation, a mass accuracy of within 0.01% of the molecular weight should be achievable, which in this case represents +/- 1.4 Da.