Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools U. Mass. Med. School.Biotools.

Similar presentations


Presentation on theme: "Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools U. Mass. Med. School.Biotools."— Presentation transcript:

1 Analysis of single sequences

2 Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools @ U. Mass. Med. School Many, many more…

3 Before we start - VecScreen When you get a DNA sequence from the sequencer, make sure it is really the sequence you think it is. If you don’t you may spend a lot of time analysing the wrong sequence!!! Possible problems: contamination! Work clean. Always: Vector contamination.

4 Vector contamination Failure to recognize foreign segments in a sequence can: –Lead to erroneous conclusions about the biological significance of the sequence –Waste time and effort in analysis of contaminated sequence –Delay the release of the sequence in a public database –Pollute public databases with contaminated sequence

5 Reminder: Cloning procedure The DNA of interest is cloned into a vector. The resultant DNA may (probably does) contain sections from the vector.

6 VecScreen VecScreen is a system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. NCBI developed VecScreen to combat the problem of vector contamination in public sequence databases.

7 VecScreen

8 EMBOSS European Molecular Biology Open Software Suite. Built for use by commandline. Many EMBOSS portals, servers and mirrors are available. Each program has its help file. One server: http://emboss.bioinformatics.nl/ http://emboss.bioinformatics.nl/ Examples of a few EMBOSS programs:

9 Briefly – What is PCR The polymerase chain reaction (PCR) is a technique to amplify a single copy of a piece of DNA.

10 Briefly – What is PCR The number of copies of the target DNA increases exponentially. After 35 cycles: 2 36 = 68 billion copies.

11 Primer design Primer Length: the optimal length is 18-22 bp. Primer Melting Temperature: Temperature at which one half of the DNA duplex will dissociate. T m of 52-58 o C produce best results. GC Content Primer Secondary Structures Repeats

12 Primer design Avoid Template secondary structure. Avoid Cross homology: –Commonly, primers are BLASTed to test the specificity.

13 primer3 Is a program from the Whitehead Institute, written by Steve Rozen and Helen J. Skaletsky, for finding primers and oligonucleotide probes. One interface to 'primer3' is eprimer3, an EMBOSS program. Primer3Plus is a nicer interface to primer3, from Biotools (U. Mass. Med. School.). We will use it.Primer3Plus

14

15 Gene prediction Identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced. Annotation

16 Gene prediction Prokaryotes: –The sequence coding for a protein occurs as one contiguous open reading frame (ORF), typically many hundreds or thousands of bp. Eukaryotes: –CpG islands and binding sites for a poly(A) tail. –Difficult to use ORF detection because of splicing.

17 Gene prediction Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. Genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. (Alexandre Lomsadze et. al., 2005)

18 SixPack “Display a DNA sequence with 6-frame translation and ORFs” Set “Minimum size of ORFs” to 300, to obtain only meaningful ORFs (Proteins are usually longer than 100 aa). Set “ORF start with an M?” to “Yes” to obtain only ORFs that begin with a Methionine.

19 SixPack The 1 st section in the results page lists all the ORFs discovered. >NM_118742.2_1_ORF1 Translation of NM_118742.2 in frame 1, ORF 1, threshold 500, 42aa GIKRLLEGQFCYRAFTWPVEITSMQTTVRDFEEDSYLSLLVS >NM_118742.2_1_ORF2 Translation of NM_118742.2 in frame 1, ORF 2, threshold 500, 909aa MDFISSLIVGCAQVLCESMNMAERRGHKTDLRQAITDLETAIGDLKAIRDDLTLRIQQDG LEGRSCSNRAREWLSAVQVTETKTALLLVRFRRREQRTRMRRRYLSCFGCADYKLCKKVS AILKSIGELRERSEAIKTDGGSIQVTCREIPIKSVVGNTTMMEQVLEFLSEEEERGIIGV YGPGGVGKTTLMQSINNELITKGHQYDVLIWVQMSREFGECTIQQAVGARLGLSWDEKET GENRALKIYRALRQKRFLLLLDDVWEEIDLEKTGVPRPDRENKCKVMFTTRSIALCNNMG

20 SixPack The 2 nd section shows a map of where the ORFs are in the actual sequence.

21 EMBOSS / plotorf Plots the ORFs found by sixpack:

22 ORF finder

23

24

25 Problems with ORF finding ORF finding can detect only 85% of genes. Short proteins More than 1 long ORF. Alternative start codon (not always the one furthest from the stop codon).

26 Possible solutions Searching the databases for similar proteins. Existence of such a protein will indicate this is a true gene. Gene prediction tools: –GeneMark: http://opal.biology.gatech.edu/GeneMark/ http://opal.biology.gatech.edu/GeneMark/ –Many more (e.g. see CBCB website)CBCB website

27 GeneMark The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods… Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification. (Alexandre Lomsadze et. al., 2005)

28 GeneMark Sample output: Exon prediction for PIP1B (remember the Gene entry?)Gene entry Gene Exon Strand Exon Exon Range Exon Start/End # # Type Length Frame 1 1 + Initial 156 483 328 1 1 - - 1 2 + Internal 709 1004 296 2 3 - - 1 3 + Internal 1085 1225 141 1 3 - - 1 4 + Terminal 1314 1409 96 1 3 - - # protein sequence of predicted genes >gene_1|GeneMark.hmm|286_aa MEGKEEDVRVGANKFPERQPIGTSAQSDKDYKEPPPAPLFEPGELASWSFWRAGIAEFIA TFLFLYITVLTVMGVKRSPNMCASVGIQGIAWAFGGMIFALVYCTAGISGGHINPAVTFG LFLARKLSLTRAVYYIVMQCLGAICGAGVVKGFQPKQYQALGGGANTIAHGYTKGSGLGA EIIGTFVLVYTVFSATDAKRNARDSHVPILAPLPIGFAVFLVHLATIPITGTGINPARSL GAAIIFNKDNAWDDHWVFWVGPFIGAALAALYHVIVIRAIPFKSRS

29 extractseq Usually one would use a sequence editing software like BioEdit. Extractseq is one editing tool available from EMBOSS. Many more options in command line option (see manual)

30 BioEdit

31 Seqret generates a multiple sequence file emma aligns the files Prettyplot generates a graphical alignment Multiple sequence alignment using EMBOSS Usually, one uses better tools for this. We’ll see them later on in the course.

32 Restriction maps Represent the locations in a DNA sequence cut by restriction enzymes. Are used, for example, in identifying whether DNA in a test-tube is the same as its putative sequence. Can be used in cloning to design inserts for plasmids.

33 ReMap Display sequence with restriction sites, translation etc. Useful in identification of small nucleotide polymorphisms (SNPs). If a SNP changes a restriction site, it will cause that RE to cut the DNA differently compared with the wild type. Other RE programs: redata, restrict …

34 PepStats PEPSTATS of AAU04762.1 from 1 to 1020 Molecular weight = 116504.98 Residues = 1020 Average Residue Weight = 114.221 Charge = 7.5 Isoelectric Point = 6.9295 A280 Molar Extinction Coefficient = 102130 A280 Extinction Coefficient 1mg/ml = 0.88 Improbability of expression in inclusion bodies = 0.709 ResidueNumberMole%DayhoffStat A = Ala292.843 0.331 B = Asx00.000 0.000 … Y = Tyr222.157 0.634 Z = Glx00.000 0.000 PropertyResiduesNumberMole% Tiny(A+C+G+S+T)23723.235 Small(A+B+C+D+G+N+P+S+T+V)44443.529 Aliphatic(A+I+L+V)30029.412 Aromatic(F+H+W+Y)10710.490 Non-polar(A+C+F+G+I+L+M+P+V+W+Y) 50849.804 Polar(D+E+H+K+N+Q+R+S+T+Z) 51250.196 Charged(B+D+E+H+K+R+Z)28928.333 Basic(H+K+R)15615.294 Acidic(B+D+E+Z)13313.039

35 PepInfo Tiny Small Aliphatic Aromatic Non-polar Polar Charged Basic Acidic

36 PepInfo Protein with transmembrane sections

37 PepInfo Protein without transmembrane sections

38 TMHMM Very good tool for identifying transmembrane segments. http://www.cbs.dtu.dk/services/TMHMM/

39 Conclusion A tip of the iceberg of what can be done with a sequence. If you start working with sequences, you will have to decide which tools suit you best. It has a lot to do with personal preference and something to do with algorithm accuracy.


Download ppt "Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools U. Mass. Med. School.Biotools."

Similar presentations


Ads by Google